Color Schemes

I wrote a new minimalist color scheme (for emacs) today. If you’re into that sort of thing, the code is on github, and down below is a screenshot.

Advertisements
Posted in Computers | Tagged | Leave a comment

A JIT Compiler for Brainf*ck

I recently read this article by Josh Haberman where he demonstrates a JIT for the Brainf*ck language (henceforth referred to as BF). However it wasn’t very different from a regular AOT compiler — the entire BF script is read in and gets compiled (at runtime) into one giant function. I figured it would be a fun exercise to write a more traditional BF VM (if there is such a thing), specifically one that interprets a BF script and JITs only the hotter loops. The result is up on GitHub and this blog post will try to explain and justify some of the design decisions so as to convince myself that this wasn’t a complete waste of time.

The Bytecode

Neither the interpreter nor the JIT understands BF source; instead they work with a bytecode based IR. The bc_from_source function in bytecode.c parses BF source and generates a stream of bytecodes representing it. The term bytecode is somewhat misleading since a bytecode is four bytes long, usually followed by one or more four-byte payloads.

Translating to bytecode is important for performance. Individual BF instructions do very little and sequences like +++ can be quickly executed at one go (by increasing the current cell’s value by 3, in this case) than by three separate bytecode dispatches. To this end, bc_from_source attempts to do some basic peephole optimizations by recognizing common BF idioms. For instance, one way to add the contents of the current cell to a cell three indices away (and clear the current cell in the process) is [->>>+<<<]. bc_from_source recognizes this pattern and emits a specialized bytecode, BC_MOVE_VALUE, which is specially handled by the interpreter and the JIT.

The Interpreter

The interpreter is quite simplistic. It uses GCC’s labels as values feature to provide for a quick dispatch mechanism. Every loop has a heat counter associated with it, whose numerical value is stored as payload following the BC_LOOP_BEGIN bytecode. The interpreter checks this value and calls in the JIT once a method is hot. The JIT does its thing, and stores away a pointer to the generated code in the program_t structure. The interpreter then calls into this generated code. The compiler also patches the BC_LOOP_BEGIN to a BC_COMPILED_LOOP so that the next iteration directly executes the JITed version of the loop.

The Compiler

The compiler uses dynasm to generate a code-generator; and as a consequence most of the compiler really lives in codegen.dasc. This might change if I later add some analysis or optimization passes which run only when compiling the bytecode, but for now the codegen practically is the compiler. The only compiler specific optimization is that the compiler tries to not emit a redundant cmp instruction if the last instruction makes it unnecessary (by virtue of being an add instruction with a negative constant as one operand).

Performance

Speed is basically what it all boils down to. Running mandlebrot.bf takes 1.25 seconds on average, which is at par with compiling BF to C and then compiling the resultant C file with gcc -O3. While I’d call this pretty good, I think there is still some room for improvement, perhaps studying the machine code generated for typical BF programs will yield some insight.

Future Work

There are two improvements I can think of, believe will improve performance, but haven’t had time to implement:

  • A register allocator should improve performance, but probably not as much as in normal programming languages — BF is a very cache friendly language. 馃檪
  • Some basic auto-vectorization. Maybe we can say "this loop simply adds these two blocks of cells together, one by one" and emit some SSE instructions to that effect.
Posted in Computers | Tagged , , | 1 Comment

Lambda Calculus for the Imperative Programmer I

This post is first in a series of articles on lambda calculus. You can see the whole series by looking up the tag lcip.

The Untyped Lambda Calculus

At the very core, the untyped lambda calculus (referred to as 位 from this point on) is a model of computation. In other words, it formalizes the notion of computability and, since it is Turing Complete, it is possible to implement any algorithm using just the primitives given to us by 位 (this last part is known as the Church Turing Thesis).

The definition of 位 gives us two things: it teaches us how to construct "lambda terms" and it teaches us how to manipulate such terms we create. The surprise is not in the rules themselves — they are all very simple and intuitive; but in how they come together in creating a powerful model of computation.

位 Terms

A lambda term can either be a variable, an application or an abstraction. 位 terms are inductively defined; any term resulting from the zero or more applications of the following rules will result in a lambda term:

  • (Variable) A textual symbol or token (like x) is a 位 term. A variable is usually printed as itself.
  • (Application) Two 位 terms can be combined into a new 位 term by invoking this rule. The textual representation of an application is usually the textual representation of the first 位 term followed by the textual representation of the second.
  • (Abstraction) A 位 term can be combined with a variable to give a new 位 term. An abstraction is generally textually represented as 位 x. T; where x is the representation of the variable and T is the representation of the term.

In C:

enum {
  TERM_VARIABLE,
  TERM_ABSTRACTION,
  TERM_APPLICATION
};

typedef struct term {
  int type;
  union {
    struct {
      /* variables  */
      const char *var_name;
    };
    struct {
      /* applications  */
      struct term *left, *right;
    };
    struct {
      /* abstractions  */
      const char *arg_name;
      struct term *body;
    };
  };
} term_t;

void term_print(FILE *fptr, term_t *term) {
  switch (term->type) {
    case TERM_VARIABLE:
      fprintf(fptr, "%s", term->var_name);
      break;

    case TERM_APPLICATION:
      fprintf(fptr, "( ");
      term_print(fptr, term->left);
      fprintf(fptr, " ");
      term_print(fptr, term->right);
      fprintf(fptr, " )");
      break;

    case TERM_ABSTRACTION:
      fprintf(fptr, "( \ %s. ", term->var_name);
      term_print(fptr, term->body);
      fprintf(fptr, " )");
      break;
  }
}

For an abstraction 位 x. T, x is said to be bound inside T. Variables (or symbols) that aren’t bound this way are said to be free. In the expression 位 x. (位 y. (((位 t. x) y) z) y), z and the last y are free variables (the parenthesis’ exist to denote the order in which the terms are combined). That z is free is obvious, since the expression doesn’t have a sub-term of the form 位 z. T, z can’t really be bound anywhere. The last y too is free since none of its parent abstractions (there is only one, the outermost one) abstract over y. The second last y, however, is bound since its parent abstracts over y.

A direct (but rather slow) way to list the free variables in an expression in C:

static void list_free_recursive(stack_t *stack, term_t *term) {
  switch (term->type) {
    case TERM_VARIABLE: {
      if (!stack_contains(stack, term->var_name))
        printf("%sn", term->var_name);
      break;
    }

    case TERM_APPLICATION: {
      list_free_recursive(stack, term->left);
      list_free_recursive(stack, term->right);
      break;
    }

    case TERM_ABSTRACTION: {
      stack_push(stack, term->arg_name);
      list_free_recursive(stack, term->body);
      stack_pop(stack);
      break;
    }
  }
}

void term_list_free_variables(term_t *term) {
  stack_t stack;
  stack_init(&stack);
  list_free_recursive(&stack, term);
}

The exact symbol used to represent bound variables don’t have any semantic meaning, in the same way x does not have any semantic meaning in int foo(int x) { return x; } (if you think about C compilation in terms of source files and compilation units, it isn’t hard to convince yourself that only external identifiers have any real meaning). In other words, you can always rename bound variables as long as you are consistent about it — 位 x. (位 y. x) is exactly the same as 位 t. (位 y. t). This property, called Alpha Equivalence, means that testing two terms for equality isn’t a simple recursive comparison — 位 x. x is equal to 位 y. y semantically but not textually.

Sometimes De Bruijn indices indices are used to indicate bound variables. Instead of textual variables, you use an integer to denote how far out the abstraction introducing a variable is. In 位 x. (位 y. y x), for example, the y in (位 y. y x) is introduced by the first when moving outwards, while the x is introduced by the second. This makes its De Bruijn form 位. (位. 1 2). Note that since the "variables" now directly reference the they’re interested in, the terms themselves don’t have to have explicitly name a symbol. De Bruijn indices ensure that equal terms also look equal textually.

位 terms with no free variables are called closed terms or combinators and terms with at least one free variable are called open terms. We need to be a little careful when dealing with open terms since we have to ensure we don’t end up accidentally binding variables. For instance, if T is closed, we can always use it in an abstraction like 位 x. (位 y. (y (T x))). However, if T is open because it has x as a free variable, we may end up doing something unintentional (why?).

尾 Reductions

If anything, this is the property that should make it intuitively clear how 位 models functional programming languages. It gives us the semantics of one step or transition in the evaluation or reduction of a given 位 term.

A 尾 reduction reduces (位 x. T) A to [x := A] T, where [a := B] C means "replace all open a in C by B. Consider (位 x. (位 x. x) x) y as an example — mechanically applying the rule tell us that this term can be reduced to [x := y] (位 x. x) x. There is only one open x in the term (位 x. x) x, which when replaced by y gives us (位 x. x) y.

A naive implementation of [var := new_term] term in C:

term_t *term_replace_open(term_t *term, const char *var,
                          term_t *new_term) {
  switch (term->type) {
    case TERM_VARIABLE:
      if (strcmp(term->var_name, var) == 0) return new_term;
      return term;

    case TERM_APPLICATION: {
      term_t *left =
        term_replace_open(term->left, var, new_term);
      term_t *right =
        term_replace_open(term->right, var, new_term);
      return term_make_application(left, right);
    }

    case TERM_ABSTRACTION:
      if (strcmp(term->arg_name, var) == 0) return term;
      return term_make_abstraction(
          term->arg_name,
          term_replace_open(term->body, var, new_term));
  }
}

A sequence of 尾 reductions is the 位 equivalent of "executing" a program.

To make sure you’ve understood how 尾 reductions work, try to (mentally or with pen and paper) intuitively guess what the following expressions reduce to after a 尾 reduction:

  • (位 x. (位 y. (x y))) (位 x. x)
  • (位 x. (x x)) (位 x. (x x))
  • (位 x. (位 x. (x x))) (位 x. x)

What can you infer from the second example?

A good way to develop intuition and ease in manipulating 位 terms is to have a look at the various ways to encode "real world values" like integers and boolean in 位. One way to do this is using Church encoding. However, before that you need to take a look at the conventions generally when representing 位 terms (copied from this place):

  • Outermost parentheses are dropped: M N instead of (M N)
  • Applications are assumed to be left associative: M N P may be written instead of ((M N) P)
  • The body of an abstraction extends as far right as possible: 位x. M N means 位x. (M N) and not (位x. M) N.
  • A sequence of abstractions is contracted: 位x. 位y. 位z. N is abbreviated as 位xyz. N

Evaluation Strategies

Given a term to reduce, a redex, the 尾 reduction rule tells us how to get a reduct (the result of applying the rule) but nothing about how to choose the redex to be reduced. For example, the expression (位 xy. x y) ((位 x. x) (位 x. x)) (from now on we’ll follow the notation specified above) has two reduction sites. We can see the expression either as (位 xy. x y) T (with ((位 x. x) (位 x. x)) as T) and try to reduce it to [x := T] (位 y. x y); or as (位 xy. x y) ((位 x. x) S) (with 位 x. x as S) and try to reduce it to (位 xy. x y) ([x := S] x).

For a more convincing case, consider (位 xy. y) ((位 x. x x) (位 x. x x)) (位 x. x); can you find an policy for choosing redexes that will put the evaluator in an infinite loop? Can you find a policy that doesn’t?

The process of choose the next redex in a 位 term is dictated by the various reduction strategies a particular implementation can adopt.

Next posts

In the next posts we’ll discuss an implementation for 位, and then move onto typed variants.

Posted in Computers | Tagged , , | 1 Comment

Lock-free programming, first steps: mostly lock-free fixed-size vector

I recently got interested in lock-free programming techniques, and began playing with the idea of a lock-free, fixed size vector. While the final result isn’t completely lock-free, it is an optimization over a naively locked version. The following is a graph showing the amount of time taken for x threads to complete three high-contention benchmarks on a shared vector:

The Interface

The (unsurprising) interface is specified by an serial reference implementaion, which is implemented subject to respecting the ACID properties in a multithreaded context:

      FixedVectorSemantics(T of Type, Size of Integer) {
        Fields:
          Buffer of Array(size = Size, type = T)
          Length of Integer

        Methods:
          push_back of (Value of Pointer(T)) 鈫 Integer {
            if Length < Size then:
              Buffer[Length] 鈫 Value
              OldLength 鈫 Length
              Length 鈫 (Length + 1)
              return OldLength
            else:
              return -1
          }

          pop_back of (Pointer(T) 脳 Integer) {
            if Length > 0 then:
              Length 鈫 Length - 1
              return (Buffer[Length], Length)
            else:
              return OutOfRange
          }

          get of (Index of Integer) 鈫 Pointer(T) {
            if Index < Length:
              return Buffer[Index]
            else:
              return OutOfRange
          }
      }
    

There are a few things worth noting about the interface:

Pointer-only store

A FixedVector only stores pointers. This isn’t incidental –the atomic primitives it uses only operate on word-sized data. Moreover, the implementation steals the least significant bit of pointers in some cases. FixedVector can be conceivably be used to store any data that is, in size, smaller than or equal to a pointer, as long as the least significant bit of the datum is always guaranteed to be 0.

push_back and pop_back return indices

In a serial environment, the index at which an element has been inserted by push_back and the index of the element retrieved by pop_back can be inferred by inspecting Length immediately afterwards. In a multi-threaded environment, however, this is racy.

Absence of a put method

Unlike the previous two issue, is only incidental — nothing prevents the addition of a put method, I just haven’t gotten around to it yet.

The ACID Properties

The FixedVector implementation satisfies the ACID properties under the following interpretations:

Atomicity

A method either succeeds (and the change is then insured by durability) or fails (and effects no observable change).

Consistency

FixedVector doesn’t provide any property that I could group under this heading. Note that it does not provide the guarantee that invoking get on an index less than Length will not return OutOfRange. Providing this guarantee will force FixedVector to not provide other, more important guarantees (see the implementation of push_back).

Isolation

Let N requests, R0, R1RN executed concurrently on a FixedVector; and let those requests take the FixedVector from state S0 to S1 and return values V0, V1VN respectively. Then, there is a way to order the requests such that the change in state can also be obtained by serially invoking the requests in some order and have each request return the same values as earlier. Maintaining this property was the toughest aspect of the implementation.

Durability

A request that has gone through (the function called has returned) can’t be reordered with a request that comes in "later". This is a rather loose constraint, since different processors may see different views of the same FixedVector (even though x86’s cache coherency ensures every processor usually gets a consistent view).

As an example, consider the following four operations invoked concurrently on an empty stack:

  1. push_back(5)
  2. push_back(6)
  3. push_back(7)
  4. pop_back

Once all the four requests have been completed, the state of the vector may be one of the following:

Vector contents Value returned from pop_back
Any permutation of [5, 6, 7] OutOfRange
Any permutation of [5, 7] 6
Any permutation of [6, 7] 5
Any permutation of [5, 6] 7

No other state is legal.

The Implementation

I approach thinking about a certain problem by asking myself two, mutually exclusive questions: Is this trivial? If so, how? and Is this non-trivial? If so, why?. These, followed up with some experimentation, usually lead to a solution.

The implementation makes heavy use of atomic reads and writes, and the usual compare and swap. The operations are mostly lock-free, only push_back does some fine grained locking.

Implementing push_back

The simplest code that could possibly work for push_back looks like this:

  Len = AtomicRead(Length)
  if CompareAndSwap(&Length, Len, Len + 1)
    if failed: Retry
  AtomicWrite(Buffer[Len], { value to push })
      

Turns out, this is good enough, with one modification:

  Len = AtomicRead(Length)
  CompareAndSwap(&Length, Len, Len + 1)
    if failed: Retry
  MemoryFence()
  AtomicWrite(Buffer[Len], { value to push })
      

The MemoryFence ensures that write to Buffer[Length] is not reordered ahead of the write to Length. The CPU is normally free to do such a reordering since this reordering preserves the semantics of a single-threaded program. The problem with this reordering is easy to see (the line executing in a particular step is marked with an asterisk):

Thread 0 Thread 1
 push_back(5):
*  Len = Length
   CompareAndSwap(
     &Length, Len, Len + 1)
   Buffer[Len] = 5
	  
 push_back(10):
*  Len = Length
   CompareAndSwap(
     &Length, Len, Len + 1)
   Buffer[Len] = 10
	  
Vector State:

Length = 0
Data = []
	  
 push_back(5):
   Len = Length
     Len is 0 now
*  CompareAndSwap(
    &Length, 0, 1)
   Buffer[0] = 5
	  
 [Thread Stalled]
 push_back(10):
   Len = Length
     Len is 0 now
   CompareAndSwap(
     &Length, 0, 1)
*  Buffer[0] = 10
	  
Vector State:

Length = 0
Data = []
	  
 push_back(5):
   Len = Length
     Len is 0 now
   CompareAndSwap(
     &Length, 0, 1)
*  Buffer[0] = 5
	  
 [Thread Stalled]
 push_back(10):
   Len = Length
     Len is 0 now
   CompareAndSwap(
     &Length, 0, 1)
*  Buffer[0] = 10
	  
Vector State:

Length = 1
Data = []
	  
 push_back(5):
   Successful
	  
 push_back(10):
   Len = Length
     Len is 0 now
   CompareAndSwap(
     &Length, 0, 1)
*  Buffer[0] = 10
	  
Vector State:

Length = 1
Data = [5]
	  
 push_back(5):
   Successful
	  
 push_back(10):
   Len = Length
     Len is 0 now
*  CompareAndSwap(
     &Length, 0, 1)
   Buffer[0] = 10
	  
Vector State:

Length = 1
Data = [10]
	  
 push_back(5):
   Successful
	  
 push_back(10):
   Len = Length
   CompareAndSwap(
     &Length, 0, 1)
     Fails, retries
   Buffer[0] = 10
	  
Vector State:

Length = 1
Data = [10]
	  
 push_back(5):
   Successful
	  
 push_back(10):
   Successful in second
   try, execution skipped
   for brevity
	  
Vector State:

Length = 2
Data = [10, 10]
	  

The state of an empty vector after a serial push_back(5) and a push_back(10) (in either of the two possible orders) can never by [10, 10] — a missing fence breaks isolation. However, with the memory fence in place, the CAS is always executed (and fails when required) before the atomic write and this problem is avoided.

There is another subtlety, however. Consider a concurrent invocation of n + m + 1 (m >= 0) pushes and n pops on a vector with k elements (with k >= n). Assume the first push_back sets the length to k + t + 1 (with t >= 0) and stalls, after which all the other requests execute in a way that the length of the vector when the first request is resumed is greater than k. Since the first request, on resumption, writes a value to Buffer[k], it essentially overwrites a value that was written to Buffer by another push_back! I found this troubling, more so because, in this case, we end up pop_backing a random value that hasn’t yet been pushed! Fortunately, this issue can be solved with no modification to push_back at all, as we shall see. We’ll return to this issue after discussing the implementation for pop_back.

Implementing pop_back

I found implementing pop_back a great deal harder than implementing push_back, primarily because of the ABA problem. The problem stems from the fact that a push_back implementation needs to atomically both modify Length and read Buffer[Length - 1]. If we modify Length first, we might just allow a pop_back and read and return a value different from the one we popped, and if we read Buffer[Length - 1] first, we might just allow a pop_back followed by a push_back to overwrite the value we are about to pop and, again, return a value different from what we popped.

An example of the first issue (when we modify Length before reading Buffer[Length - 1]):

Thread 0 Thread 1
 pop_back:
*  Len = Length
   CompareAndSwap(
     &Length, Len, Len - 1)
   OldValue = Buffer[Len]
   Return OldValue
      
 push_back(10):
*  Len = Length
   CompareAndSwap(
     &Length, Len, Len + 1)
   MemoryFence()
   Buffer[Len] = 10
      
Vector State:

Length = 1
Data = [42]
      
 pop_back:
   Len = Length
*  CompareAndSwap(
     &Length, 1, 0)
   OldValue = Buffer[Len]
   Return OldValue
      
 push_back(10):
*  Len = Length
   CompareAndSwap(
     &Length, Len, Len + 1)
   MemoryFence()
   Buffer[Len] = 10
      
Vector State:

Length = 1
Data = [42]
      
 pop_back:
   Len = Length
   CompareAndSwap(
     &Length, 1, 0)
*  OldValue = Buffer[Len]
   Return OldValue
      
 push_back(10):
*  Len = Length
   CompareAndSwap(
     &Length, Len, Len + 1)
   MemoryFence()
   Buffer[Len] = 10
      
Vector State:

Length = 0
Data = []
      
 pop_back:
   Len = Length
   CompareAndSwap(
     &Length, 1, 0)
*  OldValue = Buffer[Len]
   Return OldValue
      
 push_back(10):
    Succeeds
      
Vector State:

Length = 1
Data = [10]
      
 pop_back:
    "Succeeds", return 10
      
 push_back(10):
    Succeeds
      
Vector State:

Length = 1
Data = [10]
      

We end up breaking isolation in this case — no ordering of push_back(10) and pop_back on a vector [42] can have the pop_back return 10 and change the state of the vector to [10]. A similar example can be made for the case where we read Buffer[Length - 1] before modifying Length.

I eventually had to settle in for a (fine-grained) locking implementation. The lock here is a bit set in the word about to be popped. You cannot pop and return a word that has the lock bit set. The algorithm looks like this (Lock(word) returns word with the lock bit set):

  Len = AtomicRead(Length)
  if Len == 0 then return OutOfRange
  OldValue = AtomicRead(Buffer[Len - 1])
  if OldValue has lock bit set: Retry -- (A)

  CompareAndSwap(&Buffer[Len - 1], OldValue, Lock(OldValue))
    if failed: Retry
  MemoryFence()

  CompareAndSwap(&Length, Len, Len - 1)
    if failed:
      AtomicStore(&Buffer[Len - 1], OldValue)
      Retry
  return OldValue
      

First of all, pop_back isn’t lock free, because of the Retry in (A). In short, a pop_back requested when the length of the vector is k locks all the elements from pop_back with index <= k — since popping an element with index <= k involves popping the element at index k, which (A) forbids us to do until the pop_back either succeeds or fails. I have a hunch that making pop_back completely lock-free will be a significant performance boost.

Note that the lock bit isn’t reset if the pop_back succeeds — this is crucial to the correctness of this algorithm. This prevents a pop_back between a push_back CAS on Length and the subsequent write to Buffer[OldLength]. You see, since we’re pushing a value at location i (say), there must have been a successful pop from location i + 1 to i at some point. And that pop_back must have left the lock bit set for Buffer[i], and this is how the pop_back is locked out till the push_back is finished. This takes care of the situation discussed earlier in the push_back section. We take care of case when a push_back is done at a new index, one which hasn’t ever seen a value by initializing the entire buffer with a sentinel value at allocation. This sentinel value has the lock bit set.

The MemoryFence serves a purpose similar to that in push_back — if the CAS on Length is reordered ahead of the CAS on the lock bit, we might end up returning a different value than we popped.

Implementing get

Implementing get is the easiest of all, since it has the luxury of being a read-only operation:

  get(index):
    Len = AtomicRead(Length)
    if index >= Len: return OutOfRange
    Value = AtomicLoad(&Buffer[index])
    if Value != Sentinel:
      return Value
    else:
      return OutOfRange
      

In the case we do see a value that is being concurrently popped, we can pretend as if the get was reordered to before that particular pop_back request.

My Conclusions

Lock-free programming seems like a different world — multiple cores make ordinary things behave in unexpected ways. I have tested my implementing on an x86 (Intel i7), a coherent paradise, I’m sure testing on a looser architecture will reveal subtle, overlooked issues. And just so you know, there are better ways to effect a lock-free vector, written by people far smarter than me. I’m still learning. 馃檪

Posted in Computers | Tagged , , , | Leave a comment

Peeking inside LuaJIT

LuaJIT is a high-performance virtual machine for Lua. I’ve recently started playing with it and this blog post talks about what I’ve discovered so far.

LuaJIT is a tracing JIT as opposed to a method JIT — it works not by discovering and optimizing hot methods as a whole but by discovering and optimizing hot traces or paths of execution. LuaJIT has a rather interesting interpreter as well, but I won’t talk about it here except in context of the JIT.

The meat of the interpreter lies in the vm_<arch>.dasc files, which contain a specification of the interpreter (and other assembly) stubs in an assembly-like language. So, for instance, vm_x86.dasc contains a function like:

// Generate assembly code for a specific Lua bytecode.
static void build_ins(BuildCtx *ctx, BCOp op, int defop) {
  switch (op) {
  case BC_HHGTTG:   // hypothetical bytecode
    | mov eax, 42  // hypothetical implementation
    break;
  }
}

vm_x86.dasc is first lowered by dynasm.lua into a C program. The lines beginning with |s are mapped to calls to dasm_put in a way that the C functions produced now emit these assembly opcodes. This new C program is then linked with buildvm.c and run at compile time to generate the required assembly stubs. For instance, calling build_ins with BC_HHGTTG (after lowering) will emit the assembly opcodes for moving 42 to eax (whatever the semantic implication might be). This assembly is emitted into lj_vm.s and linked in with the rest of the Lua JIT.

Detecting hot traces

Now, with that bit of detail out of our way, let us look at how Lua actually picks up the traces it wishes to compile. LuaJIT maintains the hotness of relevant instructions (jumps, calls) in a hashtable but a catch — it doesn’t resolve collisions. Since hotness is a heuristic and not dense (i.e. it is unlikely that all jumps in a program will be hot) this approach works quite well in practice. The following macro, used to access a the hotness counter for a particular instruction from C, should make things clear:

// gg is the global state and pc points to the bytecode
// instruction whose hotcount we want.
#define hotcount_get(gg, pc) (gg)->hotcount[(u32ptr(pc)>>2) & (HOTCOUNT_SIZE-1)]

The interpreter updates and checks the hotness counters when executing a jump or a call. This is done using the hotloop and hotcall macros in vm_<arch>.dasc (we’ll only be looking at the x86 architecture so <arch> is x86):

|.macro hotloop, reg
|  mov reg, PC
|  shr reg, 1
|  and reg, HOTCOUNT_PCMASK
|  sub word [DISPATCH+reg+GG_DISP2HOT], HOTCOUNT_LOOP
|  jb ->vm_hotloop
|.endmacro

and later, in various places::

case BC_FORL:
  |.if JIT
  |  hotloop RB
  |.endif

dynasm lowers the macro invocations in a straightforward way — BC_FORL subtracts HOTCOUNT_LOOP ticks from the corresponding counter and then jumps to vm_hotloop once the counter under-flows:

|->vm_hotloop:            // Hot loop counter underflow.
|.if JIT
|  mov LFUNC:RB, [BASE-8]     // Same as curr_topL(L).
|  mov RB, LFUNC:RB->pc
|  movzx RD, byte [RB+PC2PROTO(framesize)]
|  lea RD, [BASE+RD*8]
|  mov L:RB, SAVE_L
|  mov L:RB->base, BASE
|  mov L:RB->top, RD
|  mov FCARG2, PC
|  lea FCARG1, [DISPATCH+GG_DISP2J]
|  mov aword [DISPATCH+DISPATCH_J(L)], L:RBa
|  mov SAVE_PC, PC
|  call extern lj_trace_hot@8     // (jit_State *J, const BCIns *pc)
|  jmp <3
|.endif

dynasm can be made aware of C structs and offsets of their fields using a .type directive. In the body of vm_hotloop, LFUNC:RB tells dynasm to treat the value in RB as an LFUNC. RB and LFUNC are defined earlier:

|.define RB,      ebp
// ...
|.type LFUNC,     GCfuncL

Of course, treating RB as an LFUNC in LFUNC:RB doesn’t do anything, it is essentially documentation. In the next instruction (mov RB, LFUNC:RB->pc), however, this annotation allows us to say LFUNC:RB->pc and have dynasm automatically figure out the correct offset. Let’s step through vm_hotloop, noting that BASE points to the top of Lua’s evaluation stack, RB and RD are registers, FCARG1 and FCARG2 are the registers that are used as the first two arguments in the calling convention used when transitioning from Lua to C and that SAVE_L and SAVE_PC are stack slots.

First, we pop the GCfuncL off the top of the stack and read pc, which points to the beginning of the bytecode for the closure:

|  mov LFUNC:RB, [BASE-8]
|  mov RB, LFUNC:RB->pc

LuaJIT follows a common convention of storing function literals or prototypes separately from function closures. This allows the VM to save space by sharing the byte-code between two closures of the same function prototype. In V8, for instance, you have SharedFunctionInfos holding the information specific to a function literal and Functions representing actual, executable, closures.

In LuaJIT, function literals are represented using GCproto. In memory, a GCproto object is followed by the bytecode for the function literal (something shared by all closures, represented by GCfuncLs). Thus, given a GCfuncL, we can extract the corresponding GCproto by subtracting sizeof(GCproto) from the pointer to the beginning of the bytecode array. PC2PROTO uses this technique to access the framesize field in the GCproto structure and uses it to compute the first free slot in the stack:

|  movzx RD, byte [RB+PC2PROTO(framesize)]
|  lea RD, [BASE+RD*8]

Then we fill up the fields in lua_State (defined in lj_obj.h):

|  mov L:RB, SAVE_L
|  mov L:RB->base, BASE
|  mov L:RB->top, RD

set the argument registers, save few things in stack slots and call into lj_trace_hot:

|  mov FCARG2, PC
|  lea FCARG1, [DISPATCH+GG_DISP2J]
|  mov aword [DISPATCH+DISPATCH_J(L)], L:RBa
|  mov SAVE_PC, PC
|  call extern lj_trace_hot@8
|  jmp <3

which puts LuaJIT into recording mode.

Recording Traces

A trace is the linear sequence of bytecodes encountered when executing a particular code path. Traces are recorded (the recording begins when lj_trace_hot is called) by coordinating with the interpreter. The interpreter uses a vector of pointers to assembly stubs, indexed by the bytecode instructions they implement — in principle, interpreting a Lua program proceeds by decoding the bytecodes one by one and jumping to the corresponding assembly stubs. Traces are recorded by making each element in the dispatch vector point to lj_vm_record (dynasm adds the lj_ prefix, in vm_x86.dasc, the symbol is just vm_record). A simplified vm_record looks like this:

|->vm_record:
|  mov L:RB, SAVE_L
|  mov L:RB->base, BASE
|  mov FCARG2, PC
|  mov FCARG1, L:RB
|  call extern lj_dispatch_ins@8
|  mov BASE, L:RB->base
|  movzx RA, PC_RA
|  movzx OP, PC_OP
|  movzx RD, PC_RD
|  jmp aword [DISPATCH+OP*8+GG_DISP2STATIC]
|.endif

We see that it basically boils down to a call to lj_dispatch_ins followed by a jump to whatever [DISPATCH+OP*8+GG_DISP2STATIC] points to. LuaJIT saves a backup copy of the dispatch vectors at a distance of GG_DISP2STATIC from the original dispatch vector (which now only contains pointers to lj_vm_record), and lj_vm_record uses this backup vector to jump to the real assembly stub.

Let’s get off the armchair now and look at some actual code and understand some subtleties.

Once a loop or a call is measured to be hot, the interpreter calls lj_trace_hot which is the entry point into recording, compiling and installing a trace. lj_trace_hot sets state of the central jit_State object to LJ_TRACE_START and hands off control to lj_trace_ins.

The tracing subsystem of LuaJIT has five states: START, RECORD, END, ASM and ERR. lj_trace_ins is a finite automaton which moves between these states based on the bytecode instructions read off the execution trace. The overall scheme is simple:

START   -------------------> RECORD ---> ERR
          set up dispatch      |             
              vector           |            [IDLE]
                               v             /
                              END -----> ASM

Of course, this doesn’t happen all at once — the state is remembered and transitions are made (or not) as calls to lj_dispatch_ins uncover more and more of the currently executing trace. The bytecode stream the tracer sees is translated into an intermediate SSA representation which is then optimized and compiled into assembly.

Tracing gives us only one linear path through the code, disregarding other, equally valid paths that may have been taken. A tracing JIT usually deals with this by installing guards which check assumptions and bail out from the trace (to the interpreter, in LuaJIT’s case) on violated assumptions. For instance, consider this simple Lua program:

total = 0

function test()
  if total < 500 then
      total = total + 1
   end
end

for i=1,501 do
   test()
end

This compiles to the following bytecode (stripping out some parts non-essential to our understanding):

;; function test()
0001    GGET     0   0
0002    KSHORT   1 500
0003    ISGE     0   1
0004    JMP      0 => 0008
0005    GGET     0   0
0006    ADDVN    0   0   0
0007    GSET     0   0
0008 => RET0     0   1

;; body (instructions 0001 to 0007 removed)
0008    FORI     0 => 0012
0009 => GGET     4   2      ; "test"
0010    CALL     4   1   1
0011    FORL     0 => 0009
0012 => RET0     0   1

(You can get luajit to emit bytecode in this form using -b -l and have it dump all the trace information using -jdump.)

Nothing unexpected here — you have a for loop (delimited by FORI and FORL) running 501 times calling test in each iteration. LuaJIT extracts the following (again, trimmed down) trace (in SSA form):

;; Instructions 0001 to 0013 removed
0014 >  p32 HREFK  0013  "total" @11
0015 >  num HLOAD  0014
0016 >  num LT     0015  +500
0017  + num ADD    0015  +1
0018    num HSTORE 0014  0017
0019  + int ADD    0001  +1
0020 >  int LE     0019  +501
0021 ------ LOOP ------------
0022 >  num LT     0017  +500
0023  + num ADD    0017  +1
0024    num HSTORE 0014  0023
0025  + int ADD    0019  +1
0026 >  int LE     0025  +501
0027    int PHI    0019  0025
0028    num PHI    0017  0023

The instructions following LOOP are unsurprising: a normal loop with the usual SSA phi nodes. Note that (as is usual in tracing JITs) a version of test has been inlined completely into a tight loop. Instruction 0016 is rather interesting; the > means that a particular instruction is a guard and that the trace bails out to the interpreter if the condition doesn’t hold. In this case, we bail out if 0015 (which is the value of total, as you can make out from instructions 0014 and 0015) is greater or equal to 500. This is interesting because the tracing compiler doesn’t try to be a smartass and infer to not do anything when total is >= 500, which also is correct behavior. All it knows is that when total is < 500, the correct behaviour is to add 1 to total, because that is what it has observed in the trace. Notably, if the total >= 500 path gets taken often enough, it would be marked hot and be compiled into a side-trace (you should try reproducing this).

lj_record_ins in lj_record.c records each bytecode instruction into SSA before it is executed. One architectural subtlety is knowing which guard to emit — given a condition IF A < B THEN {0} ELSE {1} END do you emit a guard for A < B or for NOT (A < B)? We get that information only after the condition is evaluated, if A < B is true during the trace recording phase, then we need a guard asserting (A < B) and vice versa. Instead of implementing a partial interpreter in the tracer, LuaJIT does this by postponing the insertion of the guard instructions till before lj_record_ins is called with the next bytecode. It knows the result of evaluating the condition by then.

Another subtlety involves tracing across already-compiled traces. LuaJIT keep track of compiled traces by replacing the normal loop or call bytecodes with special marker bytecodes. These special bytecodes signal the existence and provide the location of a compiled trace beginning at that point. However, when tracing, we’d like to be able to see the function bytecodes. This is done by temporarily patching the special JIT marker bytecode (BC_JFUNCF or BC_JLOOP, for instance) with the original bytecode, tracing through the original bytecode and patching back the JIT marker bytecode once the tracing is done. To see this happening, have a look at rec_func_jit (in lj_record.c) and trace_pendpatch in lj_trace.c.

Snapshots

Any VM that involves transitioning between two (or more!) modes of execution faces the problem of converting between unoptimized and optimized stack layouts. This is true of LuaJIT too, with a slightly different meaning of stack. We’d like observational equivalence between a interpreted trace and the same trace compiled. We’d like the compiled trace to have the same side effects and map the operand stack the same way as the interpreted trace. However, this equivalence need not hold inside the compiled trace, we only need to make sure that this holds in the trace exits (guarded or not).

LuaJIT solves this by maintaining a mapping from stack location to SSA IR instructions. Such mappings are called snapshots in the LuaJIT codebase. Using a snapshot, it can then reconstruct the operand stack just like how it would have been had the trace been interpreted (snap_restoreval and ln_snap_restore in lj_snap.c).

This is slow, but quoting Mike Pall: "State restoration using this data-driven approach is slow of course. But repeatedly taken side exits quickly trigger the generation of side traces. The snapshot is used to initialize the IR of the side trace with the necessary state using pseudo-loads. These can be optimized together with the remainder of the side trace. The pseudo-loads are unified with the machine state of the parent trace by the backend to enable zero-cost linking to side traces."

Posted in Computers | Tagged , , , , , | 4 Comments

Classical Axioms in Coq

For the past few days I’ve been working through an amazing text, Software Foundations, which aims to develop “basic concepts of functional programming, logic, operational semantics, lambda-calculus, and static type systems, using the Coq proof assistant”. This post is an annotated solution (I know, I’ve been doing a lot of those lately) to an interesting problem stated in the text.

The Problem

The following five statements are often considered as characterizations of classical logic (as opposed to constructive logic, which is what is “built in” to Coq). We can’t prove them in Coq, but we can consistently add any one of them as an unproven axiom if we wish to work in classical logic. Prove that these five propositions are equivalent.

Definition peirce := forall P Q: Prop,
  ((P->Q)->P)->P.
Definition classic := forall P:Prop,
  ~~P -> P.
Definition excluded_middle := forall P:Prop,
  P / ~P.
Definition de_morgan_not_and_not := forall P Q:Prop,
  ~(~P/~Q) -> P/Q.
Definition implies_to_or := forall P Q:Prop,
  (P->Q) -> (~P/Q).

Since logical equivalence is transitive and symmetric, it is sufficient to prove each of the following:

  • peirceclassic
  • classicde_morgan_not_and_not
  • classicexcluded_middle
  • excluded_middleimplies_to_or

We prove each of these by proving implication in both directions.

peirceclassic

We start with the usual niceties (I’ll omit these in later proofs):

Theorem peirce_classic: peirce  classic.
Proof.
  unfold iff. unfold peirce. unfold classic. unfold not.
  split.

This, followed by

  intros Peirce P Classic.

drops us into our first goal, incidentally the easier of the two:

  Peirce : forall P Q : Prop, ((P -> Q) -> P) -> P
  P : Prop
  Classic : (P -> False) -> False
  ============================
   P

This sub-proof seemed easier to me because here we’re free to choose Q. We can get P from Peirce if we can somehow manage a (P -> Q) -> P for a Q of our choice. Can we somehow use (P -> False) -> False to make a (P -> Q) -> P? To make a (P -> Q) -> P, we need make a P from a given P -> Q. We already have a (P -> False) -> False. If we set Q to False, using (P -> False) -> False we’ll get False and using False we can get anything!
This is all the intuition we need to prove PeirceClassic:

  assert ((P -> False) -> P) as P_False_P.
    intros P_False.
    apply Classic in P_False.
    inversion P_False. (** P_False is false now *)
  apply Peirce in P_False_P. (** P_False_P is now P *)
  apply P_False_P.

I found proving ClassicPeirce a bit harder since we are no longer free to choose Q to our convenience.

  Classic : forall P : Prop, ((P -> False) -> False) -> P
  P : Prop
  Q : Prop
  P_Q_P : (P -> Q) -> P
  ============================
   P

We first need to decide whether to start with Classic or with P_Q_P, both of them result in P and are fair game at first sight. If we try to use P_Q_P we’d have to somehow come up with a term for P -> Q, something that given a P gives a Q. But Q could be anything (pre-decided, out of our control)! This seems a great lot harder (if not impossible) than providing a term for (P -> False) -> False which has no such problems.

The intuition for getting a (P -> False) -> False is this: we have a (term of type) P -> False and need to somehow transform this to (a term of type) False. We can do this if we can get a term of type P from somewhere (the constructive analog of conjuring up a classical contradiction). P_Q_P can be used to get a P, but we don’t know how to create Q. However, this time around, we have with us a term (a hypothesis) of type P -> False and since False -> Z for any Z, we can have P -> Z for any Z. In other words, inside P -> Q, we can first use the P (antecedent) on the P -> False we already have to create a False and once we have False we can create anything (including Q) from it. Inside the (P -> False) -> False term, we first using above technique to create a P and then use that P on the given P -> False to create a False.

  (** From Classic to Peirce *)
  intros Classic P Q P_Q_P.
  assert ((P -> False) -> False) as P_Classic.
    intros P_False.
    assert (P -> Q) as P_Q.
      intros Given_P.
      apply P_False in Given_P.
      inversion Given_P. (** Given_P is false *)
    apply P_Q_P in P_Q. (** P_Q_P is now P *)
    apply P_False in P_Q.
    apply P_Q.
  apply Classic in P_Classic. (** P_Classic is P now *)
  apply P_Classic.

and this also proves peirceclassic.

Qed.

classicde_morgan_not_and_not

This is easier than the last proof. When proving classicde_morgan_not_and_not, Coq says

  Classic : forall P : Prop, ((P -> False) -> False) -> P
  P : Prop
  Q : Prop
  P_False_Q_False_False : (P -> False) / (Q -> False) -> False
  P_Or_Q_False : P / Q -> False
  ============================
   False

The trick for this proof is that given P / Q -> False, we can separately make both P -> False and Q -> False by simply injecting the P or Q into P / Q and then invoking the P / Q -> False term on it.

  assert ((P -> False) / (Q -> False)).
    split.
    (** P is False *)
    intros.
    apply or_introl with (Q:=Q) in H.
    apply P_Or_Q_False in H. apply H.

    (** Q is False *)
    intro.
    apply or_intror with (P:=P) in H.
    apply P_Or_Q_False in H. apply H.

Once we have a P / Q -> False, getting a False is easy.

  apply P_False_Q_False_False in H. apply H.

de_morgan_not_and_notclassic is easy too:

  De_Morgan : forall P Q : Prop,
              ((P -> False) / (Q -> False) -> False) -> P / Q
  P : Prop
  Classic : (P -> False) -> False
  ============================
   P

The trick here is to specialize De_Morgan with both P and Q set to P.

  assert ((P -> False) / (P -> False) -> False).
    intros.
    inversion H as [P_False0 P_False1].
      apply Classic in P_False0. apply P_False0.

  apply De_Morgan in H.
  inversion H as [Ans_0 | Ans_1].
    apply Ans_0.
    apply Ans_1.

and were done proving classicde_morgan_not_and_not

Qed.

classicexcluded_middle

I spent the most time proving this one, or rather, in proving classicexcluded_middle. After the usual introductions,

  Classic : forall P : Prop, ((P -> False) -> False) -> P
  P : Prop
  ============================
   P / (P -> False)

We could try to get a P out of Classic and then inject that P into P / (P -> False). However, this approach didn’t look like it’d work out: I don’t know if P is true (which is very close to what we’re trying to prove — even though we don’t know whether there is a term for P or not, either there is one for P or one for P -> False) and hence I should not be able to arbitrarily deduce the existence of the a term of that type.

Since Classic is universally quantified over P, we apply it and explore:

  apply Classic.
  intros H.

gives

  Classic : forall P : Prop, ((P -> False) -> False) -> P
  P : Prop
  H : P / (P -> False) -> False
  ============================
   False

We can’t use Classic here, since if we did, we’d have to instantiate P with False and have to provide a term of the type (False -> False) -> False. Given a False -> False, the only way to get a False is to provide a False to the antecedent (False -> False). But, if we had a way to get a False from somewhere (given a False -> False) we could use that directly (without going inside (False -> False) -> False) since it is trivial to create a False -> False term out of nowhere.

The only other possibility is to come up with a P / (P -> False) and use H on it.

  assert ((P -> False) -> False) as P_False_False.
    intro P_False.

gives

  Classic : forall P : Prop, ((P -> False) -> False) -> P
  P : Prop
  H : P / (P -> False) -> False
  P_False : P -> False
  ============================
   False

We’re finally at the point where things get obvious. We can now inject P_False to a P / (P -> False) and use that to get the False we need.

    apply or_intror with (P:=P) in P_False.
    apply H in P_False.
    apply P_False.

gives

  Classic : forall P : Prop, ((P -> False) -> False) -> P
  P : Prop
  H : P / (P -> False) -> False
  P_False_False : (P -> False) -> False
  ============================
   False

This goal is slightly tricky, but we’re almost done. We use Classic on P_False_False to get a P which we use on H to get a False.

  apply Classic in P_False_False.
  apply or_introl with (Q:=(P -> False)) in P_False_False.
  apply H in P_False_False.
  apply P_False_False.

excluded_middleclassic is much easier:

  intros H Prp Double_Neg.
  specialize H with Prp.

gives

  Prp : Prop
  H : Prp / (Prp -> False)
  Double_Neg : (Prp -> False) -> False
  ============================
   Prp

H tells us that P is either true or false. In case it is true, the conclusion directly follows. In case it isn’t. we use Double_Neg to get a False, from which anything follows:

  inversion H as [P | NegP].
  (** P is true *)
  apply P.
  (** P is false *)
  apply Double_Neg in NegP. inversion NegP.

And this proves classicexcluded_middle!

Qed.

excluded_middleimplies_to_or

This is probably the easiest one. After usual introductions,

  Excluded_Middle : forall P : Prop, P / (P -> False)
  P : Prop
  Q : Prop
  Implication : P -> Q
  ============================
   (P -> False) / Q

Excluded_Middle either gives us a term of type P from which we can get a Q using Implication or it gives us a term of type (P -> False). In either case, we have a way to construct (P -> False) / Q.

  specialize Excluded_Middle with P.
  inversion Excluded_Middle.
    apply Implication in H. right. apply H.
    left. apply H.

Proving implies_to_orexcluded_middle is easy too, after introductions:

  Implies_To_Or : forall P Q : Prop, (P -> Q) -> (P -> False) / Q
  P : Prop
  ============================
   P / (P -> False)

Note that Implies_To_Or with P := P and Q := P proves our goal for (almost) free.

  specialize Implies_To_Or with P P.
  assert (P -> P).
    intros. apply H.
  apply Implies_To_Or in H.
  inversion H as [HL | HR].
    right. apply HL.
    left. apply HR.

and this proves excluded_middleimplies_to_or.

The source is up on github.

Posted in Computers | Tagged , , , , , , | Leave a comment

Typechecking TAL0 in Agda.

This post is an annotated solution to problem 4.2.11 in the book Advanced Topics in Types and Programming Languages from the chapter Typed Assembly Language by Gerg Morrisett.

Introducing TAL0

I’ll do my best to give an informal introduction to TAL0 right here — if you wish to read something more polished I suggest you consult the original source.

TAL0 aims to provide control flow safety to simple, imperative programs. Control flow safety implies that a program is allowed to jump only to well-defined points, labels, in the code. This is easy to verify if all jump instructions take a constant label as an argument — we only have to verify that the label exists. But we’d like to go further than that and be able to dynamically jump to a location held in a register. With this freedom, the problem becomes a little more interesting.

TAL0 is a toy assembly language. You have a finite number of registers, each of which can hold either an integer or a label. You have the following instructions:

  • jump label jumps to label label.
  • reg0 鈫恟 reg1 moves the value in reg0 to reg1.
  • reg 鈫恖 label moves label label to the register reg.
  • reg 鈫恘 integer moves integer integer to the register reg.
  • reg0 鈫 reg1 鉄 op 鉄 reg2 performs operation op (anything with the type 鈩 鈫 鈩 鈫 鈩) on the integers stored in reg1 and reg2 and stores the result in reg0.
  • if reg0 jump reg1 jumps to the label stored in reg1 if the value in reg0 is not 0.

Informally, we’d like to design a type system that ensures that all the above operations are always well-defined. The type-system should ensure that reg0 鈫 reg1 鉄 op 鉄 reg2 doesn’t try to add 42 to a label, for example.

Labels are sequences of instructions. I have take the liberty to constrain sequences to always end with a static jump instruction. I don’t think we lose any generality due to this.

The only mutable “state” in the system are the registers.

The type system

In essence, the approach described in the paper assigns a type to the register file (the array of registers) at each instruction.

A label’s type is a vector of types, one for each register. This label-type, which we’ll call , encodes what the sequence of instructions expects the register file to look like on entry. If, for instance, says that r0 contains a label and r1 an integer on entry, we know that if r1 jump r0 is safe, if it is the first instruction in the sequence. We then morph as we go through the instructions in the sequence based on how each instruction affects the register file. For example, we know that after executing r0 鈫恘 42, 螕[r0] (the type at r0 in ) is Nat. In other words, we assign a (possibly different) type to the register file at each instruction. The type of each register (at a specific instruction) can either be Nat, denoting that it contains an integer or block { r0 : t0, r1 : t1 ... }, denoting that it contains a label to a block (or sequence) of instructions which, on entry, expects r0 to contain a value of type t0, r1 to contain a value of type t1 and so on.

There is one subtlety, however. What type do you assign to this sequence of instructions?

  if r1 jump r0
  jump some_label

You see, going naively by the above scheme, the type r0 has at the point where the if_jump_ has to contain itself. A little more formally, if is the type of the register file at the location of the if_jump_ then 螕[r0] = 螕. One obvious solution is to allow to be an infinite type. Another approach (the one used in the paper) uses a polymorphic type and instantiates it with itself when required. I found this approach a little hard to implement in Agda. Specifically, I ran into issues with positivity of types, something I don’t fully understand, and decided to implement a simpler (but more limited) approach.

I use an anything type that denotes that a particular register can contain, really anything. The type-system is still sound since TAL0 can only manipulate values of this type opaquely — it can neither jump to registers of type anything nor can it do math on them. In the motivating example, we’d set 螕[r0] = label { r0 = anything, r1 = ... }. The type of the r0 is a label, which allows r0 to contain any value whatsoever. Since r0 can contain anything before the jump, it can very well contain a value of type label { r0 = anything, r1 = ... }. The crucial point is that we don’t really allow any kind of self-nesting or recursion here. I found it helpful to think about anything being — the result of an infinite recursion.

The implementation

The first step is declaring , the type of a register file. This is a mutually recursive definition:

mutual
  螕鈥 : 鈩 鈫 Set
  螕鈥 n = Vec Type n

  螕 = 螕鈥 size

  data Type : Set where
    natType : Type
    blockType : 螕鈥 size 鈫 Type
    anything : Type

Here size is a value of type which is basically the number of registers we have. Secondly, we define a function safeToJump' that checks if it is feasible to jump to the label contained in a specific register from the current point in the instruction stream:

safeToJump : {n : 鈩晑 鈫 (to : 螕鈥 n) 鈫 (from : 螕鈥 n) 鈫 Set
safeToJump [] [] = 鈯
safeToJump (anything 鈭 xs) (x' 鈭 xs') = safeToJump xs xs'
safeToJump (x 鈭 xs) (x' 鈭 xs') =
  if (equalT x x') then (safeToJump xs xs') else 鈯

safeToJump' : 螕 鈫 Reg 鈫 Set
safeToJump' 纬 idx with lookup idx 纬
... | natType = 鈯
... | blockType 纬鈥 = safeToJump 纬鈥 纬
... | anything = 鈯

equalT is a trivial function that compares types for equality. safeToJump' merely asserts that it is never safe to jump to an integer or an anything. When it sees a block-type it does a straightforward deep comparison, keeping in mind the anything semantics discussed above. Returning and allows us to have an implicit argument of the type safeToJump' 纬 r and have Agda fill in the value automatically for and fail typechecking for .

The meat of the implementation is a Block data-type. A value of this type is also the proof of the consistency of the stream of instructions it represents. Each sequence is expressed as a discrete value and a program is a mutually recursive set of such values. For instance, here are two labels which just keep jumping to each other:

mutual
  伪 : Block (natType 鈭 natType 鈭 [])
  伪 = jump (鈾 尾)
  尾 : Block (natType 鈭 natType 鈭 [])
  尾 = jump (鈾 伪)

Agda’s mixfix syntax allows us to really go wild when defining Block:

data Block : 螕 鈫 Set where
  jump : {纬鈧 纬鈧 : 螕} 鈫 鈭 (Block 纬鈧) -> { _ : safeToJump 纬鈧 纬鈧 } 鈫 Block 纬鈧
  _鈫恟_鈻禵 : {纬 : 螕} 鈫 (dest : Reg) 鈫 (src : Reg) 鈫
            Block (纬 [ dest ]鈮 (lookup src 纬)) 鈫  Block 纬
  _鈫恖_鈻禵 : {纬 纬鈧 : 螕} 鈫 (dest : Reg) 鈫 (src : 鈭 (Block 纬鈧)) 鈫
            Block (纬 [ dest ]鈮 (blockType 纬鈧)) 鈫  Block 纬
  _鈫恘_鈻禵 : {纬 : 螕} 鈫 (dest : Reg) 鈫 (src : 鈩) 鈫
            Block (纬 [ dest ]鈮 natType) 鈫  Block 纬
  _鈫恄鉄╛鉄鈻禵 : {纬 : 螕} 鈫 (dest : Reg) 鈫 (op鈧 : Reg) 鈫
               (fn : 鈩 鈫 鈩 鈫 鈩) 鈫 (op鈧 : Reg) 鈫
               猞 _ : lookup op鈧 纬 鈮 natType 猞 鈫
               猞 _ : lookup op鈧 纬 鈮 natType 猞 鈫
               Block (纬 [ dest ]鈮 natType) 鈫 Block 纬
  if_jump_鈻 : {纬 : 螕} 鈫 (src : Reg) 鈫 (dest : Reg) 鈫
              猞 _ : lookup src 纬 鈮 natType 猞 鈫
              { _ : safeToJump' 纬 dest } 鈫
              Block 纬 鈫 Block 纬

Coinduction (the ) is needed to allow, for instance, jump instructions to the same block. We want to be able to type b = jump (鈾 b). The 猞 ... 猞 is Agda syntax for instance arguments.

We build the blocks bottom up. Every instruction (other than jump which is always the last instruction) takes a block of type 螕鈧 and produces a block of type 螕鈧, where running the instruction will change a register file of type 螕鈧 to a register file of type 螕鈧.

The individual rules are rather simple and I will annotate only one of them: the _鈫恄鉄╛鉄鈻禵 rule says that we expect both op鈧 and op鈧 to contain integers and the rest of the block will see an integer at the destination register. It results in a block that may not contain an integer at the destination register. The approach may look a little mysterious (and ass-backwards), but Agda is perfectly capable of doing this kind of type inference, if it knows the expected type of resultant Block. It isn’t hard to see how — given , it is always possible to compute 纬 [ dest ]鈮 natType.

And yeah, this is about it! The full code (with a few trivial additions, like the equalT function) is available at https://github.com/sanjoy/Snippets/blob/master/TAL0.agda

Programming in Agda is awesome to the point of being unfair.

Posted in Computers | Tagged , , , , , , | Leave a comment