Neither the interpreter nor the JIT understands BF source; instead they work with a bytecode based IR. The bc_from_source
function in bytecode.c parses BF source and generates a stream of bytecodes representing it. The term bytecode is somewhat misleading since a bytecode is four bytes long, usually followed by one or more fourbyte payloads.
Translating to bytecode is important for performance. Individual BF instructions do very little and sequences like +++
can be quickly executed at one go (by increasing the current cell’s value by 3
, in this case) than by three separate bytecode dispatches. To this end, bc_from_source
attempts to do some basic peephole optimizations by recognizing common BF idioms. For instance, one way to add the contents of the current cell to a cell three indices away (and clear the current cell in the process) is [>>>+<<<]
. bc_from_source
recognizes this pattern and emits a specialized bytecode, BC_MOVE_VALUE
, which is specially handled by the interpreter and the JIT.
The interpreter is quite simplistic. It uses GCC’s labels as values feature to provide for a quick dispatch mechanism. Every loop has a heat counter associated with it, whose numerical value is stored as payload following the BC_LOOP_BEGIN
bytecode. The interpreter checks this value and calls in the JIT once a method is hot. The JIT does its thing, and stores away a pointer to the generated code in the program_t
structure. The interpreter then calls into this generated code. The compiler also patches the BC_LOOP_BEGIN
to a BC_COMPILED_LOOP
so that the next iteration directly executes the JITed version of the loop.
The compiler uses dynasm to generate a codegenerator; and as a consequence most of the compiler really lives in codegen.dasc. This might change if I later add some analysis or optimization passes which run only when compiling the bytecode, but for now the codegen practically is the compiler. The only compiler specific optimization is that the compiler tries to not emit a redundant cmp
instruction if the last instruction makes it unnecessary (by virtue of being an add
instruction with a negative constant as one operand).
Speed is basically what it all boils down to. Running mandlebrot.bf takes 1.25 seconds on average, which is at par with compiling BF to C and then compiling the resultant C file with gcc O3
. While I’d call this pretty good, I think there is still some room for improvement, perhaps studying the machine code generated for typical BF programs will yield some insight.
There are two improvements I can think of, believe will improve performance, but haven’t had time to implement:
At the very core, the untyped lambda calculus (referred to as λ from this point on) is a model of computation. In other words, it formalizes the notion of computability and, since it is Turing Complete, it is possible to implement any algorithm using just the primitives given to us by λ (this last part is known as the Church Turing Thesis).
The definition of λ gives us two things: it teaches us how to construct "lambda terms" and it teaches us how to manipulate such terms we create. The surprise is not in the rules themselves — they are all very simple and intuitive; but in how they come together in creating a powerful model of computation.
A lambda term can either be a variable, an application or an abstraction. λ terms are inductively defined; any term resulting from the zero or more applications of the following rules will result in a lambda term:
x
) is a λ term. A variable is usually printed as itself.λ x. T
; where x
is the representation of the variable and T
is the representation of the term.In C:
enum { TERM_VARIABLE, TERM_ABSTRACTION, TERM_APPLICATION }; typedef struct term { int type; union { struct { /* variables */ const char *var_name; }; struct { /* applications */ struct term *left, *right; }; struct { /* abstractions */ const char *arg_name; struct term *body; }; }; } term_t; void term_print(FILE *fptr, term_t *term) { switch (term>type) { case TERM_VARIABLE: fprintf(fptr, "%s", term>var_name); break; case TERM_APPLICATION: fprintf(fptr, "( "); term_print(fptr, term>left); fprintf(fptr, " "); term_print(fptr, term>right); fprintf(fptr, " )"); break; case TERM_ABSTRACTION: fprintf(fptr, "( \ %s. ", term>var_name); term_print(fptr, term>body); fprintf(fptr, " )"); break; } }
For an abstraction λ x. T
, x
is said to be bound inside T
. Variables (or symbols) that aren’t bound this way are said to be free. In the expression λ x. (λ y. (((λ t. x) y) z) y)
, z
and the last y
are free variables (the parenthesis’ exist to denote the order in which the terms are combined). That z
is free is obvious, since the expression doesn’t have a subterm of the form λ z. T
, z
can’t really be bound anywhere. The last y
too is free since none of its parent abstractions (there is only one, the outermost one) abstract over y
. The second last y
, however, is bound since its parent abstracts over y
.
A direct (but rather slow) way to list the free variables in an expression in C:
static void list_free_recursive(stack_t *stack, term_t *term) { switch (term>type) { case TERM_VARIABLE: { if (!stack_contains(stack, term>var_name)) printf("%sn", term>var_name); break; } case TERM_APPLICATION: { list_free_recursive(stack, term>left); list_free_recursive(stack, term>right); break; } case TERM_ABSTRACTION: { stack_push(stack, term>arg_name); list_free_recursive(stack, term>body); stack_pop(stack); break; } } } void term_list_free_variables(term_t *term) { stack_t stack; stack_init(&stack); list_free_recursive(&stack, term); }
The exact symbol used to represent bound variables don’t have any semantic meaning, in the same way x
does not have any semantic meaning in int foo(int x) { return x; }
(if you think about C compilation in terms of source files and compilation units, it isn’t hard to convince yourself that only extern
al identifiers have any real meaning). In other words, you can always rename bound variables as long as you are consistent about it — λ x. (λ y. x)
is exactly the same as λ t. (λ y. t)
. This property, called Alpha Equivalence, means that testing two terms for equality isn’t a simple recursive comparison — λ x. x
is equal to λ y. y
semantically but not textually.
Sometimes De Bruijn indices indices are used to indicate bound variables. Instead of textual variables, you use an integer to denote how far out the abstraction introducing a variable is. In λ x. (λ y. y x)
, for example, the y
in (λ y. y x)
is introduced by the first λ
when moving outwards, while the x
is introduced by the second. This makes its De Bruijn form λ. (λ. 1 2)
. Note that since the "variables" now directly reference the λ
they’re interested in, the λ
terms themselves don’t have to have explicitly name a symbol. De Bruijn indices ensure that equal terms also look equal textually.
λ terms with no free variables are called closed terms or combinators and terms with at least one free variable are called open terms. We need to be a little careful when dealing with open terms since we have to ensure we don’t end up accidentally binding variables. For instance, if T
is closed, we can always use it in an abstraction like λ x. (λ y. (y (T x)))
. However, if T
is open because it has x
as a free variable, we may end up doing something unintentional (why?).
If anything, this is the property that should make it intuitively clear how λ models functional programming languages. It gives us the semantics of one step or transition in the evaluation or reduction of a given λ term.
A β reduction reduces (λ x. T) A
to [x := A] T
, where [a := B] C
means "replace all open a
in C
by B
. Consider (λ x. (λ x. x) x) y
as an example — mechanically applying the rule tell us that this term can be reduced to [x := y] (λ x. x) x
. There is only one open x
in the term (λ x. x) x
, which when replaced by y
gives us (λ x. x) y
.
A naive implementation of [var := new_term] term
in C:
term_t *term_replace_open(term_t *term, const char *var, term_t *new_term) { switch (term>type) { case TERM_VARIABLE: if (strcmp(term>var_name, var) == 0) return new_term; return term; case TERM_APPLICATION: { term_t *left = term_replace_open(term>left, var, new_term); term_t *right = term_replace_open(term>right, var, new_term); return term_make_application(left, right); } case TERM_ABSTRACTION: if (strcmp(term>arg_name, var) == 0) return term; return term_make_abstraction( term>arg_name, term_replace_open(term>body, var, new_term)); } }
A sequence of β reductions is the λ equivalent of "executing" a program.
To make sure you’ve understood how β reductions work, try to (mentally or with pen and paper) intuitively guess what the following expressions reduce to after a β reduction:
(λ x. (λ y. (x y))) (λ x. x)
(λ x. (x x)) (λ x. (x x))
(λ x. (λ x. (x x))) (λ x. x)
What can you infer from the second example?
A good way to develop intuition and ease in manipulating λ terms is to have a look at the various ways to encode "real world values" like integers and boolean in λ. One way to do this is using Church encoding. However, before that you need to take a look at the conventions generally when representing λ terms (copied from this place):
M N
instead of (M N)
M N P
may be written instead of ((M N) P)
λx. M N
means λx. (M N)
and not (λx. M) N
.λx. λy. λz. N
is abbreviated as λxyz. N
Given a term to reduce, a redex, the β reduction rule tells us how to get a reduct (the result of applying the rule) but nothing about how to choose the redex to be reduced. For example, the expression (λ xy. x y) ((λ x. x) (λ x. x))
(from now on we’ll follow the notation specified above) has two reduction sites. We can see the expression either as (λ xy. x y) T
(with ((λ x. x) (λ x. x))
as T
) and try to reduce it to [x := T] (λ y. x y)
; or as (λ xy. x y) ((λ x. x) S)
(with λ x. x
as S
) and try to reduce it to (λ xy. x y) ([x := S] x)
.
For a more convincing case, consider (λ xy. y) ((λ x. x x) (λ x. x x)) (λ x. x)
; can you find an policy for choosing redexes that will put the evaluator in an infinite loop? Can you find a policy that doesn’t?
The process of choose the next redex in a λ term is dictated by the various reduction strategies a particular implementation can adopt.
Next posts
In the next posts we’ll discuss an implementation for λ, and then move onto typed variants.
The (unsurprising) interface is specified by an serial reference implementaion, which is implemented subject to respecting the ACID properties in a multithreaded context:
FixedVectorSemantics(T of Type, Size of Integer) { Fields: Buffer of Array(size = Size, type = T) Length of Integer Methods: push_back of (Value of Pointer(T)) → Integer { if Length < Size then: Buffer[Length] ← Value OldLength ← Length Length ← (Length + 1) return OldLength else: return 1 } pop_back of (Pointer(T) × Integer) { if Length > 0 then: Length ← Length  1 return (Buffer[Length], Length) else: return OutOfRange } get of (Index of Integer) → Pointer(T) { if Index < Length: return Buffer[Index] else: return OutOfRange } }
There are a few things worth noting about the interface:
A FixedVector
only stores pointers. This isn’t incidental –the atomic primitives it uses only operate on wordsized data. Moreover, the implementation steals the least significant bit of pointers in some cases. FixedVector
can be conceivably be used to store any data that is, in size, smaller than or equal to a pointer, as long as the least significant bit of the datum is always guaranteed to be 0
.
push_back
and pop_back
return indicesIn a serial environment, the index at which an element has been inserted by push_back
and the index of the element retrieved by pop_back
can be inferred by inspecting Length
immediately afterwards. In a multithreaded environment, however, this is racy.
put
methodUnlike the previous two issue, is only incidental — nothing prevents the addition of a put
method, I just haven’t gotten around to it yet.
The FixedVector
implementation satisfies the ACID properties under the following interpretations:
A method either succeeds (and the change is then insured by durability) or fails (and effects no observable change).
FixedVector
doesn’t provide any property that I could group under this heading. Note that it does not provide the guarantee that invoking get
on an index less than Length
will not return OutOfRange
. Providing this guarantee will force FixedVector
to not provide other, more important guarantees (see the implementation of push_back
).
Let N requests, R_{0}, R_{1} … R_{N} executed concurrently on a FixedVector
; and let those requests take the FixedVector
from state S_{0} to S_{1} and return values V_{0}, V_{1} … V_{N} respectively. Then, there is a way to order the requests such that the change in state can also be obtained by serially invoking the requests in some order and have each request return the same values as earlier. Maintaining this property was the toughest aspect of the implementation.
A request that has gone through (the function called has returned) can’t be reordered with a request that comes in "later". This is a rather loose constraint, since different processors may see different views of the same FixedVector
(even though x86’s cache coherency ensures every processor usually gets a consistent view).
As an example, consider the following four operations invoked concurrently on an empty stack:
push_back(5)
push_back(6)
push_back(7)
pop_back
Once all the four requests have been completed, the state of the vector may be one of the following:
Vector contents  Value returned from pop_back 
Any permutation of [5, 6, 7] 
OutOfRange 
Any permutation of [5, 7] 
6 
Any permutation of [6, 7] 
5 
Any permutation of [5, 6] 
7 
No other state is legal.
I approach thinking about a certain problem by asking myself two, mutually exclusive questions: Is this trivial? If so, how? and Is this nontrivial? If so, why?. These, followed up with some experimentation, usually lead to a solution.
The implementation makes heavy use of atomic reads and writes, and the usual compare and swap. The operations are mostly lockfree, only push_back
does some fine grained locking.
push_back
The simplest code that could possibly work for push_back
looks like this:
Len = AtomicRead(Length) if CompareAndSwap(&Length, Len, Len + 1) if failed: Retry AtomicWrite(Buffer[Len], { value to push })
Turns out, this is good enough, with one modification:
Len = AtomicRead(Length) CompareAndSwap(&Length, Len, Len + 1) if failed: Retry MemoryFence() AtomicWrite(Buffer[Len], { value to push })
The MemoryFence
ensures that write to Buffer[Length]
is not reordered ahead of the write to Length
. The CPU is normally free to do such a reordering since this reordering preserves the semantics of a singlethreaded program. The problem with this reordering is easy to see (the line executing in a particular step is marked with an asterisk):
Thread 0  Thread 1 
push_back(5): * Len = Length CompareAndSwap( &Length, Len, Len + 1) Buffer[Len] = 5 
push_back(10): * Len = Length CompareAndSwap( &Length, Len, Len + 1) Buffer[Len] = 10 
Vector State: Length = 0 Data = [] 

push_back(5): Len = Length Len is 0 now * CompareAndSwap( &Length, 0, 1) Buffer[0] = 5 
[Thread Stalled] push_back(10): Len = Length Len is 0 now CompareAndSwap( &Length, 0, 1) * Buffer[0] = 10 
Vector State: Length = 0 Data = [] 

push_back(5): Len = Length Len is 0 now CompareAndSwap( &Length, 0, 1) * Buffer[0] = 5 
[Thread Stalled] push_back(10): Len = Length Len is 0 now CompareAndSwap( &Length, 0, 1) * Buffer[0] = 10 
Vector State: Length = 1 Data = [] 

push_back(5): Successful 
push_back(10): Len = Length Len is 0 now CompareAndSwap( &Length, 0, 1) * Buffer[0] = 10 
Vector State: Length = 1 Data = [5] 

push_back(5): Successful 
push_back(10): Len = Length Len is 0 now * CompareAndSwap( &Length, 0, 1) Buffer[0] = 10 
Vector State: Length = 1 Data = [10] 

push_back(5): Successful 
push_back(10): Len = Length CompareAndSwap( &Length, 0, 1) Fails, retries Buffer[0] = 10 
Vector State: Length = 1 Data = [10] 

push_back(5): Successful 
push_back(10): Successful in second try, execution skipped for brevity 
Vector State: Length = 2 Data = [10, 10] 
The state of an empty vector after a serial push_back(5)
and a push_back(10)
(in either of the two possible orders) can never by [10, 10]
— a missing fence breaks isolation. However, with the memory fence in place, the CAS
is always executed (and fails when required) before the atomic write and this problem is avoided.
There is another subtlety, however. Consider a concurrent invocation of n + m + 1 (m >= 0) pushes and n pops on a vector with k elements (with k >= n). Assume the first push_back
sets the length to k + t + 1 (with t >= 0) and stalls, after which all the other requests execute in a way that the length of the vector when the first request is resumed is greater than k. Since the first request, on resumption, writes a value to Buffer[k]
, it essentially overwrites a value that was written to Buffer
by another push_back
! I found this troubling, more so because, in this case, we end up pop_back
ing a random value that hasn’t yet been pushed! Fortunately, this issue can be solved with no modification to push_back
at all, as we shall see. We’ll return to this issue after discussing the implementation for pop_back
.
Implementing pop_back
I found implementing pop_back
a great deal harder than implementing push_back
, primarily because of the ABA problem. The problem stems from the fact that a push_back
implementation needs to atomically both modify Length
and read Buffer[Length  1]
. If we modify Length
first, we might just allow a pop_back
and read and return a value different from the one we popped, and if we read Buffer[Length  1]
first, we might just allow a pop_back
followed by a push_back
to overwrite the value we are about to pop and, again, return a value different from what we popped.
An example of the first issue (when we modify Length
before reading Buffer[Length  1]
):
Thread 0  Thread 1 
pop_back: * Len = Length CompareAndSwap( &Length, Len, Len  1) OldValue = Buffer[Len] Return OldValue 
push_back(10): * Len = Length CompareAndSwap( &Length, Len, Len + 1) MemoryFence() Buffer[Len] = 10 
Vector State: Length = 1 Data = [42] 

pop_back: Len = Length * CompareAndSwap( &Length, 1, 0) OldValue = Buffer[Len] Return OldValue 
push_back(10): * Len = Length CompareAndSwap( &Length, Len, Len + 1) MemoryFence() Buffer[Len] = 10 
Vector State: Length = 1 Data = [42] 

pop_back: Len = Length CompareAndSwap( &Length, 1, 0) * OldValue = Buffer[Len] Return OldValue 
push_back(10): * Len = Length CompareAndSwap( &Length, Len, Len + 1) MemoryFence() Buffer[Len] = 10 
Vector State: Length = 0 Data = [] 

pop_back: Len = Length CompareAndSwap( &Length, 1, 0) * OldValue = Buffer[Len] Return OldValue 
push_back(10): Succeeds 
Vector State: Length = 1 Data = [10] 

pop_back: "Succeeds", return 10 
push_back(10): Succeeds 
Vector State: Length = 1 Data = [10] 
We end up breaking isolation in this case — no ordering of push_back(10)
and pop_back
on a vector [42]
can have the pop_back
return 10
and change the state of the vector to [10]
. A similar example can be made for the case where we read Buffer[Length  1]
before modifying Length
.
I eventually had to settle in for a (finegrained) locking implementation. The lock here is a bit set in the word about to be popped. You cannot pop and return a word that has the lock bit set. The algorithm looks like this (Lock(word)
returns word
with the lock bit set):
Len = AtomicRead(Length) if Len == 0 then return OutOfRange OldValue = AtomicRead(Buffer[Len  1]) if OldValue has lock bit set: Retry  (A) CompareAndSwap(&Buffer[Len  1], OldValue, Lock(OldValue)) if failed: Retry MemoryFence() CompareAndSwap(&Length, Len, Len  1) if failed: AtomicStore(&Buffer[Len  1], OldValue) Retry return OldValue
First of all, pop_back
isn’t lock free, because of the Retry
in (A)
. In short, a pop_back
requested when the length of the vector is k locks all the elements from pop_back
with index <= k — since popping an element with index <= k involves popping the element at index k, which (A)
forbids us to do until the pop_back
either succeeds or fails. I have a hunch that making pop_back
completely lockfree will be a significant performance boost.
Note that the lock bit isn’t reset if the pop_back
succeeds — this is crucial to the correctness of this algorithm. This prevents a pop_back
between a push_back
CAS
on Length
and the subsequent write to Buffer[OldLength]
. You see, since we’re pushing a value at location i (say), there must have been a successful pop from location i + 1 to i at some point. And that pop_back
must have left the lock bit set for Buffer[i]
, and this is how the pop_back
is locked out till the push_back
is finished. This takes care of the situation discussed earlier in the push_back
section. We take care of case when a push_back
is done at a new index, one which hasn’t ever seen a value by initializing the entire buffer with a sentinel value at allocation. This sentinel value has the lock bit set.
The MemoryFence
serves a purpose similar to that in push_back
— if the CAS
on Length
is reordered ahead of the CAS
on the lock bit, we might end up returning a different value than we popped.
Implementing get
Implementing get
is the easiest of all, since it has the luxury of being a readonly operation:
get(index): Len = AtomicRead(Length) if index >= Len: return OutOfRange Value = AtomicLoad(&Buffer[index]) if Value != Sentinel: return Value else: return OutOfRange
In the case we do see a value that is being concurrently popped, we can pretend as if the get
was reordered to before that particular pop_back
request.
Lockfree programming seems like a different world — multiple cores make ordinary things behave in unexpected ways. I have tested my implementing on an x86 (Intel i7), a coherent paradise, I’m sure testing on a looser architecture will reveal subtle, overlooked issues. And just so you know, there are better ways to effect a lockfree vector, written by people far smarter than me. I’m still learning.
LuaJIT is a tracing JIT as opposed to a method JIT — it works not by discovering and optimizing hot methods as a whole but by discovering and optimizing hot traces or paths of execution. LuaJIT has a rather interesting interpreter as well, but I won’t talk about it here except in context of the JIT.
The meat of the interpreter lies in the vm_<arch>.dasc
files, which contain a specification of the interpreter (and other assembly) stubs in an assemblylike language. So, for instance, vm_x86.dasc
contains a function like:
// Generate assembly code for a specific Lua bytecode.
static void build_ins(BuildCtx *ctx, BCOp op, int defop) {
switch (op) {
case BC_HHGTTG: // hypothetical bytecode
 mov eax, 42 // hypothetical implementation
break;
}
}
vm_x86.dasc
is first lowered by dynasm.lua
into a C program. The lines beginning with 
s are mapped to calls to dasm_put
in a way that the C functions produced now emit these assembly opcodes. This new C program is then linked with buildvm.c
and run at compile time to generate the required assembly stubs. For instance, calling build_ins
with BC_HHGTTG
(after lowering) will emit the assembly opcodes for moving 42
to eax
(whatever the semantic implication might be). This assembly is emitted into lj_vm.s
and linked in with the rest of the Lua JIT.
Detecting hot traces
Now, with that bit of detail out of our way, let us look at how Lua actually picks up the traces it wishes to compile. LuaJIT maintains the hotness of relevant instructions (jumps, calls) in a hashtable but a catch — it doesn’t resolve collisions. Since hotness is a heuristic and not dense (i.e. it is unlikely that all jumps in a program will be hot) this approach works quite well in practice. The following macro, used to access a the hotness counter for a particular instruction from C, should make things clear:
// gg is the global state and pc points to the bytecode
// instruction whose hotcount we want.
#define hotcount_get(gg, pc) (gg)>hotcount[(u32ptr(pc)>>2) & (HOTCOUNT_SIZE1)]
The interpreter updates and checks the hotness counters when executing a jump or a call. This is done using the hotloop
and hotcall
macros in vm_<arch>.dasc
(we’ll only be looking at the x86 architecture so <arch>
is x86
):
.macro hotloop, reg
 mov reg, PC
 shr reg, 1
 and reg, HOTCOUNT_PCMASK
 sub word [DISPATCH+reg+GG_DISP2HOT], HOTCOUNT_LOOP
 jb >vm_hotloop
.endmacro
and later, in various places::
case BC_FORL:
.if JIT
 hotloop RB
.endif
dynasm
lowers the macro invocations in a straightforward way — BC_FORL
subtracts HOTCOUNT_LOOP
ticks from the corresponding counter and then jumps to vm_hotloop
once the counter underflows:
>vm_hotloop: // Hot loop counter underflow.
.if JIT
 mov LFUNC:RB, [BASE8] // Same as curr_topL(L).
 mov RB, LFUNC:RB>pc
 movzx RD, byte [RB+PC2PROTO(framesize)]
 lea RD, [BASE+RD*8]
 mov L:RB, SAVE_L
 mov L:RB>base, BASE
 mov L:RB>top, RD
 mov FCARG2, PC
 lea FCARG1, [DISPATCH+GG_DISP2J]
 mov aword [DISPATCH+DISPATCH_J(L)], L:RBa
 mov SAVE_PC, PC
 call extern lj_trace_hot@8 // (jit_State *J, const BCIns *pc)
 jmp <3
.endif
dynasm
can be made aware of C struct
s and offsets of their fields using a .type
directive. In the body of vm_hotloop
, LFUNC:RB
tells dynasm
to treat the value in RB
as an LFUNC
. RB
and LFUNC
are defined earlier:
.define RB, ebp
// ...
.type LFUNC, GCfuncL
Of course, treating RB
as an LFUNC
in LFUNC:RB
doesn’t do anything, it is essentially documentation. In the next instruction (mov RB, LFUNC:RB>pc
), however, this annotation allows us to say LFUNC:RB>pc
and have dynasm
automatically figure out the correct offset. Let’s step through vm_hotloop
, noting that BASE
points to the top of Lua’s evaluation stack, RB
and RD
are registers, FCARG1
and FCARG2
are the registers that are used as the first two arguments in the calling convention used when transitioning from Lua to C and that SAVE_L
and SAVE_PC
are stack slots.
First, we pop the GCfuncL
off the top of the stack and read pc
, which points to the beginning of the bytecode for the closure:
 mov LFUNC:RB, [BASE8]
 mov RB, LFUNC:RB>pc
LuaJIT follows a common convention of storing function literals or prototypes separately from function closures. This allows the VM to save space by sharing the bytecode between two closures of the same function prototype. In V8, for instance, you have SharedFunctionInfo
s holding the information specific to a function literal and Function
s representing actual, executable, closures.
In LuaJIT, function literals are represented using GCproto
. In memory, a GCproto
object is followed by the bytecode for the function literal (something shared by all closures, represented by GCfuncL
s). Thus, given a GCfuncL
, we can extract the corresponding GCproto
by subtracting sizeof(GCproto)
from the pointer to the beginning of the bytecode array. PC2PROTO
uses this technique to access the framesize
field in the GCproto
structure and uses it to compute the first free slot in the stack:
 movzx RD, byte [RB+PC2PROTO(framesize)]
 lea RD, [BASE+RD*8]
Then we fill up the fields in lua_State
(defined in lj_obj.h
):
 mov L:RB, SAVE_L
 mov L:RB>base, BASE
 mov L:RB>top, RD
set the argument registers, save few things in stack slots and call into lj_trace_hot
:
 mov FCARG2, PC
 lea FCARG1, [DISPATCH+GG_DISP2J]
 mov aword [DISPATCH+DISPATCH_J(L)], L:RBa
 mov SAVE_PC, PC
 call extern lj_trace_hot@8
 jmp <3
which puts LuaJIT into recording mode.
Recording Traces
A trace is the linear sequence of bytecodes encountered when executing a particular code path. Traces are recorded (the recording begins when lj_trace_hot
is called) by coordinating with the interpreter. The interpreter uses a vector of pointers to assembly stubs, indexed by the bytecode instructions they implement — in principle, interpreting a Lua program proceeds by decoding the bytecodes one by one and jumping to the corresponding assembly stubs. Traces are recorded by making each element in the dispatch vector point to lj_vm_record
(dynasm
adds the lj_
prefix, in vm_x86.dasc
, the symbol is just vm_record
). A simplified vm_record
looks like this:
>vm_record:
 mov L:RB, SAVE_L
 mov L:RB>base, BASE
 mov FCARG2, PC
 mov FCARG1, L:RB
 call extern lj_dispatch_ins@8
 mov BASE, L:RB>base
 movzx RA, PC_RA
 movzx OP, PC_OP
 movzx RD, PC_RD
 jmp aword [DISPATCH+OP*8+GG_DISP2STATIC]
.endif
We see that it basically boils down to a call to lj_dispatch_ins
followed by a jump to whatever [DISPATCH+OP*8+GG_DISP2STATIC]
points to. LuaJIT saves a backup copy of the dispatch vectors at a distance of GG_DISP2STATIC
from the original dispatch vector (which now only contains pointers to lj_vm_record
), and lj_vm_record
uses this backup vector to jump to the real assembly stub.
Let’s get off the armchair now and look at some actual code and understand some subtleties.
Once a loop or a call is measured to be hot, the interpreter calls lj_trace_hot
which is the entry point into recording, compiling and installing a trace. lj_trace_hot
sets state
of the central jit_State
object to LJ_TRACE_START
and hands off control to lj_trace_ins
.
The tracing subsystem of LuaJIT has five states: START
, RECORD
, END
, ASM
and ERR
. lj_trace_ins
is a finite automaton which moves between these states based on the bytecode instructions read off the execution trace. The overall scheme is simple:
START > RECORD > ERR
set up dispatch 
vector  [IDLE]
v /
END > ASM
Of course, this doesn’t happen all at once — the state is remembered and transitions are made (or not) as calls to lj_dispatch_ins
uncover more and more of the currently executing trace. The bytecode stream the tracer sees is translated into an intermediate SSA representation which is then optimized and compiled into assembly.
Tracing gives us only one linear path through the code, disregarding other, equally valid paths that may have been taken. A tracing JIT usually deals with this by installing guards which check assumptions and bail out from the trace (to the interpreter, in LuaJIT’s case) on violated assumptions. For instance, consider this simple Lua program:
total = 0
function test()
if total < 500 then
total = total + 1
end
end
for i=1,501 do
test()
end
This compiles to the following bytecode (stripping out some parts nonessential to our understanding):
;; function test()
0001 GGET 0 0
0002 KSHORT 1 500
0003 ISGE 0 1
0004 JMP 0 => 0008
0005 GGET 0 0
0006 ADDVN 0 0 0
0007 GSET 0 0
0008 => RET0 0 1
;; body (instructions 0001 to 0007 removed)
0008 FORI 0 => 0012
0009 => GGET 4 2 ; "test"
0010 CALL 4 1 1
0011 FORL 0 => 0009
0012 => RET0 0 1
(You can get luajit
to emit bytecode in this form using b l
and have it dump all the trace information using jdump
.)
Nothing unexpected here — you have a for loop (delimited by FORI
and FORL
) running 501
times calling test
in each iteration. LuaJIT extracts the following (again, trimmed down) trace (in SSA form):
;; Instructions 0001 to 0013 removed
0014 > p32 HREFK 0013 "total" @11
0015 > num HLOAD 0014
0016 > num LT 0015 +500
0017 + num ADD 0015 +1
0018 num HSTORE 0014 0017
0019 + int ADD 0001 +1
0020 > int LE 0019 +501
0021  LOOP 
0022 > num LT 0017 +500
0023 + num ADD 0017 +1
0024 num HSTORE 0014 0023
0025 + int ADD 0019 +1
0026 > int LE 0025 +501
0027 int PHI 0019 0025
0028 num PHI 0017 0023
The instructions following LOOP
are unsurprising: a normal loop with the usual SSA phi nodes. Note that (as is usual in tracing JITs) a version of test
has been inlined completely into a tight loop. Instruction 0016
is rather interesting; the >
means that a particular instruction is a guard and that the trace bails out to the interpreter if the condition doesn’t hold. In this case, we bail out if 0015
(which is the value of total
, as you can make out from instructions 0014
and 0015
) is greater or equal to 500
. This is interesting because the tracing compiler doesn’t try to be a smartass and infer to not do anything when total
is >=
500
, which also is correct behavior. All it knows is that when total
is <
500
, the correct behaviour is to add 1
to total
, because that is what it has observed in the trace. Notably, if the total
>=
500
path gets taken often enough, it would be marked hot and be compiled into a sidetrace (you should try reproducing this).
lj_record_ins
in lj_record.c
records each bytecode instruction into SSA before it is executed. One architectural subtlety is knowing which guard to emit — given a condition IF A < B THEN {0} ELSE {1} END
do you emit a guard for A < B
or for NOT (A < B)
? We get that information only after the condition is evaluated, if A < B
is true during the trace recording phase, then we need a guard asserting (A < B)
and vice versa. Instead of implementing a partial interpreter in the tracer, LuaJIT does this by postponing the insertion of the guard instructions till before lj_record_ins
is called with the next bytecode. It knows the result of evaluating the condition by then.
Another subtlety involves tracing across alreadycompiled traces. LuaJIT keep track of compiled traces by replacing the normal loop or call bytecodes with special marker bytecodes. These special bytecodes signal the existence and provide the location of a compiled trace beginning at that point. However, when tracing, we’d like to be able to see the function bytecodes. This is done by temporarily patching the special JIT marker bytecode (BC_JFUNCF
or BC_JLOOP
, for instance) with the original bytecode, tracing through the original bytecode and patching back the JIT marker bytecode once the tracing is done. To see this happening, have a look at rec_func_jit
(in lj_record.c
) and trace_pendpatch
in lj_trace.c
.
Snapshots
Any VM that involves transitioning between two (or more!) modes of execution faces the problem of converting between unoptimized and optimized stack layouts. This is true of LuaJIT too, with a slightly different meaning of stack. We’d like observational equivalence between a interpreted trace and the same trace compiled. We’d like the compiled trace to have the same side effects and map the operand stack the same way as the interpreted trace. However, this equivalence need not hold inside the compiled trace, we only need to make sure that this holds in the trace exits (guarded or not).
LuaJIT solves this by maintaining a mapping from stack location to SSA IR instructions. Such mappings are called snapshots in the LuaJIT codebase. Using a snapshot, it can then reconstruct the operand stack just like how it would have been had the trace been interpreted (snap_restoreval
and ln_snap_restore
in lj_snap.c
).
This is slow, but quoting Mike Pall: "State restoration using this datadriven approach is slow of course. But repeatedly taken side exits quickly trigger the generation of side traces. The snapshot is used to initialize the IR of the side trace with the necessary state using pseudoloads. These can be optimized together with the remainder of the side trace. The pseudoloads are unified with the machine state of the parent trace by the backend to enable zerocost linking to side traces."
The Problem
The following five statements are often considered as characterizations of classical logic (as opposed to constructive logic, which is what is “built in” to Coq). We can’t prove them in Coq, but we can consistently add any one of them as an unproven axiom if we wish to work in classical logic. Prove that these five propositions are equivalent.
Definition peirce := forall P Q: Prop, ((P>Q)>P)>P. Definition classic := forall P:Prop, ~~P > P. Definition excluded_middle := forall P:Prop, P / ~P. Definition de_morgan_not_and_not := forall P Q:Prop, ~(~P/~Q) > P/Q. Definition implies_to_or := forall P Q:Prop, (P>Q) > (~P/Q).
Since logical equivalence is transitive and symmetric, it is sufficient to prove each of the following:
peirce
⇔ classic
classic
⇔ de_morgan_not_and_not
classic
⇔ excluded_middle
excluded_middle
⇔ implies_to_or
We prove each of these by proving implication in both directions.
peirce
⇔ classic
We start with the usual niceties (I’ll omit these in later proofs):
Theorem peirce_classic: peirce classic. Proof. unfold iff. unfold peirce. unfold classic. unfold not. split.
This, followed by
intros Peirce P Classic.
drops us into our first goal, incidentally the easier of the two:
Peirce : forall P Q : Prop, ((P > Q) > P) > P P : Prop Classic : (P > False) > False ============================ P
This subproof seemed easier to me because here we’re free to choose Q
. We can get P
from Peirce
if we can somehow manage a (P > Q) > P
for a Q
of our choice. Can we somehow use (P > False) > False
to make a (P > Q) > P
? To make a (P > Q) > P
, we need make a P
from a given P > Q
. We already have a (P > False) > False
. If we set Q
to False
, using (P > False) > False
we’ll get False
and using False
we can get anything!
This is all the intuition we need to prove Peirce
⇒ Classic
:
assert ((P > False) > P) as P_False_P. intros P_False. apply Classic in P_False. inversion P_False. (** P_False is false now *) apply Peirce in P_False_P. (** P_False_P is now P *) apply P_False_P.
I found proving Classic
⇒ Peirce
a bit harder since we are no longer free to choose Q
to our convenience.
Classic : forall P : Prop, ((P > False) > False) > P P : Prop Q : Prop P_Q_P : (P > Q) > P ============================ P
We first need to decide whether to start with Classic
or with P_Q_P
, both of them result in P
and are fair game at first sight. If we try to use P_Q_P
we’d have to somehow come up with a term for P > Q
, something that given a P
gives a Q
. But Q
could be anything (predecided, out of our control)! This seems a great lot harder (if not impossible) than providing a term for (P > False) > False
which has no such problems.
The intuition for getting a (P > False) > False
is this: we have a (term of type) P > False
and need to somehow transform this to (a term of type) False
. We can do this if we can get a term of type P
from somewhere (the constructive analog of conjuring up a classical contradiction). P_Q_P
can be used to get a P
, but we don’t know how to create Q
. However, this time around, we have with us a term (a hypothesis) of type P > False
and since False > Z
for any Z
, we can have P > Z
for any Z
. In other words, inside P > Q
, we can first use the P
(antecedent) on the P > False
we already have to create a False
and once we have False
we can create anything (including Q
) from it. Inside the (P > False) > False
term, we first using above technique to create a P
and then use that P
on the given P > False
to create a False
.
(** From Classic to Peirce *) intros Classic P Q P_Q_P. assert ((P > False) > False) as P_Classic. intros P_False. assert (P > Q) as P_Q. intros Given_P. apply P_False in Given_P. inversion Given_P. (** Given_P is false *) apply P_Q_P in P_Q. (** P_Q_P is now P *) apply P_False in P_Q. apply P_Q. apply Classic in P_Classic. (** P_Classic is P now *) apply P_Classic.
and this also proves peirce
⇔ classic
.
Qed.
classic
⇔ de_morgan_not_and_not
This is easier than the last proof. When proving classic
⇒ de_morgan_not_and_not
, Coq says
Classic : forall P : Prop, ((P > False) > False) > P P : Prop Q : Prop P_False_Q_False_False : (P > False) / (Q > False) > False P_Or_Q_False : P / Q > False ============================ False
The trick for this proof is that given P / Q > False
, we can separately make both P > False
and Q > False
by simply injecting the P
or Q
into P / Q
and then invoking the P / Q > False
term on it.
assert ((P > False) / (Q > False)). split. (** P is False *) intros. apply or_introl with (Q:=Q) in H. apply P_Or_Q_False in H. apply H. (** Q is False *) intro. apply or_intror with (P:=P) in H. apply P_Or_Q_False in H. apply H.
Once we have a P / Q > False
, getting a False
is easy.
apply P_False_Q_False_False in H. apply H.
de_morgan_not_and_not
⇔ classic
is easy too:
De_Morgan : forall P Q : Prop, ((P > False) / (Q > False) > False) > P / Q P : Prop Classic : (P > False) > False ============================ P
The trick here is to specialize De_Morgan
with both P
and Q
set to P
.
assert ((P > False) / (P > False) > False). intros. inversion H as [P_False0 P_False1]. apply Classic in P_False0. apply P_False0. apply De_Morgan in H. inversion H as [Ans_0  Ans_1]. apply Ans_0. apply Ans_1.
and were done proving classic
⇔ de_morgan_not_and_not
Qed.
classic
⇔ excluded_middle
I spent the most time proving this one, or rather, in proving classic
⇒ excluded_middle
. After the usual introductions,
Classic : forall P : Prop, ((P > False) > False) > P P : Prop ============================ P / (P > False)
We could try to get a P
out of Classic
and then inject that P
into P / (P > False)
. However, this approach didn’t look like it’d work out: I don’t know if P
is true (which is very close to what we’re trying to prove — even though we don’t know whether there is a term for P
or not, either there is one for P
or one for P > False
) and hence I should not be able to arbitrarily deduce the existence of the a term of that type.
Since Classic
is universally quantified over P
, we apply
it and explore:
apply Classic. intros H.
gives
Classic : forall P : Prop, ((P > False) > False) > P P : Prop H : P / (P > False) > False ============================ False
We can’t use Classic
here, since if we did, we’d have to instantiate P
with False
and have to provide a term of the type (False > False) > False
. Given a False > False
, the only way to get a False
is to provide a False
to the antecedent (False > False
). But, if we had a way to get a False
from somewhere (given a False > False
) we could use that directly (without going inside (False > False) > False
) since it is trivial to create a False > False
term out of nowhere.
The only other possibility is to come up with a P / (P > False)
and use H
on it.
assert ((P > False) > False) as P_False_False. intro P_False.
gives
Classic : forall P : Prop, ((P > False) > False) > P P : Prop H : P / (P > False) > False P_False : P > False ============================ False
We’re finally at the point where things get obvious. We can now inject P_False
to a P / (P > False)
and use that to get the False
we need.
apply or_intror with (P:=P) in P_False. apply H in P_False. apply P_False.
gives
Classic : forall P : Prop, ((P > False) > False) > P P : Prop H : P / (P > False) > False P_False_False : (P > False) > False ============================ False
This goal is slightly tricky, but we’re almost done. We use Classic
on P_False_False
to get a P
which we use on H
to get a False
.
apply Classic in P_False_False. apply or_introl with (Q:=(P > False)) in P_False_False. apply H in P_False_False. apply P_False_False.
excluded_middle
⇒ classic
is much easier:
intros H Prp Double_Neg. specialize H with Prp.
gives
Prp : Prop H : Prp / (Prp > False) Double_Neg : (Prp > False) > False ============================ Prp
H
tells us that P
is either true or false. In case it is true, the conclusion directly follows. In case it isn’t. we use Double_Neg
to get a False
, from which anything follows:
inversion H as [P  NegP]. (** P is true *) apply P. (** P is false *) apply Double_Neg in NegP. inversion NegP.
And this proves classic
⇔ excluded_middle
!
Qed.
excluded_middle
⇔ implies_to_or
This is probably the easiest one. After usual introductions,
Excluded_Middle : forall P : Prop, P / (P > False) P : Prop Q : Prop Implication : P > Q ============================ (P > False) / Q
Excluded_Middle
either gives us a term of type P
from which we can get a Q
using Implication
or it gives us a term of type (P > False)
. In either case, we have a way to construct (P > False) / Q
.
specialize Excluded_Middle with P. inversion Excluded_Middle. apply Implication in H. right. apply H. left. apply H.
Proving implies_to_or
⇒ excluded_middle
is easy too, after introductions:
Implies_To_Or : forall P Q : Prop, (P > Q) > (P > False) / Q P : Prop ============================ P / (P > False)
Note that Implies_To_Or
with P := P
and Q := P
proves our goal for (almost) free.
specialize Implies_To_Or with P P. assert (P > P). intros. apply H. apply Implies_To_Or in H. inversion H as [HL  HR]. right. apply HL. left. apply HR.
and this proves excluded_middle
⇔ implies_to_or
.
The source is up on github.
Introducing TAL0
I’ll do my best to give an informal introduction to TAL0 right here — if you wish to read something more polished I suggest you consult the original source.
TAL0 aims to provide control flow safety to simple, imperative programs. Control flow safety implies that a program is allowed to jump only to welldefined points, labels, in the code. This is easy to verify if all jump instructions take a constant label as an argument — we only have to verify that the label exists. But we’d like to go further than that and be able to dynamically jump to a location held in a register. With this freedom, the problem becomes a little more interesting.
TAL0 is a toy assembly language. You have a finite number of registers, each of which can hold either an integer or a label. You have the following instructions:
jump label
jumps to label label
.reg0 ←r reg1
moves the value in reg0
to reg1
.reg ←l label
moves label label
to the register reg
.reg ←n integer
moves integer integer
to the register reg
.reg0 ← reg1 ⟨ op ⟩ reg2
performs operation op
(anything with the type ℕ → ℕ → ℕ
) on the integers stored in reg1
and reg2
and stores the result in reg0
.if reg0 jump reg1
jumps to the label stored in reg1
if the value in reg0
is not 0
.Informally, we’d like to design a type system that ensures that all the above operations are always welldefined. The typesystem should ensure that reg0 ← reg1 ⟨ op ⟩ reg2
doesn’t try to add 42
to a label, for example.
Labels are sequences of instructions. I have take the liberty to constrain sequences to always end with a static jump
instruction. I don’t think we lose any generality due to this.
The only mutable “state” in the system are the registers.
The type system
In essence, the approach described in the paper assigns a type to the register file (the array of registers) at each instruction.
A label’s type is a vector of types, one for each register. This labeltype, which we’ll call Γ
, encodes what the sequence of instructions expects the register file to look like on entry. If, for instance, Γ
says that r0
contains a label and r1
an integer on entry, we know that if r1 jump r0
is safe, if it is the first instruction in the sequence. We then morph Γ
as we go through the instructions in the sequence based on how each instruction affects the register file. For example, we know that after executing r0 ←n 42
, Γ[r0]
(the type at r0
in Γ
) is Nat
. In other words, we assign a (possibly different) type to the register file at each instruction. The type of each register (at a specific instruction) can either be Nat
, denoting that it contains an integer or block { r0 : t0, r1 : t1 ... }
, denoting that it contains a label to a block (or sequence) of instructions which, on entry, expects r0
to contain a value of type t0
, r1
to contain a value of type t1
and so on.
There is one subtlety, however. What type do you assign to this sequence of instructions?
if r1 jump r0 jump some_label
You see, going naively by the above scheme, the type r0
has at the point where the if_jump_
has to contain itself. A little more formally, if Γ
is the type of the register file at the location of the if_jump_
then Γ[r0] = Γ
. One obvious solution is to allow Γ
to be an infinite type. Another approach (the one used in the paper) uses a polymorphic type and instantiates it with itself when required. I found this approach a little hard to implement in Agda. Specifically, I ran into issues with positivity of types, something I don’t fully understand, and decided to implement a simpler (but more limited) approach.
I use an anything
type that denotes that a particular register can contain, really anything. The typesystem is still sound since TAL0 can only manipulate values of this type opaquely — it can neither jump to registers of type anything
nor can it do math on them. In the motivating example, we’d set Γ[r0] = label { r0 = anything, r1 = ... }
. The type of the r0
is a label, which allows r0
to contain any value whatsoever. Since r0
can contain anything before the jump, it can very well contain a value of type label { r0 = anything, r1 = ... }
. The crucial point is that we don’t really allow any kind of selfnesting or recursion here. I found it helpful to think about anything
being ⊥
— the result of an infinite recursion.
The implementation
The first step is declaring Γ
, the type of a register file. This is a mutually recursive definition:
mutual Γ′ : ℕ → Set Γ′ n = Vec Type n Γ = Γ′ size data Type : Set where natType : Type blockType : Γ′ size → Type anything : Type
Here size
is a value of type ℕ
which is basically the number of registers we have. Secondly, we define a function safeToJump'
that checks if it is feasible to jump to the label contained in a specific register from the current point in the instruction stream:
safeToJump : {n : ℕ} → (to : Γ′ n) → (from : Γ′ n) → Set safeToJump [] [] = ⊤ safeToJump (anything ∷ xs) (x' ∷ xs') = safeToJump xs xs' safeToJump (x ∷ xs) (x' ∷ xs') = if (equalT x x') then (safeToJump xs xs') else ⊥ safeToJump' : Γ → Reg → Set safeToJump' γ idx with lookup idx γ ...  natType = ⊥ ...  blockType γ′ = safeToJump γ′ γ ...  anything = ⊥
equalT
is a trivial function that compares types for equality. safeToJump'
merely asserts that it is never safe to jump to an integer or an anything
. When it sees a blocktype it does a straightforward deep comparison, keeping in mind the anything
semantics discussed above. Returning ⊥
and ⊤
allows us to have an implicit argument of the type safeToJump' γ r
and have Agda fill in the value automatically for ⊤
and fail typechecking for ⊥
.
The meat of the implementation is a Block
datatype. A value of this type is also the proof of the consistency of the stream of instructions it represents. Each sequence is expressed as a discrete value and a program is a mutually recursive set of such values. For instance, here are two labels which just keep jumping to each other:
mutual α : Block (natType ∷ natType ∷ []) α = jump (♯ β) β : Block (natType ∷ natType ∷ []) β = jump (♯ α)
Agda’s mixfix syntax allows us to really go wild when defining Block
:
data Block : Γ → Set where jump : {γ₀ γ₁ : Γ} → ∞ (Block γ₀) > { _ : safeToJump γ₀ γ₁ } → Block γ₁ _←r_▶_ : {γ : Γ} → (dest : Reg) → (src : Reg) → Block (γ [ dest ]≔ (lookup src γ)) → Block γ _←l_▶_ : {γ γ₀ : Γ} → (dest : Reg) → (src : ∞ (Block γ₀)) → Block (γ [ dest ]≔ (blockType γ₀)) → Block γ _←n_▶_ : {γ : Γ} → (dest : Reg) → (src : ℕ) → Block (γ [ dest ]≔ natType) → Block γ _←_⟨_⟩_▶_ : {γ : Γ} → (dest : Reg) → (op₀ : Reg) → (fn : ℕ → ℕ → ℕ) → (op₁ : Reg) → ⦃ _ : lookup op₀ γ ≡ natType ⦄ → ⦃ _ : lookup op₁ γ ≡ natType ⦄ → Block (γ [ dest ]≔ natType) → Block γ if_jump_▶ : {γ : Γ} → (src : Reg) → (dest : Reg) → ⦃ _ : lookup src γ ≡ natType ⦄ → { _ : safeToJump' γ dest } → Block γ → Block γ
Coinduction (the ∞
) is needed to allow, for instance, jump instructions to the same block. We want to be able to type b = jump (♯ b)
. The ⦃ ... ⦄
is Agda syntax for instance arguments.
We build the blocks bottom up. Every instruction (other than jump
which is always the last instruction) takes a block of type Γ₀
and produces a block of type Γ₁
, where running the instruction will change a register file of type Γ₁
to a register file of type Γ₀
.
The individual rules are rather simple and I will annotate only one of them: the _←_⟨_⟩_▶_
rule says that we expect both op₀
and op₁
to contain integers and the rest of the block will see an integer at the destination register. It results in a block that may not contain an integer at the destination register. The approach may look a little mysterious (and assbackwards), but Agda is perfectly capable of doing this kind of type inference, if it knows the expected type of resultant Block
. It isn’t hard to see how — given γ
, it is always possible to compute γ [ dest ]≔ natType
.
And yeah, this is about it! The full code (with a few trivial additions, like the equalT
function) is available at https://github.com/sanjoy/Snippets/blob/master/TAL0.agda
Programming in Agda is awesome to the point of being unfair.
The Cliffhanger
The last episode ended with a question on whether it is always possible to have a consistent morphism from (Int, intAlgebra)
to an arbitrary algebra. The answer, it turns out, is no.
This can be illustrated with a very important counterexample, the algebra ([Int], intListAlgebra)
with
intListAlgebra :: Func [Int] > [Int] intListAlgebra E = [] intListAlgebra (Func x xs) = x:xs
The limitation is actually more general — no Func
algebra of the form (Int, coerce)
can have a function to ([Int], intListAlgebra)
. For an informal proof, think about this (we’ll talk in the terminology introduced in the last post):
Consider the case when the initial Func Int
is of the form Func a b
. The first path morphs Func a b
to frob (coerce (Func a b))
while the second path maps it to a:frob b
. Let the datatype Int
be able to represent N unique values and let frob
map these N unique values to M unique values. In other words, let the cardinality of frob
‘s range be M. Then the number of possible values of frob (coerce (Func a b))
is at most M while the number of possible values of a:frob b
is N * M. Therefore the two paths can not represent the same mapping — their ranges are different.
Initial Algebras
However, the ([Int], intListAlgebra)
algebra is special in a way — we can always find a function from ([Int], intListAlgebra)
to any arbitrary Func
algebra. In Haskell, we should be able to implement this function:
createArrow :: (Func c > c) > [Int] > c
The createArrow
function is surprisingly easy to create. As earlier, we proceed with the case analysis of the Func [Int]
we have to start with. The function we’re looking for is f :: [Int] > C
.
When the initial value is E
, we require (f (intListAlgebra E)) = (algC ((fmap f) E))
. Using the definition of fmap
and intListAlgebra
(which we already know), we get f [] = algC E
.
When the initial value is Func x y
, we require f (intListAlgebra (Func x y))) = (algC ((fmap f) (Func x y)))
. This simplifies to (again, using the known, fixed definitions) f (x:y) = algC (Func x (f y))
.
The above two observations have just given us a nice, recursive, definition of f
! The definition of createArrow
automatically follows:
createArrow algC [] = algC E createArrow algC (x:xs) = algC (Func x (createArrow algC xs))
Moreover, it looks a lot like a foldr
:
createArrow algC xs = foldr (a b > algC (Func a b)) (algC E) xs
Such Falgebras, which have mappings to all other Falgebras are called initial algebras, and ([Int], intListAlgebra)
is one. It also is a terminal coalgebra (coalgebras which have a function from all other coalgebras) but we won’t talk about them now. This dualism exists because Haskell doesn’t differentiate between inductive and coinductive data structures. I wish it did.
Lambek’s Lemma
Lambek’s lemma states that if a functor has an initial algebra, the initial algebra is a fixed point of the functor. Above, we can see that [Int]
is, indeed, the fixed point of the Func
. Alternatively, it says the (Func [Int] > [Int])
we came up with is an isomorphism — it has an inverse; which is also something apparent from the above example.
Why I find Falgebra’s interesting
I decided to read up a little bit on this topic when I realized that I could use arrows between Falgebras to represent correctness preserving compilations. For instance, consider a simple evaluator for arithmetic expressions:
data Arith a = Add a a  Sub a a  If a a a  Atom Int  Define fmap the usual way data StackLang = PushSL Int  AddSL  SubSL  IfSL stackAlg :: Arith [StackLang] > [StackLang] stackAlg (Add a b) = AddSL:(a ++ b) stackAlg (Sub a b) = SubSL:(a ++ b) stackAlg (If a b c) = IfSL:(a ++ b ++ c) stackAlg (Atom a) = [PushSL a] runStack :: [StackLang] > Int runStack stack = head $ horse stack [] where horse (PushSL a:r) s = horse r (a:s) horse (AddSL:r) (x:y:s) = horse r (x + y:s) horse (SubSL:r) (x:y:s) = horse r (x  y:s) horse (IfSL:r) (x:t:f:s) = horse r ((if x /= 0 then t else f):s) horse [] s = s eval :: Arith Int > Int eval a = { usual }
Now, if we can show that runStack
is an arrow from (Arith [StackLang], stackAlg)
to Arith Int
, we will have proved that one step of compiling a top level expression to a stackmachine language preserves semantics. By structural induction, we can then prove the whole compilation process semanticspreserving.
We know that we have always have an arrow (rather a unique arrow) from the fixed point of Arith
(finite expressions in this toy expression language) to any arbitrary algebra. Specifically, the arrow to eval
should tell us something about how we recursively combine results of evaluating subtrees into a value.
I’m still quite new to all this, but, so far, Category Theory looks rather interesting and I’m probably going to spend some more time on it.
This post is light on math and uses Haskell to express formalisms. If you’re clear on what a Functor
is, you’re good to go.
Functors and Algebras
There is this regular old functor, with the type constructor Func
. We’ve made it an instance of Functor
with an appropriate fmap
following the functor laws.
An Falgebra is a tuple consisting of a Haskell type A
(i.e. something with kind *
) and a function f :: (Func A) > A
.
To makes things concrete:
import Data.Char data Func x = E  Func Int x deriving(Show) instance Functor Func where fmap f (Func i x) = Func i (f x) fmap _ E = E charAlgebra :: Func Char > Char charAlgebra E = chr 0 charAlgebra (Func _ x) = x
We have an algebra now: (Char, charAlgebra)
. Algebras are specific to the functor they operate on, so we call the above a Func
algebra. We can have a different Func
algebra, (Char, charAlgebra2)
using a different function from Func Char
to Char
:
charAlgebra2 :: Func Char > Char charAlgebra2 E = chr 0 charAlgebra2 (Func i _) = chr i
We will play with both of these.
Morphisms (arrows) between Algebras
We can also have morphisms that transform one Func
algebra to another. A morphism from Func
algebra (A, f)
to (B, g)
is a function morph :: A > B
with some special constraints.
You see, given a Func
algebra (A, f)
, the values we have to deal with are either of the type A
or Func A
. The constraint on morph
tell us the two ways we have of translating a value of type Func A
to a value of type B
are equivalent. Concretely, say we have a value v :: Func A
. There are two ways we can get a value of type B
out of it:
v
to v' :: A
using f :: Func A > A
. Then convert v' :: A
to v'' :: B
using morph
.v
to v' :: Func B
using fmap morph
. Then reduce v' :: Func B
to v'' :: B
using g
. The restriction is that these two methods must give us the same v'' :: B
. If morph
satisfies this, we say that morph
is a morphism between (A, f)
and (B, g)
. It can be succinctly formalized by saying that the following diagram commutes in the Hask category. Even if you don’t speak Category Theory, the meaning of the diagram should be quite apparent
In this post (and later ones too) we’ll refer mapping from Func A
to A
to B
as the first path and mapping Func A
to Func B
to B
as the second path.
In Haskell, the restriction implies that isArrow
is true for all values of sampleA
:
isArrow :: (Eq b) => (Func a > a) > (Func b > b) > (a > b) > (Func a) > Bool isArrow algA algB arrow sampleA = (arrow (algA sampleA)) == algB ((fmap arrow) sampleA)
isArrow
, while not a proof, will come in handy for quickly generating counterexamples.
To play with isArrow
, we first create two more Func
algebras, (Int, intAlgebra)
and (Int, intAlgebra2)
:
intAlgebra :: Func Int > Int intAlgebra E = 0 intAlgebra (Func i _) = i intAlgebra2 :: Func Int > Int intAlgebra2 E = 0 intAlgebra2 (Func _ i) = i
and some values of type Func Int
:
intSamples = [E, Func 0 42, Func 42 0]
and then try to rule out some possibilities. We use the chr :: Int > Char
function as the morph
function here. The choice of chr
is only incidental — it just seems to be the most canonical way to convert an Int
to a Char
.
isArrow0 = isArrow intAlgebra charAlgebra chr values0 = map isArrow0 intSamples isArrow1 = isArrow intAlgebra2 charAlgebra chr values1 = map isArrow1 intSamples isArrow2 = isArrow intAlgebra charAlgebra2 chr values2 = map isArrow2 intSamples isArrow3 = isArrow intAlgebra2 charAlgebra2 chr values3 = map isArrow3 intSamples
Out of the four values{03}
, only values1
and values2
are true for all inputs. Therefore, they are the only ones that warrant our time.
values1
tells us that it is possible that chr
is a function from (Int, intAlgebra2)
to (Char, charAlgebra)
. Let’s see if we can convince ourselves that this is actually the case:
We start with a value v :: Func Int
. If v
is E
then the conclusion (the restriction on chr
) is trivially true. The only other possibility is that v
is of the form Func (i::Int) (x::Int)
. In the first path, we apply intAlgebra2
, which reduces v
to x
, after which we apply chr
on it, making the end result chr x
. In the second path we first apply fmap chr
on v
, which gives us Func i (chr x)
. Now, charAlgebra
simply picks up the second field, and the end result is chr x
again; same in both the paths. This proves that chr
is indeed a morphism from (Int, intAlgebra2)
to (Char, charAlgebra)
.
A similar argument can be made for isArrow2
, I think.
We have seen that chr
can’t be an arrow from intAlgebra
to charAlgebra
. In fact there is an arrow (and only one arrow) from intAlgebra
to charAlgebra
. To find it, assume that the arrow is arr
, and observe that in the case v = Func i x
, the first path first maps v
to i
and then to arr i
. The second path first maps it to Func i (arr x)
and then to arr x
. The only way arr i
can be equal to arr x
for all values of i
and x
is if arr
doesn’t depend on its argument at all! Thus we establish that arr
is of the form _ > constantChar
.
We get the value of constantChar
by setting v
to E
. The first path first maps E
to 0
and then to arr 0 = constantChar
. The second path first maps E
to E
and then E
to chr 0
. Thus, the value of constantChar
is chr 0
.
Is it always possible, perhaps with some effort, to define an arrow from (Int, intAlgebra)
to some arbitrary algebra? We’ll explore this and more in the next post(s).
A few days ago, Paul Snively published a post which set me thinking on how Agda (which is the closest thing to Coq I have some knowledge of) fares in similar scenarios. Pretty well, it turns out.
I’ll start by solving the problem Paul solves in his post, which is also the first problem in the book Logical Labyrinths by Raymond M. Smullyan:
Context: Abercrombie is in a weird island where all knaves lie and all knights are truthful
On the day of his arrival, Abercrombie came across three inhabitants, whom we will call A, B and C. He asked A: “Are you a knight or a knave?” A answered, but so indistinctly that Abercrombie could not understand what he said. He then asked B: “What did he say?” B replied: “He said that he is a knave.” At this point, C piped up and said: “Don’t believe that; it’s a lie!”. Was C a knight or a knave?
We begin by defining ≡
(equality) and ⊥
(a type with no values):
data _≡_ {A : Set} (a : A) : A → Set where refl : a ≡ a data ⊥ : Set where
Now a Person
can either be a knight
or knave
. We model that
data Person : Set where knight : Person knave : Person
and the fact that a knight always speaks the truth and that a knave always lies
says : Person → Set → Set says knight p = p says knave p = p → ⊥
The intuition behind says
is that says p predicate
always gives us a a predicate that is true. This is actually enough to represent and solve the first problem — we represent the problem as a type such that a value of that type is the solution we need:
data Solution₀ : Set where soln₀ : (a : Person) → (b : Person) → (c : Person) → (says b (says a (a ≡ knave))) → (says c (b ≡ knave)) → Solution₀
We encode the constraints in the problem here in a very straightforward way; it is possible to construct a value of type Solution₀
if and only if we have proofs for (says b (says a (a ≡ knave)))
and (says c (b ≡ knave))
. We now ask Agda to search for a solution:
solution₀ : Solution₀ solution₀ = ?
and Agda fills up the ?
:
solution₀ = soln₀ knight knave knight (λ ()) refl
As cool as this undoubtedly is, for this particular problem the solution is the proof of the solution and that isn’t as involved as I’d like to be. Let’s look at something deeper (problem 1.11
from the same book).
Suppose that you visit the Island of Knights and Knaves because you have heard a rumor that there is gold buried there. You meet a native and you wish to find out from him whether there really is gold there, but you don’t know whether he is a knight or a knave. You are allowed to ask him only one question answerable by yes or no. What question would you ask?
The answer, of course, is to ask “if I were to ask you whether there is gold buried somewhere here, what would you say?”. The knights answer truthfully and the knaves end up answering truthfully since they lie about a lie and hence negate a negation. Is Agda smart enough to figure this out?
A question to which only two answers are possible is basically a predicate and can be represented by a Set
. However, since we’d like to be able to ask questions containing the phrase “you” or even questions like “if you are a knight then is the sky red else are you a knave?”, we represent questions as
Prp : Set₁ Prp = Person → Set
We also declare some standard ammo
record _∧_ (A : Set) (B : Set) : Set where constructor _,_ field proj₁ : A proj₂ : B prj₁ : {A : Set} {B : Set} → A ∧ B → A prj₁ = _∧_.proj₁ prj₂ : {A : Set} {B : Set} → A ∧ B → B prj₂ = _∧_.proj₂ _⇔_ : Set → Set → Set _⇔_ a b = (a → b) ∧ (b → a)
to represent logical conjunctions, and if and only if. With these, we are ready to represent our problem. We represent a slightly more general version of the problem — “is gold buried in your island” is really an arbitrary predicate and we model it as such
data PredicateTransform : Set₁ where predicateTrans : (f : Prp → Prp) → ((OldP : Prp) → (p : Person) → (OldP p) ⇔ (says p ((f OldP) p))) → PredicateTransform
The way to construct a PredicateTransform
value is by providing a function that maps a predicate to a new one, and proving that the old predicate is equivalent to an arbitrary Person
asserting the new one. So, in the specific case of the question, the predicate would hopefully transform “is gold buried in your island” to a knight/knavehood agnostic one and provide a proof for the same. We try our old trick again
soln : PredicateTransform soln = ?
which unfortunately doesn’t work this time. The reason why it doesn’t is fundamental — remember how, for knaves, our solution would work by eliminating double negation? You can’t do that in constructive logic (and hence Agda). But all is not lost, we have the mighty ?
to the rescue:
elimdoubleneg : {A : Set} → ((A → ⊥) → ⊥) → A elimdoubleneg = ?
We can’t provide a body for elimdoubleneg
but that doesn’t stop us from keeping it around and using it to prove things. We just won’t get a complete proof, since such a proof lies outside the realm of constructive logic.
The “solution” we came up with basically transforms predicate “P” to “Would you assert the truth of predicate P”? Succinctly,
soln₁ : PredicateTransform soln₁ = predicateTrans f ? where f : Prp → Prp f q p = says p (q p)
where ?
is something we have to (interactively) fill in. We proceed the usual, highschool, way — proving both the implications separately and put them together to prove the iff:
soln₁ : PredicateTransform soln₁ = predicateTrans f proof where f : Prp → Prp f q p = says p (q p) proof₀ : (A : Prp) → (p : Person) → (A p) → (says p ((f A) p)) proof₀ = ? proof₁ : (A : Prp) → (p : Person) → (says p ((f A) p)) → (A p) proof₁ = ? proof : (A : Prp) → (p : Person) → (A p) ⇔ (says p ((f A) p)) proof A p = proof₀ A p , proof₁ A p
Note that a person can either be a knight or a knave. We tell Agda this and ask it to search for the individual proof terms:
proof₀ : (A : Prp) → (p : Person) → (A p) → (says p ((f A) p)) proof₀ A knight prf = prf proof₀ A knave prf = λ z → z prf proof₁ : (A : Prp) → (p : Person) → (says p ((f A) p)) → (A p) proof₁ A knight prf = prf proof₁ A knave prf = ?
The only place Agda gets stuck is the place where it would have to use elimdoubleneg
. The type of prf
here is (A knave → ⊥) → ⊥
(in emacs hit Cc Cd
) and the type of the hole is A knave
. We can use doublenegelim
here
proof₁ A knave prf = elimdoubleneg prf
and we’re done!
Another interesting solution to the problem is transforming the predicate “P” to “Is P true iff you are a knight?”:
soln₂ : PredicateTransform soln₂ = predicateTrans f ? where f : Prp → Prp f q p = (p ≡ knight) ⇔ (q p)
I leave this as an exercise.
The full source (and the solution to the “exercise”) is here. If you’ve made it this far please do leave a comment, especially if you think I’m mistaken somewhere.