Part 6 of the 'implementation of Hob' saga.
« Part 5 | Index | Part 7 »

Register allocation

Register allocation is the process of finding a suitable processor register (or stack location) for every value in a program. If you're completely unfamiliar with machine-level programming (and want to do something about that), I very much recommend this book.

(In this series — at least for now — I'll only be compiling to x86-64 assembly. That's a quite common processor type these days, and a little less painful to program than its 32-bit father.)

One can put as much effort as one wants into register allocation. Finding a very bad solution (just put everything on the stack right away) is very easy and results in rather ugly, big, and slow assembly code. On the other hand, finding a completely optimal solution is an NP-hard problem, and thus not something one solves before breakfast on a Sunday morning. In fact, the large gap between the scoping entry and the next one is there because I was stumped coming up with an elegant register allocation algorithm. (When faced with an intimidating problem in a hobby project, I tend to avoid the whole project.)

Closely related to register allocation are calling conventions. How do arguments get passed to functions, and what is the effect of a function call on the registers? For example, on x86 processors, C functions typically take their arguments on the stack, clobber the registers, and put their result in the %eax register. This means that the register allocator has to save any values that it will still need after the call on the stack.

The x86-64 architecture, having much more registers than plain x86, opens up a whole world of possible calling conventions. Because of the functional nature of Hob, I set out to make calls as cheap as possible. This means we'd like to avoid putting arguments on the stack, but also don't want to lose the contents of all our registers on every call.

(For those completely new to the x86-64 architecture, see this summary of the differences from x86.)

What I came up with (and, as mentioned before, I'm too new to this to really know what I'm doing) is this: The first eight arguments to functions are passed in the r8 to r15 registers, in reverse order, but always using the lower numbers (for example, three arguments are passed in r10, r9, r8). The reverse order is there because I want to specify function arguments to be evaluated left-to-right in this language. Functions return their return value in r8, and since the last argument is likely to be the result of a function call, that'll at least be in the right place already. These registers (even the ones not used for arguments) can be clobbered by the callee, and the caller must assume they are lost after a call. The other general-purpose registers are supposed to be saved on the stack and restored by every function that uses them, so that callers can assume them to still hold their value after a call.

Having callees save registers seems like a good strategy. A function that calls a lot of other functions will profit because it doesn't have to save registers multiple times, only once. A function that doesn't do much calling will not be penalised much because it's probably a small function that doesn't need that many registers.

The allocation algorithm I ended up using is absolutely not elegant, in fact it's kind of a mess, but it seems to do a good job and is not messy to the point where it really confuses me. I'll probably have to significantly revise it once I introduce garbage collection, but for what it's worth, it produces compilable and working programs, and I'm proud of that. Because the language is still missing a lot of constructs the test programs aren't very deep yet, but here's a cute one:

def true a b -> a
def false a b -> b
def if c then else -> c then else

def check x -> true

if (check 10) (if false 1 2) 3

It uses Church booleans to evaluate a simple conditional. It compiles, and properly returns 2.

The code generator, of which the register allocator is a part (see codegen.lisp), takes a block of instructions (see the previous entry) as input, and returns a list of x86-64 instructions that can be written to a assembly file (I'm using as for my assembler for now).

To do this, it makes two passes over the block. In the first pass, it records the lifetime (start and end instruction) of every virtual register in the block, and the location of function calls. It also looks for opportunities to 'chain' virtual registers. That is, if an instruction producing virtual register B takes as input the last occurrence of virtual register A, it records this information because it will probably be practical to re-use whatever real register A occupies to store B. There are two reasons this chaining is done in this extra pass, rather than during the real code generation. Firstly, registers are often involved in more than one chain, and it pays to pick the longest one. Secondly, in picking the register to use for a chain, it can be useful to know that the last virtual register of the chain has to end up in some specific location, because it is the argument to a function or is returned from the block.

The actual allocation of registers is done using two different strategies. Virtual registers that do not live beyond the next function call are, as much as possible, put into argument registers (which the callee clobbers anyway), using chaining to try and minimise the amount of useless move instructions. When no argument registers are available, or when a virtual register has to live beyond the next function call, the normal algorithm, which uses the callee-saved registers and the stack, comes into effect. It always prefers registers, since those are faster and result in smaller code. But when no registers are available, stack space is allocated. (Basically the 'linear scan' algorithm.) The stack pointer is not adjusted every time stack space is required, but the minimum amount of space needed is recorded, and when greater than zero, the code generator adds instructions to adjust the stack pointer before and after the block.

The compiler calls the code generator for every block it produces, and puts these blocks into a file, together with a simple entry _start point. All code at the top-level is put into a function called main, and _start simply calls this function and then makes an 'exit' system call using the result. This file is then run through as and ld to create an executable. All rather primitive, but it works. When it doesn't, I run gdb on the executable and try to figure out what the problem is.

Many functions compile to pleasantly simple assembly code. For example, the identity function (fn x -> x) becomes:

id_1: ret

... since both the first argument and the return value are in register r8. And this (pointless) function ...

def call_quadruple f x ->
  let x2 = x + x
    f (x2 + x2)

... becomes ...

call_quadruple_1:
  addl %r8d, %r8d
  addl %r8d, %r8d
  jmp *%r9

Note that it uses jmp, rather than call, because this is a tail call.

Part 6 of the 'implementation of Hob' saga.
« Part 5 | Index | Part 7 »

© Marijn Haverbeke, created March 25, 2009, last modified on November 27, 2009