Data Representation

In order to not allocate everything on the heap, I've been making some changes to the way the compiler handles data flow. It originally passed around values that indicated the "result" of the expression that a recursive call had compiled. For example 1 is compiled to something like lit 1, a variable is compiled to whatever that variable is bound to in the environment, and most operations store their result in a virtual register, say vreg #12, and return that.

It turned out to be a relatively simple change to have these function return lists of values instead. For example, something returning a pair will (usually) return a list with two values in it, each of which might be literals, virtual registers, or even memory addresses. Compiling unit (()) returns the empty list. Thus, the pair ((), 4) returns a list with just the literal 4, and ((), ()) returns the empty list.

When outputting high-level instructions (the format spit out by the compiler), these lists are flattened again. For example, integer addition is always applied to integers, meaning that there will be only one value in the argument lists, and the operands of the instruction can be single values. Similarly, if a function takes two pairs of integers as arguments, they just look like four regular arguments again at this point. The fact that they used to be grouped into structured values is completely lost again, and of no concern to the code generator.

Alongside of these value representations, the compiler uses type representations, which are used by abstract operations that need knowledge about a type. They handle things like destructuring/dereferencing — taking a value out of a pair means picking out the right value(s) from the list that represented the pair. Destructuring a heap value means taking the first (and only) value in the list representation of the value, and using that as a pointer to the start of the memory that holds the inner values. It was quite easy to code up a system where types contain in each other in a straight-forward way, so that an allocated value containing a pair, for example, directly contains the two elements of the pair, and returns these when dereferenced.

One hairy problem was determining which values are represented "directly", and which values are allocated on the heap. I'm using a simple heuristic, where the total size of the values held determines whether the type is direct or not. Currently the maximum size is two values, but I'll run some benchmarks later to see whether three (or even four) works better. Having this heuristic in place, no special code was needed for pairs, enum types, or even the unit type: Everything automatically gets the proper representation based on the size of its contents.

To determine the size of a type, one obviously has to know the size of its components. If a type happens to include itself as a component (for example, lists), it not only can't be represented directly, but the naive approach to finding out its size will recurse endlessly. Since a type might contain itself indirectly, it's rather hard to predict this.

The solution for that turned out to be to have types initialise themselves (which included constructing their subtypes) lazily. When starting this initialisation process, they set a flag, and when initialising a type with this flag set (meaning it's already being initialised further up the stack), it immediately determines that it won't have a direct representation, and can thus return its size as "a single pointer".