Compiling pattern matches

Pattern matching on algebraic datatypes is one of the main reasons ML or Haskell programs are so wonderfully succinct. In Hob, it looks something like this:

type Bool = True, False

def and
  False _ -> False
  True x -> x

match (and True False)
  True -> 1
  False -> 0

The first line defines an algebraic type, in this case an enum-like boolean type. Function definitions can provide several args -> body forms, with are pattern-matched against the arguments to pick a body to execute. The definition of and above returns False when the first argument is False, and returns its second argument otherwise.

match can be used to pattern-match in other situations. Here, a single value rather than a list of arguments is matched against.

Algebraic types can, of course, hold other values, as in this definition of a pair type:

type Pair = Pair Int32 Int32

let x = Pair 1 2
def second (Pair _ s) -> s
second x

(This pair can only hold integers — polymorphic datatypes haven't been implemented yet.) These matched patterns can be nested arbitrarily deep. When a match fails, an error is raised.

To compile these, the patterns have to somehow be compiled down to a bunch of tests and branches (or, even better, jump tables). This sounded very easy, but is actually quite involved.

When multiple values (for example function arguments) are being matched, it can make a difference for the amount of tests that have to be performed which one is matched first. So I'm using a simple heuristic to pick a value — each (necessary) test splits the set of possible branches into a number of (overlapping) sets. As an example, let us take this contrived program:

type Performance = Good, Average, Bad, Awful

def score
  Good  Good    -> 1
  Good  Average -> 2
  Good  _       -> 3
  _     Good    -> 4
  Awful Awful   -> 6
  _     _       -> 5

If we split on the first argument, we get five cases under Good, and three under Awful (note that cases that have either a variable name or the _ wildcard are included in both groups), as well two cases under 'other'. For the second argument, this would be four for Good, three for Awful, and three for Average. We pick the one where the maximum group size is smallest, in this case the second argument.

Having picked an argument to match on, we gather all the values that are present in the patterns it matches against (Good, Average, and Awful in this case), and recurse for each of these, as well as the fall-through case. How does that look in terms of data-transformations?

The patterns are kept as a list of (pattern array, body) pairs. 'Choosing' a branch means picking the patterns that match out of this list, and removing the used patterns from the arrays. For example:

([Good, Good], 1)
([Good, Average], 2)                 ([Good], 2)
([Good, _], 3)    == pick Average => ([Good], 1)
([_, Good], 4)         on arg 2      ([_], 5)
([Awful, Awful], 6)
([_, _], 5)

When matching against a 'composite' pattern, one that contains sub-patterns, the matched-against pattern is not simply removed from the vector, but is replaced by its arguments.

([Pair 1 _, Good], 1)  =====> ([1, _, Good], 1)
([Pair _ 2, Awful], 2)        ([_, 2, Awful], 2)

Using this technique, a match can be compiled down to a tree of simple choices — each time it branches, it selects a value to branch on (for example, the enum tags of the Performance datatype), and points to a new tree for every value that occurs, and optionally (when these values don't exhaustively cover the type) one for the 'other' case.

As soon as the 'top' pattern / body pair of a subtree always matches (only contains variables or underscores), we've reached a leaf. Leaves in this tree hold the actual branches, along with some information about the variables bound for them. For example, in the Pair example, the match-compiler has to note that s, in the body of second is bound to the first field of the first argument. From a tree like this, compiling to actual instructions is relatively simple...

Except for one thing: The 'branch picking' process shown above (there's a better name for this, but I lost the paper in which I found it) is not exclusive. A branch with variable or wildcard patterns will often end up in multiple sub-trees. It is possible to just compile it multiple times, of course, but that'll bloat up some kinds of code beyond reason. So we have to be careful, when compiling, to just jump to an already-compiled instance of the branch if one exists.

The tricky part is that, if the branch's patterns bind variables, those have to be compiled in such a way that they end up in the same virtual registers. Nested composite values will often be dereferenced and stored in a virtual register during the matching, and the path used to reach a branch's body can be wildly different. I solved this by noting that vregs can 'merge' harmlessly if they are guaranteed to hold the same value, even if they are assigned in two different places, and am using a scheme where, when dereferencing a value, I check whether any other branch dereferenced it before, and if so, re-use the virtual register the other branch used.

If the language had I/O, it could be used for (modest) real programs at this point!