Compiling pattern matches
Pattern matching on algebraic datatypes is one of the main reasons ML or Haskell programs are so wonderfully succinct. In Hob, it looks something like this:
type Bool = True, False
def and
False _ -> False
True x -> x
match (and True False)
True -> 1
False -> 0
The first line defines an algebraic type, in this case an enum-like
boolean type. Function definitions can provide several args -> body
forms, with are pattern-matched against the arguments to pick a body
to execute. The definition of and
above returns False
when the
first argument is False
, and returns its second argument otherwise.
match
can be used to pattern-match in other situations. Here, a
single value rather than a list of arguments is matched against.
Algebraic types can, of course, hold other values, as in this definition of a pair type:
type Pair = Pair Int32 Int32
let x = Pair 1 2
def second (Pair _ s) -> s
second x
(This pair can only hold integers — polymorphic datatypes haven't been implemented yet.) These matched patterns can be nested arbitrarily deep. When a match fails, an error is raised.
To compile these, the patterns have to somehow be compiled down to a bunch of tests and branches (or, even better, jump tables). This sounded very easy, but is actually quite involved.
When multiple values (for example function arguments) are being matched, it can make a difference for the amount of tests that have to be performed which one is matched first. So I'm using a simple heuristic to pick a value — each (necessary) test splits the set of possible branches into a number of (overlapping) sets. As an example, let us take this contrived program:
type Performance = Good, Average, Bad, Awful
def score
Good Good -> 1
Good Average -> 2
Good _ -> 3
_ Good -> 4
Awful Awful -> 6
_ _ -> 5
If we split on the first argument, we get five cases under Good
, and
three under Awful
(note that cases that have either a variable name
or the _
wildcard are included in both groups), as well two cases
under 'other'. For the second argument, this would be four for Good
,
three for Awful
, and three for Average
. We pick the one where the
maximum group size is smallest, in this case the second argument.
Having picked an argument to match on, we gather all the values that
are present in the patterns it matches against (Good
, Average
, and
Awful
in this case), and recurse for each of these, as well as the
fall-through case. How does that look in terms of
data-transformations?
The patterns are kept as a list of (pattern array, body) pairs. 'Choosing' a branch means picking the patterns that match out of this list, and removing the used patterns from the arrays. For example:
([Good, Good], 1)
([Good, Average], 2) ([Good], 2)
([Good, _], 3) == pick Average => ([Good], 1)
([_, Good], 4) on arg 2 ([_], 5)
([Awful, Awful], 6)
([_, _], 5)
When matching against a 'composite' pattern, one that contains sub-patterns, the matched-against pattern is not simply removed from the vector, but is replaced by its arguments.
([Pair 1 _, Good], 1) =====> ([1, _, Good], 1)
([Pair _ 2, Awful], 2) ([_, 2, Awful], 2)
Using this technique, a match can be compiled down to a tree of simple
choices — each time it branches, it selects a value to branch on (for
example, the enum tags of the Performance
datatype), and points to a
new tree for every value that occurs, and optionally (when these
values don't exhaustively cover the type) one for the 'other' case.
As soon as the 'top' pattern / body pair of a subtree always matches
(only contains variables or underscores), we've reached a leaf. Leaves
in this tree hold the actual branches, along with some information
about the variables bound for them. For example, in the Pair
example, the match-compiler has to note that s
, in the body of
second
is bound to the first field of the first argument. From a
tree like this, compiling to actual instructions is relatively
simple...
Except for one thing: The 'branch picking' process shown above (there's a better name for this, but I lost the paper in which I found it) is not exclusive. A branch with variable or wildcard patterns will often end up in multiple sub-trees. It is possible to just compile it multiple times, of course, but that'll bloat up some kinds of code beyond reason. So we have to be careful, when compiling, to just jump to an already-compiled instance of the branch if one exists.
The tricky part is that, if the branch's patterns bind variables, those have to be compiled in such a way that they end up in the same virtual registers. Nested composite values will often be dereferenced and stored in a virtual register during the matching, and the path used to reach a branch's body can be wildly different. I solved this by noting that vregs can 'merge' harmlessly if they are guaranteed to hold the same value, even if they are assigned in two different places, and am using a scheme where, when dereferencing a value, I check whether any other branch dereferenced it before, and if so, re-use the virtual register the other branch used.
If the language had I/O, it could be used for (modest) real programs at this point!