Faking algebraic (variant) data types in Common Lisp

Topics: Type definitions, functional programming, Common Lisp
Author: Marijn Haverbeke
Date: April 28th 2008

One of the things that really stuck me when I first learned Haskell was the concept of ML-style algebraic data types. I hadn't seriously played with any similar languages before, and the directness of the mapping between my intentions and the code that I was writing was striking. No superfluous inheritance hierarchies, methods, accessors, or (Jove preserve us) hacks with C unions and structs. Just show what the data looks like, and match values against patterns to make use of it.

If you haven't had the privilege of playing with these, here is a definition of a list-of-integer type in Haskell, with a length function:

data List = Nil | Cons Int List

len Nil = 0
len (Cons _ tail) = 1 + len tail

two = len (Cons 1 (Cons 2 Nil))

The data line says that values of type List have either the value Nil, or a value Cons holding an integer and another List value. The integer part could be made polymorphic, but since we're going to talk about Common Lisp, the static typing aspects don't interest us.

The len function on this type is then defined by handling exactly these two cases. The matching against a certain constructor (as the 'options' of an algebraic data type are usually called) also implicitly does the binding of the variables (tail) that we need.

Finally, the bottom line shows that values of this type are created simply by invoking the constructor like a function. Nil doesn't take arguments, and Cons takes two of them. For a lot of simple data types, creating and unpacking are the main use-cases, and it is great to be able to use such a tiny, straightforward data declaration instead of writing out a string of boilerplate classes, accessors, or initargs.

If we consider Common Lisp, algebraic types won't give us the invulnerable sense of compiler-checked correctness that we get in a statically typed language. Still, their succinctness and clarity could really brighten up some programs — types for ASTs are a typical example. In any case, I constantly find myself missing them when writing CL code, so let's see what they might look like.

The first question is how to represent instances of our algebraic data types. There are three sensible choices: classes, structs, or vectors. Vectors would be the most conservative and efficient choice, but have the major drawback that you can not associate type information with them, so that it becomes impossible to specialise methods on these types. 'Playing nice' with the type system would be a plus, so let's try without vectors.

CLOS classes are awesomely flexible, whereas structs are kind of limited and stupid. But in many cases you'll want to use these types for tiny little objects, the kind that you will be creating by the millions. Because of their simplicity, struct instances are significantly faster to create and access on every implementation I tested. They have one really big drawback though — redefining them causes undefined behaviour. This usually means you have to throw away your existing instances and recompile the code that depends on the type whenever you redefine a type. I decided this was acceptable, but you could easily modify the code to use classes instead.

What kind of interface should an implementation of algebraic data types provide? Two constructs seem to go a long way: defining new types, and matching against types. For the first, one can do...

(variant _list _nil (_cons car cdr))

(With variant being pleasantly shorter than algebraic-data-type and more readable than adt). This creates three struct types, _list, _nil, and _cons (underscores are there to prevent name clashes with standard symbols), with _list being an empty struct that just serves as a supertype for the other two and, more importantly, as a way to indicate what this data type is intended for.

Matching happens with what is basically a case statement:

(defun len (lst)
  (vcase lst
    (_nil 0)
    ((_cons cdr) (1+ (len cdr)))))

(len (_cons 1 (_cons 2 (_nil))))

A vcase form works much like a typecase form (and in fact expands to one), except that it allows extracting of member fields. When the first element in a case is a symbol, that clause is treated exactly like a typecase clause, but when it is a list, the first element of this list is a struct type, and further elements are slot bindings. Slots can be either a single symbol, in which case the bound variable has the same name as the slot, or a list of two symbols, the first being the variable name, the second the slot name. You don't have to extract slots you don't use.

(If variant values were classes, this could simply expand to slot-value forms, but the standard doesn't require slot-value to work on structs, so my implementation expands to regular struct accessors, taken from the package that defines the struct type, so that you don't have to export the accessors. There is a problematic corner case with structs that have a name that you imported from another package.)

vcase does not do nested patterns. For serious pattern matching, struct-based variant values are trivial to use with cl-match.

An implementation of the above, containing a few small extensions that turned out to be useful in practice (evcase for ensuring at least one case matches, vbind for when you already know the type you are matching against) can be found here.