Type inference, a first attempt
Here I'll give a practical description of Hindley-Milner type inference as applied to (an early incarnation of) Hob. For a more complete treatment of the technique, see for example this paper.
To do type inference, a program walks an abstract syntax tree and assigns types to every expression it contains, giving up when it encounters a type error. It does this, conceptually, by constructing a system of constraints based on the way terms are used, and then solving this system. If we have, for example, the following Hob program...
def succ x -> x + 1
succ 5
We start with the information that +
is an operator that takes two
numbers and returns a number. This means that x
in x + 1
must have
type number, and the result of computing x + 1
is also a number.
Thus, succ
has the type number -> number
. Then, looking at succ
5
, it concludes that succ
is a function that takes a number. These
conclusions can all be reconciled, yielding a typing for the program.
If the second line had said succ "foo"
instead, that would have
given rise to the constraint that succ
takes a string argument. This
would conflict with the other constraints, meaning we have a type
error.
Because, when the type-checker gets to an element, there is not always enough information to assign a complete type to it, a type-checking function can not simply return the type of the expression it was given. For example, mutually recursive functions require both functions to be checked before their type is known. This kind of merging of information is done through an algorithm called 'unification'.
We define types to be either a 'type variable' or an instance of some
type constructor, with zero or more types as arguments. Thus, 1
has
the type number
, [1, 2]
the type list number
, and fn x -> x
,
in a polymorphic type system, has the type a -> a
, where ->
is a
type constructor, and a
is a type variable.
-
Unifying two variables results in both of them referring to the same type from then on.
-
Unifying a variable with a type instance causes the variable to refer to the instance.
-
Two instances of the same type constructor are unified by unifying their arguments.
-
Instances of different constructors can not be unified, and raise a type error.
The typecheck
function takes an abstract syntax tree and an
environment mapping (non-type) variables to types, and returns a type
— which may contain variables that later get instantiated — for that
tree. It works approximately like this:
-
For a literal, simply return the type of the literal.
-
For a variable reference, it looks up the type of that variable in the environment, and returns that. If not found, the variable is undefined, which is an error.
-
For an abstraction (
fn
expression), extend the new environment by binding the function's arguments to fresh type variables (variables with names that haven't been used yet), and type-check the body with that environment. Then return a function type mapping the argument's type variables to the body's type. -
For a function application, type-check the argument, type-check the function, create a type variable for the result type, and then unify the type obtained for the function with an instance of the
->
type constructor applied to the argument's type and the result type. Return the result type.
In this description, unification has side effects. Unifying a variable with some type instance replaces all occurrences of that variable with the type instance. Implementing this initially stumped me. There can be references to that variable scattered throughout (inside other types, or stored as the type of some expression), and when replacing the variable, all these references should be updated.
My first (rather naive) approach was to always keep types in 'boxes' —
mutable structures with one field — and update both boxes to refer to
the same type after they were unified. But this didn't even resemble a
working solution. (Unifying box a
and box b
set both boxes to
hold, say, a
. Then unifying the second box with box number
left
the first box still pointing at a
.) Next I devised a complicated
scheme where types were referred to through tags, and each type kept a
list of tags referring to it. Updating a type caused all these tags to
also be updated. This worked, but was rather clunky.
In the eventual solution, type variable structures have a ref
field,
which refers to nil
for unbound variables, to a type for bound
variables, and to another variable for variables structures that are
equivalent to another variable. Thus, these structures form chains,
with the variable at the end being the 'canonical' variable — it's
ref
field being either a type instance indicating that all the
variables referring to it are bound, or nil
. This requires
'resolving' of types before unifying them, but otherwise left the code
free of noise and blatant inefficiencies.
The system described so far, though, is not polymorphic. In an expression like this:
let id = fn x -> x
print (id 5)
print (id "foo")
If type-checking of let
is implemented along the lines of the other
constructs, the first application will unify the type of id
(a ->
a
) with number -> b
, resulting in number -> number
. Applying id
to "foo"
then yields a type error. This is not what we want.
Having polymorphic functions (functions with type variables in their
type) is only useful if they can actually be used in a polymorphic
way.
The (usual) solution to this problem is to treat the types of
variables introduced by let
(and similar) specially. Instead of
normal types, they get assigned a 'type closure', and every time they
are used (outside of their definition) a new type get instantiated
based on this type closure, meaning a copy of the type is made, with
all the unbound type variables that were introduced by the variable's
definition given new names. But only those — any type variables that
were introduced outside of the definition must keep their identity. In
this it works a lot like ordinary closures. (I didn't really 'get'
this technique until I read chapter 31 of Programming Languages:
Application and
Interpretation).
So now the rule for let
type-checking looks something like this:
- With some way to distinguish newly introduced type variables from
existing ones, type-check the
let
's definition bodies, and use the results to extend the environment by binding the bound variables to a type closure of their definition's type. Then use this environment to type-check thelet
's body.
I extended the type 'resolver' that already dealt with chains of type
variables to instantiate any type closures it ran across. The new
type-checker handles the above example (if print
is given a type),
assigning the type number
to id 5
, and string
to id "foo"
.
You probably noticed the asymmetry of only applying this technique to
let
. Function definitions also introduce variables. With the new
code, we still can't type-check something like this:
(fn id -> print (id 5); print (id "baz"))
(fn x -> x)
The id
variable gets bound by the function, and thus holds a regular
type, not a type closure. Thus, the above is a type error. It won't
type-check in Haskell either ((\id -> (id 5, id "baz")) (\x -> x)
).
The problem with such a function is that type-checker knows nothing
about the kind of id
value that might get passed to it, and thus it
is hard to assign a closure for it. (Also, depending on the way in
which type polymorphism is implemented, such functions might be very
hard to compile.)
Another relevant issue is infinite types, as in the expression fn x
-> x x
. This infers the type of x
to be a function type with its
own type as the argument type, which sends our type-checker into an
infinite recursion of type expansions. The simplest way to prevent
this is to fail when trying to replace a variable with a term that
contains that same variable.
A very necessary extension to a type-system like this is Haskell-style
type classes. For example, if we want to make integers and
floating-point numbers a different type, we also need to use different
operators for addition or multiplication (see O'Caml), and a generic
print
function as used in the examples is impossible. I am leaving
this for later.
Having a parser and a type-checker for our language makes it feel somewhat real already. Programs can not be executed yet, unfortunately, but at least we can get some feedback on whether they make sense.