I keep coming across people who consider parser technology a forbidding, scary field of programming. This is nonsense—a small parser can be very, very simple, and provide a wholesome exercise in recursive thinking. At the same time, it is true that you can make parsing extremely complicated. Mostly, this tends to happen when you generalize parsing techniques to work for different grammars. Since compiler textbooks usually describe general approaches to parsing, they may be to blame for putting people off.
This post describes a new parsing system I wrote for CodeMirror, a source code editor. It frames the system with some history, and digresses into some neat architectural details.
Editor features like syntax highlighting, bracket matching, code folding, and autocompletion all involve some level of parsing. Unfortunately, since editors have to handle many different languages, they require a generalized approach to parsing.
CodeMirror is in the process of being rewritten, and I wanted to improve the way it parses its content. Parsing inside of an editor comes with its own unique set of constraints, which can be hard to satisfy. Though I had been planning new approaches for years, all I had to show for it so far were a pile of dead ends.
The constraints that make the parsing problem in a code editor hard are roughly these:
The document is constantly changing.
You can't do anything expensive. If the parsing works takes too long, it'll introduce latency that makes editing feel slugglish and unresponsive.
The input is often not in a finished, syntactically correct form. But you still have to make some sense of it—nobody wants an editor where most features stop working when you have a syntax error in your document.
You often want to be able to mix several languages/grammars in a single document (think HTML with JavaScript and CSS embedded in it).
Keeping those in mind, let's go over the approaches I've tried.
The system in as it exists in CodeMirror 5 now (which is pretty
much what we've been using from the very beginning) is a simple
one. For
each language, you write a tokenizer which splits the input into
pieces, and labels each piece with some syntactic category (such as
variable
, keyword
, or number
). The tokenizers can be stateful,
which allows them to secretly be full parsers if they want to.
This state must by copyable, so that the editor can strategically store tokenizer states from a previous run, and after a change, resume one close to that change to avoid re-tokenizing the entire document. Because we are usually only interested in the code in the visible viewport, this means the complexity of re-tokenizing is bounded by the distance between the change and the end of the viewport. Since most changes happen inside of that viewport, this works well in practice.
Such tokenizers are awkward to write directly, so over the years several attempts have been made to build abstractions over them. The first was the Common JavaScript Syntax Highlighting Specification, an attempt by the authors of Mozilla Skywriter (formerly Bespin, later merged into ACE) to define a declarative format for describing tokenizers as state machines with regular expressions (describing the tokens) as edges. The ACE project ended up with an incompatible but similar format (too entangled with their internals to use in CodeMirror, unfortunately). I did an implementation of the original spec for CodeMirror, and then another incompatible extension because the base spec was too limiting. There are a few CodeMirror modes still based on that code, but it was no real success.
I think the reason such state machines (and the somewhat related TextMate grammars which are in wide use in desktop editors) never felt like a great solution is that, once you get past trivial grammars (where their declarative simplicity does look really nice), they don't really help that much with abstraction. Manually designing complicated state machines is a chore. Regular expressions, which are bad enough on their own, become downright terrifying when you have to construct all your edges out of them, often stuffing multiple tokens into a single expression to avoid creating intermediate states. This “abstraction” has a tendency to produce uglier, less maintainable code than what you'd get when writing the tokenizer as plain code.
So in 2017, I started an ambitious project to create a better way to abstractly define incremental tokenizers. I had concluded that classical parser generators based on context-free grammars were never going to work in this context (for reasons that I'll come back to later on). But I kept coming across parsing expression grammars, which weren't based on context-free grammars and had some interesting properties, such as being able to combine multiple grammars to create a new grammar (which is great for mixed-language documents).
So I spent several months building a parsing system that took a PEG-like grammar, compiled it down to a state machine, and made it possible to run that state machine as a CodeMirror language mode.
This system is a marvel. It uses a moderately sophisticated optimizing compiler to generate the state machines. The result works quite well, and is used in several real-world systems today. But unfortunately, if I'm honest, it is a tragically bad idea taken way too far.
Parsing expression grammars are parsed by backtracking. And as such, they are very poorly suited for implementing a stateful tokenizer. In a backtracking system, you never know when you've definitely parsed a piece of content—later input might require you to backtrack again. So what I ended up with was actually not PEG at all, but a system where you had to explicitly annotate where the parser should look ahead. Though grammars written this way were relatively readable, they involved a lot of finicky, error-prone kludges to resolve local ambiguity.
Also, parsing PEG is just really inefficient. Such grammars are “scannerless” meaning they don't make a distinction between tokenizing and parsing. When parsing in that way naively, you basically have to run your whole parsing logic for every input character. Probably multiple times, due to backtracking. A lot of the magic in the compiler was intended to recover the tokens that were implicit in the grammar, in order to recover some efficiency. But the system never came close to hand-written language modes in terms of speed.
So, though I knew I needed a new approach, I went into the CodeMirror 6 rewrite without any specific idea on what that approach would look like.
And then I saw tree-sitter, and was enlightened.
Tree-sitter is a parser system written with the code editor use case in mind, and is in the process of being integrated into the Atom editor. It takes a much more ambitious approach to what a parser inside an editor should do: It builds up a full, accurate syntax tree for the content.
You can do so much more with an actual syntax tree than with a sequence of tokens. Whereas tokens, possibly augmented with some information stored in the tokenizer state, allow you to sort of approximate understanding some aspects of the code's structure, a tree usually gives you precisely the information you need.
Most of the ideas that tree-sitter uses aren't new, in fact a paper from 2000 describes a somewhat similar system. But as far as I know, tree-sitter is the first system that puts them all together into a practical piece of software.
Unfortunately, tree-sitter is written in C, which is still awkward to run in the browser (and CodeMirrror targets non-WASM browsers). It also generates very hefty grammar files because it makes the size/speed trade-off in a different way than a web system would.
But good ideas can be ported. Lezer is a JavaScript-based system heavily inspired by tree-sitter.
For a long time, I was firmly convinced that classical parser system based on context-free grammars and LL or LR parsing algorithms were just not suitable for the editor use case. My arguments for this were...
Context-free grammars are a limiting abstraction that breaks down as soon as the language does anything funky. Needing the grammar to be LR or LL to please the parser generator further pins you into a corner.
This is not wrong. Expressing operator precedence in a pure context-free grammar requires writing a silly formulaic rule for each level of precedence. And when you need to implement something like automatic semicolon insertion or whitespace-sensitivity, which would be a couple of lines of code in a hand-written grammar, you can't express that directly, and have to somehow escape the context-free abstraction.
Making such a grammar suitable for an LR parser generator can be even more tricky, and often requires you to have a rather deep understanding of how the parser generator works.
But like many things, once you get to know them, they aren't that bad. Parser generators can support precedence declarations, which make operator parsing a lot less terrible. They can even output decent error messages.
Supporting dynamic resolution of ambiguities through something like GLR parsing can provide a practical way out of situations that parser generators are traditionally bad at.
And contrary to some of the abstractions I mentioned before, this one actually gets us something. Context-free grammars, when combined with a proper parser generator, really do give us fast parsers from readable, compact grammar declarations.
A strict separation between the tokenizer and parser is problematic.
It is, in many languages (think of JavaScript's ambiguity between regular expressions and the division operator). It also tends to make mixed-language parsing harder.
But just because this type of parser is traditionally ran with a completely separate tokenizer doesn't mean it has to be. Having the parse state drive the tokenizer is largely unproblematic. You can even have the parser generator set this up automatically, without user involvement.
Generated parsers are way too big.
A naively generated LR parser is huge, and many tools spit out embarrassingly big files. But with careful parser state deduplication and table compression such a parser can be made about as compact as a hand-written one.
Making such a parser error-tolerant is extremely cumbersome.
If you search the scholarly literature for approaches to error-tolerance in LR parser systems, you get a lot of results, with a lot of different approaches, but none of them are very practical. Most require the grammar writer to explicitly annotate the grammar with error-recovery strategies, bloating the grammar and putting the responsibility for getting it right on every grammar author.
Tree-sitter ingeniously abuses GLR parsing, where the parser can try multiple interpretations simultaneously, to integrate automatic error-correction without a lot of extra complexity. Lezer copies this approach.
I called my tree-sitter copycat project Lezer, which is the Dutch word for reader (and pronounced a lot like laser). It is a bit less advanced than tree-sitter in some areas, a bit more advanced in others, and simply different on quite a lot of points, as determined by a different set of priorities and tastes.
CodeMirror 6 will retain the ability to run a classical stateful tokenizer, but its recommended way to define a language mode is to write a Lezer grammar and wrap it in a CodeMirror-specific packages that adds some editor-related metadata.
Lezer is an LR (with opt-in GLR) parser generator. It has support for incremental parsing, where you can cheaply re-parse a document after local changes have been made to it by reusing pieces of the old parse tree. It automatically tries to recover and continue parsing when it runs into a syntax error, leaving markers in the output tree that indicate where the recovery happened.
Lezer consists of an off-line parser generator tool, which takes a grammar description and outputs a JavaScript module containing a parser for that grammar, and a parser run-time system (which such output files depend on) to do the actual parsing. Only the run-time system and the generated parser need to be loaded by the editor.
The parser outputs non-abstract syntax trees, meaning that it just creates a raw tree structure containing the constructs it parsed (with information on where it found them), without organizing them into a clean, easy-to-use data structure.
The system is optimized for compactness, both in parser table size and syntax tree size. It needs to be practical to ship a bunch of parsers to a user on the web without producing megabytes of network traffic, and it needs to be realistic to keep syntax trees for large documents around without running out of memory.
The Lezer guide provides a more thorough introduction, as well as a description of its grammar notation. In this blog post, I want to go into the neat implementation details that aren't relevant in user documentation.
The point where I became convinced that I definitely needed to use or copy tree-sitter was when I understood its error recovery strategy.
Say you reach a point where you can no longer proceed normally because there is a syntax error. The rest of the input, after the error, is probably full of meaningful constructs that could still be parsed. We want those constructs in our syntax tree. But our regular parsing process is stuck—it doesn't know how to get from the error to a state where the parse can continue.
I definitely did not want to require the grammar author to add error recovery hints to their grammar. These tend to clutter up the grammar and are error-prone to write. Writing a grammar is hard enough without that distraction.
You can see error recovery as a search problem. There might be a parse state and input position (past the error) where the parse can meaningfully continue. We just have to find it.
The actions encoded in the parse tables, along with some recovery-specific actions that the parser wouldn't normally take, provide a kind of search tree. You start at the state(s) where the error occurred, and keep exploring new states from there.
But what does the accept condition look like? When do you know that you've found an acceptable solution? You could define that precisely, for example as the state that can handle the next N tokens without further errors. But we can also be vague.
The solution found by Max Brunsfeld in tree-sitter is to use the same mechanism that's used to parse ambiguous grammars. A GLR parser can split its parse stack and run both sides alongside each other for a while until it becomes clear which one works out.
That's pretty much exactly what a search algorithm does—it tracks a number of branches that it still has to explore, and continues to explore them, possibly pruning unpromising branches with some heuristic, until it finds a solution.
To be able to get good results, or at least some result, in messy situations like longer stretches of invalid input, each branch has a badness score associated with it, which is increased (linearly) each time a recovery action is taken, and decreased (asymptotically) every time it can consume a token normally.
What we want to do is, after an error, try all kinds of possible recovery tricks, which recursively branch off a large amount of states. But then, after a bit of that, we should consolidate to one or, at most, a few parse states again, because parsing input in a whole bunch of different ways is expensive.
To get this effect, Lezer forbids states with a badness higher than a given multiple of the best state's badness (or some maximum threshold) from applying further recovery actions, effectively dropping those branches when they can't proceed normally. In the case where one branch finds a good way to continue, that branch's badness will converge to zero and eventually stop all worse branches. In cases where the input continues to make no sense, all branches will eventually get a badness score exceeding the maximum, and the parser will only continue one of them.
The recovery strategies used are:
Skip the next token, and try again with the same state after that.
Invent a token—take any of the tokens that are valid in this state, and continue to the state that consuming them would produce. This is the main source of branching, since many states allow a lot of tokens.
Force the end of the innermost production that's currently being parsed.
There are situations where the result of this approach isn't entirely optimal, but it usually does well. The important thing is that it always keeps parsing, and does so in a way that remains tractable (exponential searches are quickly dampened). The system is biased a bit towards the token-skipping rule, so that if all else fails it'll, in effect, just continue skipping tokens until it stumbles into a situation where it can continue parsing.
When you have a parser that may be splitting its state—a lot—and build up parts of the tree multiple times, that duplicate tree building and the bookkeeping involved in it can cause a lot of unnecessary work.
The order in which an LR parser creates nodes is inner-to-outer. It will, for example, first create the node for the operands, and then later the node for the operator expression. This suggests an approach: What if, instead of building a tree structure right away, the parser just keeps a flat log of the nodes it created. This can be an array in which the nodes are arranged in post-order, with children coming before parents.
The parser just appends to this array. When splitting the state, one state keeps the existing array, and the other gets a new empty array along with a pointer to the state that has the rest of the array, and the length of that array at the time of the split.
Now splitting involves no node copying at all. You do need to copy the state stack, which LR parser use to track context, but that is generally shallow.
In addition, node allocation becomes as cheap as appending a few numbers to an array. For actions that don't result in tree nodes (Lezer allows you to mark rules as uninteresting, to keep the tree small), you don't have to do anything at all. The control stacks stores the output array position at the start of each rule, and can use that to emit enough data to later reconstruct parent-child relationships.
After a parse finishes successfully, the final state's parent-array pointers can be used to find all the nodes that make up the tree, and construct an actual tree structure out of them.
One tricky issue occurs when skipped content (whitespace and comments) produces nodes. If you have code like this...
if (true) something()
// Comment
otherStatement()
... the comment should not be part of the if statement's node. Yet
the parser only knows for sure that it can finish that node after
seeing the next statement (there might be an else
still coming).
In cases like this, where the output array contains skipped nodes immediately in front of a reduction, the parser has to move them forward and store the end of the node before them. Fortunately, this occurs relatively rarely (unless you add nodes for whitespace, in which case it'll happen at the end of every rule that has a possible continuation).
A nice thing about the flat post-order tree representation is that it is compact. Tree structures constructed the usual way, as separately allocated nodes, incur a lot of extra overhead for pointers and allocation headers. They can also have terrible locality, since who knows how far from each other the memory allocator will put the nodes.
Unfortunately, we can't just use a flat representation for our syntax trees. The incremental parser has to be able to reuse parts of it without copying those parts into a different buffer.
But we can use it for parts of the tree. Storing the coarse structure as a classical tree, but the content of smaller nodes (say less than a few thousand characters long) as flat arrays, gives us the best of both worlds. Since most nodes, by number, live in the fine structure, this saves a large amount of overhead (and helps with locality).
That does mean that we can't reuse small nodes. But since their size is limited, the amount of work that is involved in re-parsing them is also limited. And by removing them from consideration, the incremental parser can avoid quite a bit of the work involved in preparing and scanning the tree for reuse.
A small node stores its content in a typed array of 16-bit unsigned integers. It uses 4 such numbers (64 bits) per node, storing a type, a start position, an end position, and a child count for each node. Contrary to the array created by the parser, these arrays are in pre-order, because that makes forward iteration (which tends to be more common than backward iteration) cheaper. The child count was almost obsolete (the end position can sort of tell you which nodes are children), but Lezer supports zero-length nodes, which might land on the end of their parent node and make it ambiguous whether they belong to it or not.
Client code, of course, doesn't want to deal with this representation. Lezer provides an abstract interface to searching in and walking through trees that hides the buffer structure, allowing you to conceptually work with a uniform tree of nodes.
Lezer, like tree-sitter, stores the result of repetitions in the
grammar (produced by the *
and +
operators) as balanced subtrees.
This means that, unless your input is pathological (say, a thousand
applications of a single binary operator in a row), you tend to get
shallow, well-balanced syntax trees, which are cheap to search and
allow effective reuse.
Depending on the grammar's complexity, an LR parser generator creates between a dozen and a few thousand parse states for your grammar. These represent syntactic positions like “after the opening paren of an argument list” or “after an expression, possibly expecting some expression suffix”.
The parser generator can figure out which tokens are valid in a given state. It can also, for tokens specified as part of the grammar, automatically determine which tokens conflict (match the same input, or some prefix of each other).
A well-known example of conflicting tokens is the division operator
versus regular expression syntax in JavaScript. But others are
keywords that can also appear as property names, and the bitwise right
shift operator (>>
) versus two closing angle brackets in C++.
Lezer will not complain about overlapping tokens if the tokens do not appear in the same parse states. This implicitly resolves the regular expression and property name issues, without any user interaction.
When conflicting tokens do appear in the same place, such as division operators and C-style comments, you have to specify an explicit precedence ordering (comments take precedence) to tell the tool that you know what you're doing.
Contextual tokenization is implemented with a concept called token groups. Tokens that have unresolved conflicts with other tokens are assigned to one or more groups, where each group contains only non-conflicting tokens. Each state is assigned a single group (if it expects tokens that conflict with each other that's an error). This group is passed to the tokenizer, which then takes care to only return tokens that are either in that group, or don't conflict with any other tokens. The check is optimized by storing group membership in a bitset, and seeing if the right bit is set with binary and.
Tokens are compiled down to a single deterministic state machine, which is ran on the input character stream. In cases like the regexp-versus-division issue, you don't want the machine to go running through regexp-specific states in a situation where you only allow division, since that would be wasteful. Therefore, each tokenizer state is also tagged with a bitset that tells you which groups the tokens reachable from that state belong to, and the tokenizer stops running when it hits a state that has no overlap with the allowed tokens for the parse state.
Almost all programming languages have special syntactic elements like whitespace and comments that may occur between any tokens. Encoding these directly in the grammar is extremely tedious for most languages.
Traditionally, tokenizer just skip such elements when reading the next token. That works well in most contexts, but makes it awkward to include the elements in the parse tree.
Lezer treats skipped things like they are part of the grammar (though in an optimized way to avoid increasing the size of the parse tables). It is possible to skip things that aren't single tokens (to implement something like nestable comments, for example, or to make sure your block comment nodes consist of smaller nodes so that you can incrementally parse giant block comments).
Each rule or group of rules may have its own set of skipped expressions, so that you can express different sublanguages in a grammar, for example something like the content of interpolated strings, without allowing spacing in places where the language doesn't allow it.
Each parse state has a pointer to a (shared) set of skip actions, which, for the skipped tokens or tokens that start a compound skipped expression, contains the actions to take for those tokens. For single-token skipped elements, that action just tells the parser to skip the token and stay in the same state. For compound elements, it causes the state that handles the rest of the element to be pushed onto the control stack.
The languages that a tool like Lezer needs to handle are wildly different, from JavaScript to Haskell to CSS to YAML. As such, it is difficult to find a cross-language vocabulary to describe their constructs. In fact, it seems like that would be a separate multi-year project, and pull in a serious amount of complexity.
Yet it would be nice if the parser output comes with some information that can be interpreted without knowing what language you are working with.
After several iterations, what I decided on was a system where nodes have names, which only have a meaning within the language, and props, which are values associated with tags defined by external code. Integrating a language grammar into CodeMirror involves assigning values for some of these props to the node types used by the language—things like syntax highlighting style information and how to indent such nodes.
Since the number of node types in a language is limited, we can allocate an object for each node type to hold this information, and have all nodes of that type point to the same object.
To allow code outside the grammar to add props without mutating global state, parser instances can be extended with additional props, creating a copy that will output nodes with the props attached. This is especially useful in the context of mixed-language trees.
Lezer has support for a limited form of grammar nesting. If language A can appear inside a document in language B, and the end of the region covered by A can be unambiguously found by scanning for a specific token, Lezer can temporarily switch to another set of parse tables while parsing such a region.
The syntax tree will then contain nodes from both grammars. Having props directly attached to the nodes makes it much easier to work with such trees (as opposed to using a language-specific table that associates node names with metadata).