23 Dec 2018

Part 1: Lexing mathematical expressions with F#

Part 2: Parsing mathematical expressions with Shunting Yard, F#, and C#

Part 3: Recursive descent parsing of mathematical expressions with C#

(Code for this post is available on GitHub.)

This post describes a backtracking lexer and combined recursive descent LL(1) parser and evaluator for mathematical expressions. We show how to stay with recursive descent, even for arbitrary precedence and associativity. No need to switch to Shunting Yard or Pratt parsing algorithms or to Yacc and family. All that's required is a small change to the grammar and parser for recursive descent to support left and right associative operators and precedence levels.

With recursive descent, grammar and parser follow from each other. The two are isomorphic, leading to easy to understand, easy to debug, and easy to extend parsers. Once we know how to parse an expression with precedence and associativity rules, the techniques apply to more involved grammars like those of programming languages.

Below is a starting Backus-Naur Form grammar for expressions. We'll evolve it into a grammar and parser with support for more precedence levels, for left and right associative operators, and for integers and floats:

Expression = Term | Expression "+" Term | Expression "-" Term Term = Factor | Term "*" Factor | Term "/" Factor Factor = Power | Factor "^" Power Power = Integer | "(" Expression ")" Integer = Digit | Integer Digit Digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

The grammar is LL(1) meaning that a parser consumes tokens left to right. When a list of tokens match a grammar rule, the parser groups these tokens in a leftmost derivation corresponding to left associativity. For right associative operators, like power, we must construct the rightmost derivation. The number one in parentheses refers to grammar rules defined such that looking one token ahead, but without consuming it, the parser can unambiguously identify one matching grammar rule. Organizing these rules by order of precedence enables support for precedence and associativity.

Rule names come from the domain described by the
grammar. Here algebra
where a term is an operand involving addition or subtraction and
factor an operand involving multiplication or division. Names might as
well have been `Addition`

and `Multiplication`

or `Rule0`

, `Rule1`

,
`RuleN`

with N being the rule's precedence level.

A lexer operates at the character level, grouping characters into
tokens, and would typically handle the `Integer`

and `Digit`

rules. Single token lookahead refers to the
operation of the parser and not the lexer. The lexer may look many
characters ahead. While inside a number, only looking at the next
digit, the lexer has no way to tell whether the number is an integer
or float. In fact, that's a common reason for separating lexing and
parsing, simplifying the parser.

Grammar rules appear off-by-one:
`Expression`

is about addition and subtraction,
`Term`

about multiplication and division, and so on. If
instead we expand the first rule into two, we end up with Term,
Factor, and so on aligned. The extra rule we
name `Primary`

:

Expression = Term Term = Factor | Term "+" Factor | Term "-" Factor Factor = Power | Factor "*" Power | Factor "/" Power Power = Primary | Power "^" Primary Primary = Integer | "(" Expression ")"

Rewriting rules is valid because a grammar comprising rules ```
A
= B
```

and `B = C | D`

is the same as ```
A = B | C |
D
```

. Another needed rewrite is rules with right associative
operators. For `Power`

, the self-reference must on the
right, as in `Primary "^" Power`

.

Even though the grammar with left and right self-references
captures precedence and associativity, a left self-reference causes a
recursive descent parser to enter an infinite
loop. With `Term`

, for instance, the parser would keep
calling itself without consuming tokens.

That's where
Extended
BNF comes in. It introduces `{ x }`

rule syntax
for `x`

repeated zero or more times. A parser would
implement such repetition with a loop instead of recursion. Rewriting
the grammar to the new syntax and including unary minus and float
support, we end up with the grammar below. In it names such
as `Addition`

, `Multiplication`

,
and `Power`

are short for operators with addition,
multiplication, and power-like precedence:

Expression = Addition Addition = Multiplication (("+" | "+") Multiplication)* Multiplication = Power (("*" | "/") Power)* Power = Unary ("^" Power)* Unary = '-' Unary | Primary Primary = Integer | Float | "(" Expression ")"

Instead of substituting left recursion with iteration, we could rewrite the grammar as all right recursive. Then the parser could apply a transformation upon evaluating left associative operators or to Abstract Syntax Tree (AST) nodes, depending on the parser's output. While possible, it's unnecessarily difficult.

A two-way mapping exists between grammar and recursive descent
parser, turning each rule into a parse method. The inside of each
parse method follows from the rule's definition as per Wirth's recipe
for translating EBNF into code (see Additional references
below). Placing a breakpoint on a parse method, the call stack reveals
the path from `Expression`

to that parse method,
illustrating the relationship between grammar and parser.

To keep the implementation simple, evaluation happens during parsing. Instead of each parse method returning an AST node as with the Shunting Yard implementation, each returns the numeric result of the computation thus far.

Below is a trace from parsing and evaluating an expression. Enter marks an entry into a parse method (20 in total) and the value of the current token. Exit marks the return of a parse method and the value passed up the call chain:

// Equivalent to 2+(3^(4^0.5)*5) > 2+3^4^0.5*5 Enter: Parse, Value: 2 Enter: ParseExpression, Value: 2 Enter: ParseAddition, Value: 2 Enter: ParseMultiplication, Value: 2 Enter: ParsePower, Value: 2 Enter: ParseUnary, Value: 2 Enter: ParsePrimary, Value: 2 Exit: Value: 2 Exit: Value: 2 Exit: Value: 2 Exit: Value: 2 Enter: ParseMultiplication, Value: 3 Enter: ParsePower, Value: 3 Enter: ParseUnary, Value: 3 Enter: ParsePrimary, Value: 3 Exit: Value: 3 Exit: Value: 3 Enter: ParsePower, Value: 4 Enter: ParseUnary, Value: 4 Enter: ParsePrimary, Value: 4 Exit: Value: 4 Exit: Value: 4 Enter: ParsePower, Value: 0.5 Enter: ParseUnary, Value: 0.5 Enter: ParsePrimary, Value: 0.5 Exit: Value: 0.5 Exit: Value: 0.5 Exit: Value: 0.5 Exit: Value: 2 Exit: Value: 9 Enter: ParsePower, Value: 5 Enter: ParseUnary, Value: 5 Enter: ParsePrimary, Value: 5 Exit: Value: 5 Exit: Value: 5 Exit: Value: 5 Exit: Value: 45 Exit: Value: 47 Exit: Value: 47 Exit: Value: 47

With grammar rules mapping to parse methods, the relationship of rules to parse methods, and proper handling of precedence and associativity, is visible.

A recursive descent parser is the most intuitive and simplest to implement. It dates to the early 1960s, but back then it was impractical to implement. The call stack was still a novel concept, function calls were expensive, and stack memory scarce. These limitations lead Dijkstra to develop the Shunting Yard algorithm with its explicit operator and operand stacks. Today, unless we're parsing large programs with many precedence levels, recursive descent should be the starting point. Their stack is the implicit call stack.

In 1973, Pratt developed the Top Down Operator Precedence parser, later called Pratt parser. It outperforms recursive descent and Shunting Yard in that regardless of precedence level, a Pratt parser requires a single parse method call only. It's well suited for grammars with many levels of precedence, and languages with support for dynamic operators, where a programmer may add new operators anywhere in the precedence hierarchy and with any associativity. A similar flexibility can be built into recursive descent by replacing or supplementing the fixed rules/parse methods with a lookup table.

As an example of a hybrid of a recursive descent parser for statements and a Pratt parser for expressions, have a look at the MonkeyLang parser.

Niklaus Wirth's Compiler construction, Chapter 2 through to Section 4.1 (12 pages). With examples, these pages cover almost everything needed to write a recursive descent parser for any language.

Per Vognsen's Programming an x64 compiler from scratch - part 2", from 2h30m to 3h28m, implements a simple expression parser in C. Also, Bitwise, Day 2: C Programming & Parsing from 1h00m and Bitwise, Day 3: More Programming & Parsing from 1h10m to 1h40m are worth a look, though they're overlapping.

Bob Nystrom's Crafting Interpreters, Chapter 6 details how to modify an expression grammar to encode precedence levels.

Eli Bendersky's Some problems of recursive descent parsers details how to transform a right recursive grammar into repetitions and how to handle left and right associative operators.