Recursive descent parsing of mathematical expressions with C#

23 Dec 2018

Part 1: Lexing mathematical expressions with F#
Part 2: Parsing mathematical expressions with Shunting Yard, F#, and C#
Part 3: Recursive descent parsing of mathematical expressions with C#

(Code for this post is available on GitHub.)

This post describes a backtracking lexer and combined recursive descent LL(1) parser and evaluator for mathematical expressions. We show how to stay with recursive descent, even for arbitrary operator precedence and associativity. There's no need to switch to Shunting Yard or Pratt parsing algorithms or to Yacc. All that's required is a small change to the grammar and parser for recursive descent to support left and right associative operators and precedence levels.

With recursive descent, grammar and parser follow from each other. The two are isomorphic, leading to easy to understand, easy to an debug, and easy to extend parser. Once we know how to parse an expression with precedence and associativity rules, the techniques carry over to more involved grammars like those of programming languages.

Textbook expression grammar

Below is a starting point Backus-Naur Form grammar for expressions. We'll evolve it into a grammar and parser with support for additional precedence levels, for left and right associative operators, and for integers and floats:

Expression = Term | Expression "+" Term | Expression "-" Term
Term = Factor | Term "*" Factor | Term "/" Factor 
Factor = Power | Factor "^" Power
Power = Integer | "(" Expression ")"
Integer = Digit | Integer Digit    
Digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

A couple of points to note about the grammar:

The grammar is LL(1), meaning that a parser would consume tokens left to right. When a list of tokens match a grammar rule, the parser groups these using leftmost derivation, corresponding to left associativity.
The number one in parentheses refers to the grammar rules defined in such a way that looking at most one token ahead, but without consuming it, the parser can unambiguously identify a single matching grammar rule.
Whether a grammar rule's self-reference is to the left or right of the operator determines associativity of that operator.
The order of grammar rules define the order of precedence among operators.

Grammar rule names come from the domain described by the grammar. Here algebra where a term is an operand involving addition or subtraction and factor an operand involving multiplication or division. Names might as well have been Addition and Multiplication or simply Rule0, Rule1, RuleN with N denoting the rule's precedence level.

A lexer operates at the character level, grouping characters into tokens, and would typically implement the Integer and Digit rules. Single token lookahead refers to the operation of the parser and not the lexer. The lexer may look many characters ahead. While inside a number, only looking at the next digit, the lexer has no way to tell whether the number is an integer or float. In fact, that's a common reason for separating lexing and parsing, simplifying the parser.

Adjusting grammar for off-by-one naming

Grammar rules appear off-by-one: Expression is about addition and subtraction, Term is about multiplication and division, and so on. If instead we expand the first rule into two, we end up with Term, Factor, and so on aligned. The extra rule we name Primary:

Expression = Term
Term = Factor | Term "+" Factor | Term "-" Factor
Factor = Power | Factor "*" Power | Factor "/" Power
Power = Primary | Power "^" Primary
Primary = Integer | "(" Expression ")"

Rewriting rules is valid because a grammar comprising of rules A = B and B = C | D is identical to A = B | C | D.

Adjusting grammar for right associativity

For right associative operators, we can apply another grammar rewrite rule. With Power for instance, it must be rewritten to Primary "^" Power. Power must go on the right, resulting in rightmost derivation and hence right associativity.

Even though the grammar with left and right self-references captures precedence and associativity, a left self-reference causes a recursive descent parser to enter an infinite loop. With Term for instance, the parser would keep calling itself without consuming tokens.

Eliminating left recursion through EBNF rule rewriting

This is where Extended BNF comes in. It introduces { x } rule syntax for x repeated zero or more times. A parser would implement such repetition with a loop instead of recursion. Rewriting the grammar to the new syntax and including unary minus and float support, we end up with the grammar below. Names such as Addition, Multiplication, and Power are short for operators with addition, multiplication, and power-like precedence:

Expression = Addition
Addition = Multiplication {("+" | "+") Multiplication}
Multiplication = Power {("*" | "/") Power}
Power = Unary {"^" Power}
Unary = '-' Unary | Primary
Primary = Integer | Float | "(" Expression ")"

Instead of substituting left recursion with iteration, we could rewrite the grammar as all right recursive. The parser could then apply a transformation upon evaluating left associative operators or to Abstract Syntax Tree (AST) nodes, depending on the parser's output. While possible, it's unnecessarily difficult.

From grammar rules to parser

A two-way mapping exists between grammar and recursive descent parser, turning each rule into a parse method and vice versa. The inside of each parse method follows from the rule's definition as per Wirth's recipe for translating EBNF into code (see Additional references below). Placing a breakpoint on a parse method, the call stack reveals the path from Expression to that parse method, showing the close relationship between grammar and parser.

To keep the implementation simple, evaluation happens during parsing. Instead of each parse method returning an AST node as with the Shunting Yard implementation, each parse method returns the numeric result of the computation so far.

Below is a trace from parsing and evaluating an expression. Enter marks an entry into a parse method (20 in total) and the value of the current token. Exit marks the return from a parse method and the value passed up the call chain:

// Equivalent to 2+(3^(4^0.5)*5)
> 2+3^4^0.5*5
Enter: Parse, Value: 2
    Enter: ParseExpression, Value: 2
        Enter: ParseAddition, Value: 2
            Enter: ParseMultiplication, Value: 2
                Enter: ParsePower, Value: 2
                    Enter: ParseUnary, Value: 2
                        Enter: ParsePrimary, Value: 2
                        Exit: Value: 2
                    Exit: Value: 2
                Exit: Value: 2
            Exit: Value: 2
            Enter: ParseMultiplication, Value: 3
                Enter: ParsePower, Value: 3
                    Enter: ParseUnary, Value: 3
                        Enter: ParsePrimary, Value: 3
                        Exit: Value: 3
                    Exit: Value: 3
                    Enter: ParsePower, Value: 4
                        Enter: ParseUnary, Value: 4
                            Enter: ParsePrimary, Value: 4
                            Exit: Value: 4
                        Exit: Value: 4
                        Enter: ParsePower, Value: 0.5
                            Enter: ParseUnary, Value: 0.5
                                Enter: ParsePrimary, Value: 0.5
                                Exit: Value: 0.5
                            Exit: Value: 0.5
                        Exit: Value: 0.5
                    Exit: Value: 2
                Exit: Value: 9
                Enter: ParsePower, Value: 5
                    Enter: ParseUnary, Value: 5
                        Enter: ParsePrimary, Value: 5
                        Exit: Value: 5
                    Exit: Value: 5
                Exit: Value: 5
            Exit: Value: 45
        Exit: Value: 47
    Exit: Value: 47
Exit: Value: 47

With grammar rules mapping closely to parse methods and correct handling of precedence and associativity, the parsing has become explicit.

Comparing recursive descent, Shunting Yard, and Pratt parsers

A recursive descent parser is the most intuitive and simplest to implement. It was conceived of in the early 1960s, where it was impractical to implement. The call stack was still a novel concept, function calls were expensive, and stack memory scarce. These limitations led Dijkstra to develop the Shunting Yard algorithm. It uses explicit operator and operand stacks over the implicit call stack. Today, unless parsing large programs with many precedence levels, recursive descent should be the default option.

In 1973, Pratt developed the Top Down Operator Precedence parser, later called Pratt parser. It outperforms recursive descent and Shunting Yard in that regardless of an operator's precedence level, a Pratt parser requires a single parse method call only. It's well suited for grammars with many levels of precedence, and languages with dynamic operator support, where a new operator may be added anywhere in the precedence hierarchy and with any associativity. A similar flexibility can be built into recursive descent by replacing or supplementing the fixed rules/parse methods with a lookup table.

As an example of a hybrid of a recursive descent parser for statements and a Pratt parser for expressions, have a look at the Monkey-CSharp parser.

Additional references

Niklaus Wirth's Compiler construction, Chapter 2 through to Section 4.1 (12 pages). With examples, these pages cover almost everything needed to write a recursive descent parser for any language.
Per Vognsen's Programming an x64 compiler from scratch - part 2", from 2h30m to 3h28m, implements a simple expression parser in C. Also, Bitwise, Day 2: C Programming & Parsing from 1h00m and Bitwise, Day 3: More Programming & Parsing from 1h10m to 1h40m are worth a look, though overlapping.
Bob Nystrom's Crafting Interpreters, Chapter 6 details how to modify an expression grammar to encode precedence levels.
Eli Bendersky's Some problems of recursive descent parsers details how to transform a right recursive grammar into repetitions and how to handle left and right associative operators.