Recursive descent parsing of mathematical expressions with C#

23 Dec 2018
Have a comment or question? Please drop me an email or tweet me @ronnieholm.

Part 1: Lexing mathematical expressions with F#
Part 2: Parsing mathematical expressions with Shunting Yard, F#, and C#
Part 3: Recursive descent parsing of mathematical expressions with C#

(Code for this post is available on GitHub.)

This post describes a backtracking lexer and combined recursive descent LL(1) parser and evaluator for mathematical expressions. We show how to stay with recursive descent, even for arbitrary precedence and associativity. No need to switch to Shunting Yard or Pratt parsing algorithms or to Yacc and family. All that's required is a small change to the grammar and parser for recursive descent to support left and right associative operators and precedence levels.

With recursive descent, grammar and parser follow from each other. The two are isomorphic, leading to easy to understand, easy to debug, and easy to extend parsers. Once we know how to parse an expression with precedence and associativity rules, the techniques apply to more involved grammars like those of programming languages.

Textbook expression grammar

Below is a starting Backus-Naur Form grammar for expressions. We'll evolve it into a grammar and parser with support for more precedence levels, for left and right associative operators, and for integers and floats:

Expression = Term | Expression "+" Term | Expression "-" Term
Term = Factor | Term "*" Factor | Term "/" Factor 
Factor = Power | Factor "^" Power
Power = Integer | "(" Expression ")"
Integer = Digit | Integer Digit    
Digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" 

The grammar is LL(1) meaning that a parser consumes tokens left to right. When a list of tokens match a grammar rule, the parser groups these tokens in a leftmost derivation corresponding to left associativity. For right associative operators, like power, we must construct the rightmost derivation. The number one in parentheses refers to grammar rules defined such that looking one token ahead, but without consuming it, the parser can unambiguously identify one matching grammar rule. Organizing these rules by order of precedence enables support for precedence and associativity.

Rule names come from the domain described by the grammar. Here algebra where a term is an operand involving addition or subtraction and factor an operand involving multiplication or division. Names might as well have been Addition and Multiplication or Rule0, Rule1, RuleN with N being the rule's precedence level.

A lexer operates at the character level, grouping characters into tokens, and would typically handle the Integer and Digit rules. Single token lookahead refers to the operation of the parser and not the lexer. The lexer may look many characters ahead. While inside a number, only looking at the next digit, the lexer has no way to tell whether the number is an integer or float. In fact, that's a common reason for separating lexing and parsing, simplifying the parser.

Adjusting grammar for off-by-one naming

Grammar rules appear off-by-one: Expression is about addition and subtraction, Term about multiplication and division, and so on. If instead we expand the first rule into two, we end up with Term, Factor, and so on aligned. The extra rule we name Primary:

Expression = Term
Term = Factor | Term "+" Factor | Term "-" Factor
Factor = Power | Factor "*" Power | Factor "/" Power
Power = Primary | Power "^" Primary
Primary = Integer | "(" Expression ")"

Rewriting rules is valid because a grammar comprising rules A = B and B = C | D is the same as A = B | C | D. Another needed rewrite is rules with right associative operators. For Power, the self-reference must on the right, as in Primary "^" Power.

Even though the grammar with left and right self-references captures precedence and associativity, a left self-reference causes a recursive descent parser to enter an infinite loop. With Term, for instance, the parser would keep calling itself without consuming tokens.

Eliminating left recursion through EBNF rule rewriting

That's where Extended BNF comes in. It introduces { x } rule syntax for x repeated zero or more times. A parser would implement such repetition with a loop instead of recursion. Rewriting the grammar to the new syntax and including unary minus and float support, we end up with the grammar below. In it names such as Addition, Multiplication, and Power are short for operators with addition, multiplication, and power-like precedence:

Expression = Addition
Addition = Multiplication (("+" | "+") Multiplication)*
Multiplication = Power (("*" | "/") Power)*
Power = Unary ("^" Power)*
Unary = '-' Unary | Primary
Primary = Integer | Float | "(" Expression ")"

Instead of substituting left recursion with iteration, we could rewrite the grammar as all right recursive. Then the parser could apply a transformation upon evaluating left associative operators or to Abstract Syntax Tree (AST) nodes, depending on the parser's output. While possible, it's unnecessarily difficult.

From grammar rules to parser

A two-way mapping exists between grammar and recursive descent parser, turning each rule into a parse method. The inside of each parse method follows from the rule's definition as per Wirth's recipe for translating EBNF into code (see Additional references below). Placing a breakpoint on a parse method, the call stack reveals the path from Expression to that parse method, illustrating the relationship between grammar and parser.

To keep the implementation simple, evaluation happens during parsing. Instead of each parse method returning an AST node as with the Shunting Yard implementation, each returns the numeric result of the computation thus far.

Below is a trace from parsing and evaluating an expression. Enter marks an entry into a parse method (20 in total) and the value of the current token. Exit marks the return of a parse method and the value passed up the call chain:

// Equivalent to 2+(3^(4^0.5)*5)
> 2+3^4^0.5*5
Enter: Parse, Value: 2
    Enter: ParseExpression, Value: 2
        Enter: ParseAddition, Value: 2
            Enter: ParseMultiplication, Value: 2
                Enter: ParsePower, Value: 2
                    Enter: ParseUnary, Value: 2
                        Enter: ParsePrimary, Value: 2
                        Exit: Value: 2
                    Exit: Value: 2
                Exit: Value: 2
            Exit: Value: 2
            Enter: ParseMultiplication, Value: 3
                Enter: ParsePower, Value: 3
                    Enter: ParseUnary, Value: 3
                        Enter: ParsePrimary, Value: 3
                        Exit: Value: 3
                    Exit: Value: 3
                    Enter: ParsePower, Value: 4
                        Enter: ParseUnary, Value: 4
                            Enter: ParsePrimary, Value: 4
                            Exit: Value: 4
                        Exit: Value: 4
                        Enter: ParsePower, Value: 0.5
                            Enter: ParseUnary, Value: 0.5
                                Enter: ParsePrimary, Value: 0.5
                                Exit: Value: 0.5
                            Exit: Value: 0.5
                        Exit: Value: 0.5
                    Exit: Value: 2
                Exit: Value: 9
                Enter: ParsePower, Value: 5
                    Enter: ParseUnary, Value: 5
                        Enter: ParsePrimary, Value: 5
                        Exit: Value: 5
                    Exit: Value: 5
                Exit: Value: 5
            Exit: Value: 45
        Exit: Value: 47
    Exit: Value: 47
Exit: Value: 47

With grammar rules mapping to parse methods, the relationship of rules to parse methods, and proper handling of precedence and associativity, is visible.

Comparing recursive descent, Shunting Yard, and Pratt parsers

A recursive descent parser is the most intuitive and simplest to implement. It dates to the early 1960s, but back then it was impractical to implement. The call stack was still a novel concept, function calls were expensive, and stack memory scarce. These limitations lead Dijkstra to develop the Shunting Yard algorithm with its explicit operator and operand stacks. Today, unless we're parsing large programs with many precedence levels, recursive descent should be the starting point. Their stack is the implicit call stack.

In 1973, Pratt developed the Top Down Operator Precedence parser, later called Pratt parser. It outperforms recursive descent and Shunting Yard in that regardless of precedence level, a Pratt parser requires a single parse method call only. It's well suited for grammars with many levels of precedence, and languages with support for dynamic operators, where a programmer may add new operators anywhere in the precedence hierarchy and with any associativity. A similar flexibility can be built into recursive descent by replacing or supplementing the fixed rules/parse methods with a lookup table.

As an example of a hybrid of a recursive descent parser for statements and a Pratt parser for expressions, have a look at the MonkeyLang parser.

Additional references