Parsing mathematical expressions with Shunting Yard, F#, and C#

14 Aug 2015

Part 1: Lexing mathematical expressions with F#
Part 2: Parsing mathematical expressions with Shunting Yard, F#, and C#
Part 3: Recursive descent parsing of mathematical expressions with C#

(Code for this post is available on GitHub. There's both an F# and a C# version. The C# version is more advanced and constructs an Abstract Syntax Tree (AST) during parsing. It then implements a tree-walking interpreter to output the numeric result, the AST, and the prefix and postfix form of the expression.)

In the previous post, focus was on how to construct an expression parser, turning a string such as "-12+45*56" into a list of tokens: UnaryMinOp, Integer 12, BinPlusOp, Integer 34, BinMulOp, and Integer 56. Building on the lexer, this post employs Dijkstra's Shunting Yard algorithm (see original '61 paper, page 22) to reduce the list of tokens to a single Integer, evaluating the infix expression. What Shunting Yard provides is an elegant solution to the problem of non-trivial, non-left to right processing of tokens due to operator associativity and precedence.

Associativity and precedence

Operator associativity and operator precedence exist to disambiguate the order of evaluation of sub-expressions in a expression with no parentheses. The rules of associativity come into play when an expression contains multiple operators at the same level of precedence. For instance, with the binary exponentiation operator being right associative, 2^3^4 and 2^(3^4) are equivalent. As for precedence, its rules are needed when an expression contains operators at different levels of precedence. Multiplication trumping addition is what makes 2+3*4 and 2+(3*4) equivalent.

In case the exact definition of which operators are left and right associative and their levels of precedence appear somewhat arbitrary, it's because it is. Luckily, we all adhere to the same arbitrary rules in order to reduce the number of parentheses required.

The Shunting Yard algorithm

Central to Shunting Yard is the table of associativity and precedence. During execution, the algorithm consults this table to determine if an expression is to be immediately evaluated or its evaluation deferred. The algorithm is naively processing tokens left to right and thus doesn't have the foresight of a human knowing what comes next. This apparent limitation is probably due to its 1961 heritage working with severely limited resources.

Following mathematical conventions, here's the table listing associativity and precedence for the operators supported by our language:

type Associativity =
    | Left
    | Right

let configuration =
    [(Token.BinExpOp, 4, Right)
     (Token.UnaryMinOp, 3, Left)
     (Token.BinMulOp, 2, Left)
     (Token.BinDivOp, 2, Left)
     (Token.BinPlusOp, 1, Left)
     (Token.BinMinOp, 1, Left)]

In order to defer evaluation and keep track of intermediate results, Shunting Yard makes use of two stacks: one for operators and one for operands. Rather than defining new data types to store on the stack, mimicking what's already stored in the Token type, we use the existing Token type:

let operators = Stack<Token>()
let operands = Stack<Token>()

With these introductory definitions out of the way, let's turn our attention to Shunting Yard in pseudo-code form (a modified version of the one on Wikipedia). The later implementation contains an inlined version of the pseudo-code for traceability:

while tokens to be read
  read token
  if token is operand then push onto operand stack
  if token is unary prefix operator then push onto operator stack
  if token is binary operator, o1, then
    while operator token, o2, at top of operator stack, and
        either o1 is left associative and its precedence is <= to that of o2
        or o1 is right associative and its precedence < that of o2
      reduce expression
    push o1 onto the operator stack
  if token is left paren then push it onto operator stack
  if token is right paren
    until token at top of operator stack is left paren
      reduce expression
    if operator stack runs out without finding left paren then mismatched parens
    pop left paren from stack
when no more tokens to read and
  while still tokens on operator stack
    if operator token on top of stack is paren then mismatched parens
    reduce expression
  pop result of operand stack

reduce expression
  pop operator off operator stack
  pop operands off operand stack
  process expression
  push result onto operand stack

At first glance, Shunting Yard may appear daunting. It's probably best understood through initial pen and paper evaluation of gradually more complex expressions.

Implementing the algorithm bottom up, we require a few functions to answer questions such as whether a token is an operand or an operator and whether the operator is unary or binary. We could've encoded these binary decisions in the configuration table, but encoding the mapping through functions seemed more natural:

// val lookupOperator : token:Token -> int * Associativity
let lookupOperator token = 
    configuration 
    |> List.find (fun (t, _, _) -> token = t) 
    |> fun (_, p, a) -> p, a

// val isOperand : _arg1:Token -> bool
let isOperand = function
    | Integer _ -> true
    | _ -> false

// val isUnaryOperator : _arg1:Token -> bool
let isUnaryOperator = function
    | UnaryMinOp | UnaryPlusOp -> true
    | _ -> false

// val isBinaryOperator : _arg1:Token -> bool
let isBinaryOperator = function
    // not (isUnaryOperator token) will not work since Integer is neither
    | BinPlusOp | BinMinOp | BinMulOp | BinDivOp | BinExpOp -> true
    | _ -> false

Next follows the reduce expression part of the algorithm. When the main part decides that it's time, reduce expression is where operator and operand tokens get reduced to a simpler Integer form. At this point Shunting Yard could equally well construct an abstract syntax tree from the tokens. The algorithm would remain the same, but reduceExpression would no longer push and pop Integers off the operand stack, but types of syntax tree nodes. The node which remains on the operand stack after all tokens have been processed would be the root node of the syntax tree as exemplified by this C# example.

// val reduceExpression : unit -> unit
let reduceExpression() = 
    let extractValue t =
        match t with
        | Integer i -> i
        | _ -> failwithf "Unsupported %A" t

    let operator = operators.Pop()
    if operator = UnaryMinOp then
        let operand = operands.Pop() |> extractValue
        operands.Push(Integer -operand)
    else if operator = UnaryPlusOp then operands.Pop() |> ignore
    else
        let left = operands.Pop() |> extractValue
        let right = operands.Pop() |> extractValue
        match operator with
        | BinPlusOp -> operands.Push(Integer (left + right))
        | BinMinOp -> operands.Push(Integer (left - right))
        | BinMulOp -> operands.Push(Integer (left * right))
        | BinDivOp -> operands.Push(Integer (left / right))
        | BinExpOp -> 
            operands.Push(Integer (Math.Pow(float left, float right) |> int))
        | _ -> failwithf "Unsupported operator %A" operator

And finally, the main part of Shunting Yard, tying together the pieces and including Unquote tests:

// val evaluateExpression : tokens:Token list -> Token
let evaluateExpression tokens =
    // while tokens to be read
    // val eval : _arg1:Token list -> Token
    let rec eval = function
        // read token
        | hd :: tl ->
            // if token is operand then push onto operand stack
            if isOperand hd then operands.Push(hd)

            // if token is unary prefix operator then push onto operator stack
            else if isUnaryOperator hd then operators.Push(hd)

            // if token is binary operator, o1, then
            else if isBinaryOperator hd then
                // while operator token, o2, at top of operator stack, and
                //     either o1 is left associative and its precedence <= o2
                //     or o1 is right associative and its precedence < o2
                //   reduce expression
                // push o1 onto the operator stack
                let pO1, aO1 = lookupOperator hd
                let isReduceRequired operator =                 
                    let o2 = 
                        if isBinaryOperator(operator) 
                        then Some (lookupOperator(operator)) 
                        else None
                    match o2 with
                    | Some(pO2, aO2) -> 
                        (aO1 = Left && pO1 <= pO2) || (aO1 = Right && pO1 < pO2)
                    | None -> false

                while operators.Count > 0 && isReduceRequired(operators.Peek()) do 
                    reduceExpression()
                operators.Push(hd)

            // if token is left paren then push it onto operator stack
            else if hd = LParen then operators.Push(hd)

            // if token is right paren
            else if hd = RParen then
                // until token at top of operator stack is left paren
                while operators.Count > 0 && operators.Peek() <> LParen do 
                    reduceExpression()

                // if operator stack runs out without finding left paren 
                // then mismatched parens
                if operators.Count = 0 then failwith "Unmatched parens"

                // pop left paren from stack
                if operators.Peek() = LParen then operators.Pop() |> ignore

            eval tl
        | [] ->
            // when no more tokens to read and
            // while still tokens on operator stack
            while operators.Count > 0 do
                // if operator token on top of stack is paren 
                // then mismatched parens
                if operators.Peek() = LParen || operators.Peek() = RParen then
                    failwith "Unmatched paren"

                // reduce expression
                reduceExpression()

            // pop result of operand stack
            operands.Pop()

    eval tokens

// val eval : (string -> Token)
let eval = parseExpression >> evaluateExpression 

test <@ "1" |> eval = Integer 1 @>
test <@ "-1" |> eval = Integer -1 @>
test <@ "1+2" |> eval = Integer 3 @>
test <@ "1+-2" |> eval = Integer -1 @>
test <@ "-(1+2)" |> eval = Integer -3 @>
test <@ "2^3^2" |> eval = Integer 512 @>
test <@ "1+2*3" |> eval = Integer 7 @>
test <@ "4^5/1+2*3" |> eval = Integer 1030 @>

Conclusion

Writing a lexer, parser, and evaluator by hand, even for this simple language of mathematical expressions, was no trivial task. Yet it proved a great learning experience and I've since taken the ideas to writing a parser and evaluator for a domain specific language. For this SPCalendarRecurrenceExpander, which turns each SharePoint calendar recurrence event into a series of individual events, taking into account recurrence exceptions, parsing entails generating an abstract syntax tree from the appointment. Evaluation is then consuming the abstract syntax tree, turning the recurrence into individual appointments.