Writing a Tag Algebra compiler with Coco/R

This week, i was digging back into the Coco/R implemented parser of DekiScript, tracking down a bug, which turned out to not be in the parsing bits at all. It did, however, get me familiarized with Coco/R again. So I thought i'd give myself an hour to implement the parser for my Tag related boolean algebra with Coco/R. If i could pull it off, forget about the regex/state-machine approach i was considering.

Took me about 15 minutes to set up the classes to represent the intermediate AST and another 30 minutes for the grammar in Coco/R ATG format. After that I wrote a couple of unit tests to check that the parsing was right, only to realize that while AND and OR are left-to-right associative, NOT is right-to-left associative. Figuring out how to adjust the grammar for that actually took me another 10-15 minutes. But overall, I hit the one hour goal.

The Syntax Tree

Before tackling the grammar to parse, I needed to define data structures to represent the parsed syntax tree, which I'd later convert to executable code. The syntax is fairly simple:

(foo+bar)|(^foo+baz)

This can be represented by just 4 tree node types (with common parent TagExpression:

AndExpression
OrExpression
NotExpression
Tag

The parentheses are just tree structure artifacts, so are not represented in the AST.

The Coco/R grammar

I've broken the grammar up to discuss the various parts, but the below sections represent a single ATG file, the Coco/R grammar format.

COMPILER TagAlgebra

  public TagExpression Result = TagExpression.Empty;

The COMPILER defines the entrypoint PRODUCTIONfor the parser. The following lines. until the next grammar definition, are inserted into the generated Parser, and can be used to inject class fields, extra methods, etc. into the Parser source. The only thing I inserted was a field to hold the root of the AST and initialize it to empty.

IGNORECASE

This tells Coco/R that our grammar is case insensitive.

CHARACTERS
  tab = '\\t'.
  eol = '\\n'.
  cr = '\\r'.
  nbsp = '\\u00a0'. // 0xA0 = 'unbreakable space'
  shy = '\\u00ad'.  // 0xAD = 'soft-hyphen'
  letter = 'a'..'z'.
  digit = '0'..'9'.

CHARACTERS defines characters that the parser should recognize for matches.

TOKENS
  tag =
    letter { letter | digit | "_" | ":" }
    .

The only token in the grammar are tag, which is composed from the characters defined above and extra quoted characters.

IGNORE eol + cr + tab + nbsp + shy

IGNORE tells Coco/R what characters have no meaning in parsing the input.

Next come the PRODUCTIONS, i.e. the meat of the grammar. These are the rules for matching input and converting it into code. Coco/R is an LL(1) parser generator, i.e. the grammar must be parsable from Left to Right with Left-canonical derivations and one look-ahead symbol. We also cannot have a loop in our grammar, i.e all possible branches have to lead to a terminal via a unique set of production matches.

PRODUCTIONS

  TagAlgebra                      (. Result = TagExpression.Empty; .)
    =
    [ BinaryExpr<out Result> ]
    .

The first production, is the entry point which, again, sets the result to an empty AST, since the same instance of the parser can parse multiple expressions. It then specifies 0 or 1 BinaryExpr productions.

  BinaryExpr<out TagExpression expr>  (.  expr = null; TagExpression right = null; .)
    =
    NotExpr<out expr> {               (. bool and = false; .)
      (
        "|"
        | "+"                         (. and = true; .)
      )
      NotExpr<out right>              (. expr = and ? (TagExpression)new AndExpression(expr,right) : new OrExpression(expr,right); .)
    }
    .

BinaryExpr is used for both AND and OR expressions, since both take two sub-expressions and combine them with a single operator. The production specifies a left side of a NotExpr followed by an optional operator and another NotExpr. I.e. should our first match happen to not be a BinaryExpr, the parser can fall through to NotExpr return its result instead, without matching the optional production body.

  NotExpr<out TagExpression expr> (. expr = null; int notCount = 0; .)
    =
    {
      "^"                         (. notCount++; .)
    }
    (
      "(" BinaryExpr<out expr> ")"
      | tag                       (. expr = new Tag(t.val); .)
    )                             (. for(var i=0;i<notCount;i++) expr = new NotExpression(expr); .)
    .

END TagAlgebra.

NotExpr, just like BinaryExpr optionally matches on its operator, requiring only that the end of the operation it matches either a BinaryExpr enclosed by parentheses (i.e. not a circular match back into BinaryExpr, since it requires additional surrounding matches), or it matches a tag, the ultimate terminal in the grammar.

There is one tricky bit in this production, i.e. the NOT operator can match multiple times, which means, we need to accumulate the number of operator matches and build the chain of NOT expressions wrapping the current expression once we know how many, if any, matched.

What to do with the AST

The nice thing with Coco/R is that it adds no runtime dependency at all, building a fully self-contained Scanner and Parser. With these built, it is now possible to take Tag Algebra expressions and turn them into an executable tree of Func calls, as described in "Boolean Algebra for Tag queries".

The grammar could have been written to accumulate the unique tags and construct the Func tree right away, but the two benefits of going to an AST first, is that a)the AST can easily be rendered back into text form (even placing parentheses properly for expressions that previously had not parentheses), and b) the AST can easily be programatically composed with other expressions, or decomposed into sub-expressions, which can be used for caching and other efficiency operations.

I'll probably play with Irony next, but the "no runtime dependency" and existing familiarity made Coco/R the winner this time.