Parsing in Rust

2020-12-21T00:00:00+00:00

In a world where everything is increasingly YAML, you might find yourself wondering: “why bother to write a parser?” For starters, I recommend reading the YAML specification before if you haven’t, but more importantly: there are so many domains which can be better modeled with domain-specific semantics and syntax. When I was younger parsing was typically done with lexx/yacc/bison/whatever and was complete drudgery, but there are a few great modern tools in the Rust ecosystem that make writing parsers fun.

I first dabbled in writing parsers with ANTLRv4 which is an absolutely fantastic toolset for writing parsers. The primary author Terence Parr has written a number of good books such as “The Definitive ANTLR 4 Reference” and “Language Implementation Patterns”. Both of which I recommend even if you’re not setting out to write that next great programming language.

In Rust our options are also pretty decent. When I first ventured into writing Rust I discovered antlr4rust which I promptly bookmarked and then set aside until I had a parsing project. Once I finally had a parsing project, I revisited the project and found that I didn’t like the ANTLR-like semantics in the Rust language. It didn’t quite feel idiomatic enough for me to feel comfortable.

More recently I have discovered Pest which I have now used within Otto and my most recent experiment Jenkins Declarative Parser.

The grammar is similar enough to ANTLR that I was able to get started and my ideas quite quickly. Still, I haven’t become clever enough to use parser-level stack manipulations, so I think that means I remain a parser-simpleton.

Below is an example of the grammar necessary to parse the script { } step in Declarative Jenkins Pipelines, which themselves allow arbitrary Groovy code within them (I didn’t want to parse the groovy too).

scriptStep = { "script" ~ opening_brace ~ groovy ~ closing_brace }
groovy = {
            (
            // Handle nested structures
            (opening_brace ~ groovy ~ closing_brace)
            | (!closing_brace ~ ANY)
            )*
         }

stagesDecl = { "stages" ~
                opening_brace ~
                stage+ ~
                closing_brace
              }

The qualifiers and details on the grammar can be found in the pest_derive crate’s documentation.

Once compiled into the Rust program, using the generated parser is a little goofy but still very workable, a snippet:

let mut parser = PipelineParser::parse(Rule::pipeline, buffer)?;

while let Some(parsed) = parser.next() {
    match parsed.as_rule() {
        Rule::agentDecl => {
            // parse the agent {} declaration
        }
        Rule::stagesDecl => {
            parse_stages(&mut parsed.into_inner())?;
        }
        _ => {}
    }
}

The parsers I am writing tend to be relatively simplistic, taking user-friendly models and turning them into internal data structures for further use. While basic it reminds me of the domain-specific language (DSL) “fad” among Rubyists. I once joked “for loving Ruby so much, Rubyists sure do spend a lot of time building tools to avoid writing Ruby.” Once you have a simple and easy approach to create syntax and tooling that better models the domain you’re working it, it’s hard to avoid!

YAML, XML, and JSON have their place as data serialization formats, but far too frequently they’re used for configuration or other descriptive usages. Many developers will cite “everybody knows YAML” in their use, thereby overlooking that “syntax” and “semantics” are two very distinct pieces of the puzzle. Yes, most everybody grasps the basics of YAML syntax, however whatever keys a program is encoding as semantically significant for its configuration (see: Kubernetes) is a very different story.

The next time you find yourself needing to describe or model complex concepts for your program, consider creating a language to describe it! Writing the parser will be easier than you might think!

rtyler

Parsing in Rust