Among the exciting new features coming with GQLite 1.6, there is a new parser based on logos for lexing and nom for parsing. This new parser, added to gqlparser library, offers increased maintainability and performance. While performance is important, the parser is not a bottleneck in GQLite. However, the increased maintainability will make it easier to fix some of the parsing bugs, and extend the missing functionalities.

pest

After trying to use flex/bison, some 20 years ago, for parsing a programming language, I quickly switched to hand writting lexer and parser. The primary motivation is that those tools tend to have more drawbacks than benefits. However, I believe that it is important to constantly reassess your views, and check if tooling has improved.

While there are dozens of available generators, I don’t have the time to evaluate all of them. When converting GQLite from C++ to Rust, I settled on pest, it appeared to be very popular with 195M downloads on crates.io, it has an extension for VSCode and it is maintained. I also liked that the syntax of the grammar makes is closed to the BNF Notation, which makes it easier to read.

While pest is popular it fell short for GQLite. The core issue: it doesn’t generate a proper parse tree. The resulting data structure is cumbersome, lacks compile-time guarantees, and loses critical parsing information. For a complex language like OpenCypher, this became a major obstacle. It might be suitable for smaller language, but it becomes a major pain with larger language like OpenCypher. The main problem is that the structure can be simplified to:

struct Node
{
  rule: Rule,
  children: Vec<Node>, 
}

Where Rule is an enum that specify the rule. So essentially, pest translate the text source input into an other structure, that then need to be parsed into your final structure. In theory, the children part of the structure is constant for each rule. However, if you want to write safe Rust code, you shouldn’t make that assumption. There has been a number of effort for pest to provide a parse tree (pest-ast, pest 3) but none have delivered yet.

logos/nom

So after deciding to drop pest, I was left with the choice of trying an other tool, or writting a parser from scratch. The original C++ parser was easy to write, as over the years, I have developed my own toolbox for parsing in C++. Since, it is the first parser I wrote in Rust, I had nothing to start with. I looked around to see what is available, and a popular choice in the Rust ecosystem is nom with 422M downloads on crates.io.

The nice thing about nom is that it gives you the primitives to work with and write your hand-crafted parser. By default, it works on stream of characters, while I was developing the parser this felt awkward. At least for a language like OpenCypher, it is more natural to look at it as a stream of token. So I needed a lexer, and after looking for what is available I settled for logos, it is well documented, well maintained, and fast.

parser mean median slowest fastest
pest 1.126 s 945.2 ms 2.923 s 907.3 ms
nom 1.194 s 1.175 s 1.733 s 901 ms
logos/nom 779.1 ms 767.7 ms 1.359 s 690.4 ms

The parsing was done on pokec_small_import.cypher, a 5MB big, 131716 lines long creation query. You could argue that a different choice of query would give different results. However, when it comes to OpenCypher the only long queries are the queries for creating nodes. And that is the only time parsing speed has an affect, as can be seen on the benchmark results.

Tags:

Categories:

Updated: