Building Performant Parsers in Rust with nom and pest

Introduction

In the realm of software development, the need to interpret and process structured data is ubiquitous. Whether it's configuration files, domain-specific languages, network protocols, or even complex user inputs, parsing sits at the heart of many applications. While manually writing parsers can be a tedious and error-prone endeavor, especially for non-trivial grammars, Rust, with its focus on performance and safety, offers powerful tools to simplify this task. This blog post explores two prominent parser combinator libraries in the Rust ecosystem – nom and pest – demonstrating how they empower developers to build efficient and robust parsers with elegance and ease. We'll dive into their methodologies, compare their approaches, and equip you with the knowledge to choose the right tool for your next parsing challenge.

Core Concepts Before We Parse

Before we jump into the intricacies of nom and pest, let's define some fundamental concepts crucial to understanding their operation:

Parser: A function or component that takes an input string or byte stream and transforms it into a structured representation, typically an Abstract Syntax Tree (AST) or a simpler data structure.
Combinator: In the context of parsing, a combinator is a higher-order function that takes one or more parsers as input and returns a new parser. This allows for building complex parsers from simpler, reusable components, resembling functional programming paradigms.
Grammar: A set of rules that define the valid structure of a language or data format. Grammars are often expressed using Formal Grammars like Backus-Naur Form (BNF) or Extended Backus-Naur Form (EBNF).
Abstract Syntax Tree (AST): A tree representation of the abstract syntactic structure of source code written in a programming language. Each node in the tree denotes a construct occurring in the source code.
Lexer (or Tokenizer): The first phase of parsing, which breaks the input text into a sequence of tokens (meaningful units like keywords, identifiers, operators, etc.).
Parser Generator: A tool that takes a grammar definition as input and automatically generates source code for a parser. pest is an example of a parser generator.
Parser Combinator Library: A library that provides a set of functions (combinators) that can be used to manually construct a parser from smaller parsing functions. nom is an example of a parser combinator library.

Building Parsers with nom

nom is a powerful, zero-copy parser combinator library for Rust. Its design philosophy emphasizes a functional approach, where parsing rules are composed of smaller, easily testable functions. nom operates directly on byte slices or string slices, avoiding unnecessary memory allocations and copies, which contributes significantly to its efficiency.

Let's illustrate nom with a simple example: parsing a basic key-value pair format like key:value.

use nom::{
    bytes::complete::{tag, take_while1},
    character::complete::{alpha1, multispace0},
    sequence::separated_pair,
    IResult,
};

// Define a parser for a key (alphanumeric characters)
fn parse_key(input: &str) -> IResult<&str, &str> {
    alpha1(input)
}

// Define a parser for a value (any character until a newline or end of input)
fn parse_value(input: &str) -> IResult<&str, &str> {
    take_while1(|c: char| c.is_ascii_graphic())(input)
}

// Combine key and value parsers with a separator
fn parse_key_value(input: &str) -> IResult<&str, (&str, &str)> {
    // `separated_pair` takes three parsers: the first element, the separator, and the second element.
    separated_pair(parse_key, tag(":"), parse_value)(input)
}

fn main() {
    let input = "name:Alice\nage:30";

    match parse_key_value(input) {
        Ok((remaining, (key, value))) => {
            println!("Parsed key: {}, value: {}", key, value);
            println!("Remaining input: '{}'", remaining);
        }
        Err(e) => println!("Error parsing: {:?}", e),
    }

    let input_with_whitespace = "  city:NewYork  ";
    let (remaining, (key, value)) =
        separated_pair(
            multispace0.and_then(parse_key), // Allows optional whitespace before key
            tag(":"),
            parse_value,
        )(input_with_whitespace)
        .expect("Failed to parse with whitespace");
    
    println!("Parsed key: {}, value: {}", key, value);
    println!("Remaining input: '{}'", remaining);

}

In this example:

We define parse_key and parse_value using nom's built-in combinators like alpha1 (matches one or more alphabetic characters) and take_while1 (matches characters as long as a condition holds).
tag(":") is a simple parser that matches the literal string :.
separated_pair is a powerful combinator that applies three parsers in sequence: a parser for the first element, a parser for the separator, and a parser for the second element. It returns the results of the element parsers as a tuple.
The IResult type returned by nom parsers contains either the remaining input and the parsed value on success, or an error.

nom shines when you need fine-grained control over parsing, are dealing with binary formats, or when performance is absolutely critical due to its zero-copy nature. Its learning curve can be steeper for complete beginners, as it requires understanding how to compose many small parsing functions.

Crafting Parsers with pest

pest takes a different approach by leveraging parser generators. Instead of writing parsing logic in Rust code, you define your grammar in a separate file using pest's custom EBNF-like syntax. pest then generates the parsing code for you, making it very suitable for complex grammars and domain-specific languages (DSLs) where readability and maintainability of the grammar definition are paramount.

Let's parse the same key-value pair format using pest. First, define the grammar in a file named key_value.pest:

// key_value.pest
WHITESPACE = _{ " " | "\t" }

key = @{ ASCII_ALPHA+ }
value = @{ (ANY - NEWLINE)+ }

pair = { key ~ ":" ~ value }

Next, in your main.rs, integrate pest:

use pest::Parser;
use pest_derive::Parser;

// Include the generated parser from the grammar file
#[derive(Parser)]
#[grammar = "key_value.pest"] // Path to our grammar file
pub struct KeyValueParser;

fn main() {
    let input = "name:Alice\n misure:100cm";

    // Parse the input string using the "pair" rule
    let pairs = KeyValueParser::parse(Rule::pair, input)
        .expect("Failed to parse input");

    for pair in pairs {
        if pair.as_rule() == Rule::pair {
            let mut inner_rules = pair.into_inner();
            let key = inner_rules.next().unwrap().as_str();
            let value = inner_rules.next().unwrap().as_str();
            println!("Parsed key: {}, value: {}", key, value);
        }
    }

    // Example with whitespace (handled implicitly by WHITESPACE rule)
    let input_with_whitespace = "  city:NewYork  ";
    let parsed_with_whitespace = KeyValueParser::parse(Rule::pair, input_with_whitespace)
        .expect("Failed to parse with whitespace");

    for pair in parsed_with_whitespace {
        if pair.as_rule() == Rule::pair {
            let mut inner_rules = pair.into_inner();
            let key = inner_rules.next().unwrap().as_str();
            let value = inner_rules.next().unwrap().as_str();
            println!("Parsed key: {}, value: {}", key, value);
        }
    }
}

In the key_value.pest grammar:

WHITESPACE = _{ " " | "\t" } defines a rule for whitespace. The _ makes it "invisible" – pest automatically ignores whitespace between rules unless explicitly told not to.
key = @{ ASCII_ALPHA+ } defines a key as one or more alphabetic characters. @ signifies that we want to capture the matched text.
value = @{ (ANY - NEWLINE)+ } defines a value as one or more of any character except a newline. This is a common pattern for "rest of the line" values.
pair = { key ~ ":" ~ value } combines the key, literal ":", and value rules to form a pair. The ~ operator denotes sequential matching.

pest excels when:

Dealing with complex, formally defined grammars.
Grammar readability and maintainability are critical.
You prefer a declarative way of defining parsing rules.
The generated parser overhead is acceptable.

Choosing Between nom and pest

Both nom and pest are excellent tools, but they cater to slightly different use cases and preferences:

Feature	`nom`	`pest`
Approach	Parser Combinator Library (imperative)	Parser Generator (declarative, grammar-driven)
Grammar Def.	Rust code (functions, macros)	Separate `.pest` file (EBNF-like syntax)
Performance	Generally very high (zero-copy parsing)	High, but with some overhead from generated code
Flexibility	High, ideal for binary formats, custom logic	Moderate, great for textual grammars
Learning Curve	Steeper for complex scenarios	More approachable for grammar definition
Error Handling	Explicit `IResult` handling	Built-in error reporting with span information
Use Cases	Network protocols, binary data, simple line protocols	DSLs, config files, programming languages, markup

For raw speed and low-level control, especially with binary input, nom is often the go-to choice. Its combinator approach can be incredibly powerful once mastered. For language parsing, DSLs, or any scenario where a clear separation between grammar definition and parsing logic is beneficial, pest offers a more declarative and often more readable solution.

Ultimately, the choice often comes down to the complexity of your grammar, your performance requirements, and your comfort level with each paradigm. In some advanced scenarios, developers even combine elements, using nom for the lexical analysis (tokenizing) and then feeding those tokens into a pest-generated parser for the syntactic analysis.

Conclusion

Rust provides exceptional capabilities for building efficient and robust parsers, and nom and pest stand out as the leading libraries in this domain. nom, with its functional parser combinator approach, offers unparalleled performance and fine-grained control, making it ideal for low-level and binary parsing tasks. pest, on the other hand, simplifies the creation of complex textual parsers through its powerful grammar definition language and code generation, allowing for clear and maintainable DSLs. By understanding their core principles and application scenarios, Rust developers can confidently select the right tool to tackle any parsing challenge, transforming unstructured data into meaningful insights with precision and speed.

Building Performant Parsers in Rust with nom and pest

Introduction

Core Concepts Before We Parse

Building Parsers with nom

Crafting Parsers with pest

Choosing Between nom and pest

Conclusion

Share this article

More Posts from Leapcell

Bridging Rust and C Generating C Bindings and Headers with Cbindgen and Cargo-c

Streamlining API Calls with a Rust Functional Macro

Popular Posts