enso/docs/parser/flexer.md

9.0 KiB

layout title category tags order
developer-doc Flexer syntax
parser
flexer
lexer
dfa
3

Flexer

The flexer is a finite-automata-based engine for generating lexers. Akin to flex and other lexer generators, it is given a definition as a series of rules from which it then generates code for a highly-optimised lexer.

Pattern Description

The definition of a lexer using the flexer library consists of a set of rules for how to behave when matching portions of syntax. These rules behave as follows:

  • A rule describes a regex-like pattern.
  • It also describes the code to be executed when the pattern is matched.
pub fn lexer_definition() -> String {
    // ...

    let chr = alphaNum | '_';
    let blank = Pattern::from('_')

    lexer.rule(lexer.root,blank,"self.on_ident(token::blank(self.start_location))");

    // ...
}

A pattern, such as chr, or blank is a description of the characters that should be matched for that pattern to match. The flexer library provides a set of basic matchers for doing this.

A lexer.rule(...) definition consists of the following parts:

  • A state, used for grouping rules and named for debugging (see the section on state management below).
  • A pattern, as described above.
  • The code that is executed when the pattern matches.

State Management

States in the flexer engine provide a mechanism for grouping sets of rules together known as State. At any given time, only rules from the active state are considered by the lexer.

  • States are named for purposes of debugging.
  • You can activate another state from within the flexer instance by using state.push(new_state).
  • You can deactivate the topmost state by using state.pop().

Code Generation

The patterns in a lexer definition are used to generate a highly-efficient and specialised lexer. This translation process works as follows:

  1. All rules are taken and used to generate an NFA.
  2. A DFA is generated from the NFA using the standard subset construction algorithm, but with some additional optimisations that ensure the following properties hold:
    • Patterns are matched in the order that they are defined.
    • The associated code chunks are maintained properly.
    • Lexing is O(n), where n is the size of the input.
  3. The DFA is used to generate the code for a lexer Engine struct, containing the Lexer definition.

The Engine generated through this process contains a main loop that consumes the input stream character-by-character, evaluating a big switch generated from the DFA using functions from the Lexer.

Lexing proceeds from top-to-bottom of the rules, and the first expression that matches fully is chosen. This differs from other common lexer generators, as they mostly choose the longest match instead. Once the pattern is matched, the associated code is executed and the process starts over again until the input stream has been consumed.

Automated Code Generation

In order to avoid the lexer definition getting out of sync with its implementation (the generated engine), it is necessary to create a separate crate for the generated engine that has the lexer definition as one of its dependencies.

This separation enables a call to flexer.generate_specialized_code() in build.rs (or a macro) during compilation. The output can be stored in a new file i.e. lexer-engine.rs and exported from the library with include!("lexer-engine.rs"). The project structure therefore appears as follows:

- lib/rust/lexer/
  - definition/
    - src/
      - lexer.rs
    - cargo.toml

  - generation/
    - src/
      - lexer.rs <-- include!("lexer-engine.rs")
    - build.rs   <-- calls `lexer_definition::lexer_source_code()`
                  -- and saves its output to `src/lexer-engine.rs`
    - cargo.toml <-- lexer-definition is in dependencies and build-dependencies

With this design, flexer.generate_specialized_code() is going to be executed on each rebuild of lexer/generation. Therefore, generation should contain only the minimum amount of logic (i.e. tests should be in separate crate) and its dependencies should optimally involve only such code which directly influences the content of generated code (in order to minimize the unnecessary calls to expensive flexer specialization).

Notes on Code Generation

The following properties are likely to hold for the code generation machinery.

  • The vast majority of the code generated by the flexer is going to be the same for all lexers.
  • The primary generation is in consume_next_character, which takes a Lexer as an argument.

Structuring the Flexer Code

In order to unify the API between the definition and generated usages of the flexer, the API is separated into the following components:

  • Flexer: The main flexer definition itself, providing functionality common to the definition and implementation of all lexers.
  • FlexerState: The stateful components of a lexer definition. This trait is implemented for a particular lexer definition, allowing the user to store arbitrary data in their lexer, as needed.
  • User-Defined Lexer: The user can then define a lexer that wraps the flexer, specialised to the particular FlexerState that the user has defined. It is recommended to implement Deref and DerefMut between the defined lexer and the Flexer, to allow for ease of use.

Supporting the Definition of Lexers

The actionables for this section are:

  • Fill it in as the generation solidifies.

Supporting Code Generation

This architecture separates out the generated code (which can be defined purely on the user-defined lexer), from the code that is defined as part of the lexer definition. This means that the same underlying structures can be used to both define the lexer, and be used by the generated code from that definition.

For an example of how these components are used in the generated lexer, please see generated_api_test.

An Example

The following code provides a sketchy example of the intended API for the flexer code generation using the definition of a simple lexer.

use crate::prelude::*;

use flexer;
use flexer::Flexer;



// =============
// === Token ===
// =============

pub struct Token {
    location : flexer::Location,
    ast      : TokenAst,
}

enum TokenAst {
    Var(ImString),
    Cons(ImString),
    Blank,
    ...
}

impl Token {
    pub fn new(location:Location, ast:TokenAst) -> Self {
        Self {location,ast}
    }

    pub fn var(location:Location, name:impl Into<ImString>) -> Self {
        let ast = TokenAst::Var(name.into());
        Self::new(location,ast)
    }

    ...
}



// =============
// === Lexer ===
// =============

#[derive(Debug,Default)]
struct Lexer<T:Flexer::State> {
    current : Option<Token>,
    tokens  : Vec<Token>,
    state   : T
}

impl Lexer {
    fn on_ident(&mut self, tok:Token) {
        self.current = Some(tok);
        self.state.push(self.ident_sfx_check);
    }

    fn on_ident_err_sfx(&mut self) {
        println!("OH NO!")
    }

    fn on_no_ident_err_sfx(&mut self) {
        let current = std::mem::take(&mut self.current).unwrap();
        self.tokens.push_back(current);
    }
}

impl Flexer::Definition Lexer {
    fn state     (&    self) -> &    flexer::State { &    self.state }
    fn state_mut (&mut self) -> &mut flexer::State { &mut self.state }
}

pub fn lexer_source_code() -> String {
    let lexer = Flexer::<Lexer<_>>::new();

    let chr     = alphaNum | '_';
    let blank   = Pattern::from('_');
    let body    = chr.many >> '\''.many();
    let var     = lowerLetter >> body;
    let cons    = upperLetter >> body;
    let breaker = "^`!@#$%^&*()-=+[]{}|;:<>,./ \t\r\n\\";

    let sfx_check = lexer.add(State("Identifier Suffix Check"));

    lexer.rule(lexer.root,var,"self.on_ident(Token::var(self.start_location,self.current_match()))");
    lexer.rule(lexer.root,cons,"self.on_ident(token::cons(self.start_location,self.current_match()))");
    lexer.rule(lexer.root,blank,"self.on_ident(token::blank(self.start_location))");
    lexer.rule(sfx_check,err_sfx,"self.on_ident_err_sfx()");
    lexer.rule(sfx_check,Flexer::always,"self.on_no_ident_err_sfx()");
    ...
    lexer.generate_specialized_code() // This code needs to become a source file, probably via build.rs
}

Some things to note:

  • The function definitions in Lexer take self as their first argument because Engine implements Deref and DerefMut to Lexer.