enso/docs/flexer/flexer.md

200 lines
8.2 KiB
Markdown
Raw Normal View History

2021-10-30 15:50:06 +03:00
---
layout: developer-doc
title: Flexer
category: syntax
tags: [parser, flexer, lexer, dfa]
order: 1
---
# Flexer
The flexer is a finite-automata-based engine for the definition and generation
of lexers. Akin to `flex`, and other lexer generators, the user may use it to
define a series of rules for lexing their language, which are then used by the
flexer to generate a highly-efficient lexer implementation.
Where the flexer differs from other programs in this space, however, is the
power that it gives users. When matching a rule, the flexer allows its users to
execute _arbitrary_ Rust code, which may even manipulate the lexer's state and
position. This means that the languages that can be lexed by the flexer extend
from the simplest regular grammars right up to unrestricted grammars (but please
don't write a programming language whose syntax falls into this category). It
also differs in that it chooses the first complete match for a rule, rather than
the longest one, which makes lexers much easier to define and maintain.
For detailed library documentation, please see the
[crate documentation](../../lib/rust/flexer/src/lib.rs) itself. This includes a
comprehensive tutorial on how to define a lexer using the flexer.
<!-- MarkdownTOC levels="2,3" autolink="true" -->
- [The Lexing Process](#the-lexing-process)
- [Lexing Rules](#lexing-rules)
- [Groups](#groups)
- [Patterns](#patterns)
- [Transition Functions](#transition-functions)
- [Code Generation](#code-generation)
- [Automated Code Generation](#automated-code-generation)
- [Structuring the Flexer Code](#structuring-the-flexer-code)
- [Supporting Code Generation](#supporting-code-generation)
<!-- /MarkdownTOC -->
## The Lexing Process
In the flexer, the lexing process proceeds from the top to the bottom of the
user-defined rules, and selects the first expression that _matches fully_. Once
a pattern has been matched against the input, the associated code is executed
and the process starts again until the input stream has been consumed.
This point about _matching fully_ is particularly important to keep in mind, as
it differs from other lexer generators that tend to prefer the _longest_ match
instead.
## Lexing Rules
A lexing rule for the flexer is a combination of three things:
1. A group.
2. A pattern.
3. A transition function.
An example of defining a rule is as follows:
```rust
fn define() -> Self {
let mut lexer = TestLexer::new();
let a_word = Pattern::char('a').many1();
let root_group_id = lexer.initial_state;
let root_group = lexer.groups_mut().group_mut(root_group_id);
// Here is the rule definition.
root_group.create_rule(&a_word,"self.on_first_word(reader)");
lexer
}
```
### Groups
A group is a mechanism that the flexer provides to allow grouping of rules
together. The flexer has a concept of a "state stack", which records the
currently active state at the current time, that can be manipulated by the
user-defined [transition functions](#transition-functions).
A state can be made active by using `flexer::push_state(state)`, and can be
deactivated by using `flexer::pop_state(state)` or
`flexer::pop_states_until(state)`. In addition, states may also have _parents_,
from which they can inherit rules. This is fantastic for removing the need to
repeat yourself when defining the lexer.
When inheriting rules from a parent group, the rules from the parent group are
matched strictly _after_ the rules from the child group. This means that groups
are able to selectively "override" the rules of their parents. Rules are still
matched in order for each group's set of rules.
### Patterns
Rules are defined to match _patterns_. Patterns are regular-grammar-like
descriptions of the textual content (as characters) that should be matched. For
a description of the various patterns provided by the flexer, see
[pattern.rs](../../lib/rust/flexer/src/automata/pattern.rs).
When a pattern is matched, the associated
[transition function](#transition-functions) is executed.
### Transition Functions
The transition function is a piece of arbitrary rust code that is executed when
the pattern for a given rule is matched by the flexer. This code may perform
arbitrary manipulations of the lexer state, and is where the majority of the
power of the flexer stems from.
## Code Generation
While it would be possible to interpret the flexer definition directly at
runtime, this would involve far too much dynamicism and non-cache-local lookup
to be at all fast.
Instead, the flexer includes
[`generate.rs`](../../lib/rust/flexer/src/generate.rs), a library for generating
highly-specialized lexer implementations based on the definition provided by the
user. The transformation that it implements operates as follows for each group
of rules.
1. The set of rules in a group is used to generate a
[Nondeterministic Finite Automaton](https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton),
(NFA).
2. The NFA is ttransformed into a
[Deterministic Finite Automaton](https://en.wikipedia.org/wiki/Deterministic_finite_automaton)
(DFA), using a variant of the standard
[powerset construction](https://en.wikipedia.org/wiki/Powerset_construction)
algorithm. This variant has been modified to ensure that the following
additional properties hold:
- Patterns are matched in the order in which they are defined.
- The associated transition functions are maintained correctly through the
transformation.
- The lexing process is `O(n)`, where `n` is the size of the input.
3. The DFA is then used to generate the rust code that implements that lexer.
The generated lexer contains a main loop that consumes the input stream
character-by-character, evaluating what is effectively a big `match` expression
that processes the input to evaluate the user-provided transition functions as
appropriate.
### Automated Code Generation
In order to avoid the lexer definition getting out of sync with its
implementation (the generated engine), it is necessary to create a separate
crate for the generated engine that has the lexer definition as one of its
dependencies.
This separation enables a call to `flexer::State::specialize()` in the crate's
`build.rs` (or a macro) during compilation. The output can be stored in a new
file i.e. `engine.rs` and exported from the library as needed. The project
structure would therefore appear as follows.
```
- lib/rust/lexer/
- definition/
- src/
- lib.rs
- cargo.toml
- generation/
- src/
- engine.rs <-- the generated file
- lib.rs <-- `pub mod engine`
- build.rs <-- calls `flexer::State::specialize()` and saves its output to
`src/engine.rs`
- cargo.toml <-- lexer-definition is in dependencies and build-dependencies
```
With this design, `flexer.generate_specialized_code()` is going to be executed
on each rebuild of `lexer/generation`. Therefore, `generation` should contain
only the minimum amount of logic, and should endeavor to minimize any
unnecessary dependencies to avoid recompiling too often.
## Structuring the Flexer Code
In order to unify the API between the definition and generated usages of the
flexer, the API is separated into the following components:
- `Flexer`: The main flexer definition itself, providing functionality common to
the definition and implementation of all lexers.
- `flexer::State`: The stateful components of a lexer definition. This trait is
implemented for a particular lexer definition, allowing the user to store
arbitrary data in their lexer, as needed.
- **User-Defined Lexer:** The user can then define a lexer that _wraps_ the
flexer, specialised to the particular `flexer::State` that the user has
defined. It is recommended to implement `Deref` and `DerefMut` between the
defined lexer and the `Flexer`, to allow for ease of use.
### Supporting Code Generation
This architecture separates out the generated code (which can be defined purely
on the user-defined lexer), from the code that is defined as part of the lexer
definition. This means that the same underlying structures can be used to both
_define_ the lexer, and be used by the generated code from that definition.
For an example of how these components are used in the generated lexer, please
see [`generated_api_test`](../../lib/rust/flexer/tests/generated_api_test.rs).