enso/docs/parser/lexer.md

165 lines
6.9 KiB
Markdown
Raw Normal View History

2020-06-25 15:06:08 +03:00
---
layout: developer-doc
title: Lexer
category: syntax
tags: [parser, lexer]
order: 3
2020-06-25 15:06:08 +03:00
---
# Lexer
2020-07-21 15:59:40 +03:00
2020-06-26 16:54:20 +03:00
The lexer is the code generated by the [flexer](./flexer.md) that is actually
responsible for lexing Enso source code. It chunks the character stream into a
(structured) token stream in order to make later processing faster, and to
identify blocks
2020-06-25 15:06:08 +03:00
<!-- MarkdownTOC levels="2,3" autolink="true" -->
- [Lexer Architecture](#lexer-architecture)
- [Libraries in the Lexer Definition](#libraries-in-the-lexer-definition)
2020-06-26 16:54:20 +03:00
- [Lexer Functionality](#lexer-functionality)
- [The Lexer AST](#the-lexer-ast)
- [Benchmarking the Lexer](#benchmarking-the-lexer)
- [Running a Subset of the Benchmarks](#running-a-subset-of-the-benchmarks)
- [Changing the Lexer](#changing-the-lexer)
2020-06-26 16:54:20 +03:00
2020-06-25 15:06:08 +03:00
<!-- /MarkdownTOC -->
2020-06-26 16:54:20 +03:00
## Lexer Architecture
The structure of the flexer's code generation forces the lexer to be split into
two parts: the definition, and the generation. As the latter is the point from
which the lexer will be used, the second subproject is the one that is graced
with the name `lexer`.
### Libraries in the Lexer Definition
The lexer generation subproject needs to be able to make the assumption that all
imports will be in the same place (relative to the crate root). To this end, the
definition subproject exports public modules `library` and `prelude`. These are
re-imported and used in the generation subproject to ensure that all components
are found at the same paths relative to the crate root.
This does mean, however, that all imports from _within_ the current crate in the
definition subproject must be imported from the `library` module, not from their
paths directly from the crate root.
2020-06-26 16:54:20 +03:00
## Lexer Functionality
2020-07-21 15:59:40 +03:00
The lexer provides the following functionality as part of the parser.
2020-06-26 16:54:20 +03:00
- It consumes the source lazily, character by character, and produces a
structured token stream consisting of the lexer [ast](#the-lexer-ast).
- It succeeds _any_ input, even if there are invalid constructs in the token
stream, represented by `Invalid` tokens.
2020-06-26 16:54:20 +03:00
## The Lexer AST
2020-07-21 15:59:40 +03:00
2020-06-26 16:54:20 +03:00
In contrast to the full parser [ast](./ast.md), the lexer operates on a
simplified AST that we call a 'structured token stream'. While most lexers
output a linear token stream, it is very important in Enso that we encode the
nature of _blocks_ into the token stream, hence giving it structure.
This encoding of blocks is _crucial_ to the functionality of Enso as it ensures
that no later stages of the parser can ignore blocks, and hence maintains them
for use by the GUI.
It contains the following constructs:
- `Referent`: Referrent identifiers (e.g. `Some_Ref_Ident`).
- `Variable`: Variable identifiers (e.g. `some_var_ident`).
- `External`: External identifiers (e.g. `someJavaName`).
- `Blank`: The blank name `_`.
- `Operator`: Operator identifiers (e.g. `-->>`).
- `Modifier`: Modifier operators (e.g. `+=`).
- `Annotation`: An annotation (e.g. `@Tail_Call`).
- `Number`: Numbers (`16_FFFF`).
- `DanglingBase`: An explicit base without an associated number (e.g. `16_`).
- `TextLine`: A single-line text literal.
- `TextInlineBlock`: An inline block text literal.
- `TextBlock`: A text block literal.
- `InvalidQuote`: An invalid set of quotes for a text literal.
- `TextSegmentRaw`: A raw text segment in which the contents should be
interpreted literally.
- `TextSegmentEscape:` A text segment containing an escape sequence.
- `TextSegmentInterpolate:` A text segment containing an arbitrary interpolated
expression.
- `TextSegmentUnclosedInterpolate`: An unclosed interpolation text segment.
- `Line`: A line in a block that contains tokens.
- `BlankLine`: A line in a block that contains only whitespace.
2020-06-26 16:54:20 +03:00
- `Block`: Syntactic blocks in the language.
- `InvalidSuffix`: Invalid tokens when in a given state that would otherwise be
valid.
- `Unrecognized`: Tokens that the lexer doesn't recognise.
- `DisableComment`: A standard comment that disables interpretation of the
commented code (i.e. `#`).
- `DocComment:` A documentation comment (e.g. `##`). Documentation syntax is
_not_ lexed by this lexer.
2020-06-26 16:54:20 +03:00
The distinction is made here between the various kinds of identifiers in order
to keep lexing fast, but also in order to allow macros to switch on the kinds of
identifiers.
> The actionables for this section are:
>
> - Determine if we want to have separate ASTs for the lexer and the parser, or
> not.
## Benchmarking the Lexer
As the lexer is the first port of call when getting an Enso program to run it
needs to be quick. To that end, we insist on comprehensive benchmarks for any
change made to the lexer. The lexer benchmarks are written using
[criterion.rs](https://github.com/bheisler/criterion.rs), and include both
examples of whole program definitions and more specific benchmark examples.
**Baseline Commit:** `e5695e6f5d44cba4094380545036a3a5cbbf6973`
The benchmarking process for the lexer is as follows:
1. Check out the current _baseline commit_, listed above.
2. In `lexer_bench_sources.rs` change the line that reads `.retain_baseline` to
instead read `.save_baseline`. This will save the current baseline (taken on
your machine).
3. Run the benchmarks using `cargo bench`. Please note that running these
benchmarks takes approximately two hours, so sufficient time should be
allotted.
4. Once the baseline run has completed, change the above-mentioned line back to
`.retain_baseline`. This will disable overwriting the saved baseline, and
will perform its regression reporting against it.
5. Make your changes.
6. Run the benchmark suite again. It will report any performance regressions in
the benchmark report, measured against your saved baseline.
Unfortunately, the use of time-based benchmarks means that we can't commit the
baseline to the repository. There is far too much variance between machines for
this to be useful.
### Running a Subset of the Benchmarks
The benchmarks are very comprehensive, running a wide range of program text
through the lexer while replicating it out to various sizes (see
`lexer_bench_sources.rs` for the full list). However, in order to decrease
iteration time it can be useful to run a subset of these.
There are two main tuning points for this:
1. The _sizes_ of inputs being executed on.
2. The benchmarks being executed.
The sizes can be tuned by editing the `SIZES` array in the
`lexer_bench_sources.rs` file. The benchmarks themselves are best tuned by
changing the macro definitions in `lexer_time_bench.rs` to exclude certain
benchmarks or groups of benchmarks.
While it is _possible_ to tune the benchmarking config (`bench_config` in
`lexer_bench_sources.rs`) to decrease benchmarking time, this is not
recommended. The current settings are tuned to provide reliable results.
### Changing the Lexer
When changing the lexer the _full_ benchmark suite must be run against the
current baseline before the changes can be merged. This suite run must use the
provided settings for the benchmarking library, and should be performed using
the process described above.