mirror of
https://github.com/enso-org/enso.git
synced 2024-12-23 19:21:54 +03:00
165 lines
6.9 KiB
Markdown
165 lines
6.9 KiB
Markdown
---
|
|
layout: developer-doc
|
|
title: Lexer
|
|
category: syntax
|
|
tags: [parser, lexer]
|
|
order: 3
|
|
---
|
|
|
|
# Lexer
|
|
|
|
The lexer is the code generated by the [flexer](./flexer.md) that is actually
|
|
responsible for lexing Enso source code. It chunks the character stream into a
|
|
(structured) token stream in order to make later processing faster, and to
|
|
identify blocks
|
|
|
|
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
|
|
|
- [Lexer Architecture](#lexer-architecture)
|
|
- [Libraries in the Lexer Definition](#libraries-in-the-lexer-definition)
|
|
- [Lexer Functionality](#lexer-functionality)
|
|
- [The Lexer AST](#the-lexer-ast)
|
|
- [Benchmarking the Lexer](#benchmarking-the-lexer)
|
|
- [Running a Subset of the Benchmarks](#running-a-subset-of-the-benchmarks)
|
|
- [Changing the Lexer](#changing-the-lexer)
|
|
|
|
<!-- /MarkdownTOC -->
|
|
|
|
## Lexer Architecture
|
|
|
|
The structure of the flexer's code generation forces the lexer to be split into
|
|
two parts: the definition, and the generation. As the latter is the point from
|
|
which the lexer will be used, the second subproject is the one that is graced
|
|
with the name `lexer`.
|
|
|
|
### Libraries in the Lexer Definition
|
|
|
|
The lexer generation subproject needs to be able to make the assumption that all
|
|
imports will be in the same place (relative to the crate root). To this end, the
|
|
definition subproject exports public modules `library` and `prelude`. These are
|
|
re-imported and used in the generation subproject to ensure that all components
|
|
are found at the same paths relative to the crate root.
|
|
|
|
This does mean, however, that all imports from _within_ the current crate in the
|
|
definition subproject must be imported from the `library` module, not from their
|
|
paths directly from the crate root.
|
|
|
|
## Lexer Functionality
|
|
|
|
The lexer provides the following functionality as part of the parser.
|
|
|
|
- It consumes the source lazily, character by character, and produces a
|
|
structured token stream consisting of the lexer [ast](#the-lexer-ast).
|
|
- It succeeds _any_ input, even if there are invalid constructs in the token
|
|
stream, represented by `Invalid` tokens.
|
|
|
|
## The Lexer AST
|
|
|
|
In contrast to the full parser [ast](./ast.md), the lexer operates on a
|
|
simplified AST that we call a 'structured token stream'. While most lexers
|
|
output a linear token stream, it is very important in Enso that we encode the
|
|
nature of _blocks_ into the token stream, hence giving it structure.
|
|
|
|
This encoding of blocks is _crucial_ to the functionality of Enso as it ensures
|
|
that no later stages of the parser can ignore blocks, and hence maintains them
|
|
for use by the GUI.
|
|
|
|
It contains the following constructs:
|
|
|
|
- `Referent`: Referrent identifiers (e.g. `Some_Ref_Ident`).
|
|
- `Variable`: Variable identifiers (e.g. `some_var_ident`).
|
|
- `External`: External identifiers (e.g. `someJavaName`).
|
|
- `Blank`: The blank name `_`.
|
|
- `Operator`: Operator identifiers (e.g. `-->>`).
|
|
- `Modifier`: Modifier operators (e.g. `+=`).
|
|
- `Annotation`: An annotation (e.g. `@Tail_Call`).
|
|
- `Number`: Numbers (`16_FFFF`).
|
|
- `DanglingBase`: An explicit base without an associated number (e.g. `16_`).
|
|
- `TextLine`: A single-line text literal.
|
|
- `TextInlineBlock`: An inline block text literal.
|
|
- `TextBlock`: A text block literal.
|
|
- `InvalidQuote`: An invalid set of quotes for a text literal.
|
|
- `TextSegmentRaw`: A raw text segment in which the contents should be
|
|
interpreted literally.
|
|
- `TextSegmentEscape:` A text segment containing an escape sequence.
|
|
- `TextSegmentInterpolate:` A text segment containing an arbitrary interpolated
|
|
expression.
|
|
- `TextSegmentUnclosedInterpolate`: An unclosed interpolation text segment.
|
|
- `Line`: A line in a block that contains tokens.
|
|
- `BlankLine`: A line in a block that contains only whitespace.
|
|
- `Block`: Syntactic blocks in the language.
|
|
- `InvalidSuffix`: Invalid tokens when in a given state that would otherwise be
|
|
valid.
|
|
- `Unrecognized`: Tokens that the lexer doesn't recognise.
|
|
- `DisableComment`: A standard comment that disables interpretation of the
|
|
commented code (i.e. `#`).
|
|
- `DocComment:` A documentation comment (e.g. `##`). Documentation syntax is
|
|
_not_ lexed by this lexer.
|
|
|
|
The distinction is made here between the various kinds of identifiers in order
|
|
to keep lexing fast, but also in order to allow macros to switch on the kinds of
|
|
identifiers.
|
|
|
|
> The actionables for this section are:
|
|
>
|
|
> - Determine if we want to have separate ASTs for the lexer and the parser, or
|
|
> not.
|
|
|
|
## Benchmarking the Lexer
|
|
|
|
As the lexer is the first port of call when getting an Enso program to run it
|
|
needs to be quick. To that end, we insist on comprehensive benchmarks for any
|
|
change made to the lexer. The lexer benchmarks are written using
|
|
[criterion.rs](https://github.com/bheisler/criterion.rs), and include both
|
|
examples of whole program definitions and more specific benchmark examples.
|
|
|
|
**Baseline Commit:** `e5695e6f5d44cba4094380545036a3a5cbbf6973`
|
|
|
|
The benchmarking process for the lexer is as follows:
|
|
|
|
1. Check out the current _baseline commit_, listed above.
|
|
2. In `lexer_bench_sources.rs` change the line that reads `.retain_baseline` to
|
|
instead read `.save_baseline`. This will save the current baseline (taken on
|
|
your machine).
|
|
3. Run the benchmarks using `cargo bench`. Please note that running these
|
|
benchmarks takes approximately two hours, so sufficient time should be
|
|
allotted.
|
|
4. Once the baseline run has completed, change the above-mentioned line back to
|
|
`.retain_baseline`. This will disable overwriting the saved baseline, and
|
|
will perform its regression reporting against it.
|
|
5. Make your changes.
|
|
6. Run the benchmark suite again. It will report any performance regressions in
|
|
the benchmark report, measured against your saved baseline.
|
|
|
|
Unfortunately, the use of time-based benchmarks means that we can't commit the
|
|
baseline to the repository. There is far too much variance between machines for
|
|
this to be useful.
|
|
|
|
### Running a Subset of the Benchmarks
|
|
|
|
The benchmarks are very comprehensive, running a wide range of program text
|
|
through the lexer while replicating it out to various sizes (see
|
|
`lexer_bench_sources.rs` for the full list). However, in order to decrease
|
|
iteration time it can be useful to run a subset of these.
|
|
|
|
There are two main tuning points for this:
|
|
|
|
1. The _sizes_ of inputs being executed on.
|
|
2. The benchmarks being executed.
|
|
|
|
The sizes can be tuned by editing the `SIZES` array in the
|
|
`lexer_bench_sources.rs` file. The benchmarks themselves are best tuned by
|
|
changing the macro definitions in `lexer_time_bench.rs` to exclude certain
|
|
benchmarks or groups of benchmarks.
|
|
|
|
While it is _possible_ to tune the benchmarking config (`bench_config` in
|
|
`lexer_bench_sources.rs`) to decrease benchmarking time, this is not
|
|
recommended. The current settings are tuned to provide reliable results.
|
|
|
|
### Changing the Lexer
|
|
|
|
When changing the lexer the _full_ benchmark suite must be run against the
|
|
current baseline before the changes can be merged. This suite run must use the
|
|
provided settings for the benchmarking library, and should be performed using
|
|
the process described above.
|