enso/docs/parser/macro-resolution.md
2020-12-18 14:25:30 +00:00

313 lines
13 KiB
Markdown

---
layout: developer-doc
title: Macro Resolution
category: parser
tags: [parser, macro, resolution]
order: 4
---
# Macro Resolution
The Enso macro system is responsible for translating the structured token stream
from the [lexer](./lexer.md) and resolving it into the output [ast](./ast.md)
that represents the Enso language.
The macro system is of _key_ importance to the operation of Enso, providing the
IDE with detailed information on how users can interact with each and every
portion of the syntax. This has myriad benefits over a traditional parser,
allowing the IDE to handle all portions of the language's syntax in a seamless
way; the [macro matchers](#macro-matchers) describe the Enso syntax in a way
that allows the IDE to extract information about extension points and other
crucial metadata.
Nevertheless, the macro system _also_ serves as the language's parser,
implementing a sophisticated resolution algorithm that is able to turn the token
stream into the language AST.
Furthermore, the macro system can be extended in future to allow _users_ to
define their own hygienic macros in Enso itself. This will allow them to
manipulate their programs syntactically using _exactly_ the same mechanism as
used in the Enso parser itself.
<!-- MarkdownTOC levels="2,3" autolink="true" indent=" " -->
- [Resolving Macros](#resolving-macros)
- [Macro Segments](#macro-segments)
- [Matching](#matching)
- [Macro Matchers](#macro-matchers)
- [Relative Precedence Matchers](#relative-precedence-matchers)
- [Named Macros](#named-macros)
- [Macro Resolution Errors](#macro-resolution-errors)
- [Macro Errors as Parser Errors](#macro-errors-as-parser-errors)
- [User-Defined Macros](#user-defined-macros)
- [Benchmarking](#benchmarking)
- [Running a Subset of the Benchmarks](#running-a-subset-of-the-benchmarks)
- [Changing the Macro Resolver](#changing-the-macro-resolver)
<!-- /MarkdownTOC -->
## Resolving Macros
Conceptually, the macro resolution process builds a tree out of the user-defined
macros, segmenting the token-space based on the types of tokens expected by the
various definitions. Resolution then proceeds to walk the tree, pruning the
possible cases until either:
1. An unambiguous resolution is found.
2. An ambiguous resolution is found and an error emitted.
Resolution then recurses, continuing to apply the macro resolution process
against the non-matchable segments in the macro.
Consider, for example, the macros defined as follows:
```rust
let if_then = "if" >> Matcher::Expr >> "then" >> Matcher::Expr
let if_then_else = "if" >> Matcher::Expr >> "then" >> Matcher::Expr >> "else" >> Matcher::Expr
```
These generate a tree as follows:
```
┌──────┐
│ "if" │
└──────┘
├─────────────────┐
▼ ▼
┌───────────────┐ ┌───────┐
│ Matcher::Expr │ │ Other │
└───────────────┘ └───────┘
│ │
│ │
▼ ▼
┌──────┐ ┌──────┐
│"then"│ │ Fail │
└──────┘ └──────┘
├─────────────────┐
▼ ▼
┌───────────────┐ ┌───────┐
│ Matcher::Expr │ │ Other │
└───────────────┘ └───────┘
│ │
┌───────────────────┤ │
▼ ▼ ▼
┌─────────┐ ┌──────┐ ┌──────┐
│ Succeed │ │"else"│ │ Fail │
└─────────┘ └──────┘ └──────┘
├─────────────────┐
▼ ▼
┌───────────────┐ ┌───────┐
│ Matcher::Expr │ │ Other │
└───────────────┘ └───────┘
│ │
│ │
▼ ▼
┌─────────┐ ┌──────┐
│ Succeed │ │ Fail │
└─────────┘ └──────┘
```
Within each of the patterns, the macro resolution continues recursively, thus
allowing the macros to match nested uses of themselves (e.g.
`if a then if b then c`).
In reality, this is not how it _actually_ takes place. Applying one macro to
completion and then starting again on all remaining segments provides abysmal
complexity. As a result, the process instead operates on a _stack_ of resolution
contexts, performing its resolution in a linear scan over the token stream.
### Macro Segments
Macros are described in terms of _segments_. A segment consists of three main
components:
1. An optional _preceding section_ that uses [matchers](#macro-matchers) to
describe what they match until encountering the _literal_ in 2. This can
only occur with the first segment.
2. A _literal_ that must be matched for that segment to apply.
3. An optional associated [matcher](#macro-matchers) that consumes certain
_types_ of token.
Segments determine how a given macro matches on the input token stream, and are
used to generate the tree for performing resolution.
### Matching
When a match occurs, the resultant segments are passed to a function that can
perform some action based on those segments. This function is responsible for
manipulating the output of the macro resolver, and hence manipulates the AST.
As the function may assert _additional_ properties on the match not able to be
ascertained by the segment definitions themselves,
## Macro Matchers
The matchers describe the _kinds_ of tokens that can be reasoned about within a
macro definition. These fall across a set of categories as follows.
The boundary matchers are:
- `Start`: Matches the start of the line.
- `End`: Matches the end of the line.
Structural matchers:
- `Nothing`: Never matches on any token.
- `Any`: Matches on any token. This may become a family of tokens (e.g.
`AnyExpr`), and so on, for performance.
- `Seq`: Matches on the contained matches in sequence.
- `Or`: Matches on one of the contained matches.
- `Many`: Matches on multiple of the given match.
- `Except`: Matches on everything but the given pattern.
Meta matchers:
- `Build`: Performs AST resolution on the match, but this must become implicit.
- `Err`: Allows ascribing a user-defined error to the failure of the contained
matcher.
- `Tag`: Does nothing, but attaches string metadata to the result. This is very
useful for syntax highlighting.
- `Exact`: Matches on an explicit AST.
Identifier matchers:
- `Referent`: Matches on a referent identifier.
- `Variable`: Matches on a variable identifier.
- `External`: Matches on an external identifier.
- `Blank`: Matches the blank identifier (`_`).
- `Operator`: Matches on an operator identifier.
- `Modifier`: Matches on a modifier identifier.
- `Annotation`: Matches on an annotation identifier.
Literal matchers:
- `Number`: Matches on a literal number.
- `Text`: Any text literal.
- `TextLine`: Matches a single-line textual literal.
- `TextInlineBlock`: Matches an inline-block textual literal.
- `TextBlock`: Matches a block text literal.
Comment matchers:
- `DisableComment`: Matches a standard disable comment.
- `DocComment`: Matches an Enso doc comment.
Structural matchers:
- `Block`: Matches an Enso block.
Error matchers:
- `DanglingBase`: Matches an erroneous base for a number literal.
- `TextError`: Matches tokens representing error states during textual lexing.
- `InvalidSuffix`: Matches an erroneous suffix token.
- `Unrecognized`: Matches an unrecognised token.
In addition, it is highly likely that some of these will be combined into
aggregate matchers that match one of many kinds of token.
### Relative Precedence Matchers
Where relevant, the matchers above may take an optional "maximum precedence" of
operator that they may match. This effectively serves to implement differing
precedences on either side of a literal operator being implemented via a macro,
and is important for correctly resolving the function arrow `->`.
### Named Macros
In addition to the basic matcher described above, we also need a `Named`,
matcher. The intent behind this matcher is that it allows you to refer to
existing macro definitions by name, and also provides for better error messages.
By convention, we aim for these names to be treated as qualified, based on the
package in which they're declared. This convention should be followed for the
built-in macros as well, even if they can't be overridden, as they may appear in
error messages.
## Macro Resolution Errors
Due to the incredibly structured nature of the resolver it is quite possible for
the macro system to produce very detailed errors when resolution fails. This is
because it knows the path it took to get to the failure, and the expected tokens
or kinds of tokens at the failure site.
In addition, the resolver allows users to define their own errors through the
use of the `Err` matcher mentioned [above](#macro-matchers).
### Macro Errors as Parser Errors
The wealth of information in a failed macro resolution brings the necessary
tools for the compiler to produce informative and detailed parsing errors.
> The actionables for this section are:
>
> - Determine how we ensure provision of failure information to the engine in a
> useful format.
## User-Defined Macros
In the future we want to provide an Enso-level syntax that allows the user to
define syntactic macros for their programs, similar to how Rust exposes the
`macro_rules!` construct.
> The actionables for this section are:
>
> - Determine what this should look like in surface Enso.
> - Determine exactly how the round-trip mechanic between the runtime and parser
> should function.
## Benchmarking
All components of the macro resolver are accompanied by comprehensive benchmarks
in order to protect the performance-crucial code against regressions. These
benchmarks are written using
[criterion.rs](https://github.com/bheisler/criterion.rs), and cover all of the
performance critical functionality of the macro resolver.
**Baseline Commit:** TBC (use the latest for now)
The benchmarking process for the macro resolver is as follows:
1. Check out the current _baseline commit_, listed above.
2. In each of the benchmark files, change the configuration line reading
`.retain_baseline` to `.save_baseline`. This will save the current baseline
(taken on your machine).
3. In `lexer_bench_sources.rs` change the line that reads `.retain_baseline` to
instead read `.save_baseline`. This will save the current baseline (taken on
your machine).
4. Run the benchmarks using `cargo bench`.
5. Once the baseline run has completed, change the above-mentioned lines back to
`.retain_baseline`. This will disable overwriting the saved baseline, and
will perform its regression reporting against it.
6. Make your changes.
7. Run the benchmark suite again. It will report any performance regressions in
the benchmark report, measured against your saved baseline.
Unfortunately, the use of time-based benchmarks means that we can't commit the
baseline to the repository. There is far too much variance between machines for
this to be useful.
### Running a Subset of the Benchmarks
The benchmarks are very comprehensive, testing all performance-critical paths of
the macro resolver. However, it can often be useful to run a subset of these.
There are two main tuning points for this:
1. The _sizes_ of inputs being executed on, where relevant.
2. The benchmarks being executed.
While it is _possible_ to tune the benchmarking config to decrease benchmarking
time, this is not recommended. The current settings are tuned to provide
reliable results.
### Changing the Macro Resolver
When changing the macro resolver the _full_ benchmark suite must be run against
the current baseline before the changes can be merged. This suite run must use
the provided settings for the benchmarking library, and should be performed
using the process described above.