Do initial design for the lexer (#947)

This commit is contained in:
Ara Adkins 2020-06-26 14:54:20 +01:00 committed by GitHub
parent 0e139ee42a
commit f0551f7693
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
12 changed files with 462 additions and 190 deletions

View File

@ -31,6 +31,8 @@ below:
stream of source code.
- [**Macro Resolution:**](./macro-resolution.md) The system for defining and
resolving macros on the token stream.
- [**Operator Resolution:**](./operator-resolution.md) The system for resolving
operator applications properly.
- [**Construct Resolution:**](./construct-resolution.md) The system for
resolving higher-level language constructs in the AST to produce a useful
output.

View File

@ -8,43 +8,72 @@ order: 2
# Parser Architecture Overview
The Enso parser is designed in a highly modular fashion, with separate crates
responsible for the component's various responsibilities. The main components of
the parser are described below.
responsible for the component's various responsibilities. The overall
architecture for the parser is described in this document.
<!-- MarkdownTOC levels="2,3" autolink="true" -->
- [Overall Architecture](#overall-architecture)
- [Reader](#reader)
- [Flexer](#flexer)
- [Lexer](#lexer)
- [Macro Resolution](#macro-resolution)
- [Operator Resolution](#operator-resolution)
- [Construct Resolution](#construct-resolution)
- [Parser Driver](#parser-driver)
- [AST](#ast)
- [JVM Object Generation](#jvm-object-generation)
<!-- /MarkdownTOC -->
## Overall Architecture
The overall architecture of the parser subsystem can be visualised as follows.
## Reader
## Flexer
## Lexer
## Macro Resolution
## Operator Resolution
## Construct Resolution
## Parser Driver
### AST
## JVM Object Generation
- Should wrap the parser as a whole into a new module, built for the engine
```
┌───────────────┐
│ Source Code │
└───────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ ┌──────────────┐ Parser │
│ │ UTF-X Reader │ │
│ └──────────────┘ │
│ │ │
│ │ Character │
│ │ Stream │
│ ▽ │
│ ┌────────┐ │
│ │ Lexer │ │
│ │┌──────┐│ │
│ ││Flexer││ │
│ │└──────┘│ │
│ └────────┘ │
│ │ │
│ │ Structured │
│ │ Token Stream │
│ ▽ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ │ │ │ │ │ │
│ │ Macro │ Rust AST │ Operator │ Rust AST │ Construct │ │
│ │ Resolution │─────────────▷│ Resolution │─────────────▷│ Resolution │ │
│ │ │ │ │ │ │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │ │
│ Rust AST │ │
│ ▽ │
│ ┌────────────┐ │
│ │ AST Output │ │
│ └────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
┌───────────────┤ Rust AST
▽ │
┌────────────┐ │
│ │ │
│ JVM Object │ └─────────────────┐
│ Generator │ │
│ │ │
└────────────┘ │
│ │
JVM AST │ │
▽ ▽
┌────────────┐ ┌────────────┐
│ │ │ │
│ Use in JVM │ │ Direct Use │
│ Code │ │in Rust Code│
│ │ │ │
└────────────┘ └────────────┘
```

View File

@ -1,13 +1,40 @@
---
layout: developer-doc
title: Parser Driver
title: AST
category: parser
tags: [parser, ast]
order: 8
order: 9
---
# Parser Driver
# AST
The parser AST describes the high-level syntactic structure of Enso, as well as
containing robust and descriptive parser errors directly in the AST.
<!-- MarkdownTOC levels="2,3" autolink="true" -->
- [Functionality](#functionality)
<!-- /MarkdownTOC -->
## Functionality
The parser AST needs to account for the following:
- A single `Name` type, removing the distinction between different names found
in the [lexer](./lexer.md). This should provide functions `is_var`, `is_opr`,
and `is_ref`.
- It should contain all of the language constructs that may appear in Enso's
source.
- It should contain `Invalid` nodes, but these should be given a descriptive
error as to _why_ the construct is invalid.
- It should also contain `Ambiguous` nodes, where a macro cannot be resolved in
an unambiguous fashion.
Each node should contain:
- An identifier, attributed to it from the ID map.
- The start source position of the node, and the length (span) of the node.
> The actionables for this section are:
>
> - Flesh out the design for the AST based on the requirements of the various
> parser phases.

View File

@ -3,11 +3,37 @@ layout: developer-doc
title: Construct Resolution
category: parser
tags: [parser, construct, resolution]
order: 6
order: 7
---
# Construct Resolution
Construct resolution is the process of turning the low-level AST format into the
full high-level AST format that represents both all of Enso's language
constructs and contains rich error nodes.
<!-- MarkdownTOC levels="2,3" autolink="true" -->
- [Syntax Errors](#syntax-errors)
<!-- /MarkdownTOC -->
> The actionables for this section are:
>
> - Produce a detailed design for this resolution functionality, accounting for
> all known current use cases.
## Syntax Errors
It is very important that Enso is able to provide descriptive and useful syntax
errors to its users. Doing so requires that it has a full understanding of the
language's syntax, but also that it is designed in such a fashion that it will
always succeed, regardless of any errors. Errors must be:
- Highly descriptive, so that it is easy for the runtime to explain to the user
what went wrong.
- Highly localised, so that the scope of the error has as minimal an impact on
parsing as possible.
> The actionables for this section are:
>
> - Determine how to design this parsing phase to obtain very accurate syntax
> errors.

View File

@ -7,7 +7,191 @@ order: 3
---
# Flexer
The flexer is a finite-automata-based engine for generating lexers. Akin to
`flex` and other lexer generators, it is given a definition as a series of rules
from which it then generates code for a highly-optimised lexer.
<!-- MarkdownTOC levels="2,3" autolink="true" -->
- [Pattern Description](#pattern-description)
- [State Management](#state-management)
- [Code Generation](#code-generation)
- [Notes on Code Generation](#notes-on-code-generation)
- [An Example](#an-example)
<!-- /MarkdownTOC -->
## Pattern Description
The definition of a lexer using the flexer library consists of a set of rules
for how to behave when matching portions of syntax. These rules behave as
follows:
- A rule describes a regex-like pattern.
- It also describes the code to be executed when the pattern is matched.
```rust
pub fn lexer_definition() -> String {
...
let chr = alphaNum | '_';
let blank = Pattern::from('_')
lexer.rule(lexer.root,blank,"self.on_ident(token::blank(self.start_location))");
}
```
A pattern, such as `chr`, or `blank` is a description of the characters that
should be matched for that pattern to match. The flexer library provides a set
of basic matchers for doing this.
A `lexer.rule(...)` definition consists of the following parts:
- A state, used for grouping rules and named for debugging (see the section on
[state management](#state-management) below).
- A pattern, as described above.
- The code that is executed when the pattern matches.
## State Management
States in the flexer engine provide a mechanism for grouping sets of rules
together known as `State`. At any given time, only rules from the _active_ state
are considered by the lexer.
- States are named for purposes of debugging.
- You can activate another state from within the flexer instance by using
`state.push(new_state)`.
- You can deactivate the topmost state by using `state.pop()`.
## Code Generation
The patterns in a lexer definition are used to generate a highly-efficient and
specialised lexer. This translation process works as follows:
1. All rules are taken and used to generate an NFA.
2. A DFA is generated from the NFA using the standard
[subset construction](https://en.wikipedia.org/wiki/Powerset_construction)
algorithm, but with some additional optimisations that ensure the following
properties hold:
- Patterns are matched in the order that they are defined.
- The associated code chunks are maintained properly.
- Lexing is `O(n)`, where `n` is the size of the input.
3. The DFA is used to generate the code for a lexer `Engine` struct, containing
the `Lexer` definition.
The `Engine` generated through this process contains a main loop that consumes
the input stream character-by-character, evaluating a big switch generated from
the DFA using functions from the `Lexer`.
Lexing proceeds from top-to-bottom of the rules, and the first expression that
_matches fully_ is chosen. This differs from other common lexer generators, as
they mostly choose the _longest_ match instead. Once the pattern is matched, the
associated code is executed and the process starts over again until the input
stream has been consumed.
### Notes on Code Generation
The following properties are likely to hold for the code generation machinery.
- The vast majority of the code generated by the flexer is going to be the same
for all lexers.
- The primary generation is in `consume_next_character`, which takes a `Lexer`
as an argument.
## An Example
The following code provides a sketchy example of the intended API for the
flexer code generation using the definition of a simple lexer.
```rust
use crate::prelude::*;
use flexer;
use flexer::Flexer;
// =============
// === Token ===
// =============
pub struct Token {
location : flexer::Location,
ast : TokenAst,
}
enum TokenAst {
Var(ImString),
Cons(ImString),
Blank,
...
}
impl Token {
pub fn new(location:Location, ast:TokenAst) -> Self {
Self {location,ast}
}
pub fn var(location:Location, name:impl Into<ImString>) -> Self {
let ast = TokenAst::Var(name.into());
Self::new(location,ast)
}
...
}
// =============
// === Lexer ===
// =============
#[derive(Debug,Default)]
struct Lexer<T:Flexer::State> {
current : Option<Token>,
tokens : Vec<Token>,
state : T
}
impl Lexer {
fn on_ident(&mut self, tok:Token) {
self.current = Some(tok);
self.state.push(self.ident_sfx_check);
}
fn on_ident_err_sfx(&mut self) {
println!("OH NO!")
}
fn on_no_ident_err_sfx(&mut self) {
let current = std::mem::take(&mut self.current).unwrap();
self.tokens.push_back(current);
}
}
impl Flexer::Definition Lexer {
fn state (& self) -> & flexer::State { & self.state }
fn state_mut (&mut self) -> &mut flexer::State { &mut self.state }
}
pub fn lexer_source_code() -> String {
let lexer = Flexer::<Lexer<_>>::new();
let chr = alphaNum | '_';
let blank = Pattern::from('_');
let body = chr.many >> '\''.many();
let var = lowerLetter >> body;
let cons = upperLetter >> body;
let breaker = "^`!@#$%^&*()-=+[]{}|;:<>,./ \t\r\n\\";
let sfx_check = lexer.add(State("Identifier Suffix Check"));
lexer.rule(lexer.root,var,"self.on_ident(Token::var(self.start_location,self.current_match()))");
lexer.rule(lexer.root,cons,"self.on_ident(token::cons(self.start_location,self.current_match()))");
lexer.rule(lexer.root,blank,"self.on_ident(token::blank(self.start_location))");
lexer.rule(sfx_check,err_sfx,"self.on_ident_err_sfx()");
lexer.rule(sfx_check,Flexer::always,"self.on_no_ident_err_sfx()");
...
lexer.generate_specialized_code() // This code needs to become a source file, probably via build.rs
}
```
Some things to note:
- The function definitions in `Lexer` take `self` as their first argument
because `Engine` implements `Deref` and `DerefMut` to `Lexer`.

View File

@ -3,11 +3,19 @@ layout: developer-doc
title: JVM Object Generation
category: parser
tags: [parser, jvm, object-generation]
order: 9
order: 10
---
# JVM Object Generation
The JVM object generation phase is responsible for creating JVM-native objects
representing the parser AST from the rust-native AST. This is required to allow
the compiler and runtime to work with the AST.
<!-- MarkdownTOC levels="2,3" autolink="true" -->
<!-- /MarkdownTOC -->
> The actionables for this section are:
>
> - Work out how on earth this is going to work.
> - Produce a detailed design for this functionality.

View File

@ -7,7 +7,51 @@ order: 4
---
# Lexer
The lexer is the code generated by the [flexer](./flexer.md) that is actually
responsible for lexing Enso source code. It chunks the character stream into a
(structured) token stream in order to make later processing faster, and to
identify blocks
<!-- MarkdownTOC levels="2,3" autolink="true" -->
- [Lexer Functionality](#lexer-functionality)
- [The Lexer AST](#the-lexer-ast)
<!-- /MarkdownTOC -->
## Lexer Functionality
The lexer needs to provide the following functionality as part of the parser.
- It consumes the source lazily, character by character, and produces a
structured token stream consisting of the lexer [ast](#the-lexer-ast).
- It must succeed on _any_ input, even if there are invalid constructs in the
token stream, represented by `Invalid` tokens.
## The Lexer AST
In contrast to the full parser [ast](./ast.md), the lexer operates on a
simplified AST that we call a 'structured token stream'. While most lexers
output a linear token stream, it is very important in Enso that we encode the
nature of _blocks_ into the token stream, hence giving it structure.
This encoding of blocks is _crucial_ to the functionality of Enso as it ensures
that no later stages of the parser can ignore blocks, and hence maintains them
for use by the GUI.
It contains the following constructs:
- `Var`: Variable identifiers.
- `Ref`: Referrent identifiers.
- `Opr`: Operator identifiers.
- `Number`: Numbers.
- `Text`: Text.
- `Invalid`: Invalid constructs that cannot be lexed.
- `Block`: Syntactic blocks in the language.
The distinction is made here between the various kinds of identifiers in order
to keep lexing fast, but also in order to allow macros to switch on the kinds of
identifiers.
> The actionables for this section are:
>
> - Determine if we want to have separate ASTs for the lexer and the parser, or
> not.

View File

@ -7,7 +7,39 @@ order: 5
---
# Macro Resolution
Macro resolution is the process of taking the structured token stream from the
[lexer](./lexer.md), and resolving it into the [ast](./ast.md) through the
process of resolving macros. This process produces a chunked AST stream,
including spacing-unaware elements.
<!-- MarkdownTOC levels="2,3" autolink="true" -->
- [Functionality](#functionality)
- [Errors During Macro Resolution](#errors-during-macro-resolution)
<!-- /MarkdownTOC -->
## Functionality
The current functionality of the macro resolver is as follows:
- TBC
The current overview of the macro resolution process can be found in the scala
[implementation](../../lib/syntax/specialization/shared/src/main/scala/org/enso/syntax/text/Parser.scala).
> The actionables for this section are:
>
> - Discuss how the space-unaware AST should be handled as it is produced by
> macros.
> - Handle precedence for operators properly within macro resolution (e.g.
> `x : a -> b : a -> c` should parse with the correct precedence).
> - Create a detailed design for how macro resolution should work.
## Errors During Macro Resolution
It is very important that, during macro resolution, the resolver produces
descriptive errors for error conditions in the macro resolver.
> The actionables for this section are:
>
> - Determine how best to provide detailed and specific errors from within the
> macro resolution engine.

View File

@ -0,0 +1,28 @@
---
layout: developer-doc
title: Operator Resolution
category: parser
tags: [parser, operator, resolution]
order: 6
---
# Operator Resolution
Operator resolution is the process of resolving applications of operators into
specific nodes on the AST.
<!-- MarkdownTOC levels="2,3" autolink="true" -->
- [Resolution Algorithm](#resolution-algorithm)
<!-- /MarkdownTOC -->
## Resolution Algorithm
The operator resolution process uses a version of the classic
[shunting-yard algorithm](https://en.wikipedia.org/wiki/Shunting-yard_algorithm)
with modifications to support operator sections.
> The actionables for this section are:
>
> - Work out how to formulate this functionality efficiently in rust. The scala
> implementation can be found
> [here](../../lib/syntax/definition/src/main/scala/org/enso/syntax/text/prec/Operator.scala).

View File

@ -3,11 +3,26 @@ layout: developer-doc
title: Parser Driver
category: parser
tags: [parser, driver]
order: 7
order: 8
---
# Parser Driver
The parser driver component is responsible for orchestrating the entire action
of the parser. It handles the following duties:
1. Consuming input text using a provided [reader](./reader.md) in a lazy
fashion.
2. Lexing and then parsing the input text.
3. Writing the output AST to the client of the parser.
<!-- MarkdownTOC levels="2,3" autolink="true" -->
- [Driver Clients](#driver-clients)
<!-- /MarkdownTOC -->
## Driver Clients
The parser is going to be employed in two contexts, both running in-process:
1. In the IDE codebase as a rust dependency.
2. In the engine as a native code dependency used via JNI.

View File

@ -1,153 +0,0 @@
# Parser Design
## 1. Lexer (Code -> Token Stream)
- Lexer needs to be generic over the input stream encoding to support utf-16
coming from the JVM.
- Is there any use case that requires the lexer to read an actual file?
- The prelude needs to be released to crates.io otherwise we're going to rapidly
get out of sync.
- I don't think it makes sense to have separate `Var` and `Cons` identifiers. We
should instead have `Name`, with functions `is_referrent` and `is_variable`.
This better mirrors how the language actually treats names.
- What actually is the flexer?
- What should the AST look like?
Lexer reads source file (lazily, line by line) or uses in-memory `&str` and produces token stream of `Var`, `Cons`, `Opr`, `Number`, `Text`, `Invalid`, and `Block`. Please note that `Block` is part of the token stream on purpose. It is important that the source code is easy to parse visually, so if you see a block, it should be a block. Discovering blocks in lexer allows us to prevent all other parts of parser, like macros, from breaking this assumption. Moreover, it makes the design of the following stages a lot simpler. Enso lexer should always succeed, on any input stream (token stream could contain `Invalid` tokens).
Lexer is defined using Rust procedural macro system. We are using procedural macros, because the lexer definition produces a Rust code (pastes it "in-place" of the macro usage). Let's consider a very simple lexer definition:
```rust
use crate::prelude::*; // Needs to be a released crate
use flexer;
use flexer::Flexer;
// =============
// === Token ===
// =============
pub struct Token {
location : flexer::Location,
ast : TokenAst,
}
enum TokenAst {
Var(ImString),
Cons(ImString),
Blank,
...
}
impl Token {
pub fn new(location:Location, ast:TokenAst) -> Self {
Self {location,ast}
}
pub fn var(location:Location, name:impl Into<ImString>) -> Self {
let ast = TokenAst::Var(name.into());
Self::new(location,ast)
}
...
}
// =============
// === Lexer ===
// =============
#[derive(Debug,Default)]
struct Lexer {
current : Option<Token>,
tokens : Vec<Token>,
state : Flexer::State
}
impl Lexer {
fn on_ident(&mut self, tok:Token) {
self.current = Some(tok);
self.state.push(self.ident_sfx_check);
}
fn on_ident_err_sfx(&mut self) {
println!("OH NO!")
}
fn on_no_ident_err_sfx(&mut self) {
let current = std::mem::take(&mut self.current).unwrap();
self.tokens.push_back(current);
}
}
impl Flexer::Definition Lexer {
fn state (& self) -> & flexer::State { & self.state }
fn state_mut (&mut self) -> &mut flexer::State { &mut self.state }
}
pub fn lexer_source_code() -> String {
let lexer = Flexer::<Lexer>::new();
let chr = alphaNum | '_';
let blank = Pattern::from('_');
let body = chr.many >> '\''.many();
let var = lowerLetter >> body;
let cons = upperLetter >> body;
let breaker = "^`!@#$%^&*()-=+[]{}|;:<>,./ \t\r\n\\";
let sfx_check = lexer.add(State("Identifier Suffix Check"));
lexer.rule(lexer.root,var,"self.on_ident(Token::var(self.start_location,self.current_match()))");
lexer.rule(lexer.root,cons,"self.on_ident(token::cons(self.start_location,self.current_match()))");
lexer.rule(lexer.root,blank,"self.on_ident(token::blank(self.start_location))");
lexer.rule(sfx_check,err_sfx,"self.on_ident_err_sfx()");
lexer.rule(sfx_check,Flexer::always,"self.on_no_ident_err_sfx()");
...
lexer.generate_specialized_code()
}
```
The idea here is that we are describing regexp-like patterns and tell what should happen when the pattern is matched. For example, after matching the `var` pattern, the code `self.on_ident(ast::Var)` should be evaluated. The code is passed as string, because it will be part of the generated, highly specialized, very fast lexer.
Technically, the patterns are first translated to a state machine, and then to a bunch of if-then-else statements in such a way, that parsing is always `O(n)` where `n` is the input size. Logically, the regular expressions are matched top-bottom and the first fully-matched expression is chosen (unlike in the popular lexer generator flex, which uses longest match instead). After the expression is chosen, the associated function is executed and the process starts over again till the end of the input stream. Only the rules from the currently active state are considered. State is just a named (for debug purposes only) set of rules. Lexer always starts with the `lexer.root` state. You can make other state active by running (from within Flexer instance) `state.push(new_state)`, and pop it using `state.pop()`.
The `lexer.generate_specialized_code` first works in a few steps:
1. It takes all rules and states and generates an NFA state machine.
2. It generates DFA state machine using some custom optimizations to make sure that the regexps are matched in order and the associated code chunks are not lost.
3. It generates a highly tailored lexer `Engine` struct. One of the fields of the engine is the `Lexer` struct we defined above. The engine contains a main "loop" which consumes char by char, evaluates a big if-then-else machinery generated from the NFA, and evaluates functions from the `Lexer`. Please note that the functions start with `self`, that's because `Engine` implements `Deref` and `DerefMut` to `Lexer`.
The generation of the if-then-else code block is not defined in this document, but can be observed by:
1. Inspecting the current code in Scala.
2. Printing the Java code generated by current Scala Flexer implementation.
3. Talking with @wdanilo about it.
## 2. Macro Resolution (Token Stream -> Chunked AST Stream incl spaec-unaware AST)
To be described in detail taking into consideration all current use cases. For the current documentation of macro resolution, take a look here: https://github.com/luna/enso/blob/main/lib/syntax/specialization/shared/src/main/scala/org/enso/syntax/text/Parser.scala
Before implementing this step, we need to talk about handling of space-unaware AST (the AST produced by user-macros).
## 3. Operator Resolution (Chunked AST Stream -> Chunked AST Stream with Opr Apps)
Using modified [Shunting-yard algorithm](https://en.wikipedia.org/wiki/Shunting-yard_algorithm). The algorithm is modified to support sections. The Scala implementation is here: https://github.com/luna/enso/blob/main/lib/syntax/definition/src/main/scala/org/enso/syntax/text/prec/Operator.scala . Unfortunatelly, we cannot use recursion in Rust, so it needs to be re-worked.
## 4. Finalization and Special Rules Discovery (Chunked AST Stream with Opr Apps -> AST)
To be described in detail taking into consideration all current use cases.

View File

@ -3,11 +3,41 @@ layout: developer-doc
title: Reading Source Code
category: parser
tags: [parser, reader]
order: 9
order: 11
---
# Reading Source Code
The reader is responsible for abstracting the interface to reading a character
from a stream. This handles abstracting away the various encodings that the
project is going to use, as well as backing formats for the stream.
<!-- MarkdownTOC levels="2,3" autolink="true" -->
- [Reader Functionality](#reader-functionality)
- [Provided Readers](#provided-readers)
- [UTF-8 Reader](#utf-8-reader)
- [UTF-16 Reader](#utf-16-reader)
<!-- /MarkdownTOC -->
## Reader Functionality
The reader trait needs to have the following functionality:
- It must read its input _lazily_, not requiring the entire input to be in
memory.
- It should provide the interface to `next_character`, returning rust-native
UTF-8, and hence abstract away the various underlying encodings.
## Provided Readers
The parser implementation currently provides the following reader utilities to
clients.
### UTF-8 Reader
Rust natively uses UTF-8 encoding for its strings. In order for the IDE to make
use of the parser, it must provide a simple rust-native reader.
### UTF-16 Reader
As the JVM as a platform makes use of UTF-16 for encoding its strings, we need
to provide a reader that will let JVM clients of the parser provide the source
code in a streaming fashion without needing to re-encode it prior to passing it
to the parser.