mirror of
https://github.com/enso-org/enso.git
synced 2024-12-23 11:41:56 +03:00
Do initial design for the lexer (#947)
This commit is contained in:
parent
0e139ee42a
commit
f0551f7693
@ -31,6 +31,8 @@ below:
|
||||
stream of source code.
|
||||
- [**Macro Resolution:**](./macro-resolution.md) The system for defining and
|
||||
resolving macros on the token stream.
|
||||
- [**Operator Resolution:**](./operator-resolution.md) The system for resolving
|
||||
operator applications properly.
|
||||
- [**Construct Resolution:**](./construct-resolution.md) The system for
|
||||
resolving higher-level language constructs in the AST to produce a useful
|
||||
output.
|
||||
|
@ -8,43 +8,72 @@ order: 2
|
||||
|
||||
# Parser Architecture Overview
|
||||
The Enso parser is designed in a highly modular fashion, with separate crates
|
||||
responsible for the component's various responsibilities. The main components of
|
||||
the parser are described below.
|
||||
responsible for the component's various responsibilities. The overall
|
||||
architecture for the parser is described in this document.
|
||||
|
||||
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
||||
|
||||
- [Overall Architecture](#overall-architecture)
|
||||
- [Reader](#reader)
|
||||
- [Flexer](#flexer)
|
||||
- [Lexer](#lexer)
|
||||
- [Macro Resolution](#macro-resolution)
|
||||
- [Operator Resolution](#operator-resolution)
|
||||
- [Construct Resolution](#construct-resolution)
|
||||
- [Parser Driver](#parser-driver)
|
||||
- [AST](#ast)
|
||||
- [JVM Object Generation](#jvm-object-generation)
|
||||
|
||||
<!-- /MarkdownTOC -->
|
||||
|
||||
## Overall Architecture
|
||||
The overall architecture of the parser subsystem can be visualised as follows.
|
||||
|
||||
## Reader
|
||||
|
||||
## Flexer
|
||||
|
||||
## Lexer
|
||||
|
||||
## Macro Resolution
|
||||
|
||||
## Operator Resolution
|
||||
|
||||
## Construct Resolution
|
||||
|
||||
## Parser Driver
|
||||
|
||||
### AST
|
||||
|
||||
## JVM Object Generation
|
||||
|
||||
- Should wrap the parser as a whole into a new module, built for the engine
|
||||
```
|
||||
┌───────────────┐
|
||||
│ Source Code │
|
||||
└───────────────┘
|
||||
│
|
||||
│
|
||||
▽
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ ┌──────────────┐ Parser │
|
||||
│ │ UTF-X Reader │ │
|
||||
│ └──────────────┘ │
|
||||
│ │ │
|
||||
│ │ Character │
|
||||
│ │ Stream │
|
||||
│ ▽ │
|
||||
│ ┌────────┐ │
|
||||
│ │ Lexer │ │
|
||||
│ │┌──────┐│ │
|
||||
│ ││Flexer││ │
|
||||
│ │└──────┘│ │
|
||||
│ └────────┘ │
|
||||
│ │ │
|
||||
│ │ Structured │
|
||||
│ │ Token Stream │
|
||||
│ ▽ │
|
||||
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ │ Macro │ Rust AST │ Operator │ Rust AST │ Construct │ │
|
||||
│ │ Resolution │─────────────▷│ Resolution │─────────────▷│ Resolution │ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ └────────────┘ └────────────┘ └────────────┘ │
|
||||
│ │ │
|
||||
│ Rust AST │ │
|
||||
│ ▽ │
|
||||
│ ┌────────────┐ │
|
||||
│ │ AST Output │ │
|
||||
│ └────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌───────────────┤ Rust AST
|
||||
▽ │
|
||||
┌────────────┐ │
|
||||
│ │ │
|
||||
│ JVM Object │ └─────────────────┐
|
||||
│ Generator │ │
|
||||
│ │ │
|
||||
└────────────┘ │
|
||||
│ │
|
||||
JVM AST │ │
|
||||
▽ ▽
|
||||
┌────────────┐ ┌────────────┐
|
||||
│ │ │ │
|
||||
│ Use in JVM │ │ Direct Use │
|
||||
│ Code │ │in Rust Code│
|
||||
│ │ │ │
|
||||
└────────────┘ └────────────┘
|
||||
```
|
||||
|
@ -1,13 +1,40 @@
|
||||
---
|
||||
layout: developer-doc
|
||||
title: Parser Driver
|
||||
title: AST
|
||||
category: parser
|
||||
tags: [parser, ast]
|
||||
order: 8
|
||||
order: 9
|
||||
---
|
||||
|
||||
# Parser Driver
|
||||
# AST
|
||||
The parser AST describes the high-level syntactic structure of Enso, as well as
|
||||
containing robust and descriptive parser errors directly in the AST.
|
||||
|
||||
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
||||
|
||||
- [Functionality](#functionality)
|
||||
|
||||
<!-- /MarkdownTOC -->
|
||||
|
||||
## Functionality
|
||||
The parser AST needs to account for the following:
|
||||
|
||||
- A single `Name` type, removing the distinction between different names found
|
||||
in the [lexer](./lexer.md). This should provide functions `is_var`, `is_opr`,
|
||||
and `is_ref`.
|
||||
- It should contain all of the language constructs that may appear in Enso's
|
||||
source.
|
||||
- It should contain `Invalid` nodes, but these should be given a descriptive
|
||||
error as to _why_ the construct is invalid.
|
||||
- It should also contain `Ambiguous` nodes, where a macro cannot be resolved in
|
||||
an unambiguous fashion.
|
||||
|
||||
Each node should contain:
|
||||
|
||||
- An identifier, attributed to it from the ID map.
|
||||
- The start source position of the node, and the length (span) of the node.
|
||||
|
||||
> The actionables for this section are:
|
||||
>
|
||||
> - Flesh out the design for the AST based on the requirements of the various
|
||||
> parser phases.
|
||||
|
@ -3,11 +3,37 @@ layout: developer-doc
|
||||
title: Construct Resolution
|
||||
category: parser
|
||||
tags: [parser, construct, resolution]
|
||||
order: 6
|
||||
order: 7
|
||||
---
|
||||
|
||||
# Construct Resolution
|
||||
Construct resolution is the process of turning the low-level AST format into the
|
||||
full high-level AST format that represents both all of Enso's language
|
||||
constructs and contains rich error nodes.
|
||||
|
||||
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
||||
|
||||
- [Syntax Errors](#syntax-errors)
|
||||
|
||||
<!-- /MarkdownTOC -->
|
||||
|
||||
> The actionables for this section are:
|
||||
>
|
||||
> - Produce a detailed design for this resolution functionality, accounting for
|
||||
> all known current use cases.
|
||||
|
||||
## Syntax Errors
|
||||
It is very important that Enso is able to provide descriptive and useful syntax
|
||||
errors to its users. Doing so requires that it has a full understanding of the
|
||||
language's syntax, but also that it is designed in such a fashion that it will
|
||||
always succeed, regardless of any errors. Errors must be:
|
||||
|
||||
- Highly descriptive, so that it is easy for the runtime to explain to the user
|
||||
what went wrong.
|
||||
- Highly localised, so that the scope of the error has as minimal an impact on
|
||||
parsing as possible.
|
||||
|
||||
> The actionables for this section are:
|
||||
>
|
||||
> - Determine how to design this parsing phase to obtain very accurate syntax
|
||||
> errors.
|
||||
|
@ -7,7 +7,191 @@ order: 3
|
||||
---
|
||||
|
||||
# Flexer
|
||||
The flexer is a finite-automata-based engine for generating lexers. Akin to
|
||||
`flex` and other lexer generators, it is given a definition as a series of rules
|
||||
from which it then generates code for a highly-optimised lexer.
|
||||
|
||||
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
||||
|
||||
- [Pattern Description](#pattern-description)
|
||||
- [State Management](#state-management)
|
||||
- [Code Generation](#code-generation)
|
||||
- [Notes on Code Generation](#notes-on-code-generation)
|
||||
- [An Example](#an-example)
|
||||
|
||||
<!-- /MarkdownTOC -->
|
||||
|
||||
## Pattern Description
|
||||
The definition of a lexer using the flexer library consists of a set of rules
|
||||
for how to behave when matching portions of syntax. These rules behave as
|
||||
follows:
|
||||
|
||||
- A rule describes a regex-like pattern.
|
||||
- It also describes the code to be executed when the pattern is matched.
|
||||
|
||||
```rust
|
||||
pub fn lexer_definition() -> String {
|
||||
...
|
||||
|
||||
let chr = alphaNum | '_';
|
||||
let blank = Pattern::from('_')
|
||||
|
||||
lexer.rule(lexer.root,blank,"self.on_ident(token::blank(self.start_location))");
|
||||
}
|
||||
```
|
||||
|
||||
A pattern, such as `chr`, or `blank` is a description of the characters that
|
||||
should be matched for that pattern to match. The flexer library provides a set
|
||||
of basic matchers for doing this.
|
||||
|
||||
A `lexer.rule(...)` definition consists of the following parts:
|
||||
|
||||
- A state, used for grouping rules and named for debugging (see the section on
|
||||
[state management](#state-management) below).
|
||||
- A pattern, as described above.
|
||||
- The code that is executed when the pattern matches.
|
||||
|
||||
## State Management
|
||||
States in the flexer engine provide a mechanism for grouping sets of rules
|
||||
together known as `State`. At any given time, only rules from the _active_ state
|
||||
are considered by the lexer.
|
||||
|
||||
- States are named for purposes of debugging.
|
||||
- You can activate another state from within the flexer instance by using
|
||||
`state.push(new_state)`.
|
||||
- You can deactivate the topmost state by using `state.pop()`.
|
||||
|
||||
## Code Generation
|
||||
The patterns in a lexer definition are used to generate a highly-efficient and
|
||||
specialised lexer. This translation process works as follows:
|
||||
|
||||
1. All rules are taken and used to generate an NFA.
|
||||
2. A DFA is generated from the NFA using the standard
|
||||
[subset construction](https://en.wikipedia.org/wiki/Powerset_construction)
|
||||
algorithm, but with some additional optimisations that ensure the following
|
||||
properties hold:
|
||||
- Patterns are matched in the order that they are defined.
|
||||
- The associated code chunks are maintained properly.
|
||||
- Lexing is `O(n)`, where `n` is the size of the input.
|
||||
3. The DFA is used to generate the code for a lexer `Engine` struct, containing
|
||||
the `Lexer` definition.
|
||||
|
||||
The `Engine` generated through this process contains a main loop that consumes
|
||||
the input stream character-by-character, evaluating a big switch generated from
|
||||
the DFA using functions from the `Lexer`.
|
||||
|
||||
Lexing proceeds from top-to-bottom of the rules, and the first expression that
|
||||
_matches fully_ is chosen. This differs from other common lexer generators, as
|
||||
they mostly choose the _longest_ match instead. Once the pattern is matched, the
|
||||
associated code is executed and the process starts over again until the input
|
||||
stream has been consumed.
|
||||
|
||||
### Notes on Code Generation
|
||||
The following properties are likely to hold for the code generation machinery.
|
||||
|
||||
- The vast majority of the code generated by the flexer is going to be the same
|
||||
for all lexers.
|
||||
- The primary generation is in `consume_next_character`, which takes a `Lexer`
|
||||
as an argument.
|
||||
|
||||
## An Example
|
||||
The following code provides a sketchy example of the intended API for the
|
||||
flexer code generation using the definition of a simple lexer.
|
||||
|
||||
```rust
|
||||
use crate::prelude::*;
|
||||
|
||||
use flexer;
|
||||
use flexer::Flexer;
|
||||
|
||||
|
||||
|
||||
// =============
|
||||
// === Token ===
|
||||
// =============
|
||||
|
||||
pub struct Token {
|
||||
location : flexer::Location,
|
||||
ast : TokenAst,
|
||||
}
|
||||
|
||||
enum TokenAst {
|
||||
Var(ImString),
|
||||
Cons(ImString),
|
||||
Blank,
|
||||
...
|
||||
}
|
||||
|
||||
impl Token {
|
||||
pub fn new(location:Location, ast:TokenAst) -> Self {
|
||||
Self {location,ast}
|
||||
}
|
||||
|
||||
pub fn var(location:Location, name:impl Into<ImString>) -> Self {
|
||||
let ast = TokenAst::Var(name.into());
|
||||
Self::new(location,ast)
|
||||
}
|
||||
|
||||
...
|
||||
}
|
||||
|
||||
|
||||
|
||||
// =============
|
||||
// === Lexer ===
|
||||
// =============
|
||||
|
||||
#[derive(Debug,Default)]
|
||||
struct Lexer<T:Flexer::State> {
|
||||
current : Option<Token>,
|
||||
tokens : Vec<Token>,
|
||||
state : T
|
||||
}
|
||||
|
||||
impl Lexer {
|
||||
fn on_ident(&mut self, tok:Token) {
|
||||
self.current = Some(tok);
|
||||
self.state.push(self.ident_sfx_check);
|
||||
}
|
||||
|
||||
fn on_ident_err_sfx(&mut self) {
|
||||
println!("OH NO!")
|
||||
}
|
||||
|
||||
fn on_no_ident_err_sfx(&mut self) {
|
||||
let current = std::mem::take(&mut self.current).unwrap();
|
||||
self.tokens.push_back(current);
|
||||
}
|
||||
}
|
||||
|
||||
impl Flexer::Definition Lexer {
|
||||
fn state (& self) -> & flexer::State { & self.state }
|
||||
fn state_mut (&mut self) -> &mut flexer::State { &mut self.state }
|
||||
}
|
||||
|
||||
pub fn lexer_source_code() -> String {
|
||||
let lexer = Flexer::<Lexer<_>>::new();
|
||||
|
||||
let chr = alphaNum | '_';
|
||||
let blank = Pattern::from('_');
|
||||
let body = chr.many >> '\''.many();
|
||||
let var = lowerLetter >> body;
|
||||
let cons = upperLetter >> body;
|
||||
let breaker = "^`!@#$%^&*()-=+[]{}|;:<>,./ \t\r\n\\";
|
||||
|
||||
let sfx_check = lexer.add(State("Identifier Suffix Check"));
|
||||
|
||||
lexer.rule(lexer.root,var,"self.on_ident(Token::var(self.start_location,self.current_match()))");
|
||||
lexer.rule(lexer.root,cons,"self.on_ident(token::cons(self.start_location,self.current_match()))");
|
||||
lexer.rule(lexer.root,blank,"self.on_ident(token::blank(self.start_location))");
|
||||
lexer.rule(sfx_check,err_sfx,"self.on_ident_err_sfx()");
|
||||
lexer.rule(sfx_check,Flexer::always,"self.on_no_ident_err_sfx()");
|
||||
...
|
||||
lexer.generate_specialized_code() // This code needs to become a source file, probably via build.rs
|
||||
}
|
||||
```
|
||||
|
||||
Some things to note:
|
||||
|
||||
- The function definitions in `Lexer` take `self` as their first argument
|
||||
because `Engine` implements `Deref` and `DerefMut` to `Lexer`.
|
||||
|
@ -3,11 +3,19 @@ layout: developer-doc
|
||||
title: JVM Object Generation
|
||||
category: parser
|
||||
tags: [parser, jvm, object-generation]
|
||||
order: 9
|
||||
order: 10
|
||||
---
|
||||
|
||||
# JVM Object Generation
|
||||
The JVM object generation phase is responsible for creating JVM-native objects
|
||||
representing the parser AST from the rust-native AST. This is required to allow
|
||||
the compiler and runtime to work with the AST.
|
||||
|
||||
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
||||
|
||||
<!-- /MarkdownTOC -->
|
||||
|
||||
> The actionables for this section are:
|
||||
>
|
||||
> - Work out how on earth this is going to work.
|
||||
> - Produce a detailed design for this functionality.
|
||||
|
@ -7,7 +7,51 @@ order: 4
|
||||
---
|
||||
|
||||
# Lexer
|
||||
The lexer is the code generated by the [flexer](./flexer.md) that is actually
|
||||
responsible for lexing Enso source code. It chunks the character stream into a
|
||||
(structured) token stream in order to make later processing faster, and to
|
||||
identify blocks
|
||||
|
||||
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
||||
|
||||
- [Lexer Functionality](#lexer-functionality)
|
||||
- [The Lexer AST](#the-lexer-ast)
|
||||
|
||||
<!-- /MarkdownTOC -->
|
||||
|
||||
## Lexer Functionality
|
||||
The lexer needs to provide the following functionality as part of the parser.
|
||||
|
||||
- It consumes the source lazily, character by character, and produces a
|
||||
structured token stream consisting of the lexer [ast](#the-lexer-ast).
|
||||
- It must succeed on _any_ input, even if there are invalid constructs in the
|
||||
token stream, represented by `Invalid` tokens.
|
||||
|
||||
## The Lexer AST
|
||||
In contrast to the full parser [ast](./ast.md), the lexer operates on a
|
||||
simplified AST that we call a 'structured token stream'. While most lexers
|
||||
output a linear token stream, it is very important in Enso that we encode the
|
||||
nature of _blocks_ into the token stream, hence giving it structure.
|
||||
|
||||
This encoding of blocks is _crucial_ to the functionality of Enso as it ensures
|
||||
that no later stages of the parser can ignore blocks, and hence maintains them
|
||||
for use by the GUI.
|
||||
|
||||
It contains the following constructs:
|
||||
|
||||
- `Var`: Variable identifiers.
|
||||
- `Ref`: Referrent identifiers.
|
||||
- `Opr`: Operator identifiers.
|
||||
- `Number`: Numbers.
|
||||
- `Text`: Text.
|
||||
- `Invalid`: Invalid constructs that cannot be lexed.
|
||||
- `Block`: Syntactic blocks in the language.
|
||||
|
||||
The distinction is made here between the various kinds of identifiers in order
|
||||
to keep lexing fast, but also in order to allow macros to switch on the kinds of
|
||||
identifiers.
|
||||
|
||||
> The actionables for this section are:
|
||||
>
|
||||
> - Determine if we want to have separate ASTs for the lexer and the parser, or
|
||||
> not.
|
||||
|
@ -7,7 +7,39 @@ order: 5
|
||||
---
|
||||
|
||||
# Macro Resolution
|
||||
Macro resolution is the process of taking the structured token stream from the
|
||||
[lexer](./lexer.md), and resolving it into the [ast](./ast.md) through the
|
||||
process of resolving macros. This process produces a chunked AST stream,
|
||||
including spacing-unaware elements.
|
||||
|
||||
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
||||
|
||||
- [Functionality](#functionality)
|
||||
- [Errors During Macro Resolution](#errors-during-macro-resolution)
|
||||
|
||||
<!-- /MarkdownTOC -->
|
||||
|
||||
## Functionality
|
||||
The current functionality of the macro resolver is as follows:
|
||||
|
||||
- TBC
|
||||
|
||||
The current overview of the macro resolution process can be found in the scala
|
||||
[implementation](../../lib/syntax/specialization/shared/src/main/scala/org/enso/syntax/text/Parser.scala).
|
||||
|
||||
> The actionables for this section are:
|
||||
>
|
||||
> - Discuss how the space-unaware AST should be handled as it is produced by
|
||||
> macros.
|
||||
> - Handle precedence for operators properly within macro resolution (e.g.
|
||||
> `x : a -> b : a -> c` should parse with the correct precedence).
|
||||
> - Create a detailed design for how macro resolution should work.
|
||||
|
||||
## Errors During Macro Resolution
|
||||
It is very important that, during macro resolution, the resolver produces
|
||||
descriptive errors for error conditions in the macro resolver.
|
||||
|
||||
> The actionables for this section are:
|
||||
>
|
||||
> - Determine how best to provide detailed and specific errors from within the
|
||||
> macro resolution engine.
|
||||
|
28
docs/parser/operator-resolution.md
Normal file
28
docs/parser/operator-resolution.md
Normal file
@ -0,0 +1,28 @@
|
||||
---
|
||||
layout: developer-doc
|
||||
title: Operator Resolution
|
||||
category: parser
|
||||
tags: [parser, operator, resolution]
|
||||
order: 6
|
||||
---
|
||||
|
||||
# Operator Resolution
|
||||
Operator resolution is the process of resolving applications of operators into
|
||||
specific nodes on the AST.
|
||||
|
||||
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
||||
|
||||
- [Resolution Algorithm](#resolution-algorithm)
|
||||
|
||||
<!-- /MarkdownTOC -->
|
||||
|
||||
## Resolution Algorithm
|
||||
The operator resolution process uses a version of the classic
|
||||
[shunting-yard algorithm](https://en.wikipedia.org/wiki/Shunting-yard_algorithm)
|
||||
with modifications to support operator sections.
|
||||
|
||||
> The actionables for this section are:
|
||||
>
|
||||
> - Work out how to formulate this functionality efficiently in rust. The scala
|
||||
> implementation can be found
|
||||
> [here](../../lib/syntax/definition/src/main/scala/org/enso/syntax/text/prec/Operator.scala).
|
@ -3,11 +3,26 @@ layout: developer-doc
|
||||
title: Parser Driver
|
||||
category: parser
|
||||
tags: [parser, driver]
|
||||
order: 7
|
||||
order: 8
|
||||
---
|
||||
|
||||
# Parser Driver
|
||||
The parser driver component is responsible for orchestrating the entire action
|
||||
of the parser. It handles the following duties:
|
||||
|
||||
1. Consuming input text using a provided [reader](./reader.md) in a lazy
|
||||
fashion.
|
||||
2. Lexing and then parsing the input text.
|
||||
3. Writing the output AST to the client of the parser.
|
||||
|
||||
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
||||
|
||||
- [Driver Clients](#driver-clients)
|
||||
|
||||
<!-- /MarkdownTOC -->
|
||||
|
||||
## Driver Clients
|
||||
The parser is going to be employed in two contexts, both running in-process:
|
||||
|
||||
1. In the IDE codebase as a rust dependency.
|
||||
2. In the engine as a native code dependency used via JNI.
|
||||
|
@ -1,153 +0,0 @@
|
||||
# Parser Design
|
||||
|
||||
## 1. Lexer (Code -> Token Stream)
|
||||
|
||||
- Lexer needs to be generic over the input stream encoding to support utf-16
|
||||
coming from the JVM.
|
||||
- Is there any use case that requires the lexer to read an actual file?
|
||||
- The prelude needs to be released to crates.io otherwise we're going to rapidly
|
||||
get out of sync.
|
||||
- I don't think it makes sense to have separate `Var` and `Cons` identifiers. We
|
||||
should instead have `Name`, with functions `is_referrent` and `is_variable`.
|
||||
This better mirrors how the language actually treats names.
|
||||
- What actually is the flexer?
|
||||
- What should the AST look like?
|
||||
|
||||
Lexer reads source file (lazily, line by line) or uses in-memory `&str` and produces token stream of `Var`, `Cons`, `Opr`, `Number`, `Text`, `Invalid`, and `Block`. Please note that `Block` is part of the token stream on purpose. It is important that the source code is easy to parse visually, so if you see a block, it should be a block. Discovering blocks in lexer allows us to prevent all other parts of parser, like macros, from breaking this assumption. Moreover, it makes the design of the following stages a lot simpler. Enso lexer should always succeed, on any input stream (token stream could contain `Invalid` tokens).
|
||||
|
||||
Lexer is defined using Rust procedural macro system. We are using procedural macros, because the lexer definition produces a Rust code (pastes it "in-place" of the macro usage). Let's consider a very simple lexer definition:
|
||||
|
||||
```rust
|
||||
use crate::prelude::*; // Needs to be a released crate
|
||||
|
||||
use flexer;
|
||||
use flexer::Flexer;
|
||||
|
||||
|
||||
|
||||
// =============
|
||||
// === Token ===
|
||||
// =============
|
||||
|
||||
pub struct Token {
|
||||
location : flexer::Location,
|
||||
ast : TokenAst,
|
||||
}
|
||||
|
||||
enum TokenAst {
|
||||
Var(ImString),
|
||||
Cons(ImString),
|
||||
Blank,
|
||||
...
|
||||
}
|
||||
|
||||
impl Token {
|
||||
pub fn new(location:Location, ast:TokenAst) -> Self {
|
||||
Self {location,ast}
|
||||
}
|
||||
|
||||
pub fn var(location:Location, name:impl Into<ImString>) -> Self {
|
||||
let ast = TokenAst::Var(name.into());
|
||||
Self::new(location,ast)
|
||||
}
|
||||
|
||||
...
|
||||
}
|
||||
|
||||
|
||||
|
||||
// =============
|
||||
// === Lexer ===
|
||||
// =============
|
||||
|
||||
#[derive(Debug,Default)]
|
||||
struct Lexer {
|
||||
current : Option<Token>,
|
||||
tokens : Vec<Token>,
|
||||
state : Flexer::State
|
||||
}
|
||||
|
||||
impl Lexer {
|
||||
fn on_ident(&mut self, tok:Token) {
|
||||
self.current = Some(tok);
|
||||
self.state.push(self.ident_sfx_check);
|
||||
}
|
||||
|
||||
fn on_ident_err_sfx(&mut self) {
|
||||
println!("OH NO!")
|
||||
}
|
||||
|
||||
fn on_no_ident_err_sfx(&mut self) {
|
||||
let current = std::mem::take(&mut self.current).unwrap();
|
||||
self.tokens.push_back(current);
|
||||
}
|
||||
}
|
||||
|
||||
impl Flexer::Definition Lexer {
|
||||
fn state (& self) -> & flexer::State { & self.state }
|
||||
fn state_mut (&mut self) -> &mut flexer::State { &mut self.state }
|
||||
}
|
||||
|
||||
pub fn lexer_source_code() -> String {
|
||||
let lexer = Flexer::<Lexer>::new();
|
||||
|
||||
let chr = alphaNum | '_';
|
||||
let blank = Pattern::from('_');
|
||||
let body = chr.many >> '\''.many();
|
||||
let var = lowerLetter >> body;
|
||||
let cons = upperLetter >> body;
|
||||
let breaker = "^`!@#$%^&*()-=+[]{}|;:<>,./ \t\r\n\\";
|
||||
|
||||
let sfx_check = lexer.add(State("Identifier Suffix Check"));
|
||||
|
||||
lexer.rule(lexer.root,var,"self.on_ident(Token::var(self.start_location,self.current_match()))");
|
||||
lexer.rule(lexer.root,cons,"self.on_ident(token::cons(self.start_location,self.current_match()))");
|
||||
lexer.rule(lexer.root,blank,"self.on_ident(token::blank(self.start_location))");
|
||||
lexer.rule(sfx_check,err_sfx,"self.on_ident_err_sfx()");
|
||||
lexer.rule(sfx_check,Flexer::always,"self.on_no_ident_err_sfx()");
|
||||
...
|
||||
lexer.generate_specialized_code()
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
The idea here is that we are describing regexp-like patterns and tell what should happen when the pattern is matched. For example, after matching the `var` pattern, the code `self.on_ident(ast::Var)` should be evaluated. The code is passed as string, because it will be part of the generated, highly specialized, very fast lexer.
|
||||
|
||||
Technically, the patterns are first translated to a state machine, and then to a bunch of if-then-else statements in such a way, that parsing is always `O(n)` where `n` is the input size. Logically, the regular expressions are matched top-bottom and the first fully-matched expression is chosen (unlike in the popular lexer generator flex, which uses longest match instead). After the expression is chosen, the associated function is executed and the process starts over again till the end of the input stream. Only the rules from the currently active state are considered. State is just a named (for debug purposes only) set of rules. Lexer always starts with the `lexer.root` state. You can make other state active by running (from within Flexer instance) `state.push(new_state)`, and pop it using `state.pop()`.
|
||||
|
||||
The `lexer.generate_specialized_code` first works in a few steps:
|
||||
|
||||
1. It takes all rules and states and generates an NFA state machine.
|
||||
2. It generates DFA state machine using some custom optimizations to make sure that the regexps are matched in order and the associated code chunks are not lost.
|
||||
3. It generates a highly tailored lexer `Engine` struct. One of the fields of the engine is the `Lexer` struct we defined above. The engine contains a main "loop" which consumes char by char, evaluates a big if-then-else machinery generated from the NFA, and evaluates functions from the `Lexer`. Please note that the functions start with `self`, that's because `Engine` implements `Deref` and `DerefMut` to `Lexer`.
|
||||
|
||||
The generation of the if-then-else code block is not defined in this document, but can be observed by:
|
||||
|
||||
1. Inspecting the current code in Scala.
|
||||
2. Printing the Java code generated by current Scala Flexer implementation.
|
||||
3. Talking with @wdanilo about it.
|
||||
|
||||
|
||||
|
||||
## 2. Macro Resolution (Token Stream -> Chunked AST Stream incl spaec-unaware AST)
|
||||
|
||||
To be described in detail taking into consideration all current use cases. For the current documentation of macro resolution, take a look here: https://github.com/luna/enso/blob/main/lib/syntax/specialization/shared/src/main/scala/org/enso/syntax/text/Parser.scala
|
||||
|
||||
Before implementing this step, we need to talk about handling of space-unaware AST (the AST produced by user-macros).
|
||||
|
||||
|
||||
|
||||
## 3. Operator Resolution (Chunked AST Stream -> Chunked AST Stream with Opr Apps)
|
||||
|
||||
Using modified [Shunting-yard algorithm](https://en.wikipedia.org/wiki/Shunting-yard_algorithm). The algorithm is modified to support sections. The Scala implementation is here: https://github.com/luna/enso/blob/main/lib/syntax/definition/src/main/scala/org/enso/syntax/text/prec/Operator.scala . Unfortunatelly, we cannot use recursion in Rust, so it needs to be re-worked.
|
||||
|
||||
|
||||
|
||||
## 4. Finalization and Special Rules Discovery (Chunked AST Stream with Opr Apps -> AST)
|
||||
|
||||
To be described in detail taking into consideration all current use cases.
|
||||
|
||||
|
||||
|
||||
|
||||
|
@ -3,11 +3,41 @@ layout: developer-doc
|
||||
title: Reading Source Code
|
||||
category: parser
|
||||
tags: [parser, reader]
|
||||
order: 9
|
||||
order: 11
|
||||
---
|
||||
|
||||
# Reading Source Code
|
||||
The reader is responsible for abstracting the interface to reading a character
|
||||
from a stream. This handles abstracting away the various encodings that the
|
||||
project is going to use, as well as backing formats for the stream.
|
||||
|
||||
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
||||
|
||||
- [Reader Functionality](#reader-functionality)
|
||||
- [Provided Readers](#provided-readers)
|
||||
- [UTF-8 Reader](#utf-8-reader)
|
||||
- [UTF-16 Reader](#utf-16-reader)
|
||||
|
||||
<!-- /MarkdownTOC -->
|
||||
|
||||
## Reader Functionality
|
||||
The reader trait needs to have the following functionality:
|
||||
|
||||
- It must read its input _lazily_, not requiring the entire input to be in
|
||||
memory.
|
||||
- It should provide the interface to `next_character`, returning rust-native
|
||||
UTF-8, and hence abstract away the various underlying encodings.
|
||||
|
||||
## Provided Readers
|
||||
The parser implementation currently provides the following reader utilities to
|
||||
clients.
|
||||
|
||||
### UTF-8 Reader
|
||||
Rust natively uses UTF-8 encoding for its strings. In order for the IDE to make
|
||||
use of the parser, it must provide a simple rust-native reader.
|
||||
|
||||
### UTF-16 Reader
|
||||
As the JVM as a platform makes use of UTF-16 for encoding its strings, we need
|
||||
to provide a reader that will let JVM clients of the parser provide the source
|
||||
code in a streaming fashion without needing to re-encode it prior to passing it
|
||||
to the parser.
|
||||
|
Loading…
Reference in New Issue
Block a user