2020-06-25 15:06:08 +03:00
|
|
|
---
|
|
|
|
layout: developer-doc
|
|
|
|
title: Lexer
|
|
|
|
category: syntax
|
|
|
|
tags: [parser, lexer]
|
|
|
|
order: 4
|
|
|
|
---
|
|
|
|
|
|
|
|
# Lexer
|
2020-07-21 15:59:40 +03:00
|
|
|
|
2020-06-26 16:54:20 +03:00
|
|
|
The lexer is the code generated by the [flexer](./flexer.md) that is actually
|
|
|
|
responsible for lexing Enso source code. It chunks the character stream into a
|
|
|
|
(structured) token stream in order to make later processing faster, and to
|
|
|
|
identify blocks
|
2020-06-25 15:06:08 +03:00
|
|
|
|
|
|
|
<!-- MarkdownTOC levels="2,3" autolink="true" -->
|
|
|
|
|
2020-08-27 15:27:22 +03:00
|
|
|
- [Lexer Architecture](#lexer-architecture)
|
|
|
|
- [Libraries in the Lexer Definition](#libraries-in-the-lexer-definition)
|
2020-06-26 16:54:20 +03:00
|
|
|
- [Lexer Functionality](#lexer-functionality)
|
|
|
|
- [The Lexer AST](#the-lexer-ast)
|
|
|
|
|
2020-06-25 15:06:08 +03:00
|
|
|
<!-- /MarkdownTOC -->
|
2020-06-26 16:54:20 +03:00
|
|
|
|
2020-08-27 15:27:22 +03:00
|
|
|
## Lexer Architecture
|
|
|
|
|
|
|
|
The structure of the flexer's code generation forces the lexer to be split into
|
|
|
|
two parts: the definition, and the generation. As the latter is the point from
|
|
|
|
which the lexer will be used, the second subproject is the one that is graced
|
|
|
|
with the name `lexer`.
|
|
|
|
|
|
|
|
### Libraries in the Lexer Definition
|
|
|
|
|
|
|
|
The lexer generation subproject needs to be able to make the assumption that all
|
|
|
|
imports will be in the same place (relative to the crate root). To this end, the
|
|
|
|
definition subproject exports public modules `library` and `prelude`. These are
|
|
|
|
re-imported and used in the generation subproject to ensure that all components
|
|
|
|
are found at the same paths relative to the crate root.
|
|
|
|
|
|
|
|
This does mean, however, that all imports from _within_ the current crate in the
|
|
|
|
definition subproject must be imported from the `library` module, not from their
|
|
|
|
paths directly from the crate root.
|
|
|
|
|
2020-06-26 16:54:20 +03:00
|
|
|
## Lexer Functionality
|
2020-07-21 15:59:40 +03:00
|
|
|
|
2020-06-26 16:54:20 +03:00
|
|
|
The lexer needs to provide the following functionality as part of the parser.
|
|
|
|
|
|
|
|
- It consumes the source lazily, character by character, and produces a
|
|
|
|
structured token stream consisting of the lexer [ast](#the-lexer-ast).
|
|
|
|
- It must succeed on _any_ input, even if there are invalid constructs in the
|
|
|
|
token stream, represented by `Invalid` tokens.
|
|
|
|
|
|
|
|
## The Lexer AST
|
2020-07-21 15:59:40 +03:00
|
|
|
|
2020-06-26 16:54:20 +03:00
|
|
|
In contrast to the full parser [ast](./ast.md), the lexer operates on a
|
|
|
|
simplified AST that we call a 'structured token stream'. While most lexers
|
|
|
|
output a linear token stream, it is very important in Enso that we encode the
|
|
|
|
nature of _blocks_ into the token stream, hence giving it structure.
|
|
|
|
|
|
|
|
This encoding of blocks is _crucial_ to the functionality of Enso as it ensures
|
|
|
|
that no later stages of the parser can ignore blocks, and hence maintains them
|
|
|
|
for use by the GUI.
|
|
|
|
|
|
|
|
It contains the following constructs:
|
|
|
|
|
2020-08-27 15:27:22 +03:00
|
|
|
- `Referent`: Referrent identifiers (e.g. `Some_Ref_Ident`).
|
|
|
|
- `Variable`: Variable identifiers (e.g. `some_var_ident`).
|
|
|
|
- `External`: External identifiers (e.g. `someJavaName`).
|
|
|
|
- `Blank`: The blank name `_`.
|
|
|
|
- `Operator`: Operator identifiers (e.g. `-->>`).
|
|
|
|
- `Modifier`: Modifier operators (e.g. `+=`).
|
|
|
|
- `Number`: Numbers (`16_FFFF`).
|
|
|
|
- `DanglingBase`: An explicit base without an associated number (e.g. `16_`).
|
|
|
|
- `Text`: Text (e.g. `"Some text goes here."`).
|
|
|
|
- `Line`: A line in a block that contains tokens.
|
|
|
|
- `BlankLine`: A line in a block that contains only whitespace.
|
2020-06-26 16:54:20 +03:00
|
|
|
- `Block`: Syntactic blocks in the language.
|
2020-08-27 15:27:22 +03:00
|
|
|
- `InvalidSuffix`: Invalid tokens when in a given state that would otherwise be
|
|
|
|
valid.
|
|
|
|
- `Unrecognized`: Tokens that the lexer doesn't recognise.
|
2020-06-26 16:54:20 +03:00
|
|
|
|
|
|
|
The distinction is made here between the various kinds of identifiers in order
|
|
|
|
to keep lexing fast, but also in order to allow macros to switch on the kinds of
|
|
|
|
identifiers.
|
|
|
|
|
|
|
|
> The actionables for this section are:
|
|
|
|
>
|
|
|
|
> - Determine if we want to have separate ASTs for the lexer and the parser, or
|
|
|
|
> not.
|