enso/docs/parser/lexer.md

---
layout: developer-doc
title: Lexer
category: syntax
tags: [parser, lexer]
order: 4
---

# Lexer

The lexer is the code generated by the [flexer](./flexer.md) that is actually
responsible for lexing Enso source code. It chunks the character stream into a
(structured) token stream in order to make later processing faster, and to
identify blocks

<!-- MarkdownTOC levels="2,3" autolink="true" -->

- [Lexer Architecture](#lexer-architecture)
  - [Libraries in the Lexer Definition](#libraries-in-the-lexer-definition)
- [Lexer Functionality](#lexer-functionality)
- [The Lexer AST](#the-lexer-ast)

<!-- /MarkdownTOC -->

## Lexer Architecture

The structure of the flexer's code generation forces the lexer to be split into
two parts: the definition, and the generation. As the latter is the point from
which the lexer will be used, the second subproject is the one that is graced
with the name `lexer`.

### Libraries in the Lexer Definition

The lexer generation subproject needs to be able to make the assumption that all
imports will be in the same place (relative to the crate root). To this end, the
definition subproject exports public modules `library` and `prelude`. These are
re-imported and used in the generation subproject to ensure that all components
are found at the same paths relative to the crate root.

This does mean, however, that all imports from _within_ the current crate in the
definition subproject must be imported from the `library` module, not from their
paths directly from the crate root.

## Lexer Functionality

The lexer needs to provide the following functionality as part of the parser.

- It consumes the source lazily, character by character, and produces a
  structured token stream consisting of the lexer [ast](#the-lexer-ast).
- It must succeed on _any_ input, even if there are invalid constructs in the
  token stream, represented by `Invalid` tokens.

## The Lexer AST

In contrast to the full parser [ast](./ast.md), the lexer operates on a
simplified AST that we call a 'structured token stream'. While most lexers
output a linear token stream, it is very important in Enso that we encode the
nature of _blocks_ into the token stream, hence giving it structure.

This encoding of blocks is _crucial_ to the functionality of Enso as it ensures
that no later stages of the parser can ignore blocks, and hence maintains them
for use by the GUI.

It contains the following constructs:

- `Referent`: Referrent identifiers (e.g. `Some_Ref_Ident`).
- `Variable`: Variable identifiers (e.g. `some_var_ident`).
- `External`: External identifiers (e.g. `someJavaName`).
- `Blank`: The blank name `_`.
- `Operator`: Operator identifiers (e.g. `-->>`).
- `Modifier`: Modifier operators (e.g. `+=`).
- `Number`: Numbers (`16_FFFF`).
- `DanglingBase`: An explicit base without an associated number (e.g. `16_`).
- `Text`: Text (e.g. `"Some text goes here."`).
- `Line`: A line in a block that contains tokens.
- `BlankLine`: A line in a block that contains only whitespace.
- `Block`: Syntactic blocks in the language.
- `InvalidSuffix`: Invalid tokens when in a given state that would otherwise be
  valid.
- `Unrecognized`: Tokens that the lexer doesn't recognise.

The distinction is made here between the various kinds of identifiers in order
to keep lexing fast, but also in order to allow macros to switch on the kinds of
identifiers.

> The actionables for this section are:
>
> - Determine if we want to have separate ASTs for the lexer and the parser, or
>   not.
Initial parser docs setup (#936) 2020-06-25 15:06:08 +03:00			`---`
			`layout: developer-doc`
			`title: Lexer`
			`category: syntax`
			`tags: [parser, lexer]`
			`order: 4`
			`---`

			`# Lexer`
Add a markdown style guide (#1022) 2020-07-21 15:59:40 +03:00
Do initial design for the lexer (#947) 2020-06-26 16:54:20 +03:00			`The lexer is the code generated by the [flexer](./flexer.md) that is actually`
			`responsible for lexing Enso source code. It chunks the character stream into a`
			`(structured) token stream in order to make later processing faster, and to`
			`identify blocks`
Initial parser docs setup (#936) 2020-06-25 15:06:08 +03:00
			`<!-- MarkdownTOC levels="2,3" autolink="true" -->`

Implement part of the Enso lexer in rust (#1109) 2020-08-27 15:27:22 +03:00			`- [Lexer Architecture](#lexer-architecture)`
			`- [Libraries in the Lexer Definition](#libraries-in-the-lexer-definition)`
Do initial design for the lexer (#947) 2020-06-26 16:54:20 +03:00			`- [Lexer Functionality](#lexer-functionality)`
			`- [The Lexer AST](#the-lexer-ast)`

Initial parser docs setup (#936) 2020-06-25 15:06:08 +03:00			`<!-- /MarkdownTOC -->`
Do initial design for the lexer (#947) 2020-06-26 16:54:20 +03:00
Implement part of the Enso lexer in rust (#1109) 2020-08-27 15:27:22 +03:00			`## Lexer Architecture`

			`The structure of the flexer's code generation forces the lexer to be split into`
			`two parts: the definition, and the generation. As the latter is the point from`
			`which the lexer will be used, the second subproject is the one that is graced`
			with the name `lexer`.

			`### Libraries in the Lexer Definition`

			`The lexer generation subproject needs to be able to make the assumption that all`
			`imports will be in the same place (relative to the crate root). To this end, the`
			definition subproject exports public modules `library` and `prelude`. These are
			`re-imported and used in the generation subproject to ensure that all components`
			`are found at the same paths relative to the crate root.`

			`This does mean, however, that all imports from _within_ the current crate in the`
			definition subproject must be imported from the `library` module, not from their
			`paths directly from the crate root.`

Do initial design for the lexer (#947) 2020-06-26 16:54:20 +03:00			`## Lexer Functionality`
Add a markdown style guide (#1022) 2020-07-21 15:59:40 +03:00
Do initial design for the lexer (#947) 2020-06-26 16:54:20 +03:00			`The lexer needs to provide the following functionality as part of the parser.`

			`- It consumes the source lazily, character by character, and produces a`
			`structured token stream consisting of the lexer [ast](#the-lexer-ast).`
			`- It must succeed on _any_ input, even if there are invalid constructs in the`
			token stream, represented by `Invalid` tokens.

			`## The Lexer AST`
Add a markdown style guide (#1022) 2020-07-21 15:59:40 +03:00
Do initial design for the lexer (#947) 2020-06-26 16:54:20 +03:00			`In contrast to the full parser [ast](./ast.md), the lexer operates on a`
			`simplified AST that we call a 'structured token stream'. While most lexers`
			`output a linear token stream, it is very important in Enso that we encode the`
			`nature of _blocks_ into the token stream, hence giving it structure.`

			`This encoding of blocks is _crucial_ to the functionality of Enso as it ensures`
			`that no later stages of the parser can ignore blocks, and hence maintains them`
			`for use by the GUI.`

			`It contains the following constructs:`

Implement part of the Enso lexer in rust (#1109) 2020-08-27 15:27:22 +03:00			- `Referent`: Referrent identifiers (e.g. `Some_Ref_Ident`).
			- `Variable`: Variable identifiers (e.g. `some_var_ident`).
			- `External`: External identifiers (e.g. `someJavaName`).
			- `Blank`: The blank name `_`.
			- `Operator`: Operator identifiers (e.g. `-->>`).
			- `Modifier`: Modifier operators (e.g. `+=`).
			- `Number`: Numbers (`16_FFFF`).
			- `DanglingBase`: An explicit base without an associated number (e.g. `16_`).
			- `Text`: Text (e.g. `"Some text goes here."`).
			- `Line`: A line in a block that contains tokens.
			- `BlankLine`: A line in a block that contains only whitespace.
Do initial design for the lexer (#947) 2020-06-26 16:54:20 +03:00			- `Block`: Syntactic blocks in the language.
Implement part of the Enso lexer in rust (#1109) 2020-08-27 15:27:22 +03:00			- `InvalidSuffix`: Invalid tokens when in a given state that would otherwise be
			`valid.`
			- `Unrecognized`: Tokens that the lexer doesn't recognise.
Do initial design for the lexer (#947) 2020-06-26 16:54:20 +03:00
			`The distinction is made here between the various kinds of identifiers in order`
			`to keep lexing fast, but also in order to allow macros to switch on the kinds of`
			`identifiers.`

			`> The actionables for this section are:`
			`>`
			`> - Determine if we want to have separate ASTs for the lexer and the parser, or`
			`> not.`