enso/docs/parser/lexer.md
2020-07-21 13:59:40 +01:00

61 lines
1.9 KiB
Markdown

---
layout: developer-doc
title: Lexer
category: syntax
tags: [parser, lexer]
order: 4
---
# Lexer
The lexer is the code generated by the [flexer](./flexer.md) that is actually
responsible for lexing Enso source code. It chunks the character stream into a
(structured) token stream in order to make later processing faster, and to
identify blocks
<!-- MarkdownTOC levels="2,3" autolink="true" -->
- [Lexer Functionality](#lexer-functionality)
- [The Lexer AST](#the-lexer-ast)
<!-- /MarkdownTOC -->
## Lexer Functionality
The lexer needs to provide the following functionality as part of the parser.
- It consumes the source lazily, character by character, and produces a
structured token stream consisting of the lexer [ast](#the-lexer-ast).
- It must succeed on _any_ input, even if there are invalid constructs in the
token stream, represented by `Invalid` tokens.
## The Lexer AST
In contrast to the full parser [ast](./ast.md), the lexer operates on a
simplified AST that we call a 'structured token stream'. While most lexers
output a linear token stream, it is very important in Enso that we encode the
nature of _blocks_ into the token stream, hence giving it structure.
This encoding of blocks is _crucial_ to the functionality of Enso as it ensures
that no later stages of the parser can ignore blocks, and hence maintains them
for use by the GUI.
It contains the following constructs:
- `Var`: Variable identifiers.
- `Ref`: Referrent identifiers.
- `Opr`: Operator identifiers.
- `Number`: Numbers.
- `Text`: Text.
- `Invalid`: Invalid constructs that cannot be lexed.
- `Block`: Syntactic blocks in the language.
The distinction is made here between the various kinds of identifiers in order
to keep lexing fast, but also in order to allow macros to switch on the kinds of
identifiers.
> The actionables for this section are:
>
> - Determine if we want to have separate ASTs for the lexer and the parser, or
> not.