roc/crates/compiler/DESIGN.md
2023-03-13 14:07:42 +00:00

5.7 KiB

Compiler Design

The current Roc compiler is designed as a pipelining compiler parallelizable across Roc modules.

Roc's compilation pipeline consists of a few major components, which form the table of contents for this document.

Parsing

Roc's parsers are designed as combinators. A list of Roc's parse AST and combinators can be found in the root parse file.

Combinators enable parsing to compose as functions would - for example, the one_of combinator supports attempting multiple parsing strategies, and succeeding on the first one; the and_then combinator chains two parsers together, failing if either parser in the sequence fails.

Since Roc is an indentation-sensitive language, parsing must be cognizant and deligent about handling indentation and de-indentation levels. Most parsing functions take a min_indent parameter that specifies the minimum indentation of the scope an expression should be parsed in. Generally, failing to reach min_indent indicates that an expression has ended (but perhaps too early).

Canonicalization

After parsing a Roc program into an AST, the AST is transformed into a canonical form AST. This may seem a bit redundant - why build another tree, when we already have the AST? Canonicalization performs a few analyses to catch user errors, and sets up the state necessary to solve the types in a program. Among other things, canonicalization

  • Uniquely identifies names (think variable and function names). Along the way, canonicalization builds a graph of all variables' references, and catches unused definitions, undefined definitions, and shadowed definitions.
  • Resolves type signatures, including aliases, into a form suitable for type solving.
  • Determines the order definitions are used in, if they are defined out-of-order.
  • Eliminates syntax sugar (for example, renaming + to the function call add and converting backpassing to function calls).
  • Collects declared abilities, and ability implementations defined for opaque types. Derived abilities for opaque types are elaborated during canonicalization.

Symbol Resolution

Identifiers, like variable names, are resolved to Symbols.

Currently, a symbol is a 64-bit value with

  • the bottom 32 bits defining the ModuleId the symbol is defined in
  • the top 32 bits defining the IdentId of the symbol in the module

A symbol is unique per identifier name and the scope that the identifier has been declared in. Symbols are how the rest of the compiler refers to value definitions - since the unique scope and identifier name is disambiguated when symbols are created, referencing symbols requires no further name resolution.

As symbols are constructed, canonicalization also keeps track of all references to a given symbol. This simplifies catching unused definitions, undefined definitions, and shadowing, to an index into an array.

Type-alias normalization

Closure naming

Constraint Generation

(Mutually-)recursive definitions

Type Solving

Unification

Type Inference

Recursive Types

Lambda Sets

Ability Collection

Ability Specialization

Ability Derivation

Exhaustiveness Checking

Debugging

IR Generation

Memory Layouts

Compiling Calls

Decision Trees

Tail-call Optimization

Reference-count insertion

Reusing Memory Allocations

Debugging

LLVM Code Generator

Morphic Analysis

C ABI

Test Harness

Debugging

WASM Code Generator

WASM Interpreter

Debugging

Dev Code Generator

Debugging

Builtins

Compiler Driver

Caching types

Repl

test and dbg

Formatter

Glue

Active areas of research / help wanted