.. | ||
src/AST | ||
CHANGELOG.md | ||
Main.hs | ||
README.md | ||
semantic-codegen.cabal | ||
Setup.hs |
CodeGen Documentation
CodeGen is the process for auto-generating language-specific, strongly-typed ASTs to be used in Semantic.
Note: This project was recently moved from tree-sitter
into Semantic
. These docs are in the process of being updated to reflect changes.
Prerequisites
To get started, first make sure your language has:
- An existing tree-sitter parser;
- An existing Cabal package in this repository for said language. This will provide an interface into tree-sitter's C source. Here is an example of a library for Python, a supported language that the remaining documentation will refer to.
CodeGen Pipeline
During parser generation, tree-sitter produces a JSON file that captures the structure of a language's grammar. Based on this, we're able to derive datatypes representing surface languages, and then use those datatypes to generically build ASTs. This automates the engineering effort historically required for adding a new language.
The following steps provide a high-level outline of the process:
- Deserialize. First, we deserialize the
node-types.json
file for a given language into the desired shape of datatypes via parsing capabilities afforded by the Aeson library. There are four distinct types represented in the node-types.json file takes on: sums, products, named leaves and anonymous leaves. - Generate Syntax. We then use Template Haskell to auto-generate language-specific, strongly-typed datatypes that represent various language constructs. This API exports the top-level function
astDeclarationsForLanguage
to auto-generate datatypes at compile-time, which is is invoked by a given language AST module. - Unmarshal. Unmarshaling is the process of iterating over tree-sitter’s parse trees using its tree cursor API, and producing Haskell ASTs for the relevant nodes. We parse source code from tree-sitter and unmarshal the data we get to build these ASTs generically. This file exports the top-level function
parseByteString
, which takes source code and a language as arguments, and produces an AST.
Here is an example that describes the relationship between a Python identifier represented in the tree-sitter generated JSON file, and a datatype generated by Template Haskell based on the provided JSON:
Type | JSON | TH-generated code |
---|---|---|
Named leaf | { |
data TreeSitter.Python.AST.Identifier a |
The remaining document provides more details on generating ASTs, inspecting datatypes, tests, and information on decisions pertaining to relevant APIs.
Table of Contents
Generating ASTs
To parse source code and produce ASTs locally:
- Load the REPL for a given language:
cabal new-repl lib:tree-sitter-python
- Set language extensions,
OverloadedStrings
andTypeApplications
, and import relevant modules,AST.Unmarshal
,Source.Range
andSource.Span
:
:seti -XOverloadedStrings
:seti -XTypeApplications
import Source.Span
import Source.Range
import AST.Unmarshal
- You can now call
parseByteString
, passing in the desired language you wish to parse (in this case Python exemplified bytree_sitter_python
), and the source code (in this case an integer1
). Since the function is constrained by(Unmarshal t, UnmarshalAnn a)
, you can use type applications to provide a top-level nodet
, an entry point into the tree, in addition to a polymorphic annotationa
used to represent range and span:
parseByteString @TreeSitter.Python.AST.Module @(Source.Span.Span, Source.Range.Range) tree_sitter_python "1"
This generates the following AST:
Right
( Module
{ ann =
( Range
{ start = 0
, end = 1
}
, Span
{ start = Pos
{ line = 0
, column = 0
}
, end = Pos
{ line = 0
, column = 1
}
}
)
, extraChildren =
[ R1
( SimpleStatement
( L1
( R1
( R1
( L1
( ExpressionStatement
{ ann =
( Range
{ start = 0
, end = 1
}
, Span
{ start = Pos
{ line = 0
, column = 0
}
, end = Pos
{ line = 0
, column = 1
}
}
)
, extraChildren = L1
( L1
( Expression
( L1
( L1
( L1
( PrimaryExpression
( R1
( L1
( L1
( L1
( Integer
{ ann =
( Range
{ start = 0
, end = 1
}
, Span
{ start = Pos
{ line = 0
, column = 0
}
, end = Pos
{ line = 0
, column = 1
}
}
)
, text = "1"
}
)
)
)
)
)
)
)
)
)
)
) :| []
}
)
)
)
)
)
)
]
}
)
Inspecting auto-generated datatypes
Datatypes are derived from a language and its node-types.json
file using the GenerateSyntax API. Definition can be viewed in the REPL just as they would for any other datatype, using :i
:
:i TreeSitter.Python.AST.Module
This shows us the auto-generated Module
datatype:
data TreeSitter.Python.AST.Module a
= TreeSitter.Python.AST.Module {TreeSitter.Python.AST.ann :: a,
TreeSitter.Python.AST.extraChildren :: [(GHC.Generics.:+:)
TreeSitter.Python.AST.CompoundStatement
TreeSitter.Python.AST.SimpleStatement
a]}
-- Defined at TreeSitter/Python/AST.hs:10:1
instance Show a => Show (TreeSitter.Python.AST.Module a)
-- Defined at TreeSitter/Python/AST.hs:10:1
instance Ord a => Ord (TreeSitter.Python.AST.Module a)
-- Defined at TreeSitter/Python/AST.hs:10:1
instance Eq a => Eq (TreeSitter.Python.AST.Module a)
-- Defined at TreeSitter/Python/AST.hs:10:1
instance Traversable TreeSitter.Python.AST.Module
-- Defined at TreeSitter/Python/AST.hs:10:1
instance Functor TreeSitter.Python.AST.Module
-- Defined at TreeSitter/Python/AST.hs:10:1
instance Foldable TreeSitter.Python.AST.Module
-- Defined at TreeSitter/Python/AST.hs:10:1
instance Unmarshal TreeSitter.Python.AST.Module
-- Defined at TreeSitter/Python/AST.hs:10:1
instance SymbolMatching TreeSitter.Python.AST.Module
-- Defined at TreeSitter/Python/AST.hs:10:1
Tests
As of right now, Hedgehog tests are minimal and only in place for the Python library.
To run tests:
cabal v2-test tree-sitter-python
Additional notes
- GenerateSyntax provides a way to pre-define certain datatypes for which Template Haskell is not used. Any datatypes among the node types which have already been defined in the module where the splice is run will be skipped, allowing customization of the representation of parts of the tree. While this gives us flexibility, we encourage that this is used sparingly, as it imposes extra maintenance burden, particularly when the grammar is changed. This may be used to e.g. parse literals into Haskell equivalents (e.g. parsing the textual contents of integer literals into
Integer
s), and may require definingTS.UnmarshalAnn
orTS.SymbolMatching
instances for (parts of) the custom datatypes, depending on where and how the datatype occurs in the generated tree, in addition to the usualFoldable
,Functor
, etc. instances provided for generated datatypes. - Annotations are captured by a polymorphic parameter
a
- Unmarshal defines both generic and non-generic classes. This is because generic behaviors are different than what we get non-generically, and in the case of
Maybe
,[]
—we actually preference doing things non-generically. Since[]
is a sum, the generic behavior for:+:
would be invoked and expect that we’d have repetitions represented in the parse tree as right-nested singly-linked lists (ex.,(a (b (c (d…))))
) rather than as just consecutive sibling nodes (ex.,(a b c ...d)
, which is what our trees have). We want to match the latter.