1
1
mirror of https://github.com/github/semantic.git synced 2024-12-18 04:11:48 +03:00
semantic/docs/adding-new-languages.md

34 lines
5.7 KiB
Markdown
Raw Normal View History

# Adding new languages to Semantic
2020-04-20 17:52:07 +03:00
This document outlines the process for adding a new language to Semantic. Though the Semantic authors have architected the library such that adding new languages and syntax [requires no changes to existing code](https://en.wikipedia.org/wiki/Expression_problem), adding support for a new language is a nontrivial amount of work. Those willing to take the plunge will probably need a degree of Haskell experience.
2020-04-20 18:32:09 +03:00
Note that we recently transitioned the system to auto-generate strongly-typed ASTs using [CodeGen](https://github.com/github/semantic/blob/master/docs/codegen.md), our new language support library. More information is provided below in the [FAQs](#FAQs).
2019-06-14 19:47:53 +03:00
## The procedure
2019-06-13 23:39:56 +03:00
1. **Find or write a [tree-sitter](https://tree-sitter.github.io) parser for your language.** The tree-sitter [organization page](https://github.com/tree-sitter) has a number of parsers beyond those we currently support in Semantic; look there first to make sure you're not duplicating work. The tree-sitter [documentation on creating parsers](http://tree-sitter.github.io/tree-sitter/creating-parsers) provides an exhaustive look at the process of developing and debugging tree-sitter parsers. Though we do not support grammars written with other toolkits such as [ANTLR](https://www.antlr.org), translating an ANTLR or other BNF-style grammar into a tree-sitter grammar is usually straightforward.
2020-04-18 00:48:32 +03:00
2. **Create a Haskell library providing an interface to that C source.** The [`haskell-tree-sitter`](https://github.com/tree-sitter/haskell-tree-sitter) repository provides a Cabal package for each supported language. You can find an example of a pull request to add such a package [here](https://github.com/tree-sitter/haskell-tree-sitter/pull/276/files), and a file providing:
- A bridged (via the FFI) reference to the toplevel parser in the generated file must be provided ([example](https://github.com/tree-sitter/haskell-tree-sitter/blob/master/tree-sitter-json/TreeSitter/JSON.hs#L11)).
- A way to retrieve [`tree-sitter` data](https://github.com/tree-sitter/haskell-tree-sitter/blob/master/tree-sitter-json/TreeSitter/JSON.hs#L13-L14) used to auto-generate syntax datatypes using the following steps. During parser generation, tree-sitter produces a `node-types.json` file that captures the structure of a language's grammar. The autogeneration described below in Step 4 derives datatypes based on this structural representation. The `node-types.json` is a data file in `haskell-tree-sitter` that gets installed with the package. The function `getNodeTypesPath :: IO FilePath` is defined to access in the contents of this file, using `getDataFileName :: FilePath -> IO FilePath`, which is defined in the autogenerated `Paths_` module.
2020-04-20 20:44:00 +03:00
3. **Create a Haskell library in Semantic to auto-generate precise ASTs.** Create a `semantic-[LANGUAGE]` package. This is an example of [`semantic-python`](https://github.com/github/semantic/tree/master/semantic-python)). Each package needs to provide the following API surfaces:
2020-04-20 18:35:45 +03:00
- `Language.[LANGUAGE].AST` - Derives Haskell datatypes from a language and its `node-types.json` file ([example](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python/AST.hs)).
2020-04-20 20:44:00 +03:00
- `Language.[LANGUAGE].Grammar` - Provides statically-known rules corresponding to symbols in the grammar for each syntax node, generated with the `mkStaticallyKnownRuleGrammarData` Template Haskell splice ([example](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python/Grammar.hs)).
2020-04-20 18:35:45 +03:00
- `Language.[LANGUAGE]` - Semantic functionality for programs in a language ([example](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python.hs)).
- `Language.[LANGUAGE].Tags` - Computes tags for code nav definitions and references found in source ([example](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python/Tags.hs)).
2020-04-20 20:44:00 +03:00
5. **Add tests for precise ASTs, tagging and graphing, and evaluating code written in that language.** Because tree-sitter grammars often change, we require extensive testing so as to avoid the unhappy situation of bitrotted languages that break as soon as a new grammar comes down the line. Here are examples of tests for [precise ASTs](https://github.com/github/semantic/blob/master/semantic-python/test/PreciseTest.hs), [tagging](https://github.com/github/semantic/blob/master/test/Tags/Spec.hs), and [graphing](https://github.com/github/semantic/blob/master/semantic-python/test-graphing/GraphTest.hs).
To summarize, each interaction made possible by the Semantic CLI corresponds to one (or more) of the above steps:
| Step | Interaction |
|------|-----------------|
| 1, 2 | `ts-parse` |
| 3, 4 | `parse`, `diff` |
2020-04-20 19:58:39 +03:00
| 5 | `graph` |
# FAQs
**This sounds hard.** You're right! It is currently a lot of work: just because the Semantic architecture is extensible in the expression-problem manner does not mean that adding new support is trivial.
**What recent changes have been made?** The Semantic authors have introduced a new architecture for language support and parsing, one that dispenses with the [assignment](https://github.com/github/semantic/blob/master/docs/assignment.md) step altogether. The `semantic-ast` package generates Haskell data types from tree-sitter grammars; these types are then translated into the [Semantic core language](https://github.com/github/semantic/blob/master/semantic-core/src/Data/Core.hs); all evaluators will then be written in terms of the Core language. As compared with the [historic process]() used to add new languages, these changes entire obviate the process of 1) assigning types into an open-union of syntax functors, and 2) implementing `Evaluatable` instances and adding value effects to describe the control flow of your language.