semantic/docs/adding-new-languages.md

# Adding new languages to Semantic

This document outlines the process for adding a new language to Semantic. Though the Semantic authors have architected the library such that adding new languages and syntax [requires no changes to existing code](https://en.wikipedia.org/wiki/Expression_problem), adding support for a new language is a nontrivial amount of work. Those willing to take the plunge will probably need a degree of Haskell experience.

Note that we recently transitioned the system to auto-generate strongly-typed ASTs using [CodeGen](https://github.com/github/semantic/blob/master/docs/codegen.md), our new language support library. More information is provided below in the [FAQs](#FAQs).

## The procedure

1. **Find or write a [tree-sitter](https://tree-sitter.github.io) parser for your language.** The tree-sitter [organization page](https://github.com/tree-sitter) has a number of parsers beyond those we currently support in Semantic; look there first to make sure you're not duplicating work. The tree-sitter [documentation on creating parsers](http://tree-sitter.github.io/tree-sitter/creating-parsers) provides an exhaustive look at the process of developing and debugging tree-sitter parsers. Though we do not support grammars written with other toolkits such as [ANTLR](https://www.antlr.org), translating an ANTLR or other BNF-style grammar into a tree-sitter grammar is usually straightforward.
2. **Create a Haskell library providing an interface to that C source.** The [`haskell-tree-sitter`](https://github.com/tree-sitter/haskell-tree-sitter) repository provides a Cabal package for each supported language. You can find an example of a pull request to add such a package [here](https://github.com/tree-sitter/haskell-tree-sitter/pull/276/files), and a file providing:
    - A bridged (via the FFI) reference to the toplevel parser in the generated file must be provided ([example](https://github.com/tree-sitter/haskell-tree-sitter/blob/master/tree-sitter-json/TreeSitter/JSON.hs#L11)). 
    - A way to retrieve [`tree-sitter` data](https://github.com/tree-sitter/haskell-tree-sitter/blob/master/tree-sitter-json/TreeSitter/JSON.hs#L13-L14) used to auto-generate syntax datatypes using the following steps. During parser generation, tree-sitter produces a `node-types.json` file that captures the structure of a language's grammar. The autogeneration described below in Step 4 derives datatypes based on this structural representation. The `node-types.json` is a data file in `haskell-tree-sitter` that gets installed with the package. The function `getNodeTypesPath :: IO FilePath` is defined to access in the contents of this file, using `getDataFileName :: FilePath -> IO FilePath`, which is defined in the autogenerated `Paths_` module. 
3. **Create a Haskell library in Semantic to auto-generate precise ASTs.** Create a `semantic-[LANGUAGE]` package. This is an example of [`semantic-python`](https://github.com/github/semantic/tree/master/semantic-python)). Each package needs to provide the following API surfaces:
    - `Language.[LANGUAGE].AST` - Derives Haskell datatypes from a language and its `node-types.json` file ([example](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python/AST.hs)).
    - `Language.[LANGUAGE].Grammar` - Provides statically-known rules corresponding to symbols in the grammar for each syntax node, generated with the `mkStaticallyKnownRuleGrammarData` Template Haskell splice ([example](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python/Grammar.hs)).
    - `Language.[LANGUAGE]` - Semantic functionality for programs in a language ([example](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python.hs)).
    - `Language.[LANGUAGE].Tags` - Computes tags for code nav definitions and references found in source ([example](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python/Tags.hs)).
5. **Add tests for precise ASTs, tagging and graphing, and evaluating code written in that language.** Because tree-sitter grammars often change, we require extensive testing so as to avoid the unhappy situation of bitrotted languages that break as soon as a new grammar comes down the line. Here are examples of tests for [precise ASTs](https://github.com/github/semantic/blob/master/semantic-python/test/PreciseTest.hs), [tagging](https://github.com/github/semantic/blob/master/test/Tags/Spec.hs), and [graphing](https://github.com/github/semantic/blob/master/semantic-python/test-graphing/GraphTest.hs). 

To summarize, each interaction made possible by the Semantic CLI corresponds to one (or more) of the above steps:

| Step | Interaction     |
|------|-----------------|
| 1, 2 | `ts-parse`      |
| 3, 4 | `parse`, `diff` |
| 5 | `graph`         |


# FAQs

**This sounds hard.** You're right! It is currently a lot of work: just because the Semantic architecture is extensible in the expression-problem manner does not mean that adding new support is trivial.

**What recent changes have been made?** The Semantic authors have introduced a new architecture for language support and parsing, one that dispenses with the [assignment](https://github.com/github/semantic/blob/master/docs/assignment.md) step altogether. The `semantic-ast` package generates Haskell data types from tree-sitter grammars; these types are then translated into the [Semantic core language](https://github.com/github/semantic/blob/master/semantic-core/src/Data/Core.hs); all evaluators will then be written in terms of the Core language. As compared with the [historic process]() used to add new languages, these changes entire obviate the process of 1) assigning types into an open-union of syntax functors, and 2) implementing `Evaluatable` instances and adding value effects to describe the control flow of your language.
Document the process of adding new languages. (#126) Took a first stab at this. I described the addition process in terms of how it is now, rather than how it will be when we compile tree-sitter ASTs directly into Core. I anticipate that this will impress upon the public the amount of work required to add a new language, and hopefully encourage people to wait until the new generation of AST parsing lands before they invest a ton of time into what will be a deprecated code path. 2019-06-12 22:37:40 +03:00			`# Adding new languages to Semantic`

make intro less verbose 2020-04-20 17:52:07 +03:00			`This document outlines the process for adding a new language to Semantic. Though the Semantic authors have architected the library such that adding new languages and syntax [requires no changes to existing code](https://en.wikipedia.org/wiki/Expression_problem), adding support for a new language is a nontrivial amount of work. Those willing to take the plunge will probably need a degree of Haskell experience.`
Document the process of adding new languages. (#126) Took a first stab at this. I described the addition process in terms of how it is now, rather than how it will be when we compile tree-sitter ASTs directly into Core. I anticipate that this will impress upon the public the amount of work required to add a new language, and hopefully encourage people to wait until the new generation of AST parsing lands before they invest a ton of time into what will be a deprecated code path. 2019-06-12 22:37:40 +03:00
Update notice 2020-04-20 18:32:09 +03:00			`Note that we recently transitioned the system to auto-generate strongly-typed ASTs using [CodeGen](https://github.com/github/semantic/blob/master/docs/codegen.md), our new language support library. More information is provided below in the [FAQs](#FAQs).`
Address @dcreager's changes. 2019-06-14 19:47:53 +03:00
Document the process of adding new languages. (#126) Took a first stab at this. I described the addition process in terms of how it is now, rather than how it will be when we compile tree-sitter ASTs directly into Core. I anticipate that this will impress upon the public the amount of work required to add a new language, and hopefully encourage people to wait until the new generation of AST parsing lands before they invest a ton of time into what will be a deprecated code path. 2019-06-12 22:37:40 +03:00			`## The procedure`

Address @robrix's suggestions. 2019-06-13 23:39:56 +03:00			1. Find or write a [tree-sitter](https://tree-sitter.github.io) parser for your language. The tree-sitter [organization page](https://github.com/tree-sitter) has a number of parsers beyond those we currently support in Semantic; look there first to make sure you're not duplicating work. The tree-sitter [documentation on creating parsers](http://tree-sitter.github.io/tree-sitter/creating-parsers) provides an exhaustive look at the process of developing and debugging tree-sitter parsers. Though we do not support grammars written with other toolkits such as [ANTLR](https://www.antlr.org), translating an ANTLR or other BNF-style grammar into a tree-sitter grammar is usually straightforward.
use C# pr as example 2020-04-18 00:48:32 +03:00			2. Create a Haskell library providing an interface to that C source. The [`haskell-tree-sitter`](https://github.com/tree-sitter/haskell-tree-sitter) repository provides a Cabal package for each supported language. You can find an example of a pull request to add such a package [here](https://github.com/tree-sitter/haskell-tree-sitter/pull/276/files), and a file providing:
break file example into FFI and getNodeTypes 2020-04-17 21:54:46 +03:00			`- A bridged (via the FFI) reference to the toplevel parser in the generated file must be provided ([example](https://github.com/tree-sitter/haskell-tree-sitter/blob/master/tree-sitter-json/TreeSitter/JSON.hs#L11)).`
add context on getNodeTypePaths and getDataFileName 2020-04-20 22:52:56 +03:00			- A way to retrieve [`tree-sitter` data](https://github.com/tree-sitter/haskell-tree-sitter/blob/master/tree-sitter-json/TreeSitter/JSON.hs#L13-L14) used to auto-generate syntax datatypes using the following steps. During parser generation, tree-sitter produces a `node-types.json` file that captures the structure of a language's grammar. The autogeneration described below in Step 4 derives datatypes based on this structural representation. The `node-types.json` is a data file in `haskell-tree-sitter` that gets installed with the package. The function `getNodeTypesPath :: IO FilePath` is defined to access in the contents of this file, using `getDataFileName :: FilePath -> IO FilePath`, which is defined in the autogenerated `Paths_` module.
Update adding-new-languages.md 2020-04-20 20:44:00 +03:00			3. Create a Haskell library in Semantic to auto-generate precise ASTs. Create a `semantic-[LANGUAGE]` package. This is an example of [`semantic-python`](https://github.com/github/semantic/tree/master/semantic-python)). Each package needs to provide the following API surfaces:
lowercase example 2020-04-20 18:35:45 +03:00			- `Language.[LANGUAGE].AST` - Derives Haskell datatypes from a language and its `node-types.json` file ([example](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python/AST.hs)).
Update adding-new-languages.md 2020-04-20 20:44:00 +03:00			- `Language.[LANGUAGE].Grammar` - Provides statically-known rules corresponding to symbols in the grammar for each syntax node, generated with the `mkStaticallyKnownRuleGrammarData` Template Haskell splice ([example](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python/Grammar.hs)).
lowercase example 2020-04-20 18:35:45 +03:00			- `Language.[LANGUAGE]` - Semantic functionality for programs in a language ([example](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python.hs)).
			- `Language.[LANGUAGE].Tags` - Computes tags for code nav definitions and references found in source ([example](https://github.com/github/semantic/blob/master/semantic-python/src/Language/Python/Tags.hs)).
Update adding-new-languages.md 2020-04-20 20:44:00 +03:00			5. Add tests for precise ASTs, tagging and graphing, and evaluating code written in that language. Because tree-sitter grammars often change, we require extensive testing so as to avoid the unhappy situation of bitrotted languages that break as soon as a new grammar comes down the line. Here are examples of tests for [precise ASTs](https://github.com/github/semantic/blob/master/semantic-python/test/PreciseTest.hs), [tagging](https://github.com/github/semantic/blob/master/test/Tags/Spec.hs), and [graphing](https://github.com/github/semantic/blob/master/semantic-python/test-graphing/GraphTest.hs).
Document the process of adding new languages. (#126) Took a first stab at this. I described the addition process in terms of how it is now, rather than how it will be when we compile tree-sitter ASTs directly into Core. I anticipate that this will impress upon the public the amount of work required to add a new language, and hopefully encourage people to wait until the new generation of AST parsing lands before they invest a ton of time into what will be a deprecated code path. 2019-06-12 22:37:40 +03:00
Draw correspondence between CLI and language stages. 2019-06-14 23:48:53 +03:00			`To summarize, each interaction made possible by the Semantic CLI corresponds to one (or more) of the above steps:`

			`\| Step \| Interaction \|`
			`\|------\|-----------------\|`
			\| 1, 2 \| `ts-parse` \|
			\| 3, 4 \| `parse`, `diff` \|
update step-interaction table 2020-04-20 19:58:39 +03:00			\| 5 \| `graph` \|
Draw correspondence between CLI and language stages. 2019-06-14 23:48:53 +03:00

Document the process of adding new languages. (#126) Took a first stab at this. I described the addition process in terms of how it is now, rather than how it will be when we compile tree-sitter ASTs directly into Core. I anticipate that this will impress upon the public the amount of work required to add a new language, and hopefully encourage people to wait until the new generation of AST parsing lands before they invest a ton of time into what will be a deprecated code path. 2019-06-12 22:37:40 +03:00			`# FAQs`

			`This sounds hard. You're right! It is currently a lot of work: just because the Semantic architecture is extensible in the expression-problem manner does not mean that adding new support is trivial.`

provide link to assignment docs for context 2020-04-20 21:03:53 +03:00			What recent changes have been made? The Semantic authors have introduced a new architecture for language support and parsing, one that dispenses with the [assignment](https://github.com/github/semantic/blob/master/docs/assignment.md) step altogether. The `semantic-ast` package generates Haskell data types from tree-sitter grammars; these types are then translated into the [Semantic core language](https://github.com/github/semantic/blob/master/semantic-core/src/Data/Core.hs); all evaluators will then be written in terms of the Core language. As compared with the [historic process]() used to add new languages, these changes entire obviate the process of 1) assigning types into an open-union of syntax functors, and 2) implementing `Evaluatable` instances and adding value effects to describe the control flow of your language.