1
1
mirror of https://github.com/github/semantic.git synced 2024-11-23 16:37:50 +03:00
semantic/docs/adding-new-languages.md

5.7 KiB

Adding new languages to Semantic

This document outlines the process for adding a new language to Semantic. Though the Semantic authors have architected the library such that adding new languages and syntax requires no changes to existing code, adding support for a new language is a nontrivial amount of work. Those willing to take the plunge will probably need a degree of Haskell experience.

Note that we recently transitioned the system to auto-generate strongly-typed ASTs using CodeGen, our new language support library. More information is provided below in the FAQs.

The procedure

  1. Find or write a tree-sitter parser for your language. The tree-sitter organization page has a number of parsers beyond those we currently support in Semantic; look there first to make sure you're not duplicating work. The tree-sitter documentation on creating parsers provides an exhaustive look at the process of developing and debugging tree-sitter parsers. Though we do not support grammars written with other toolkits such as ANTLR, translating an ANTLR or other BNF-style grammar into a tree-sitter grammar is usually straightforward.
  2. Create a Haskell library providing an interface to that C source. The haskell-tree-sitter repository provides a Cabal package for each supported language. You can find an example of a pull request to add such a package here, and a file providing:
    • A bridged (via the FFI) reference to the toplevel parser in the generated file must be provided (example).
    • A way to retrieve tree-sitter data used to auto-generate syntax datatypes using the following steps. During parser generation, tree-sitter produces a node-types.json file that captures the structure of a language's grammar. The autogeneration described below in Step 4 derives datatypes based on this structural representation. The node-types.json is a data file in haskell-tree-sitter that gets installed with the package. The function getNodeTypesPath :: IO FilePath is defined to access in the contents of this file, using getDataFileName :: FilePath -> IO FilePath, which is defined in the autogenerated Paths_ module.
  3. Create a Haskell library in Semantic to auto-generate precise ASTs. Create a semantic-[LANGUAGE] package. This is an example of semantic-python). Each package needs to provide the following API surfaces:
    • Language.[LANGUAGE].AST - Derives Haskell datatypes from a language and its node-types.json file (example).
    • Language.[LANGUAGE].Grammar - Provides statically-known rules corresponding to symbols in the grammar for each syntax node, generated with the mkStaticallyKnownRuleGrammarData Template Haskell splice (example).
    • Language.[LANGUAGE] - Semantic functionality for programs in a language (example).
    • Language.[LANGUAGE].Tags - Computes tags for code nav definitions and references found in source (example).
  4. Add tests for precise ASTs, tagging and graphing, and evaluating code written in that language. Because tree-sitter grammars often change, we require extensive testing so as to avoid the unhappy situation of bitrotted languages that break as soon as a new grammar comes down the line. Here are examples of tests for precise ASTs, tagging, and graphing.

To summarize, each interaction made possible by the Semantic CLI corresponds to one (or more) of the above steps:

Step Interaction
1, 2 ts-parse
3, 4 parse, diff
5 graph

FAQs

This sounds hard. You're right! It is currently a lot of work: just because the Semantic architecture is extensible in the expression-problem manner does not mean that adding new support is trivial.

What recent changes have been made? The Semantic authors have introduced a new architecture for language support and parsing, one that dispenses with the assignment step altogether. The semantic-ast package generates Haskell data types from tree-sitter grammars; these types are then translated into the Semantic core language; all evaluators will then be written in terms of the Core language. As compared with the historic process used to add new languages, these changes entire obviate the process of 1) assigning types into an open-union of syntax functors, and 2) implementing Evaluatable instances and adding value effects to describe the control flow of your language.