Idris2/docs/source/cookbook/parsing.rst

Parsing
=======

Idris 2 comes with a Lexing and a Parsing library built into the ``contrib`` package.

For this cookbook, we will write a very simple parser for a lambda calculus parser
that will accept the following language:

.. code-block:: text

    let name = world in (\x.hello x) name

Once we write a lambda calculus parser, we will also see how we can take advantage of a
powerful built in expression parser in Idris 2 to write a small calculator that should be
smart enough to parse the following expression:

.. code-block:: text

    1 + 2 - 3 * 4 / 5

Lexer
-----

The main lexer module is under ``Text.Lexer``. This module contains ``toTokenMap`` which is a function
that converts a ``List (Lexer, k) -> TokenMap (Token k)`` where ``k`` is a token kind. This function
could be used for simple lexer to token mappings. The module also includes high level lexers for
specifying quantity and common programming primitives like ``alphas``, ``intLit``,
``lineComment`` and ``blockComment``.

The ``Text.Lexer`` module also reexports ``Text.Lexer.Core``, ``Text.Quantity`` and ``Text.Token``.

``Text.Lexer.Core`` provides the building blocks of the lexer, including a type called
``Recognise`` which is the underlying data type for the lexer. The other important function that this
module provides is a ``lex`` which takes in a lexer and returns the tokens.

``Text.Quantity`` provides a data type ``Quantity`` which can be used with certain lexers to specify
how many times something is expected to appear.

``Text.Token`` provides a data type ``Token`` that represents a parsed token, its kind and the text.
This module also provides an important interface called ``TokenKind`` which tells the lexer how to map
token kinds to Idris 2 types and how to convert each kind from a string to a value.

Parser
------

The main parser module is under ``Text.Parser``. This module contains different grammar parsers, the main
one being ``match`` which takes a ``TokenKind`` and returns the value as defined in the ``TokenKind``
interface. There are other grammar parsers as well, but for our example, we will only be using ``match``.

The ``Text.Parser`` module reexports ``Text.Parser.Core``, ``Text.Quantity`` and ``Text.Token``.

``Text.Parser.Core`` provides the building blocks of the parser, including a type called ``Grammar``
which is the underlying data type for the parser. The other important function that this module provides
is ``parse`` which takes in a ``Grammar`` and returns the parsed expression.

We covered ``Text.Quantity`` and ``Text.Token`` in the Lexer section so we're not going to
repeat what they do here.

Lambda Calculus Lexer & Parser
------------------------------

.. code-block:: idris
    :caption: LambdaCalculus.idr
    :linenos:

    import Data.List
    import Data.List1
    import Text.Lexer
    import Text.Parser

    %default total

    data Expr = App Expr Expr | Abs String Expr | Var String | Let String Expr Expr

    Show Expr where
      showPrec d (App e1 e2) = showParens (d == App) (showPrec (User 0) e1 ++ " " ++ showPrec App e2)
      showPrec d (Abs v e) = showParens (d > Open) ("\\" ++ v ++ "." ++ show e)
      showPrec d (Var v) = v
      showPrec d (Let v e1 e2) = showParens (d > Open) ("let " ++ v ++ " = " ++ show e1 ++ " in " ++ show e2)

    data LambdaTokenKind
      = LTLambda
      | LTIdentifier
      | LTDot
      | LTOParen
      | LTCParen
      | LTIgnore
      | LTLet
      | LTEqual
      | LTIn

    Eq LambdaTokenKind where
      (==) LTLambda LTLambda = True
      (==) LTDot LTDot = True
      (==) LTIdentifier LTIdentifier = True
      (==) LTOParen LTOParen = True
      (==) LTCParen LTCParen = True
      (==) LTLet LTLet = True
      (==) LTEqual LTEqual = True
      (==) LTIn LTIn = True
      (==) _ _ = False

    Show LambdaTokenKind where
      show LTLambda = "LTLambda"
      show LTDot = "LTDot"
      show LTIdentifier = "LTIdentifier"
      show LTOParen = "LTOParen"
      show LTCParen = "LTCParen"
      show LTIgnore = "LTIgnore"
      show LTLet = "LTLet"
      show LTEqual = "LTEqual"
      show LTIn = "LTIn"

    LambdaToken : Type
    LambdaToken = Token LambdaTokenKind

    Show LambdaToken where
      show (Tok kind text) = "Tok kind: " ++ show kind ++ " text: " ++ text

    TokenKind LambdaTokenKind where
      TokType LTIdentifier = String
      TokType _ = ()

      tokValue LTLambda _ = ()
      tokValue LTIdentifier s = s
      tokValue LTDot _ = ()
      tokValue LTOParen _ = ()
      tokValue LTCParen _ = ()
      tokValue LTIgnore _ = ()
      tokValue LTLet _ = ()
      tokValue LTEqual _ = ()
      tokValue LTIn _ = ()

    ignored : WithBounds LambdaToken -> Bool
    ignored (MkBounded (Tok LTIgnore _) _ _) = True
    ignored _ = False

    identifier : Lexer
    identifier = alpha <+> many alphaNum

    keywords : List (String, LambdaTokenKind)
    keywords = [
      ("let", LTLet),
      ("in", LTIn)
    ]

    lambdaTokenMap : TokenMap LambdaToken
    lambdaTokenMap = toTokenMap [(spaces, LTIgnore)] ++
      [(identifier, \s =>
          case lookup s keywords of
            (Just kind) => Tok kind s
            Nothing => Tok LTIdentifier s
        )
      ] ++ toTokenMap [
        (exact "\\", LTLambda),
        (exact ".", LTDot),
        (exact "(", LTOParen),
        (exact ")", LTCParen),
        (exact "=", LTEqual)
      ]

    lexLambda : String -> Maybe (List (WithBounds LambdaToken))
    lexLambda str =
      case lex lambdaTokenMap str of
        (tokens, _, _, "") => Just tokens
        _ => Nothing

    mutual
      expr : Grammar state LambdaToken True Expr
      expr = do
        t <- term
        app t <|> pure t

      term : Grammar state LambdaToken True Expr
      term = abs
        <|> var
        <|> paren
        <|> letE

      app : Expr -> Grammar state LambdaToken True Expr
      app e1 = do
        e2 <- term
        app1 $ App e1 e2

      app1 : Expr -> Grammar state LambdaToken False Expr
      app1 e = app e <|> pure e

      abs : Grammar state LambdaToken True Expr
      abs = do
        match LTLambda
        commit
        argument <- match LTIdentifier
        match LTDot
        e <- expr
        pure $ Abs argument e

      var : Grammar state LambdaToken True Expr
      var = map Var $ match LTIdentifier

      paren : Grammar state LambdaToken True Expr
      paren = do
        match LTOParen
        e <- expr
        match LTCParen
        pure e

      letE : Grammar state LambdaToken True Expr
      letE = do
        match LTLet
        commit
        argument <- match LTIdentifier
        match LTEqual
        e1 <- expr
        match LTIn
        e2 <- expr
        pure $ Let argument e1 e2

    parseLambda : List (WithBounds LambdaToken) -> Either String Expr
    parseLambda toks =
      case parse expr $ filter (not . ignored) toks of
        Right (l, []) => Right l
        Right e => Left "contains tokens that were not consumed"
        Left e => Left (show e)

    parse : String -> Either String Expr
    parse x =
      case lexLambda x of
        Just toks => parseLambda toks
        Nothing => Left "Failed to lex."

Testing out our parser gives us back the following output:

.. code-block:: text

    $ idris2 -p contrib LambdaCalculus.idr
    Main> :exec printLn $ parse "let name = world in (\\x.hello x) name"
    Right (let name = world in (\x.hello x) name)

Expression Parser
-----------------

Idris 2 also comes with a very convenient expression parser that is
aware of precedence and associativity in ``Text.Parser.Expression``.

The main function called ``buildExpressionParser`` takes in an ``OperatorTable`` and a
``Grammar`` that represents the terms, and returns a parsed expression. The magic comes from
the ``OperatorTable`` since this table defines all the operators, the grammars for those operators,
the precedence, and the associativity.

An ``OperatorTable`` is a list of lists containing the ``Op`` type. The ``Op`` type allows you to specify
``Prefix``, ``Postfix``, and ``Infix`` operators along with their grammars. ``Infix`` also contains the
associativity called ``Assoc`` which can specify left associativity or ``AssocLeft``, right
associativity assoc or ``AssocRight`` and as being non-associative or ``AssocNone``.

An example of an operator table we'll be using for the calculator is:

.. code-block:: idris

    [
      [ Infix (match CTMultiply >> pure (*)) AssocLeft
      , Infix (match CTDivide >> pure (/)) AssocLeft
      ],
      [ Infix (match CTPlus >> pure (+)) AssocLeft
      , Infix (match CTMinus >> pure (-)) AssocLeft
      ]
    ]

This table defines 4 operators for mulitiplication, division, addition and subtraction.
Mulitiplication and division show up in the first table because they have higher precedence than
addition and subtraction, which show up in the second table. We're also defining them as infix operators,
with a specific grammar and all being left associative via ``AssocLeft``.

Building a Calculator
---------------------

.. code-block:: idris
    :caption: Calculator.idr
    :linenos:

    import Data.List1
    import Text.Lexer
    import Text.Parser
    import Text.Parser.Expression

    %default total

    data CalculatorTokenKind
      = CTNum
      | CTPlus
      | CTMinus
      | CTMultiply
      | CTDivide
      | CTOParen
      | CTCParen
      | CTIgnore

    Eq CalculatorTokenKind where
      (==) CTNum CTNum = True
      (==) CTPlus CTPlus = True
      (==) CTMinus CTMinus = True
      (==) CTMultiply CTMultiply = True
      (==) CTDivide CTDivide = True
      (==) CTOParen CTOParen = True
      (==) CTCParen CTCParen = True
      (==) _ _ = False

    Show CalculatorTokenKind where
      show CTNum = "CTNum"
      show CTPlus = "CTPlus"
      show CTMinus = "CTMinus"
      show CTMultiply = "CTMultiply"
      show CTDivide = "CTDivide"
      show CTOParen = "CTOParen"
      show CTCParen = "CTCParen"
      show CTIgnore = "CTIgnore"

    CalculatorToken : Type
    CalculatorToken = Token CalculatorTokenKind

    Show CalculatorToken where
        show (Tok kind text) = "Tok kind: " ++ show kind ++ " text: " ++ text

    TokenKind CalculatorTokenKind where
      TokType CTNum = Double
      TokType _ = ()

      tokValue CTNum s = cast s
      tokValue CTPlus _ = ()
      tokValue CTMinus _ = ()
      tokValue CTMultiply _ = ()
      tokValue CTDivide _ = ()
      tokValue CTOParen _ = ()
      tokValue CTCParen _ = ()
      tokValue CTIgnore _ = ()

    ignored : WithBounds CalculatorToken -> Bool
    ignored (MkBounded (Tok CTIgnore _) _ _) = True
    ignored _ = False

    number : Lexer
    number = digits

    calculatorTokenMap : TokenMap CalculatorToken
    calculatorTokenMap = toTokenMap [
      (spaces, CTIgnore),
      (digits, CTNum),
      (exact "+", CTPlus),
      (exact "-", CTMinus),
      (exact "*", CTMultiply),
      (exact "/", CTDivide)
    ]

    lexCalculator : String -> Maybe (List (WithBounds CalculatorToken))
    lexCalculator str =
      case lex calculatorTokenMap str of
        (tokens, _, _, "") => Just tokens
        _ => Nothing

    mutual
      term : Grammar state CalculatorToken True Double
      term = do
        num <- match CTNum
        pure num

      expr : Grammar state CalculatorToken True Double
      expr = buildExpressionParser [
        [ Infix ((*) <$ match CTMultiply) AssocLeft
        , Infix ((/) <$ match CTDivide) AssocLeft
        ],
        [ Infix ((+) <$ match CTPlus) AssocLeft
        , Infix ((-) <$ match CTMinus) AssocLeft
        ]
      ] term

    parseCalculator : List (WithBounds CalculatorToken) -> Either String Double
    parseCalculator toks =
      case parse expr $ filter (not . ignored) toks of
        Right (l, []) => Right l
        Right e => Left "contains tokens that were not consumed"
        Left e => Left (show e)

    parse1 : String -> Either String Double
    parse1 x =
      case lexCalculator x of
        Just toks => parseCalculator toks
        Nothing => Left "Failed to lex."

Testing out our calculator gives us back the following output:

.. code-block:: text

    $ idris2 -p contrib Calculator.idr
    Main> :exec printLn $ parse1 "1 + 2 - 3 * 4 / 5"
    Right 0.6000000000000001