Idris2/docs/source/cookbook/parsing.rst
2022-05-19 11:50:48 +01:00

395 lines
12 KiB
ReStructuredText

Parsing
=======
Idris 2 comes with a Lexing and a Parsing library built into the ``contrib`` package.
For this cookbook, we will write a very simple parser for a lambda calculus parser
that will accept the following language:
.. code-block:: text
let name = world in (\x.hello x) name
Once we write a lambda calculus parser, we will also see how we can take advantage of a
powerful built in expression parser in Idris 2 to write a small calculator that should be
smart enough to parse the following expression:
.. code-block:: text
1 + 2 - 3 * 4 / 5
Lexer
-----
The main lexer module is under ``Text.Lexer``. This module contains ``toTokenMap`` which is a function
that converts a ``List (Lexer, k) -> TokenMap (Token k)`` where ``k`` is a token kind. This function
could be used for simple lexer to token mappings. The module also includes high level lexers for
specifying quantity and common programming primitives like ``alphas``, ``intLit``,
``lineComment`` and ``blockComment``.
The ``Text.Lexer`` module also reexports ``Text.Lexer.Core``, ``Text.Quantity`` and ``Text.Token``.
``Text.Lexer.Core`` provides the building blocks of the lexer, including a type called
``Recognise`` which is the underlying data type for the lexer. The other important function that this
module provides is a ``lex`` which takes in a lexer and returns the tokens.
``Text.Quantity`` provides a data type ``Quantity`` which can be used with certain lexers to specify
how many times something is expected to appear.
``Text.Token`` provides a data type ``Token`` that represents a parsed token, its kind and the text.
This module also provides an important interface called ``TokenKind`` which tells the lexer how to map
token kinds to Idris 2 types and how to convert each kind from a string to a value.
Parser
------
The main parser module is under ``Text.Parser``. This module contains different grammar parsers, the main
one being ``match`` which takes a ``TokenKind`` and returns the value as defined in the ``TokenKind``
interface. There are other grammar parsers as well, but for our example, we will only be using ``match``.
The ``Text.Parser`` module reexports ``Text.Parser.Core``, ``Text.Quantity`` and ``Text.Token``.
``Text.Parser.Core`` provides the building blocks of the parser, including a type called ``Grammar``
which is the underlying data type for the parser. The other important function that this module provides
is ``parse`` which takes in a ``Grammar`` and returns the parsed expression.
We covered ``Text.Quantity`` and ``Text.Token`` in the Lexer section so we're not going to
repeat what they do here.
Lambda Calculus Lexer & Parser
------------------------------
.. code-block:: idris
:caption: LambdaCalculus.idr
:linenos:
import Data.List
import Data.List1
import Text.Lexer
import Text.Parser
%default total
data Expr = App Expr Expr | Abs String Expr | Var String | Let String Expr Expr
Show Expr where
showPrec d (App e1 e2) = showParens (d == App) (showPrec (User 0) e1 ++ " " ++ showPrec App e2)
showPrec d (Abs v e) = showParens (d > Open) ("\\" ++ v ++ "." ++ show e)
showPrec d (Var v) = v
showPrec d (Let v e1 e2) = showParens (d > Open) ("let " ++ v ++ " = " ++ show e1 ++ " in " ++ show e2)
data LambdaTokenKind
= LTLambda
| LTIdentifier
| LTDot
| LTOParen
| LTCParen
| LTIgnore
| LTLet
| LTEqual
| LTIn
Eq LambdaTokenKind where
(==) LTLambda LTLambda = True
(==) LTDot LTDot = True
(==) LTIdentifier LTIdentifier = True
(==) LTOParen LTOParen = True
(==) LTCParen LTCParen = True
(==) LTLet LTLet = True
(==) LTEqual LTEqual = True
(==) LTIn LTIn = True
(==) _ _ = False
Show LambdaTokenKind where
show LTLambda = "LTLambda"
show LTDot = "LTDot"
show LTIdentifier = "LTIdentifier"
show LTOParen = "LTOParen"
show LTCParen = "LTCParen"
show LTIgnore = "LTIgnore"
show LTLet = "LTLet"
show LTEqual = "LTEqual"
show LTIn = "LTIn"
LambdaToken : Type
LambdaToken = Token LambdaTokenKind
Show LambdaToken where
show (Tok kind text) = "Tok kind: " ++ show kind ++ " text: " ++ text
TokenKind LambdaTokenKind where
TokType LTIdentifier = String
TokType _ = ()
tokValue LTLambda _ = ()
tokValue LTIdentifier s = s
tokValue LTDot _ = ()
tokValue LTOParen _ = ()
tokValue LTCParen _ = ()
tokValue LTIgnore _ = ()
tokValue LTLet _ = ()
tokValue LTEqual _ = ()
tokValue LTIn _ = ()
ignored : WithBounds LambdaToken -> Bool
ignored (MkBounded (Tok LTIgnore _) _ _) = True
ignored _ = False
identifier : Lexer
identifier = alpha <+> many alphaNum
keywords : List (String, LambdaTokenKind)
keywords = [
("let", LTLet),
("in", LTIn)
]
lambdaTokenMap : TokenMap LambdaToken
lambdaTokenMap = toTokenMap [(spaces, LTIgnore)] ++
[(identifier, \s =>
case lookup s keywords of
(Just kind) => Tok kind s
Nothing => Tok LTIdentifier s
)
] ++ toTokenMap [
(exact "\\", LTLambda),
(exact ".", LTDot),
(exact "(", LTOParen),
(exact ")", LTCParen),
(exact "=", LTEqual)
]
lexLambda : String -> Maybe (List (WithBounds LambdaToken))
lexLambda str =
case lex lambdaTokenMap str of
(tokens, _, _, "") => Just tokens
_ => Nothing
mutual
expr : Grammar state LambdaToken True Expr
expr = do
t <- term
app t <|> pure t
term : Grammar state LambdaToken True Expr
term = abs
<|> var
<|> paren
<|> letE
app : Expr -> Grammar state LambdaToken True Expr
app e1 = do
e2 <- term
app1 $ App e1 e2
app1 : Expr -> Grammar state LambdaToken False Expr
app1 e = app e <|> pure e
abs : Grammar state LambdaToken True Expr
abs = do
match LTLambda
commit
argument <- match LTIdentifier
match LTDot
e <- expr
pure $ Abs argument e
var : Grammar state LambdaToken True Expr
var = map Var $ match LTIdentifier
paren : Grammar state LambdaToken True Expr
paren = do
match LTOParen
e <- expr
match LTCParen
pure e
letE : Grammar state LambdaToken True Expr
letE = do
match LTLet
commit
argument <- match LTIdentifier
match LTEqual
e1 <- expr
match LTIn
e2 <- expr
pure $ Let argument e1 e2
parseLambda : List (WithBounds LambdaToken) -> Either String Expr
parseLambda toks =
case parse expr $ filter (not . ignored) toks of
Right (l, []) => Right l
Right e => Left "contains tokens that were not consumed"
Left e => Left (show e)
parse : String -> Either String Expr
parse x =
case lexLambda x of
Just toks => parseLambda toks
Nothing => Left "Failed to lex."
Testing out our parser gives us back the following output:
.. code-block:: text
$ idris2 -p contrib LambdaCalculus.idr
Main> :exec printLn $ parse "let name = world in (\\x.hello x) name"
Right (let name = world in (\x.hello x) name)
Expression Parser
-----------------
Idris 2 also comes with a very convenient expression parser that is
aware of precedence and associativity in ``Text.Parser.Expression``.
The main function called ``buildExpressionParser`` takes in an ``OperatorTable`` and a
``Grammar`` that represents the terms, and returns a parsed expression. The magic comes from
the ``OperatorTable`` since this table defines all the operators, the grammars for those operators,
the precedence, and the associativity.
An ``OperatorTable`` is a list of lists containing the ``Op`` type. The ``Op`` type allows you to specify
``Prefix``, ``Postfix``, and ``Infix`` operators along with their grammars. ``Infix`` also contains the
associativity called ``Assoc`` which can specify left associativity or ``AssocLeft``, right
associativity assoc or ``AssocRight`` and as being non-associative or ``AssocNone``.
An example of an operator table we'll be using for the calculator is:
.. code-block:: idris
[
[ Infix (match CTMultiply >> pure (*)) AssocLeft
, Infix (match CTDivide >> pure (/)) AssocLeft
],
[ Infix (match CTPlus >> pure (+)) AssocLeft
, Infix (match CTMinus >> pure (-)) AssocLeft
]
]
This table defines 4 operators for mulitiplication, division, addition and subtraction.
Mulitiplication and division show up in the first table because they have higher precedence than
addition and subtraction, which show up in the second table. We're also defining them as infix operators,
with a specific grammar and all being left associative via ``AssocLeft``.
Building a Calculator
---------------------
.. code-block:: idris
:caption: Calculator.idr
:linenos:
import Data.List1
import Text.Lexer
import Text.Parser
import Text.Parser.Expression
%default total
data CalculatorTokenKind
= CTNum
| CTPlus
| CTMinus
| CTMultiply
| CTDivide
| CTOParen
| CTCParen
| CTIgnore
Eq CalculatorTokenKind where
(==) CTNum CTNum = True
(==) CTPlus CTPlus = True
(==) CTMinus CTMinus = True
(==) CTMultiply CTMultiply = True
(==) CTDivide CTDivide = True
(==) CTOParen CTOParen = True
(==) CTCParen CTCParen = True
(==) _ _ = False
Show CalculatorTokenKind where
show CTNum = "CTNum"
show CTPlus = "CTPlus"
show CTMinus = "CTMinus"
show CTMultiply = "CTMultiply"
show CTDivide = "CTDivide"
show CTOParen = "CTOParen"
show CTCParen = "CTCParen"
show CTIgnore = "CTIgnore"
CalculatorToken : Type
CalculatorToken = Token CalculatorTokenKind
Show CalculatorToken where
show (Tok kind text) = "Tok kind: " ++ show kind ++ " text: " ++ text
TokenKind CalculatorTokenKind where
TokType CTNum = Double
TokType _ = ()
tokValue CTNum s = cast s
tokValue CTPlus _ = ()
tokValue CTMinus _ = ()
tokValue CTMultiply _ = ()
tokValue CTDivide _ = ()
tokValue CTOParen _ = ()
tokValue CTCParen _ = ()
tokValue CTIgnore _ = ()
ignored : WithBounds CalculatorToken -> Bool
ignored (MkBounded (Tok CTIgnore _) _ _) = True
ignored _ = False
number : Lexer
number = digits
calculatorTokenMap : TokenMap CalculatorToken
calculatorTokenMap = toTokenMap [
(spaces, CTIgnore),
(digits, CTNum),
(exact "+", CTPlus),
(exact "-", CTMinus),
(exact "*", CTMultiply),
(exact "/", CTDivide)
]
lexCalculator : String -> Maybe (List (WithBounds CalculatorToken))
lexCalculator str =
case lex calculatorTokenMap str of
(tokens, _, _, "") => Just tokens
_ => Nothing
mutual
term : Grammar state CalculatorToken True Double
term = do
num <- match CTNum
pure num
expr : Grammar state CalculatorToken True Double
expr = buildExpressionParser [
[ Infix ((*) <$ match CTMultiply) AssocLeft
, Infix ((/) <$ match CTDivide) AssocLeft
],
[ Infix ((+) <$ match CTPlus) AssocLeft
, Infix ((-) <$ match CTMinus) AssocLeft
]
] term
parseCalculator : List (WithBounds CalculatorToken) -> Either String Double
parseCalculator toks =
case parse expr $ filter (not . ignored) toks of
Right (l, []) => Right l
Right e => Left "contains tokens that were not consumed"
Left e => Left (show e)
parse1 : String -> Either String Double
parse1 x =
case lexCalculator x of
Just toks => parseCalculator toks
Nothing => Left "Failed to lex."
Testing out our calculator gives us back the following output:
.. code-block:: text
$ idris2 -p contrib Calculator.idr
Main> :exec printLn $ parse1 "1 + 2 - 3 * 4 / 5"
Right 0.6000000000000001