write-you-a-haskell/008_extended_parser.md

<div class="pagetitle">
![](img/titles/extended_parser.png)
</div>

<p class="halfbreak">
</p>

Extended Parser
===============

Up until now we've been using parser combinators to build our parsers. Parser
combinators build top-down parsers that formally belong to the $\mathtt{LL}(k)$
family of parsers. The parser proceeds top-down, with a sequence of $k$
characters used to dispatch on the leftmost production rule.  Combined with
backtracking (i.e. the ``try`` combinator) this is simultaneously both an
extremely powerful and simple model to implement as we saw before with our
simple 100 line parser library.

However there is a family of grammars that include left-recursion that
$\mathtt{LL}(k)$ can be inefficient and often incapable of parsing.
Left-recursive rules are such where the left-most symbol of the rule
recurses on itself. For example:

$$
\begin{aligned}
e ::=\ e\ \t{op}\ \t{atom}
\end{aligned}
$$

Now we demonstrated before that we could handle these cases using the
parser combinator ``chainl1``, and while this is possible sometimes it
can in many cases be an inefficient use of the parser stack and lead to ambiguous
cases.

The other major family of parsers, $\mathtt{LR}$, are not plagued with the same
concerns over left recursion. On the other hand $\mathtt{LR}$ parser are
exceedingly more complicated to implement, relying on a rather sophisticated
method known as Tomita's algorithm to do the heavy lifting. The tooling
around the construction of the *production rules* in a form that can be handled
by the algorithm is often handled by a DSL that generates the code for the parser.
While the tooling is fairly robust, there is a level of indirection between us
and the code that can often be a bit of brittle to extend with custom logic.

The most common form of this toolchain is the Lex/Yacc lexer and parser
generator which compile into efficient C parsers for $\mathtt{LR}$ grammars.
Haskell's **Happy** and **Alex** are roughly the Haskell equivalent of these
tools.

Toolchain
---------

Our parser and lexer logic will be spread across two different modules.

* Lexer.x
* Parser.y

The code in each of these modules is a hybrid of the specific Alex/Happy grammar
syntax and arbitrary Haskell logic that is spliced in. Code delineated by braces
(``{}``) is regular Haskell, while code outside is parsera  and lexer logic.

```haskell
-- **Begin Haskell Syntax**
{
{-# OPTIONS_GHC -w #-}

module Lexer (
  Token(..),
  scanTokens
) where

import Syntax
}
-- **End Haskell Syntax**

-- **Begin Alex Syntax**
%wrapper "basic"

$digit = 0-9
$alpha = [a-zA-Z]
$eol   = [\n]
-- **End Alex Syntax**
```

The files will be used during the code generation of the two modules ``Lexer``
and ``Parser``. The toolchain is accessible in several ways, first via the
command-line tools ``alex`` and ``happy`` which will generate the resulting
modules by passing the appropriate input file to the tool.

```haskell
$ alex Lexer.x    # Generates Lexer.hs
$ happy Parser.y  # Generates Parser.hs
```

Or inside of the cabal file using the ``build-tools`` command.

```haskell
  Build-depends:       base, array
  build-tools:         alex, happy
  other-modules:
    Parser,
    Lexer
```

So the resulting structure of our interpreter will have the following set of
files.

* **Lexer.hs**
* **Parser.hs**
* Eval.hs
* Main.hs
* Syntax.hs

Alex
----

Our lexer module will export our Token definition and a function for converting
an arbitrary String into a *token stream* or a list of Tokens.

```haskell
{
module Lexer (
  Token(..),
  scanTokens
) where

import Syntax
}
```

The tokens are simply an enumeration of the unique possible tokens in our
grammar.

```haskell
data Token
  = TokenLet
  | TokenTrue
  | TokenFalse
  | TokenIn
  | TokenLambda
  | TokenNum Int
  | TokenSym String
  | TokenArrow
  | TokenEq
  | TokenAdd
  | TokenSub
  | TokenMul
  | TokenLParen
  | TokenRParen
  | TokenEOF
  deriving (Eq,Show)

scanTokens :: String -> [Token]
scanTokens = alexScanTokens
```

The token definition is a list of function definitions mapping atomic characters
and alphabetical sequences to constructors for our ``Token`` datatype.


```haskell
%wrapper "basic"

$digit = 0-9
$alpha = [a-zA-Z]
$eol   = [\n]

tokens :-

  -- Whitespace insensitive
  $eol                          ;
  $white+                       ;

  -- Comments
  "#".*                         ;

  -- Syntax
  let                           { \s -> TokenLet }
  True                          { \s -> TokenTrue }
  False                         { \s -> TokenFalse }
  in                            { \s -> TokenIn }
  $digit+                       { \s -> TokenNum (read s) }
  "->"                          { \s -> TokenArrow }
  \=                            { \s -> TokenEq }
  \\                            { \s -> TokenLambda }
  [\+]                          { \s -> TokenAdd }
  [\-]                          { \s -> TokenSub }
  [\*]                          { \s -> TokenMul }
  \(                            { \s -> TokenLParen }
  \)                            { \s -> TokenRParen }
  $alpha [$alpha $digit \_ \']* { \s -> TokenSym s }
```

Happy
-----

Using Happy and our previosly defind lexer we'll write down the production rules
for our simple untyped lambda calculus.

We start by defining a ``Syntax`` module where we define the AST we'll generate
from running over the token stream to produce the program graph structure.

```haskell
module Syntax where

type Name = String

data Expr
  = Lam Name Expr
  | App Expr Expr
  | Var Name
  | Lit Lit
  | Op Binop Expr Expr
  deriving (Eq,Show)

data Lit
  = LInt Int
  | LBool Bool
  deriving (Show, Eq, Ord)

data Binop = Add | Sub | Mul | Eql
  deriving (Eq, Ord, Show)
```

The token constructors are each assigned to a name that will be used in our
production rules.

```haskell
-- Lexer structure
%tokentype { Token }

-- Token Names
%token
    let   { TokenLet }
    true  { TokenTrue }
    false { TokenFalse }
    in    { TokenIn }
    NUM   { TokenNum $$ }
    VAR   { TokenSym $$ }
    '\\'  { TokenLambda }
    '->'  { TokenArrow }
    '='   { TokenEq }
    '+'   { TokenAdd }
    '-'   { TokenSub }
    '*'   { TokenMul }
    '('   { TokenLParen }
    ')'   { TokenRParen }
```

The parser itself will live inside of a custom monad of our choosing. In this
case we'll add error handling with the ``Except`` monad that will break out of
the parsing process if an invalid production or token is found and return a
``Left`` value which we'll handle inside of our toplevel logic.

```haskell
-- Parser monad
%monad { Except String } { (>>=) } { return }
%error { parseError }
```

And finally our production rules, the toplevel entry point for our parser will
be the ``expr`` rule. The left hand side of the production is a Happy production
rule which can be mutually recursive, while the right hand side is a Haskell
expression with several metavariable indicated by the dollar sign variables that
map to the nth expression on the left hand side.

```
$1  $2  $3  $4   $5 $6
let VAR '=' Expr in Expr    { App (Lam $2 $6) $4 }
```

```haskell
-- Entry point
%name expr

-- Operators
%left '+' '-'
%left '*'
%%

Expr : let VAR '=' Expr in Expr    { App (Lam $2 $6) $4 }
     | '\\' VAR '->' Expr          { Lam $2 $4 }
     | Form                        { $1 }

Form : Form '+' Form               { Op Add $1 $3 }
     | Form '-' Form               { Op Sub $1 $3 }
     | Form '*' Form               { Op Mul $1 $3 }
     | Fact                        { $1 }

Fact : Fact Atom                   { App $1 $2 }
     | Atom                        { $1 }

Atom : '(' Expr ')'                { $2 }
     | NUM                         { Lit (LInt $1) }
     | VAR                         { Var $1 }
     | true                        { Lit (LBool True) }
     | false                       { Lit (LBool False) }
```

Notice how naturally we can write a left recursive grammar for our binary infix
operators.

Syntax Errors
-------------

Parsec's default error reporting leaves a bit to be desired, but does in fact
contain most of the information needed to deliver better messages packed inside
the ParseError structure.

```haskell
showSyntaxError :: L.Text -> ParseError -> String
showSyntaxError s err = L.unpack $ L.unlines [
      "  ",
      "  " <> lineContents,
      "  " <> ((L.replicate col " ") <> "^"),
      (L.pack $ show err)
    ]
  where
    lineContents = (L.lines s) !! line
    pos  = errorPos err
    line = sourceLine pos - 1
    col  = fromIntegral $ sourceColumn pos - 1
```

Now when we enter an invalid expression the error reporting will point us
directly to the adjacent lexeme that caused the problem as is common in many
languages.

```bash
λ> \x -> x +

  \x -> x +
            ^
"<interactive>" (line 1, column 11):
unexpected end of input
expecting "(", character, literal string, "[", integer, "if" or identifier
```

Type Error Provenance
---------------------

Before our type inference engine would generate somewhat typical type inference
error messages. If two terms couldn't be unified it simply told us this and some
information about the top-level declaration where it occurred, leaving us with a
bit of a riddle about how exactly this error came to be.

```haskell
Cannot unify types:
        Int
with
        Bool
in the definition of 'foo'
```

Effective error reporting in the presence of type inference is a difficult task,
effectively our typechecker takes our frontend AST and transforms it into a
large constraint problem, destroying position information in the process. Even
if the position information were tracked, the nature of unification is that
a cascade of several unifications can lead to unsolvability and the immediate
two syntactic constructs that gave rise to a unification failure are not
necessarily the two that map back to human intuition about how the type error
arose. Very little research has done on this topic and it remains an open
topic with very immediate and applicable results to programming.

To do simple provenance tracking we will use a technique of tracking the "flow"
of type information through our typechecker and associate position information
with the inferred types.

```haskell
type Name = String

data Expr
  = Var Loc Name
  | App Loc Expr Expr
  | Lam Loc Name Expr
  | Lit Loc Int

data Loc = NoLoc | Located Int
  deriving (Show, Eq, Ord)
```

So now inside of our parser we simply attach Parsec information on to each AST
node. For example for the variable term.

```haskell
variable :: Parser Expr
variable = do
  x <- identifier
  l <- sourceLine <$> getPosition
  return (Var (Located l) x)
```

Our type system will also include information, although by default it will use
the ``NoLoc`` value until explicit information is provided during inference. The
two functions ``getLoc`` and ``setLoc`` will be used to update and query the
position information from type terms.

```haskell
data Type
  = TVar Loc TVar
  | TCon Loc Name
  | TArr Loc Type Type
  deriving (Show, Eq, Ord)

newtype TVar = TV String
  deriving (Show, Eq, Ord)

typeInt :: Type
typeInt = TCon NoLoc "Int"

setLoc :: Loc -> Type -> Type
setLoc l (TVar _ a)   = TVar l a
setLoc l (TCon _ a)   = TCon l a
setLoc l (TArr _ a b) = TArr l a b

getLoc :: Type -> Loc
getLoc (TVar l _) = l
getLoc (TCon l _) = l
getLoc (TArr l _ _) = l
```

Our fresh variable supply now also takes a location field which is attached to
the resulting type variable.

```haskell
fresh :: Loc -> Check Type
fresh l = do
  s <- get
  put s{count = count s + 1}
  return $ TVar l (TV (letters !! count s))
```

```haskell
infer :: Expr -> Check Type
infer expr = case expr of
  Var l n -> do
    t <- lookupVar n
    return $ setLoc l t

  App l a b -> do
    ta <- infer a
    tb <- infer b
    tr <- fresh l
    unify ta (TArr l tb tr)
    return tr

  Lam l n a -> do
    tv <- fresh l
    ty <- inEnv (n, tv) (infer a)
    return (TArr l (setLoc l ty) tv)

  Lit l _ -> return (setLoc l typeInt)
```

Now specifically at the call site of our unification solver, if we encounter a
unification fail we simply pluck the location information off the two type terms
and plug it into the type error fields.

```haskell
unifies t1 t2 | t1 == t2 = return emptyUnifer
unifies (TVar _ v) t = v `bind` t
unifies t (TVar _ v) = v `bind` t
unifies (TArr _ t1 t2) (TArr _ t3 t4) = unifyMany [t1, t2] [t3, t4]
unifies (TCon _ a) (TCon _ b) | a == b = return emptyUnifer
unifies t1 t2 = throwError $ UnificationFail t1 (getLoc t1) t2 (getLoc t2)

bind ::  TVar -> Type -> Solve Unifier
bind a t
  | eqLoc t a        = return (emptySubst, [])
  | occursCheck a t  = throwError $ InfiniteType a (getLoc t) t
  | otherwise        = return $ (Subst $ Map.singleton a t, [])
```

So now we can explicitly trace the provenance of the specific constraints that
gave rise to a given type error all the way back to the source that generated
them.

```haskell
Cannot unify types:
        Int
        Introduced at line 27 column 5

          f 2 3

with
        Int -> c
        Introduced at line 5 column 9

          let f x y = x y
```

This is of course the simplest implementation of the tracking method and could
be further extended by giving a weighted ordering to the constraints based on
their likelihood of importance and proximity and then choosing which location to
report based on this information. This remains an open area of work.

Indentation
-----------

Haskell's syntax uses indentation blocks to delineated sections of code.  This
use of indentation sensitive layout to convey the structure of logic is
sometimes called the *offside rule* in parsing literature. At the beginning of a
"laidout" block the first declaration or definition can start in any column, and
the parser marks that indentation level. Every subsequent declaration at the
same logical level must have the same indentation.


```haskell
-- Start of layout ( Column: 0 )
fib :: Int -> Int
fib x = truncate $ ( 1 / sqrt 5 ) * ( phi ^ x - psi ^ x ) -- (Column: > 0)
  -- Start of new layout ( Column: 2 )
  where
      -- Indented block ( Column: > 2 )
      phi = ( 1 + sqrt 5 ) / 2
      psi = ( 1 - sqrt 5 ) / 2
```

The Parsec monad is parameterized over a type which stands
for the State layer baked into the monad allowing us to embed custom parser
state inside of our rules. To adopt our parser to handle sensitive whitespace we
will use:

```haskell
-- Indentation sensitive Parsec monad.
type IParsec a = Parsec Text ParseState a

data ParseState = ParseState
  { indents :: Column
  } deriving (Show)

initParseState :: ParseState
initParseState = ParseState 0
```

The parser stores the internal position state (SourcePos) during its
traversal, and makes it accessible inside of rule logic via the ``getPosition``
function.

```haskell
data SourcePos = SourcePos SourceName !Line !Column
getPosition :: Monad m => ParsecT s u m SourcePos
```

In terms of this function we can write down a set of logic that will allow us to
query the current column count and then either succeed or fail to match on a
pattern based on the current indentation level. The ``laidout`` combinator will
capture the current indentation state and push it into the ``indents`` field in
the State monad.

```haskell
laidout :: Parsec s ParseState a -> Parsec s ParseState a
laidout m = do
  cur <- indents <$> getState
  pos <- sourceColumn <$> getPosition
  modifyState $ \st -> st { indents = pos }
  res <- m
  modifyState $ \st -> st { indents = cur }
  return res
```

And then have specific logic which guard the parser match based on comparing the
current indentation level to the stored indentation level.

```haskell
indentCmp
  :: (Column -> Column -> Bool)
  -> Parsec s ParseState ()
indentCmp cmp = do
  col <- sourceColumn <$> getPosition
  current <- indents <$> getState
  guard (col `cmp` current)
```

We can then write two combinators in terms of this function which match on
either further or identical indentation.

```haskell
indented :: IParsec ()
indented = indentCmp (>) <?> "Block (indented)"

align :: IParsec ()
align = indentCmp (==) <?> "Block (same indentation)"
```

On top of these we write our two combinators for handling block syntax, which
match a sequence of vertically aligned patterns as a list.

```haskell
block, block1 :: Parser a -> Parser [a]
block  p = laidout (many (align >> p))
block1 p = laidout (many1 (align >> p))
```

Haskell uses an optional layout rule for several constructs, allowing us to
equivalently manually delimit indentation sensitive syntax with braces. The most
common use is for do-notation. So for example:

```haskell
example = do { a <- m; b }

example = do
  a <- m
  b
```

To support this in Parsec style we implement a ``maybeBraces`` function.

```haskell
maybeBraces :: Parser a -> Parser [a]
maybeBraces p = braces (endBy p semi) <|> block p

maybeBraces1 :: Parser a -> Parser [a]
maybeBraces1 p = braces (endBy1 p semi) <|> block p
```

Extensible Operators
--------------------

Haskell famously allows the definition of custom infix operators, an extremely
useful language feature although this poses a bit of a challenge to parse! There
are two ways to do this and both depend on two properties of the operators.

* Precedence
* Associativity

1. The first, the way that GHC does it, is to parse all operators as left associative
and of the same precedence, and then before desugaring go back and "fix" the
parse tree given all the information we collected after finishing parsing.

2. The second method is a bit of a hack, and involves simply storing the collected
operators inside of the Parsec state monad and then simply calling
``buildExpressionParser`` on the current state each time we want to parse an
infix operator expression.

To do the later method we set up the AST objects for our fixity definitions, which
associate precedence and associativity annotations with a custom symbol.

```haskell
data FixitySpec = FixitySpec
  { fixityFix :: Fixity
  , fixityName :: String
  } deriving (Eq, Show)

data Assoc
  = L
  | R
  | N
  deriving (Eq,Ord,Show)

data Fixity
  = Infix Assoc Int
  | Prefix Int
  | Postfix Int
  deriving (Eq,Ord,Show)
```

Our parser state monad will hold a list of the active fixity specifications and
whenever a definition is encountered we will append to this list.

```haskell
data ParseState = ParseState
  { indents :: Column
  , fixities :: [FixitySpec]
  } deriving (Show)

initParseState :: ParseState
initParseState = ParseState 0 defaultOps

addOperator ::  FixitySpec -> Parsec s ParseState ()
addOperator fixdecl = do
  modifyState $ \st -> st { fixities = fixdecl : (fixities st) }
```

The initial state will consist of the default arithmetic and list operators
defined with the same specification as the Haskell specification.

```haskell
defaultOps :: [FixitySpec]
defaultOps = [
    FixitySpec (Infix L 4) ">"
  , FixitySpec (Infix L 4) "<"
  , FixitySpec (Infix L 4) "/="
  , FixitySpec (Infix L 4) "=="

  , FixitySpec (Infix R 5) ":"

  , FixitySpec (Infix L 6) "+"
  , FixitySpec (Infix L 6) "-"

  , FixitySpec (Infix L 5) "*"
  , FixitySpec (Infix L 5) "/"
  ]
```

Now in our parser we need to be able to transform the fixity specifications into
Parsec operator definitions. This is a pretty straightforward sort and group
operation on the list.

```haskell
fixityPrec :: FixitySpec -> Int
fixityPrec (FixitySpec (Infix _ n) _) = n
fixityPrec (FixitySpec _ _) = 0

toParser (FixitySpec ass tok) = case ass of
    Infix L _ -> infixOp tok (op (Name tok)) Ex.AssocLeft
    Infix R _ -> infixOp tok (op (Name tok)) Ex.AssocRight
    Infix N _ -> infixOp tok (op (Name tok)) Ex.AssocNone

mkTable ops =
  map (map toParser) $
    groupBy ((==) `on` fixityPrec) $
      reverse $ sortBy (compare `on` fixityPrec) $ ops
```

Now when parsing an infix operator declaration we simply do a state operation
and add the operator to the parser state so that all subsequent definitions
can use it.
This differs from Haskell slightly in that operators must be defined before
their usage in a module.

```haskell
fixityspec :: Parser FixitySpec
fixityspec = do
  fix  <- fixity
  prec <- precedence
  op   <- operator
  semi
  let spec = FixitySpec (fix prec) op
  addOperator spec
  return spec
  where
    fixity = Infix L <$ reserved "infixl"
         <|> Infix R <$ reserved "infixr"
         <|> Infix N <$ reserved "infix"

precedence :: Parser Int
precedence = do
  n <- natural
  if n <= 10
  then return (fromInteger n)
  else empty
  <?> "Invalid operator precedence"

fixitydecl :: Parser Decl
fixitydecl = do
  spec <- fixityspec
  return $ FixityDecl spec
 <?> "operator fixity definition"
```

And now when we need to parse an infix expression term we simply pull our state
out and build the custom operator table, and feed this to the build Expression
Parser just as before.

```haskell
term :: Parser Expr -> Parser Expr
term a = do
  st <- getState
  let customOps = mkTable (fixities st)
  Ex.buildExpressionParser customOps a
```

Full Source
-----------

* [Happy Parser](https://github.com/sdiehl/write-you-a-haskell/tree/master/chapter9/happy)
* [Imperative Language (Happy)](https://github.com/sdiehl/write-you-a-haskell/tree/master/chapter9/assign)
* [Layout Combinators](https://github.com/sdiehl/write-you-a-haskell/tree/master/chapter9/layout)
* [Type Provenance Tracking](https://github.com/sdiehl/write-you-a-haskell/tree/master/chapter9/provenance)

Resources
---------

The tooling and documentation for Alex and Happy is well-developed as it is used
extensively inside of GHC:

* [Alex User Guide](https://www.haskell.org/alex/doc)
* [Happy User Guide](https://www.haskell.org/happy/doc/html/)
* [A Tool for Generalized LR Parsing In Haskell](http://www.benmedlock.co.uk/Functional_GLR_Parsing.pdf)
* [Haskell Syntax Definition](https://www.haskell.org/onlinereport/haskell2010/haskellch10.html#x17-17500010)

GHC itself uses Alex and Happy for its parser infastructure. The resulting
parser is rather sophisicated.

* [Lexer.x](https://github.com/ghc/ghc/blob/master/compiler/parser/Lexer.x)
* [Parser.y](https://github.com/ghc/ghc/blob/master/compiler/parser/Parser.y)

One of the few papers ever written in Type Error reporting gives some techniques
for presentation and tracing provenance:

* [Top Quality Type Error Messages](http://www.staff.science.uu.nl/~swier101/Papers/Theses/TopQuality.pdf)

\clearpage