changes to the Extended Parser chapter

This commit is contained in:
Christian Sievers 2015-04-10 12:49:03 +02:00
parent 0b3825e7ee
commit 43f74ae7ff

View File

@ -9,15 +9,16 @@ Extended Parser
=============== ===============
Up until now we've been using parser combinators to build our parsers. Parser Up until now we've been using parser combinators to build our parsers. Parser
combinators are a top-down parser formally in the $\mathtt{LL}(k)$ family of combinators build top-down parsers that formally belong to the $\mathtt{LL}(k)$
parsers. The parser proceeds top-down, with a sequence of $k$ characters used to family of parsers. The parser proceeds top-down, with a sequence of $k$
dispatch on the leftmost production rule. Combined with backtracking (i.e. try characters used to dispatch on the leftmost production rule.
Combined with backtracking (i.e. the ``try``
combinator) this is simultaneously both an extremely powerful and simple model combinator) this is simultaneously both an extremely powerful and simple model
to implement as we saw before with our simple 100 line parser library. to implement as we saw before with our simple 100 line parser library.
However there are a family of grammars that include left-recursion that However there is a family of grammars that include left-recursion that
$\mathtt{LL}(k)$ can be inefficient and often incapable of parsing. $\mathtt{LL}(k)$ can be inefficient and often incapable of parsing.
Left-recursive rules are the case where the left-most symbol of the rule Left-recursive rules are such where the left-most symbol of the rule
recurses on itself. For example: recurses on itself. For example:
$$ $$
@ -26,17 +27,17 @@ e ::=\ e\ \t{op}\ \t{atom}
\end{aligned} \end{aligned}
$$ $$
Now we demonstrated a way before that we could handle these cases using the Now we demonstrated before that we could handle these cases using the
parser combinator ``chainl1`` function, and while this is possible sometimes it parser combinator ``chainl1``, and while this is possible sometimes it
can in many cases be inefficient use of parser stack and lead to ambiguous can in many cases be an inefficient use of the parser stack and lead to ambiguous
cases. cases.
The other major family of parsers $\mathtt{LR}$ are not plagued with the same The other major family of parsers, $\mathtt{LR}$, are not plagued with the same
concerns over left recursion. On the other hand $\mathtt{LR}$ parser are concerns over left recursion. On the other hand $\mathtt{LR}$ parser are
exceedingly more complicated to implement, relying on a rather sophisticated exceedingly more complicated to implement, relying on a rather sophisticated
method known as Tomita's algorithm to do the heavy lifting. The tooling can method known as Tomita's algorithm to do the heavy lifting. The tooling
around the construction of the *production rules* in a form that can be handled around the construction of the *production rules* in a form that can be handled
by the algorithm is often handled a DSL that generates the code for the parser. by the algorithm is often handled by a DSL that generates the code for the parser.
While the tooling is fairly robust, there is a level of indirection between us While the tooling is fairly robust, there is a level of indirection between us
and the code that can often be a bit of brittle to extend with custom logic. and the code that can often be a bit of brittle to extend with custom logic.
@ -83,7 +84,7 @@ $eol = [\n]
The files will be used during the code generation of the two modules ``Lexer`` The files will be used during the code generation of the two modules ``Lexer``
and ``Parser``. The toolchain is accessible in several ways, first via the and ``Parser``. The toolchain is accessible in several ways, first via the
command-line tools ``alex`` and ``happy`` will will generate the resulting command-line tools ``alex`` and ``happy`` which will generate the resulting
modules by passing the appropriate input file to the tool. modules by passing the appropriate input file to the tool.
```haskell ```haskell
@ -153,7 +154,7 @@ scanTokens :: String -> [Token]
scanTokens = alexScanTokens scanTokens = alexScanTokens
``` ```
The token definition is list of function definitions mapping atomic character The token definition is a list of function definitions mapping atomic characters
and alphabetical sequences to constructors for our ``Token`` datatype. and alphabetical sequences to constructors for our ``Token`` datatype.
@ -252,7 +253,7 @@ simple case we'll just add error handling with the ``Except`` monad.
``` ```
And finally our production rules, the toplevel entry point for our parser will And finally our production rules, the toplevel entry point for our parser will
be the ``expr`` rule. Notice how naturally we can right left recursive grammar be the ``expr`` rule. Notice how naturally we can write a left recursive grammar
for our infix operators. for our infix operators.
```haskell ```haskell
@ -280,7 +281,7 @@ Atom : '(' Expr ')' { $2 }
| NUM { Lit (LInt $1) } | NUM { Lit (LInt $1) }
| VAR { Var $1 } | VAR { Var $1 }
| true { Lit (LBool True) } | true { Lit (LBool True) }
| false { Lit (LBool True) } | false { Lit (LBool False) }
``` ```
Syntax Errors Syntax Errors
@ -324,8 +325,8 @@ Type Error Provenance
Before our type inference engine would generate somewhat typical type inference Before our type inference engine would generate somewhat typical type inference
error messages. If two terms couldn't be unified it simply told us this and some error messages. If two terms couldn't be unified it simply told us this and some
information about the toplevel declaration where it occurred. Leaving us with a information about the toplevel declaration where it occurred, leaving us with a
bit of a riddle about this error came to be. bit of a riddle about how exactly this error came to be.
```haskell ```haskell
Cannot unify types: Cannot unify types:
@ -337,19 +338,18 @@ in the definition of 'foo'
Effective error reporting in the presence of type inference is a difficult task, Effective error reporting in the presence of type inference is a difficult task,
effectively our typechecker takes our frontend AST and transforms it into a effectively our typechecker takes our frontend AST and transforms it into a
large constraint problem but effectively destroys position information large constraint problem, destroying position
information in the process. Even if the position information were tracked, the information in the process. Even if the position information were tracked, the
nature of unification is that a cascade of several unifications can give rise to nature of unification is that a cascade of several unifications can lead to
invalid solution and the immediate two syntactic constructs that gave rise to a unsolvability and the immediate two syntactic constructs that gave rise to a
unification fail are not necessarily the two that map back to human intuition unification failure are not necessarily the two that map back to human intuition
about how the type error arose. Very little research has done on this topic and about how the type error arose. Very little research has done on this topic and
it remains a open topic with very immediate and applicable results to it remains an open topic with very immediate and applicable results to
programming. programming.
To do simple provenance tracking we will use a technique of track the "flow" of To do simple provenance tracking we will use a technique of tracking the "flow"
type information through out typechecker and associate position information of type information through our typechecker and associate position information
associated with the inferred types back to their position information in the with the inferred types.
source.
```haskell ```haskell
type Name = String type Name = String
@ -376,7 +376,7 @@ variable = do
``` ```
Our type system will also include information, although by default it will use Our type system will also include information, although by default it will use
the ``NoLoc`` type until explicit information is provided during inference. The the ``NoLoc`` value until explicit information is provided during inference. The
two functions ``getLoc`` and ``setLoc`` will be used to update and query the two functions ``getLoc`` and ``setLoc`` will be used to update and query the
position information from type terms. position information from type terms.
@ -405,7 +405,7 @@ getLoc (TArr l _ _) = l
``` ```
Our fresh variable supply now also takes a location field which is attached to Our fresh variable supply now also takes a location field which is attached to
resulting type variable. the resulting type variable.
```haskell ```haskell
fresh :: Loc -> Check Type fresh :: Loc -> Check Type
@ -474,8 +474,8 @@ with
let f x y = x y let f x y = x y
``` ```
This is of course the simplest implementation of the this tracking method and This is of course the simplest implementation of the tracking method and
could be further extended by giving an weighted ordering to the constraints could be further extended by giving a weighted ordering to the constraints
based on their likelihood of importance and proximity and then choosing which based on their likelihood of importance and proximity and then choosing which
location to report based on this information. This remains an open area of work. location to report based on this information. This remains an open area of work.
@ -484,10 +484,10 @@ Indentation
Haskell's syntax uses indentation blocks to delineated sections of code. This Haskell's syntax uses indentation blocks to delineated sections of code. This
use of indentation sensitive layout to convey the structure of logic is use of indentation sensitive layout to convey the structure of logic is
sometimes called the *offside rule* in parsing literature. At the beginning of sometimes called the *offside rule* in parsing literature. At the beginning of a
"laidout" block the first declaration or definition can start in any column, and "laidout" block the first declaration or definition can start in any column, and
the parser marks that indentation level. Every subsequent top-level declaration the parser marks that indentation level. Every subsequent declaration at the
must have the same indentation. same logical level must have the same indentation.
```haskell ```haskell
@ -501,10 +501,10 @@ fib x = truncate $ ( 1 / sqrt 5 ) * ( phi ^ x - psi ^ x ) -- (Column: > 0)
psi = ( 1 - sqrt 5 ) / 2 psi = ( 1 - sqrt 5 ) / 2
``` ```
The Parsec monad is itself parameterized over a type variable ``s`` which stands The Parsec monad is parameterized over a type which stands
for the State layer baked into the monad allowing us to embed custom parser for the State layer baked into the monad allowing us to embed custom parser
state inside of our rules. To adopt our parser to handle sensitive whitespace we state inside of our rules. To adopt our parser to handle sensitive whitespace we
will will use:
```haskell ```haskell
-- Indentation sensitive Parsec monad. -- Indentation sensitive Parsec monad.
@ -518,8 +518,8 @@ initParseState :: ParseState
initParseState = ParseState 0 initParseState = ParseState 0
``` ```
Inside of the Parsec the internal position state (SourcePos) is stored during The parser stores the internal position state (SourcePos) during its
each traversal, and is accessible inside of rule logic via ``getPosition`` traversal, and makes it accessible inside of rule logic via the ``getPosition``
function. function.
```haskell ```haskell
@ -558,7 +558,7 @@ indentCmp cmp = do
``` ```
We can then write two combinators in terms of this function which match on We can then write two combinators in terms of this function which match on
either positive and identical indentation difference. either further or identical indentation.
```haskell ```haskell
indented :: IParsec () indented :: IParsec ()
@ -577,9 +577,9 @@ block p = laidout (many (align >> p))
block1 p = laidout (many1 (align >> p)) block1 p = laidout (many1 (align >> p))
``` ```
GHC uses an optional layout rule for several constructs, allowing us to Haskell uses an optional layout rule for several constructs, allowing us to
equivalently manually delimit indentation sensitive syntax with braces. The most equivalently manually delimit indentation sensitive syntax with braces. The most
common is for do-notation. So for example: common use is for do-notation. So for example:
```haskell ```haskell
example = do { a <- m; b } example = do { a <- m; b }
@ -589,7 +589,7 @@ example = do
b b
``` ```
To support this in Parsec style we adopt implement a ``maybeBraces`` function. To support this in Parsec style we implement a ``maybeBraces`` function.
```haskell ```haskell
maybeBraces :: Parser a -> Parser [a] maybeBraces :: Parser a -> Parser [a]
@ -604,21 +604,21 @@ Extensible Operators
Haskell famously allows the definition of custom infix operators, an extremely Haskell famously allows the definition of custom infix operators, an extremely
useful language feature although this poses a bit of a challenge to parse! There useful language feature although this poses a bit of a challenge to parse! There
are several ways to do this and both depend on two properties of the operators. are two ways to do this and both depend on two properties of the operators.
* Precedence * Precedence
* Associativity * Associativity
1. The first is the way that GHC does is to parse all operators as left associative 1. The first, the way that GHC does it, is to parse all operators as left associative
and of the same precedence, and then before desugaring go back and "fix" the and of the same precedence, and then before desugaring go back and "fix" the
parse tree given all the information we collected after finishing parsing. parse tree given all the information we collected after finishing parsing.
2. The second method is a bit of a hack, and involves simply storing the collected 2. The second method is a bit of a hack, and involves simply storing the collected
operators inside of the Parsec state monad and then simply calling operators inside of the Parsec state monad and then simply calling
``buildExpressionParser`` on the current state each time we want to parse and ``buildExpressionParser`` on the current state each time we want to parse an
infix operator expression. infix operator expression.
To do later method we set up the AST objects for our fixity definitions, which To do the later method we set up the AST objects for our fixity definitions, which
associate precedence and associativity annotations with a custom symbol. associate precedence and associativity annotations with a custom symbol.
```haskell ```haskell
@ -640,8 +640,8 @@ data Fixity
deriving (Eq,Ord,Show) deriving (Eq,Ord,Show)
``` ```
Our parser state monad will hold a list of the at Ivie fixity specifications and Our parser state monad will hold a list of the active fixity specifications and
whenever a definition is uncounted we will append to this list. whenever a definition is encountered we will append to this list.
```haskell ```haskell
data ParseState = ParseState data ParseState = ParseState
@ -678,7 +678,7 @@ defaultOps = [
] ]
``` ```
Now In our parser we need to be able to transform the fixity specifications into Now in our parser we need to be able to transform the fixity specifications into
Parsec operator definitions. This is a pretty straightforward sort and group Parsec operator definitions. This is a pretty straightforward sort and group
operation on the list. operation on the list.
@ -699,7 +699,8 @@ mkTable ops =
``` ```
Now when parsing an infix operator declaration we simply do a state operation Now when parsing an infix operator declaration we simply do a state operation
and add the operator to the parser state so that all subsequent definitions. and add the operator to the parser state so that all subsequent definitions
can use it.
This differs from Haskell slightly in that operators must be defined before This differs from Haskell slightly in that operators must be defined before
their usage in a module. their usage in a module.
@ -764,7 +765,7 @@ extensively inside of GHC:
* [A Tool for Generalized LR Parsing In Haskell](http://www.benmedlock.co.uk/Functional_GLR_Parsing.pdf) * [A Tool for Generalized LR Parsing In Haskell](http://www.benmedlock.co.uk/Functional_GLR_Parsing.pdf)
* [Haskell Syntax Definition](https://www.haskell.org/onlinereport/haskell2010/haskellch10.html#x17-17500010) * [Haskell Syntax Definition](https://www.haskell.org/onlinereport/haskell2010/haskellch10.html#x17-17500010)
Haskell itself uses Alex and Happy for it's parser infastructure. The resulting GHC itself uses Alex and Happy for its parser infastructure. The resulting
parser is rather sophisicated. parser is rather sophisicated.
* [Lexer.x](https://github.com/ghc/ghc/blob/master/compiler/parser/Lexer.x) * [Lexer.x](https://github.com/ghc/ghc/blob/master/compiler/parser/Lexer.x)