changes to the Extended Parser chapter

This commit is contained in:
Christian Sievers 2015-04-10 12:49:03 +02:00
parent 0b3825e7ee
commit 43f74ae7ff

View File

@ -9,15 +9,16 @@ Extended Parser
===============
Up until now we've been using parser combinators to build our parsers. Parser
combinators are a top-down parser formally in the $\mathtt{LL}(k)$ family of
parsers. The parser proceeds top-down, with a sequence of $k$ characters used to
dispatch on the leftmost production rule. Combined with backtracking (i.e. try
combinators build top-down parsers that formally belong to the $\mathtt{LL}(k)$
family of parsers. The parser proceeds top-down, with a sequence of $k$
characters used to dispatch on the leftmost production rule.
Combined with backtracking (i.e. the ``try``
combinator) this is simultaneously both an extremely powerful and simple model
to implement as we saw before with our simple 100 line parser library.
However there are a family of grammars that include left-recursion that
However there is a family of grammars that include left-recursion that
$\mathtt{LL}(k)$ can be inefficient and often incapable of parsing.
Left-recursive rules are the case where the left-most symbol of the rule
Left-recursive rules are such where the left-most symbol of the rule
recurses on itself. For example:
$$
@ -26,17 +27,17 @@ e ::=\ e\ \t{op}\ \t{atom}
\end{aligned}
$$
Now we demonstrated a way before that we could handle these cases using the
parser combinator ``chainl1`` function, and while this is possible sometimes it
can in many cases be inefficient use of parser stack and lead to ambiguous
Now we demonstrated before that we could handle these cases using the
parser combinator ``chainl1``, and while this is possible sometimes it
can in many cases be an inefficient use of the parser stack and lead to ambiguous
cases.
The other major family of parsers $\mathtt{LR}$ are not plagued with the same
The other major family of parsers, $\mathtt{LR}$, are not plagued with the same
concerns over left recursion. On the other hand $\mathtt{LR}$ parser are
exceedingly more complicated to implement, relying on a rather sophisticated
method known as Tomita's algorithm to do the heavy lifting. The tooling can
method known as Tomita's algorithm to do the heavy lifting. The tooling
around the construction of the *production rules* in a form that can be handled
by the algorithm is often handled a DSL that generates the code for the parser.
by the algorithm is often handled by a DSL that generates the code for the parser.
While the tooling is fairly robust, there is a level of indirection between us
and the code that can often be a bit of brittle to extend with custom logic.
@ -83,7 +84,7 @@ $eol = [\n]
The files will be used during the code generation of the two modules ``Lexer``
and ``Parser``. The toolchain is accessible in several ways, first via the
command-line tools ``alex`` and ``happy`` will will generate the resulting
command-line tools ``alex`` and ``happy`` which will generate the resulting
modules by passing the appropriate input file to the tool.
```haskell
@ -153,7 +154,7 @@ scanTokens :: String -> [Token]
scanTokens = alexScanTokens
```
The token definition is list of function definitions mapping atomic character
The token definition is a list of function definitions mapping atomic characters
and alphabetical sequences to constructors for our ``Token`` datatype.
@ -252,7 +253,7 @@ simple case we'll just add error handling with the ``Except`` monad.
```
And finally our production rules, the toplevel entry point for our parser will
be the ``expr`` rule. Notice how naturally we can right left recursive grammar
be the ``expr`` rule. Notice how naturally we can write a left recursive grammar
for our infix operators.
```haskell
@ -280,7 +281,7 @@ Atom : '(' Expr ')' { $2 }
| NUM { Lit (LInt $1) }
| VAR { Var $1 }
| true { Lit (LBool True) }
| false { Lit (LBool True) }
| false { Lit (LBool False) }
```
Syntax Errors
@ -324,8 +325,8 @@ Type Error Provenance
Before our type inference engine would generate somewhat typical type inference
error messages. If two terms couldn't be unified it simply told us this and some
information about the toplevel declaration where it occurred. Leaving us with a
bit of a riddle about this error came to be.
information about the toplevel declaration where it occurred, leaving us with a
bit of a riddle about how exactly this error came to be.
```haskell
Cannot unify types:
@ -337,19 +338,18 @@ in the definition of 'foo'
Effective error reporting in the presence of type inference is a difficult task,
effectively our typechecker takes our frontend AST and transforms it into a
large constraint problem but effectively destroys position information
large constraint problem, destroying position
information in the process. Even if the position information were tracked, the
nature of unification is that a cascade of several unifications can give rise to
invalid solution and the immediate two syntactic constructs that gave rise to a
unification fail are not necessarily the two that map back to human intuition
nature of unification is that a cascade of several unifications can lead to
unsolvability and the immediate two syntactic constructs that gave rise to a
unification failure are not necessarily the two that map back to human intuition
about how the type error arose. Very little research has done on this topic and
it remains a open topic with very immediate and applicable results to
it remains an open topic with very immediate and applicable results to
programming.
To do simple provenance tracking we will use a technique of track the "flow" of
type information through out typechecker and associate position information
associated with the inferred types back to their position information in the
source.
To do simple provenance tracking we will use a technique of tracking the "flow"
of type information through our typechecker and associate position information
with the inferred types.
```haskell
type Name = String
@ -376,7 +376,7 @@ variable = do
```
Our type system will also include information, although by default it will use
the ``NoLoc`` type until explicit information is provided during inference. The
the ``NoLoc`` value until explicit information is provided during inference. The
two functions ``getLoc`` and ``setLoc`` will be used to update and query the
position information from type terms.
@ -405,7 +405,7 @@ getLoc (TArr l _ _) = l
```
Our fresh variable supply now also takes a location field which is attached to
resulting type variable.
the resulting type variable.
```haskell
fresh :: Loc -> Check Type
@ -474,8 +474,8 @@ with
let f x y = x y
```
This is of course the simplest implementation of the this tracking method and
could be further extended by giving an weighted ordering to the constraints
This is of course the simplest implementation of the tracking method and
could be further extended by giving a weighted ordering to the constraints
based on their likelihood of importance and proximity and then choosing which
location to report based on this information. This remains an open area of work.
@ -484,10 +484,10 @@ Indentation
Haskell's syntax uses indentation blocks to delineated sections of code. This
use of indentation sensitive layout to convey the structure of logic is
sometimes called the *offside rule* in parsing literature. At the beginning of
sometimes called the *offside rule* in parsing literature. At the beginning of a
"laidout" block the first declaration or definition can start in any column, and
the parser marks that indentation level. Every subsequent top-level declaration
must have the same indentation.
the parser marks that indentation level. Every subsequent declaration at the
same logical level must have the same indentation.
```haskell
@ -501,10 +501,10 @@ fib x = truncate $ ( 1 / sqrt 5 ) * ( phi ^ x - psi ^ x ) -- (Column: > 0)
psi = ( 1 - sqrt 5 ) / 2
```
The Parsec monad is itself parameterized over a type variable ``s`` which stands
The Parsec monad is parameterized over a type which stands
for the State layer baked into the monad allowing us to embed custom parser
state inside of our rules. To adopt our parser to handle sensitive whitespace we
will
will use:
```haskell
-- Indentation sensitive Parsec monad.
@ -518,8 +518,8 @@ initParseState :: ParseState
initParseState = ParseState 0
```
Inside of the Parsec the internal position state (SourcePos) is stored during
each traversal, and is accessible inside of rule logic via ``getPosition``
The parser stores the internal position state (SourcePos) during its
traversal, and makes it accessible inside of rule logic via the ``getPosition``
function.
```haskell
@ -558,7 +558,7 @@ indentCmp cmp = do
```
We can then write two combinators in terms of this function which match on
either positive and identical indentation difference.
either further or identical indentation.
```haskell
indented :: IParsec ()
@ -577,9 +577,9 @@ block p = laidout (many (align >> p))
block1 p = laidout (many1 (align >> p))
```
GHC uses an optional layout rule for several constructs, allowing us to
Haskell uses an optional layout rule for several constructs, allowing us to
equivalently manually delimit indentation sensitive syntax with braces. The most
common is for do-notation. So for example:
common use is for do-notation. So for example:
```haskell
example = do { a <- m; b }
@ -589,7 +589,7 @@ example = do
b
```
To support this in Parsec style we adopt implement a ``maybeBraces`` function.
To support this in Parsec style we implement a ``maybeBraces`` function.
```haskell
maybeBraces :: Parser a -> Parser [a]
@ -604,21 +604,21 @@ Extensible Operators
Haskell famously allows the definition of custom infix operators, an extremely
useful language feature although this poses a bit of a challenge to parse! There
are several ways to do this and both depend on two properties of the operators.
are two ways to do this and both depend on two properties of the operators.
* Precedence
* Associativity
1. The first is the way that GHC does is to parse all operators as left associative
1. The first, the way that GHC does it, is to parse all operators as left associative
and of the same precedence, and then before desugaring go back and "fix" the
parse tree given all the information we collected after finishing parsing.
2. The second method is a bit of a hack, and involves simply storing the collected
operators inside of the Parsec state monad and then simply calling
``buildExpressionParser`` on the current state each time we want to parse and
``buildExpressionParser`` on the current state each time we want to parse an
infix operator expression.
To do later method we set up the AST objects for our fixity definitions, which
To do the later method we set up the AST objects for our fixity definitions, which
associate precedence and associativity annotations with a custom symbol.
```haskell
@ -640,8 +640,8 @@ data Fixity
deriving (Eq,Ord,Show)
```
Our parser state monad will hold a list of the at Ivie fixity specifications and
whenever a definition is uncounted we will append to this list.
Our parser state monad will hold a list of the active fixity specifications and
whenever a definition is encountered we will append to this list.
```haskell
data ParseState = ParseState
@ -678,7 +678,7 @@ defaultOps = [
]
```
Now In our parser we need to be able to transform the fixity specifications into
Now in our parser we need to be able to transform the fixity specifications into
Parsec operator definitions. This is a pretty straightforward sort and group
operation on the list.
@ -699,7 +699,8 @@ mkTable ops =
```
Now when parsing an infix operator declaration we simply do a state operation
and add the operator to the parser state so that all subsequent definitions.
and add the operator to the parser state so that all subsequent definitions
can use it.
This differs from Haskell slightly in that operators must be defined before
their usage in a module.
@ -764,7 +765,7 @@ extensively inside of GHC:
* [A Tool for Generalized LR Parsing In Haskell](http://www.benmedlock.co.uk/Functional_GLR_Parsing.pdf)
* [Haskell Syntax Definition](https://www.haskell.org/onlinereport/haskell2010/haskellch10.html#x17-17500010)
Haskell itself uses Alex and Happy for it's parser infastructure. The resulting
GHC itself uses Alex and Happy for its parser infastructure. The resulting
parser is rather sophisicated.
* [Lexer.x](https://github.com/ghc/ghc/blob/master/compiler/parser/Lexer.x)