write-you-a-haskell/002_parsers.md

388 lines
13 KiB
Markdown
Raw Normal View History

Squashed commit of the following: commit 41ba8c36a90cc11723b14ce6c45599eabdcfaa53 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 21:02:57 2015 -0500 type provenance commit be5eda941bb4c44b4c4af0ddbbd793643938f4ff Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 20:13:06 2015 -0500 provenance prototype commit 7aa958b9c279e7571f7c4887f6aa19443e16f6fb Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 19:35:08 2015 -0500 fix misc typos commit 52d60b3b2630e50ef0cd6ea5f0fa1f308d92e26d Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 15:15:58 2015 -0500 license badge commit 7d34274afe6f05a0002c8f87e5077b6a130b42b4 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 15:07:28 2015 -0500 fix resolution for llvm cfg graphs commit 14d9bc836ecc64f8e9acc60bcbd2da02335255b9 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 13:12:39 2015 -0500 added codegen dsl stub commit 0f74cdd6f95d0a1fe1cafd73e45cb1407709efd8 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 13:01:14 2015 -0500 llvm cfg graphs commit a199d721503985954060e7670c1d2f5e1a65dd11 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 10:56:54 2015 -0500 source code font commit c7db0c5d67b73d8633f08be093971877e2d6ede0 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 09:59:37 2015 -0500 change phrasing around recursion commit 6903700db482524233262e722df54b1066218250 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 18:20:06 2015 -0500 contributors.md commit 14d90a3f2ebf7ddf1229c084fe4a1e9fa13f2e41 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 17:35:41 2015 -0500 added llvm logo commit d270df6d94cbf1ef9eddfdd64af5aabc36ebca72 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 15:50:28 2015 -0500 initial llvm chapter commit e71b189c057ea9e399e90e47d9d49bb4cf12cda8 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 12:21:00 2015 -0500 system-f typing rules commit 2a7d5c7f137cf352eeae64836df634c98118f594 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Thu Jan 15 15:21:14 2015 -0500 flesh out system-f commit 7b3b2f0a2aea5e1102abe093cf5e0559090720aa Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 22:22:14 2015 -0500 started on extended parser commit cdeaf1a2658f15346fe1dc665ca09e954cce6c2e Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 17:25:02 2015 -0500 creative commons license commit f09d210be253a05fc8ad0827cd72ffa32404e2ba Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 16:54:10 2015 -0500 higher res images commit 8555eadfea8843f5683621e6652857e4259fa896 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 14:48:44 2015 -0500 cover page commit e5e542e92610f4bb4c5ac726ffa86cd1e07753e3 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Tue Jan 13 17:31:01 2015 -0500 initial happy/alex parser
2015-01-19 05:04:01 +03:00
<div class="pagetitle">
2015-01-06 18:09:41 +03:00
![](img/titles/parsing.png)
Squashed commit of the following: commit 41ba8c36a90cc11723b14ce6c45599eabdcfaa53 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 21:02:57 2015 -0500 type provenance commit be5eda941bb4c44b4c4af0ddbbd793643938f4ff Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 20:13:06 2015 -0500 provenance prototype commit 7aa958b9c279e7571f7c4887f6aa19443e16f6fb Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 19:35:08 2015 -0500 fix misc typos commit 52d60b3b2630e50ef0cd6ea5f0fa1f308d92e26d Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 15:15:58 2015 -0500 license badge commit 7d34274afe6f05a0002c8f87e5077b6a130b42b4 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 15:07:28 2015 -0500 fix resolution for llvm cfg graphs commit 14d9bc836ecc64f8e9acc60bcbd2da02335255b9 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 13:12:39 2015 -0500 added codegen dsl stub commit 0f74cdd6f95d0a1fe1cafd73e45cb1407709efd8 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 13:01:14 2015 -0500 llvm cfg graphs commit a199d721503985954060e7670c1d2f5e1a65dd11 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 10:56:54 2015 -0500 source code font commit c7db0c5d67b73d8633f08be093971877e2d6ede0 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 09:59:37 2015 -0500 change phrasing around recursion commit 6903700db482524233262e722df54b1066218250 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 18:20:06 2015 -0500 contributors.md commit 14d90a3f2ebf7ddf1229c084fe4a1e9fa13f2e41 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 17:35:41 2015 -0500 added llvm logo commit d270df6d94cbf1ef9eddfdd64af5aabc36ebca72 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 15:50:28 2015 -0500 initial llvm chapter commit e71b189c057ea9e399e90e47d9d49bb4cf12cda8 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 12:21:00 2015 -0500 system-f typing rules commit 2a7d5c7f137cf352eeae64836df634c98118f594 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Thu Jan 15 15:21:14 2015 -0500 flesh out system-f commit 7b3b2f0a2aea5e1102abe093cf5e0559090720aa Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 22:22:14 2015 -0500 started on extended parser commit cdeaf1a2658f15346fe1dc665ca09e954cce6c2e Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 17:25:02 2015 -0500 creative commons license commit f09d210be253a05fc8ad0827cd72ffa32404e2ba Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 16:54:10 2015 -0500 higher res images commit 8555eadfea8843f5683621e6652857e4259fa896 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 14:48:44 2015 -0500 cover page commit e5e542e92610f4bb4c5ac726ffa86cd1e07753e3 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Tue Jan 13 17:31:01 2015 -0500 initial happy/alex parser
2015-01-19 05:04:01 +03:00
</div>
2015-01-06 18:09:41 +03:00
<!--
2015-01-13 20:20:08 +03:00
> The tools we use have a profound (and devious!) influence on our thinking habits, and, therefore, on our thinking abilities.
> <cite>— Edsger Dijkstra</cite>
2015-01-06 18:09:41 +03:00
-->
<p class="halfbreak">
</p>
2015-01-09 07:37:53 +03:00
Parsing
=======
2015-01-06 18:09:41 +03:00
Parser Combinators
2015-01-09 07:37:53 +03:00
------------------
2015-01-06 18:09:41 +03:00
For parsing in Haskell it is quite common to use a family of libraries known as
*parser combinators* which let us compose higher order functions to generate
parsers. Parser combinators are a particularly expressive pattern that allows us
to quickly prototype language grammars in an small embedded domain language
inside of Haskell itself. Most notably we can embed custom Haskell logic inside
of the parser.
NanoParsec
----------
2015-01-06 23:28:33 +03:00
So now let's build our own toy parser combinator library which we'll call
**NanoParsec** just to get the feel of how these things are built.
2015-01-06 18:09:41 +03:00
Squashed commit of the following: commit 41ba8c36a90cc11723b14ce6c45599eabdcfaa53 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 21:02:57 2015 -0500 type provenance commit be5eda941bb4c44b4c4af0ddbbd793643938f4ff Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 20:13:06 2015 -0500 provenance prototype commit 7aa958b9c279e7571f7c4887f6aa19443e16f6fb Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 19:35:08 2015 -0500 fix misc typos commit 52d60b3b2630e50ef0cd6ea5f0fa1f308d92e26d Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 15:15:58 2015 -0500 license badge commit 7d34274afe6f05a0002c8f87e5077b6a130b42b4 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 15:07:28 2015 -0500 fix resolution for llvm cfg graphs commit 14d9bc836ecc64f8e9acc60bcbd2da02335255b9 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 13:12:39 2015 -0500 added codegen dsl stub commit 0f74cdd6f95d0a1fe1cafd73e45cb1407709efd8 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 13:01:14 2015 -0500 llvm cfg graphs commit a199d721503985954060e7670c1d2f5e1a65dd11 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 10:56:54 2015 -0500 source code font commit c7db0c5d67b73d8633f08be093971877e2d6ede0 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 09:59:37 2015 -0500 change phrasing around recursion commit 6903700db482524233262e722df54b1066218250 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 18:20:06 2015 -0500 contributors.md commit 14d90a3f2ebf7ddf1229c084fe4a1e9fa13f2e41 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 17:35:41 2015 -0500 added llvm logo commit d270df6d94cbf1ef9eddfdd64af5aabc36ebca72 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 15:50:28 2015 -0500 initial llvm chapter commit e71b189c057ea9e399e90e47d9d49bb4cf12cda8 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 12:21:00 2015 -0500 system-f typing rules commit 2a7d5c7f137cf352eeae64836df634c98118f594 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Thu Jan 15 15:21:14 2015 -0500 flesh out system-f commit 7b3b2f0a2aea5e1102abe093cf5e0559090720aa Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 22:22:14 2015 -0500 started on extended parser commit cdeaf1a2658f15346fe1dc665ca09e954cce6c2e Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 17:25:02 2015 -0500 creative commons license commit f09d210be253a05fc8ad0827cd72ffa32404e2ba Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 16:54:10 2015 -0500 higher res images commit 8555eadfea8843f5683621e6652857e4259fa896 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 14:48:44 2015 -0500 cover page commit e5e542e92610f4bb4c5ac726ffa86cd1e07753e3 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Tue Jan 13 17:31:01 2015 -0500 initial happy/alex parser
2015-01-19 05:04:01 +03:00
~~~~ {.haskell slice="chapter3/parsec.hs" lower=0 upper=7}
2015-01-06 18:09:41 +03:00
~~~~
Structurally a parser is a function which takes an input stream of characters
2015-01-27 15:57:22 +03:00
and yields a parse tree by applying the parser logic over sections of the
2015-01-06 18:09:41 +03:00
character stream (called *lexemes*) to build up a composite data structure for
the AST.
Squashed commit of the following: commit 41ba8c36a90cc11723b14ce6c45599eabdcfaa53 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 21:02:57 2015 -0500 type provenance commit be5eda941bb4c44b4c4af0ddbbd793643938f4ff Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 20:13:06 2015 -0500 provenance prototype commit 7aa958b9c279e7571f7c4887f6aa19443e16f6fb Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 19:35:08 2015 -0500 fix misc typos commit 52d60b3b2630e50ef0cd6ea5f0fa1f308d92e26d Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 15:15:58 2015 -0500 license badge commit 7d34274afe6f05a0002c8f87e5077b6a130b42b4 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 15:07:28 2015 -0500 fix resolution for llvm cfg graphs commit 14d9bc836ecc64f8e9acc60bcbd2da02335255b9 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 13:12:39 2015 -0500 added codegen dsl stub commit 0f74cdd6f95d0a1fe1cafd73e45cb1407709efd8 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 13:01:14 2015 -0500 llvm cfg graphs commit a199d721503985954060e7670c1d2f5e1a65dd11 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 10:56:54 2015 -0500 source code font commit c7db0c5d67b73d8633f08be093971877e2d6ede0 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sun Jan 18 09:59:37 2015 -0500 change phrasing around recursion commit 6903700db482524233262e722df54b1066218250 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 18:20:06 2015 -0500 contributors.md commit 14d90a3f2ebf7ddf1229c084fe4a1e9fa13f2e41 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 17:35:41 2015 -0500 added llvm logo commit d270df6d94cbf1ef9eddfdd64af5aabc36ebca72 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 15:50:28 2015 -0500 initial llvm chapter commit e71b189c057ea9e399e90e47d9d49bb4cf12cda8 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Sat Jan 17 12:21:00 2015 -0500 system-f typing rules commit 2a7d5c7f137cf352eeae64836df634c98118f594 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Thu Jan 15 15:21:14 2015 -0500 flesh out system-f commit 7b3b2f0a2aea5e1102abe093cf5e0559090720aa Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 22:22:14 2015 -0500 started on extended parser commit cdeaf1a2658f15346fe1dc665ca09e954cce6c2e Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 17:25:02 2015 -0500 creative commons license commit f09d210be253a05fc8ad0827cd72ffa32404e2ba Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 16:54:10 2015 -0500 higher res images commit 8555eadfea8843f5683621e6652857e4259fa896 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Wed Jan 14 14:48:44 2015 -0500 cover page commit e5e542e92610f4bb4c5ac726ffa86cd1e07753e3 Author: Stephen Diehl <stephen.m.diehl@gmail.com> Date: Tue Jan 13 17:31:01 2015 -0500 initial happy/alex parser
2015-01-19 05:04:01 +03:00
~~~~ {.haskell slice="chapter3/parsec.hs" lower=8 upper=8}
2015-01-06 18:09:41 +03:00
~~~~
Running the function will result in traversing the stream of characters
yielding a value of type ``a`` that usually represents the AST for the
parsed expression, or failing with a parse
2015-01-06 18:09:41 +03:00
error for malformed input, or failing by not consuming the entire stream of
input. A more robust implementation would track the position information of
failures for error reporting.
2015-01-09 09:50:36 +03:00
~~~~ {.haskell slice="chapter3/parsec.hs" lower=10 upper=16}
2015-01-06 18:09:41 +03:00
~~~~
2015-01-27 15:57:22 +03:00
Recall that in Haskell the String type is defined to be a list of
2015-01-06 18:09:41 +03:00
``Char`` values, so the following are equivalent forms of the same data.
```haskell
"1+2*3"
2015-01-06 18:56:23 +03:00
['1', '+', '2', '*', '3']
2015-01-06 18:09:41 +03:00
```
We advance the parser by extracting a single character from the parser stream
and returning in a tuple containing itself and the rest of the stream. The
parser logic will then scrutinize the character and either transform it in some
portion of the output or advance the stream and proceed.
2015-01-09 09:50:36 +03:00
~~~~ {.haskell slice="chapter3/parsec.hs" lower=17 upper=22}
2015-01-06 18:09:41 +03:00
~~~~
A bind operation for our parser type will take one parse operation and compose
it over the result of second parse function. Since the parser operation yields a
list of tuples, composing a second parser function simply maps itself over the
resulting list and concat's the resulting nested list of lists into a single
flat list in the usual list monad fashion. The unit operation injects a single
2015-02-01 17:19:55 +03:00
pure value as the result, without reading from the parse stream.
2015-01-06 18:09:41 +03:00
2015-01-09 09:50:36 +03:00
~~~~ {.haskell slice="chapter3/parsec.hs" lower=23 upper=28}
2015-01-06 18:09:41 +03:00
~~~~
As the terminology might have indicated this is indeed a Monad (also Functor and
Applicative).
2015-01-09 09:50:36 +03:00
~~~~ {.haskell slice="chapter3/parsec.hs" lower=29 upper=39}
2015-01-06 18:09:41 +03:00
~~~~
Of particular importance is that this particular monad has a zero value
(``failure``), namely the function which halts reading the stream and returns
the empty stream. Together this forms a monoidal structure with a secondary
operation (``combine``) which applies two parser functions over the same stream
and concatenates the result. Together these give rise to both the Alternative
and MonadPlus class instances which encode the logic for trying multiple parse
functions over the same stream and handling failure and rollover.
The core operator introduced here is the (``<|>``) operator for combining two
optional paths of parser logic, switching to the second path if the first fails
with the zero value.
2015-01-06 18:09:41 +03:00
2015-01-09 09:50:36 +03:00
~~~~ {.haskell slice="chapter3/parsec.hs" lower=40 upper=59}
2015-01-06 18:09:41 +03:00
~~~~
2015-01-27 15:57:22 +03:00
Derived automatically from the Alternative typeclass definition are the ``many``
2015-01-06 18:09:41 +03:00
and ``some`` functions. Many takes a single function argument and repeatedly
applies it until the function fails and then yields the collected results up to
that point. The ``some`` function behaves similar except that it will fail
itself if there is not at least a single match.
```haskell
-- | One or more.
some :: f a -> f [a]
some v = some_v
where
many_v = some_v <|> pure []
some_v = (:) <$> v <*> many_v
-- | Zero or more.
many :: f a -> f [a]
many v = many_v
where
many_v = some_v <|> pure []
some_v = (:) <$> v <*> many_v
```
On top of this we can add functionality for checking whether the current
character in the stream matches a given predicate ( i.e is it a digit, is it a
letter, a specific word, etc).
2015-01-09 09:50:36 +03:00
~~~~ {.haskell slice="chapter3/parsec.hs" lower=60 upper=65}
2015-01-06 18:09:41 +03:00
~~~~
Essentially this 50 lines code encodes the entire core of the parser combinator
machinery. All higher order behavior can be written on top of just this logic.
Now we can write down several higher level functions which operate over sections
of the stream.
``chainl1`` parses one or more occurrences of ``p``, separated by ``op`` and
returns a value obtained by a recursing until failure on the left hand side of
the stream. This can be used to parse left-recursive grammar.
2015-01-09 09:50:36 +03:00
~~~~ {.haskell slice="chapter3/parsec.hs" lower=70 upper=82}
2015-01-06 18:09:41 +03:00
~~~~
Using ``satisfy`` we can write down several combinators for detecting the
presence of specific common patterns of characters ( numbers, parenthesized
expressions, whitespace, etc ).
2015-01-09 09:50:36 +03:00
~~~~ {.haskell slice="chapter3/parsec.hs" lower=83 upper=117}
2015-01-06 18:09:41 +03:00
~~~~
**And that's about it!** In a few hundred lines we have enough of a parser
library to write down a simple parser for a calculator grammar. In the formal
BackusNaur Form our grammar would be written as:
```haskell
number = [ "-" ] digit { digit }.
digit = "0" | "1" | ... | "8" | "9".
expr = term { addop term }.
term = factor { mulop factor }.
factor = "(" expr ")" | number.
2015-01-06 18:09:41 +03:00
addop = "+" | "-".
mulop = "*".
```
The direct translation to Haskell in terms of our newly constructed parser
combinator has the following form:
2015-01-09 09:50:36 +03:00
~~~~ {.haskell slice="chapter3/parsec.hs" lower=130 upper=183}
2015-01-06 18:09:41 +03:00
~~~~
Now we can try out our little parser.
```bash
$ runhaskell parsec.hs
> 1+2
3
> 1+2*3
7
```
**Generalizing String**
The limitations of the String type are well-known, but what is particularly nice
2015-02-01 17:16:19 +03:00
about this approach is that it adapts to different stream types simply by adding
2015-02-14 00:31:42 +03:00
an additional parameter to the Parser type which holds the stream type. In its
place a more efficient string data structure (``Text``, ``ByteString``) can be
used.
2015-01-06 18:09:41 +03:00
```haskell
newtype Parser s a = Parser { parse :: s -> [(a,s)] }
```
For the first couple of simple parsers we will use the String type for
2015-01-27 15:57:22 +03:00
simplicity's sake, but later we will generalize our parsers to use the ``Text``
2015-01-06 18:09:41 +03:00
type. The combinators and parsing logic will not change, only the lexer and
language definition types will change slightly to a generalized form.
2015-01-06 18:09:41 +03:00
Parsec
------
2015-01-27 15:57:22 +03:00
Now that we have the feel for parser combinators work, we can graduate to the
full Parsec library. We'll effectively ignore the gritty details of parsing and
lexing from now on. Although an interesting subject parsing is effectively a
2015-01-06 18:09:41 +03:00
solved problem and the details are not terribly important for our purposes.
The *Parsec* library defines a set of common combinators much like the operators
we defined in our toy library.
2015-01-11 21:52:00 +03:00
Combinator Description
2015-01-06 18:09:41 +03:00
----------- ------------
2015-01-11 21:52:00 +03:00
``char`` Match the given character.
``string`` Match the given string.
2015-12-31 10:28:43 +03:00
``<|>`` The choice operator tries to parse the first argument before proceeding to the second. Can be chained sequentially to generate a sequence of options.
``many`` Consumes an arbitrary number of patterns matching the given pattern and returns them as a list.
2015-01-08 05:33:59 +03:00
``many1`` Like many but requires at least one match.
2015-12-31 10:28:43 +03:00
``sepBy`` Match a arbitrary length sequence of patterns, delimited by a given pattern.
``optional`` Optionally parses a given pattern returning its value as a Maybe.
``try`` Backtracking operator will let us parse ambiguous matching expressions and restart with a different pattern.
2015-01-27 15:57:22 +03:00
``parens`` Parses the given pattern surrounded by parentheses.
2015-01-06 18:09:41 +03:00
**Tokens**
To create a Parsec lexer we must first specify several parameters about how
individual characters are handled and converted into tokens. For example some
tokens will be handled as comments and simply omitted from the parse stream.
Other parameters include indicating what characters are to be handled as keyword
identifiers or operators.
```haskell
2015-01-06 23:36:03 +03:00
langDef :: Tok.LanguageDef ()
2015-01-06 18:09:41 +03:00
langDef = Tok.LanguageDef
{ Tok.commentStart = "{-"
, Tok.commentEnd = "-}"
, Tok.commentLine = "--"
, Tok.nestedComments = True
, Tok.identStart = letter
, Tok.identLetter = alphaNum <|> oneOf "_'"
, Tok.opStart = oneOf ":!#$%&*+./<=>?@\\^|-~"
, Tok.opLetter = oneOf ":!#$%&*+./<=>?@\\^|-~"
, Tok.reservedNames = reservedNames
, Tok.reservedOpNames = reservedOps
, Tok.caseSensitive = True
}
```
**Lexer**
Given the token definition we can create the lexer functions.
2015-01-11 22:17:54 +03:00
~~~~ {.haskell slice="chapter3/calc/Parser.hs" lower=30 upper=47}
2015-01-06 18:09:41 +03:00
~~~~
**Abstract Syntax Tree**
In a separate module we'll now define the abstract syntax for our language as a
datatype.
~~~~ {.haskell include="chapter3/calc/Syntax.hs"}
~~~~
**Parser**
Much like before our parser is simply written in monadic blocks, each mapping a
2015-01-27 15:57:22 +03:00
set of patterns to a construct in our ``Expr`` type. The toplevel entry point
2015-01-06 18:09:41 +03:00
to our parser is the ``expr`` function which we can parse with by using the
Parsec function ``parse``.
2015-01-06 23:36:03 +03:00
~~~~ {.haskell slice="chapter3/calc/Parser.hs" lower=46 upper=94}
2015-01-06 18:09:41 +03:00
~~~~
2015-01-27 15:57:22 +03:00
The toplevel function we'll expose from our Parse module is ``parseExpr``
2015-01-06 18:09:41 +03:00
which will be called as the entry point in our REPL.
2015-01-06 23:36:03 +03:00
~~~~ {.haskell slice="chapter3/calc/Parser.hs" lower=99 upper=100}
2015-01-06 18:09:41 +03:00
~~~~
Evaluation
----------
Our small language gives rise to two syntactic classes, values and expressions.
Values are in *normal form* and cannot be reduced further. They consist of
2015-01-08 05:33:59 +03:00
``True`` and ``False`` values and literal numbers.
2015-01-06 18:09:41 +03:00
~~~~ {.haskell slice="chapter3/calc/Eval.hs" lower=8 upper=17}
~~~~
The evaluation of our languages uses the ``Maybe`` applicative to accommodate
the fact that our reduction may halt at any level with a Nothing if the
expression being reduced has reached a normal form or cannot proceed because the
reduction simply isn't well-defined. The rules for evaluation are a single step
by which an expression takes a single small step from one form to another by a
2015-01-06 18:09:41 +03:00
given rule.
~~~~ {.haskell slice="chapter3/calc/Eval.hs" lower=19 upper=31}
~~~~
At the toplevel we simply apply ``eval'`` repeatedly until either a value is
2015-01-06 18:09:41 +03:00
reached or we're left with an expression that has no well-defined way to
proceed. The term is "stuck" and the program is in an undefined state.
2015-01-06 18:09:41 +03:00
~~~~ {.haskell slice="chapter3/calc/Eval.hs" lower=33 upper=39}
~~~~
REPL
----
The driver for our simple language simply invokes all of the parser and
evaluation logic in a loop feeding the resulting state to the next iteration. We
will use the [haskeline](http://hackage.haskell.org/package/haskeline) library
to give us readline interactions for the small REPL. Behind the scenes haskeline
is using readline or another platform-specific system library to manage the
terminal input. To start out we just create the simplest loop, which only parses
and evaluates expressions and prints them to the screen. We'll build on this
pattern in each chapter, eventually ending up with a more full-featured REPL.
The two functions of note are the operations for the ``InputT`` monad
transformer.
```haskell
runInputT :: Settings IO -> InputT IO a -> IO a
getInputLine :: String -> InputT IO (Maybe String)
```
2015-01-27 15:57:22 +03:00
When the user enters an ``EOF`` or sends a ``SIGQUIT`` to input, ``getInputLine``
2015-01-06 18:51:28 +03:00
will yield ``Nothing`` and can handle the exit logic.
2015-01-06 18:09:41 +03:00
```haskell
process :: String -> IO ()
process line = do
let res = parseExpr line
case res of
Left err -> print err
Right ex -> print $ runEval ex
main :: IO ()
main = runInputT defaultSettings loop
where
loop = do
minput <- getInputLine "Repl> "
case minput of
Nothing -> outputStrLn "Goodbye."
Just input -> (liftIO $ process input) >> loop
```
Soundness
---------
Great, now let's test our little interpreter and indeed we see that it behaves
as expected.
```bash
Arith> succ 0
succ 0
Arith> succ (succ 0)
succ (succ 0)
Arith> iszero 0
true
Arith> if false then true else false
false
Arith> iszero (pred (succ (succ 0)))
false
Arith> pred (succ 0)
0
Arith> iszero false
Cannot evaluate
Arith> if 0 then true else false
Cannot evaluate
```
Oh no, our calculator language allows us to evaluate terms which are
syntactically valid but semantically meaningless. We'd like to restrict the
existence of such terms since when we start compiling our languages later into
native CPU instructions these kind errors will correspond to all sorts of
nastiness (segfaults, out of bounds errors, etc). How can we make these illegal
states unrepresentable to begin with?
Full Source
-----------
* [NanoParsec](https://github.com/sdiehl/write-you-a-haskell/blob/master/chapter3/parsec.hs)
* [Calculator](https://github.com/sdiehl/write-you-a-haskell/tree/master/chapter3/calc)
2015-01-09 07:46:30 +03:00
\pagebreak