The page you were looking for does not exist.
-diff --git a/404.html b/404.html deleted file mode 100644 index 55aa773..0000000 --- a/404.html +++ /dev/null @@ -1,67 +0,0 @@ - - -
- - - - - -The page you were looking for does not exist.
-This is an industrial-strength monadic parser combinator library. Megaparsec is a fork of Parsec library originally written by Daan Leijen.
-This project provides flexible solutions to satisfy common parsing needs. The section describes them shortly. If you’re looking for comprehensive documentation, see the section about documentation.
-The package is built around MonadParsec
, an MTL-style monad transformer. All tools and features work with all instances of MonadParsec
. You can achieve various effects combining monad transformers, i.e. building monad stack. Since the standard common monad transformers like WriterT
, StateT
, ReaderT
and others are instances of the MonadParsec
type class, you can wrap ParsecT
in these monads, achieving, for example, backtracking state.
On the other hand ParsecT
is an instance of many type classes as well. The most useful ones are Monad
, Applicative
, Alternative
, and MonadParsec
.
The module Text.Megaparsec.Combinator
(its functions are included in Text.Megaparsec
) contains traditional, general combinators that work with instances of Applicative
and Alternative
.
Let’s enumerate methods of the MonadParsec
type class. The class abstracts primitive functions of Megaparsec parsing. The rest of the library is built via combination of these primitives:
failure
allows to fail reporting an arbitrary parse error.
label
allows to add a “label” to a parser, so if it fails the user will see the label instead of an automatically deduced expected token.
hidden
hides a parser from error messages altogether. This is the recommended way to hide things, prefer it to the label ""
approach.
try
enables backtracking in parsing.
lookAhead
allows to parse input without consuming it.
notFollowedBy
succeeds when its argument fails and does not consume input.
withRecovery
allows to recover from parse errors “on-the-fly” and continue parsing. Once parsing is finished, several parse errors may be reported or ignored altogether.
observing
allows to “observe” parse errors without ending parsing (they are returned in Left
, while normal results are wrapped in Right
).
eof
only succeeds at the end of input.
token
is used to parse a single token.
tokens
makes it easy to parse several tokens in a row.
getParserState
returns the full parser state.
updateParserState
applies a given function on the parser state.
This list of core functions is longer than in some other libraries. Our goal is efficient, readable implementations, and rich functionality, not minimal number of primitive combinators. You can read the comprehensive description of every primitive function in Megaparsec documentation.
-Megaparsec can currently work with the following types of input stream out-of-the-box:
-String
= [Char]
ByteString
(strict and lazy)Text
(strict and lazy)It’s also simple to make it work with custom token streams, and Megaparsec users have done so many times with great success.
-Megaparsec 5 introduces well-typed error messages and the ability to use custom data types to adjust the library to specific domain of interest. No need to use a shapeless bunch of strings anymore.
-The default error component (Dec
) has constructors corresponding to the fail
function and indentation-related error messages. It is a decent option that should work out-of-box for most parsing needs, while you are free to use your own custom error component when necessary.
This new design allowed Megaparsec 5 to have much more helpful error messages for indentation-sensitive parsing instead of the plain “incorrect indentation” phrase.
-Megaparsec works well with streams of tokens produced by tools like Alex/Happy. Megaparsec 5 adds updatePos
method to Stream
type class that gives you full control over textual positions that are used to report token positions in error messages. You can update current position on per character basis or extract it from token.
Megaparsec has decent support for Unicode-aware character parsing. Functions for character parsing live in the Text.Megaparsec.Char
module (they all are included in Text.Megaparsec
). The functions can be divided into several categories:
Simple parsers—parsers that parse certain character or several characters of the same kind. This includes newline
, crlf
, eol
, tab
, and space
.
Parsers corresponding to categories of characters parse single character that belongs to certain category of characters, for example: controlChar
, spaceChar
, upperChar
, lowerChar
, printChar
, digitChar
, and others.
General parsers that allow you to parse a single character you specify or one of the given characters, or any character except for the given ones, or character satisfying given predicate. Case-insensitive versions of the parsers are available.
Parsers for sequences of characters parse strings. Case-sensitive string
parser is available as well as case-insensitive string'
.
For those who are interested in parsing of permutation phrases, there is Text.Megaparsec.Perm
. You have to import the module explicitly, it’s not included in the Text.Megaparsec
module.
Megaparsec has a solution for parsing of expressions. Take a look at Text.Megaparsec.Expr
. You have to import the module explicitly, it’s not included in the Text.Megaparsec
.
Given a table of operators that describes their fixity and precedence, you can construct a parser that will parse any expression involving the operators. See documentation for comprehensive description of how it works.
-Text.Megaparsec.Lexer
is a module that should help you write your lexer. If you have used Parsec
in the past, this module “fixes” its particularly inflexible Text.Parsec.Token
.
Text.Megaparsec.Lexer
is intended to be imported via a qualified import, it’s not included in Text.Megaparsec
. The module doesn’t impose how you should write your parser, but certain approaches may be more elegant than others. An especially important theme is parsing of white space, comments, and indentation.
The design of the module allows you quickly solve simple tasks and doesn’t get in your way when you want to implement something less standard.
-Since Megaparsec 5, all tools for indentation-sensitive parsing are available in Text.Megaparsec.Lexer
module—no third party packages required.
Megaparsec is well-documented. All functions and data-types are thoroughly described. We pay attention to avoid outdated info or unclear phrases in our documentation. See the current version of Megaparsec documentation on Hackage for yourself.
-You can visit the site of the project which has several tutorials that should help you to start with your parsing tasks. The site also has instructions and tips for Parsec users who decide to migrate to Megaparsec. If you want to improve an existing tutorial or add your own, open a PR against this repo.
-Despite being quite flexible, Megaparsec is also faster than Parsec. The repository includes benchmarks that can be easily used to compare Megaparsec and Parsec. In most cases Megaparsec is faster, sometimes dramatically faster. If you happen to have some other benchmarks, I would appreciate if you add Megaparsec to them and let me know how it performs.
-If you think your Megaparsec parser is not efficient enough, take a look at these instructions.
-There are quite a few libraries that can be used for parsing in Haskell, let’s compare Megaparsec with some of them.
-Attoparsec is another prominent Haskell library for parsing. Although the both libraries deal with parsing, it’s usually easy to decide which you will need in particular project:
-Attoparsec is much faster but not that feature-rich. It should be used when you want to process large amounts of data where performance matters more than quality of error messages.
Megaparsec is good for parsing of source code or other human-readable texts. It has better error messages and it’s implemented as monad transformer.
So, if you work with something human-readable where size of input data is usually not huge, just go with Megaparsec, otherwise Attoparsec may be a better choice.
-Since Megaparsec is a fork of Parsec, we are bound to list the main differences between the two libraries:
-Better error messages. We test our error messages using dense QuickCheck tests. Good error messages are just as important for us as correct return values of our parsers. Megaparsec will be especially useful if you write a compiler or an interpreter for some language.
Some quirks and “buggy features” (as well as plain bugs) of original Parsec are fixed. There is no undocumented surprising stuff in Megaparsec.
Better support for Unicode parsing in Text.Megaparsec.Char
.
Megaparsec has more powerful combinators and can parse languages where indentation matters.
Comprehensive QuickCheck test suite covering nearly 100% of our code.
We have benchmarks to detect performance regressions.
Better documentation, with 100% of functions covered, without typos and obsolete information, with working examples. Megaparsec’s documentation is well-structured and doesn’t contain things useless to end users.
Megaparsec’s code is clearer and doesn’t contain “magic” found in original Parsec.
Megaparsec has well-typed error messages and custom error messages.
Megaparsec can recover from parse errors “on the fly” and continue parsing.
Megaparsec allows to conditionally process parse errors inside your parser before parsing is finished. In particular, it’s possible to define regions in which parse errors, should they happen, will get a “context tag”, e.g. we could build a context stack like “in function definition foo”, “in expression x”, etc. This is not possible with Parsec.
Megaparsec is faster.
Megaparsec is better supported.
If you want to see a detailed change log, CHANGELOG.md
may be helpful. Also see this original announcement for another comparison.
To be honest Parsec’s development has seemingly stagnated. It has no test suite (only three per-bug tests), and all its releases beginning from version 3.1.2 (according or its change log) were about introducing and fixing regressions. Parsec is old and somewhat famous in the Haskell community, so we understand there will be some kind of inertia, but we advise you use Megaparsec from now on because it solves many problems of the original Parsec project. If you think you still have a reason to use original Parsec, open an issue.
-Trifecta is another Haskell library featuring good error messages. Like some other projects of Edward Kmett, it’s probably good, but also under-documented, and has unfixed bugs and flaws that Edward is too busy to fix (simply a fact, no offense intended). Other reasons one may question choice of Trifecta is his/her parsing library:
-Complicated, doesn’t have any tutorials available, and documentation doesn’t help at all.
Trifecta can parse String
and ByteString
natively, but not Text
.
Trifecta’s error messages may be different with their own features, but certainly not as flexible as Megaparsec’s error messages in the latest versions.
Depends on lens
. This means you’ll pull in half of Hackage as transitive dependencies. Also if you’re not into lens
and would like to keep your code “vanilla”, you may not like the API.
Earley is a newer library that allows to safely (it your code compiles, then it probably works) parse context-free grammars (CFG). Megaparsec is a lower-level library compared to Earley, but there are still enough reasons to choose it over Earley:
-Megaparsec is faster.
Your grammar may be not context-free or you may want introduce some sort of state to the parsing process. Almost all non-trivial parsers require something of this sort. Even if your grammar is context-free, state may allow to add some additional niceties. Earley does not support that.
Megaparsec’s error messages are more flexible allowing to include arbitrary data in them, return multiple error messages, mark regions that affect any error that happens in those regions, etc.
The approach Earley uses differs from the conventional monadic parsing. If you work not alone, chances people you work with, especially beginners will be much more productive with libraries taking more traditional path to parsing like Megaparsec.
IOW, Megaparsec is less safe but also more powerful.
-There is Parsers package, which is great. You can use it with Megaparsec or Parsec, but consider the following:
-It depends on Attoparsec, Parsec, and Trifecta, which means you always grab half of Hackage as transitive dependencies by using it. This is ridiculous, by the way, because this package is supposed to be useful for parser builders, so they can write basic core functionality and get the rest “for free”.
It currently has a bug feature in definition of lookAhead
for various monad transformers like StateT
, etc. which is visible when you create backtracking state via monad stack, not via built-in features. The feature makes it so lookAhead
will backtrack your parser state but not your custom state added via StateT
. Kmett thinks this behavior is better.
We intended to use Parsers library in Megaparsec at some point, but aside from already mentioned flaws the library has different conventions for naming of things, different set of “core” functions, etc., different approach to lexing. So it didn’t happen, Megaparsec has minimal dependencies, it is feature-rich and self-contained.
-The following packages are designed to be used with Megaparsec:
-hspec-megaparsec
—utilities for testing Megaparsec parsers with with Hspec.
cassava-megaparsec
—Megaparsec parser of CSV files that plays nicely with Cassava.
tagsoup-megaparsec
—a library for easily using TagSoup as a token type in Megaparsec.
Here are some blog posts mainly announcing new features of the project and describing what sort of things are now possible:
-The project was started and is currently maintained by Mark Karpov. You can find the complete list of contributors in the AUTHORS.md
file in the official repository of the project. Thanks to all the people who propose features and ideas, although they are not in AUTHORS.md
, without them Megaparsec would not be that good.
Issues (bugs, feature requests or otherwise feedback) may be reported in the GitHub issue tracker for this project.
-Pull requests are also welcome (and yes, they will get attention and will be merged quickly if they are good).
-If you want to write a tutorial to be hosted on Megaparsec’s site, open an issue or pull request here.
-Copyright © 2015–2017 Megaparsec contributors
Copyright © 2007 Paolo Martini
Copyright © 1999–2000 Daan Leijen
Distributed under FreeBSD license.
-One of the advantages of Megaparsec 5 is the ability to use your own data types as part of data that is returned on parse failure. This opens up the possibility to tailor error messages to your domain of interest in a way that is quite unique to this library. Needless to say, all data that constitutes a error message is typed in Megaparsec 5, so it’s easy to inspect and manipulate it.
-In this tutorial we will walk through creation of a parser found in the existing library called cassava-megaparsec
, which is an alternative parser for the popular cassava
library that allows to parse CSV data. The default parser features not very user-friendly error messages, so I was asked to design a better one using Megaparsec 5.
In addition to the standard error messages (“expected” and “unexpected” tokens), the library can report problems that have to do with using methods from FromRecord
and FromNamedRecord
type classes that describe how to transform a collection of ByteString
s into a particular instance of those type classes. While performing the conversion, things may go wrong, and we would like to use a special data constructor in these cases.
The complete source code can be found in this GitHub repository.
-We will need some language extensions and imports, here is the top of Data.Csv.Parser.Megaparsec
almost literally:
{-# LANGUAGE BangPatterns #-}
-{-# LANGUAGE DeriveDataTypeable #-}
-{-# LANGUAGE RecordWildCards #-}
-
-module Data.Csv.Parser.Megaparsec
- ( Cec (..)
- , decode
- , decodeWith
- , decodeByName
- , decodeByNameWith )
-where
-
-import Control.Monad
-import Data.ByteString (ByteString)
-import Data.Char (chr)
-import Data.Csv hiding
- ( Parser
- , record
- , namedRecord
- , header
- , toNamedRecord
- , decode
- , decodeWith
- , decodeByName
- , decodeByNameWith )
-import Data.Data
-import Data.Vector (Vector)
-import Data.Word (Word8)
-import Text.Megaparsec
-import qualified Data.ByteString.Char8 as BC8
-import qualified Data.ByteString.Lazy as BL
-import qualified Data.Csv as C
-import qualified Data.HashMap.Strict as H
-import qualified Data.Set as S
-import qualified Data.Vector as V
Note that there are two imports for Data.Csv
, one for some common things like names of type class that I want to keep unprefixed and the second one for the rest of the stuff (qualified as C
).
ParseError
actually?To start with custom error messages we should take a look at how parse errors are represented in Megaparsec 5.
-The main type for error messages in ParseError
which is defined like this:
-- | 'ParseError' represents… parse errors. It provides the stack of source
--- positions, a set of expected and unexpected tokens as well as a set of
--- custom associated data. The data type is parametrized over the token type
--- @t@ and the custom data @e@.
---
--- Note that the stack of source positions contains current position as its
--- head, and the rest of positions allows to track full sequence of include
--- files with topmost source file at the end of the list.
---
--- 'Semigroup' (and 'Monoid') instance of the data type allows to merge
--- parse errors from different branches of parsing. When merging two
--- 'ParseError's, the longest match is preferred; if positions are the same,
--- custom data sets and collections of message items are combined.
-
-data ParseError t e = ParseError
- { errorPos :: NonEmpty SourcePos -- ^ Stack of source positions
- , errorUnexpected :: Set (ErrorItem t) -- ^ Unexpected items
- , errorExpected :: Set (ErrorItem t) -- ^ Expected items
- , errorCustom :: Set e -- ^ Associated data, if any
- } deriving (Show, Read, Eq, Data, Typeable, Generic)
Conceptually, we have four components in a parse error:
-ErrorItem
if you are curious).e
type. e
is the type we will be defining and using in this tutorial.We cannot ship the library without some sort of default candidate to take the place of e
type, so here it is:
-- | “Default error component”. This in our instance of 'ErrorComponent'
--- provided out-of-box.
---
--- @since 5.0.0
-
-data Dec
- = DecFail String -- ^ 'fail' has been used in parser monad
- | DecIndentation Ordering Pos Pos
- -- ^ Incorrect indentation error: desired ordering between reference
- -- level and actual level, reference indentation level, actual
- -- indentation level
- deriving (Show, Read, Eq, Ord, Data, Typeable)
As you can see it is just a sum type that accounts for all types of failures that we need to think about in the vanilla Megaparsec:
-fail
methodText.Megaparsec.Lexer
.What this means is that our new custom type should somehow provide a way to represent those things too. The requirement that a type should be capable of representing the above-mentioned exceptional situations is captured by the ErrorComponent
type class:
-- | The type class defines how to represent information about various
--- exceptional situations. Data types that are used as custom data component
--- in 'ParseError' must be instances of this type class.
---
--- @since 5.0.0
-
-class Ord e => ErrorComponent e where
-
- -- | Represent message passed to 'fail' in parser monad.
- --
- -- @since 5.0.0
-
- representFail :: String -> e
-
- -- | Represent information about incorrect indentation.
- --
- -- @since 5.0.0
-
- representIndentation
- :: Ordering -- ^ Desired ordering between reference level and actual level
- -> Pos -- ^ Reference indentation level
- -> Pos -- ^ Actual indentation level
- -> e
Every type that is going to be used as part of ParseError
must be an instance of the ErrorComponent
type class.
Another thing we would like to do with custom error component is to format it somehow, so it could be inserted in pretty-printed representation of ParseError
. This behavior is defined by the ShowErrorComponent
type class:
-- | The type class defines how to print custom data component of
--- 'ParseError'.
---
--- @since 5.0.0
-
-class Ord a => ShowErrorComponent a where
-
- -- | Pretty-print custom data component of 'ParseError'.
-
- showErrorComponent :: a -> String
We will need to make our new data type instance of that class as well.
-So, let’s start. We can grab existing definitions and instances of Dec
data type and change them as necessary. The special case we want to support is about failed conversion from vector of ByteString
s to some particular type, let’s capture this:
-- | Custom error component for CSV parsing. It allows typed reporting of
--- conversion errors.
-
-data Cec
- = CecFail String
- | CecIndentation Ordering Pos Pos
- | CecConversionError String
- deriving (Eq, Data, Typeable, Ord, Read, Show)
-
-instance ShowErrorComponent Cec where
- showErrorComponent (CecFail msg) = msg
- showErrorComponent (CecIndentation ord ref actual) =
- "incorrect indentation (got " ++ show (unPos actual) ++
- ", should be " ++ p ++ show (unPos ref) ++ ")"
- where p = case ord of
- LT -> "less than "
- EQ -> "equal to "
- GT -> "greater than "
- showErrorComponent (CecConversionError msg) =
- "conversion error: " ++ msg
-
-instance ErrorComponent Cec where
- representFail = CecFail
- representIndentation = CecIndentation
We have re-used definitions from Megaparsec’s source code for Dec
here and added a special case represented by CecConversionError
. It contains a String
that conversion functions of Cassava return. We could do better if Cassava provided typed error values, but String
is all we have, so let’s work with it.
Another handy definition we need is the Parser
type synonym. We cannot use one of the default Parser
definitions because those assume Dec
, so we define it ourselves rather trivially:
-- | Parser type that uses “custom error component” 'Cec'.
-
-type Parser = Parsec Cec BL.ByteString
Let’s start from the top and take a look at the top-level, public API:
--- | Deserialize CSV records form a lazy 'BL.ByteString'. If this fails due
--- to incomplete or invalid input, 'Left' is returned. Equivalent to
--- 'decodeWith' 'defaultDecodeOptions'.
-
-decode :: FromRecord a
- => HasHeader
- -- ^ Whether the data contains header that should be skipped
- -> FilePath
- -- ^ File name (use empty string if you have none)
- -> BL.ByteString
- -- ^ CSV data
- -> Either (ParseError Char Cec) (Vector a)
-decode = decodeWith defaultDecodeOptions
-
--- | Like 'decode', but lets you customize how the CSV data is parsed.
-
-decodeWith :: FromRecord a
- => DecodeOptions
- -- ^ Decoding options
- -> HasHeader
- -- ^ Whether the data contains header that should be skipped
- -> FilePath
- -- ^ File name (use empty string if you have none)
- -> BL.ByteString
- -- ^ CSV data
- -> Either (ParseError Char Cec) (Vector a)
-decodeWith = decodeWithC csv
-
--- | Deserialize CSV records from a lazy 'BL.ByteString'. If this fails due
--- to incomplete or invalid input, 'Left' is returned. The data is assumed
--- to be preceded by a header. Equivalent to 'decodeByNameWith'
--- 'defaultDecodeOptions'.
-
-decodeByName :: FromNamedRecord a
- => FilePath -- ^ File name (use empty string if you have none)
- -> BL.ByteString -- ^ CSV data
- -> Either (ParseError Char Cec) (Header, Vector a)
-decodeByName = decodeByNameWith defaultDecodeOptions
-
--- | Like 'decodeByName', but lets you customize how the CSV data is parsed.
-
-decodeByNameWith :: FromNamedRecord a
- => DecodeOptions -- ^ Decoding options
- -> FilePath -- ^ File name (use empty string if you have none)
- -> BL.ByteString -- ^ CSV data
- -> Either (ParseError Char Cec) (Header, Vector a)
-decodeByNameWith opts = parse (csvWithHeader opts)
-
--- | Decode CSV data using the provided parser, skipping a leading header if
--- necessary.
-
-decodeWithC
- :: (DecodeOptions -> Parser a)
- -- ^ Parsing function parametrized by 'DecodeOptions'
- -> DecodeOptions
- -- ^ Decoding options
- -> HasHeader
- -- ^ Whether to expect a header in the input
- -> FilePath
- -- ^ File name (use empty string if you have none)
- -> BL.ByteString
- -- ^ CSV data
- -> Either (ParseError Char Cec) a
-decodeWithC p opts@DecodeOptions {..} hasHeader = parse parser
- where
- parser = case hasHeader of
- HasHeader -> header decDelimiter *> p opts
- NoHeader -> p opts
Really nothing interesting here, just a bunch of wrappers that boil down to running the parser
either with skipping the CSV header or not.
What I would really like to show to you is the helpers, because one of them is going to be very handy when you decide to write your own parser after reading this manual. Here are the helpers:
--- | End parsing signaling a “conversion error”.
-
-conversionError :: String -> Parser a
-conversionError msg = failure S.empty S.empty (S.singleton err)
- where
- err = CecConversionError msg
-
--- | Convert a 'Record' to a 'NamedRecord' by attaching column names. The
--- 'Header' and 'Record' must be of the same length.
-
-toNamedRecord :: Header -> Record -> NamedRecord
-toNamedRecord hdr v = H.fromList . V.toList $ V.zip hdr v
-
--- | Parse a byte of specified value and return unit.
-
-blindByte :: Word8 -> Parser ()
-blindByte = void . char . chr . fromIntegral
The conversionError
is a handy thing to have as you can quickly fail with your custom error message without writing all the failure
-related boilerplate. toNamedRecord
just converts a Record
to NamedRecord
, while blindByte
reads a character (passed to it as a Word8
value) and returns a unit ()
.
Let’s start with parsing a field. A field in a CSV file can be either escaped or unescaped:
--- | Parse a field. The field may be in either the escaped or non-escaped
--- format. The returned value is unescaped.
-
-field :: Word8 -> Parser Field
-field del = label "field" (escapedField <|> unescapedField del)
An escaped field is written inside straight quotes ""
and can contain any characters at all, but the quote sign itself "
must be escaped by repeating it twice:
-- | Parse an escaped field.
-
-escapedField :: Parser ByteString
-escapedField =
- BC8.pack <$!> between (char '"') (char '"') (many $ normalChar <|> escapedDq)
- where
- normalChar = noneOf "\"" <?> "unescaped character"
- escapedDq = label "escaped double-quote" ('"' <$ string "\"\"")
Simple so far. unescapedField
is even simpler, it can contain any character except for the quote sign "
, delimiter sign, and newline characters:
-- | Parse an unescaped field.
-
-unescapedField :: Word8 -> Parser ByteString
-unescapedField del = BC8.pack <$!> many (noneOf es)
- where
- es = chr (fromIntegral del) : "\"\n\r"
To parse a record we have to parse a non-empty collection of fields separated by delimiter characters (supplied from the DecodeOptions
thing). Then we convert it to Vector ByteString
, because that’s what Cassava’s conversion functions expect:
-- | Parse a record, not including the terminating line separator. The
--- terminating line separate is not included as the last record in a CSV
--- file is allowed to not have a terminating line separator.
-
-record
- :: Word8 -- ^ Field delimiter
- -> (Record -> C.Parser a)
- -- ^ How to “parse” record to get the data of interest
- -> Parser a
-record del f = do
- notFollowedBy eof -- to prevent reading empty line at the end of file
- r <- V.fromList <$!> (sepBy1 (field del) (blindByte del) <?> "record")
- case C.runParser (f r) of
- Left msg -> conversionError msg
- Right x -> return x
The (<$!>)
operator works just like the familiar (<$>)
operator, but applies V.fromList
strictly. Now that we have the vector of ByteString
s, we can try to convert it: on success we just return the result, on failure we fail using the conversionError
helper.
The library also should handle CSV files with headers:
--- | Parse a CSV file that includes a header.
-
-csvWithHeader :: FromNamedRecord a
- => DecodeOptions -- ^ Decoding options
- -> Parser (Header, Vector a)
- -- ^ The parser that parser collection of named records
-csvWithHeader !DecodeOptions {..} = do
- !hdr <- header decDelimiter
- let f = parseNamedRecord . toNamedRecord hdr
- xs <- sepEndBy1 (record decDelimiter f) eol
- eof
- return $ let !v = V.fromList xs in (hdr, v)
-
--- | Parse a header, including the terminating line separator.
-
-header :: Word8 -> Parser Header
-header del = V.fromList <$!> p <* eol
- where
- p = sepBy1 (name del) (blindByte del) <?> "file header"
-
--- | Parse a header name. Header names have the same format as regular
--- 'field's.
-
-name :: Word8 -> Parser Name
-name del = field del <?> "name in header"
The code should be self-explanatory by now. The only thing that remains is to parse collection of records:
--- | Parse a CSV file that does not include a header.
-
-csv :: FromRecord a
- => DecodeOptions -- ^ Decoding options
- -> Parser (Vector a) -- ^ The parser that parses collection of records
-csv !DecodeOptions {..} = do
- xs <- sepEndBy1 (record decDelimiter parseRecord) eol
- eof
- return $! V.fromList xs
Too simple!
-The custom error messages play seamlessly with the rest of the parser. Let’s parse a CSV file into collection of (String, Maybe Int, Double)
items. If I try to parse "foo
, I get the usual Megaparsec error message with “unexpected” and “expected” parts:
my-file.csv:1:5:
-unexpected end of input
-expecting '"', escaped double-quote, or unescaped character
-However, when that phase of parsing is passed successfully, as with foo,12,boo
input, the conversion is attempted and its results are reported:
my-file.csv:1:11:
-conversion error: expected Double, got "boo" (Failed reading: takeWhile1)
-(I wouldn’t mind if (Failed reading: takeWhile1)
part were omitted, but that’s what Cassava’s conversion methods are producing.)
I hope this walk-through has demonstrated that it’s quite trivial to insert your own data into Megaparsec error messages. This way it’s also possible to pump out some data from failing parser or just keep track of things in a type-safe way, which is one thing we should always care about when writing Haskell programs.
- - - -Megaparsec 4.4.0 is a major improvement of the library. Among other things, it provides new primitive combinator withRecovery
that allows to recover from parse errors “on-the-fly” and report several errors after parsing is finished or ignore them altogether. In this tutorial, we will learn how to use this incredible tool.
For the purposes of this tutorial, we will write parser for a simplistic functional language that consists only of equations with symbol on the left hand side and arithmetic expression on the right hand side:
-y = 10
-x = 3 * (1 + y)
-
-result = x - 1 # answer is 32
-Here, it can only calculate arithmetic expressions, but if we were to design something more powerful, we could introduce more interesting operators to grab input from console, etc., but since our aim is to explore a new parsing feature, this language will do.
-First, we will write a parser that can parse entire program in this language as list of ASTs representing equations. Then we will make it failure-tolerant in a way, so when it cannot parse particular equation, it does not stop, but continues its work until all input is analyzed.
-The parser is very easy to write. We will need the following imports:
-{-# LANGUAGE FlexibleContexts #-}
-{-# LANGUAGE TypeFamilies #-}
-
-module Main where
-
-import Control.Applicative (empty)
-import Control.Monad (void)
-import Data.Scientific (toRealFloat)
-import Text.Megaparsec
-import Text.Megaparsec.String
-import Text.Megaparsec.Expr
-import qualified Text.Megaparsec.Lexer as L
To represent AST of our language we will use these definitions:
-type Program = [Equation]
-
-data Equation = Equation String Expr
- deriving (Eq, Show)
-
-data Expr
- = Value Double
- | Reference String
- | Negation Expr
- | Sum Expr Expr
- | Subtraction Expr Expr
- | Multiplication Expr Expr
- | Division Expr Expr
- deriving (Eq, Show)
It’s obvious that a program in our language is collection of equations, where every equation gives a name to an expression which in turn can be simply a number, reference to other equation, or some math involving those concepts.
-As usual, the first thing that we need to handle when starting a parser is white space. We will have two space-consuming parsers:
-scn
—consumes newlines and white space in general. We will use it for white space between equations, which will start with a newline (since equations are newline-delimited).
sc
—this does not consume newlines and is used to define lexemes, i.e. things that automatically eat white space after them.
Here is what I’ve got:
-lineComment :: Parser ()
-lineComment = L.skipLineComment "#"
-
-scn :: Parser ()
-scn = L.space (void spaceChar) lineComment empty
-
-sc :: Parser ()
-sc = L.space (void $ oneOf " \t") lineComment empty
-
-lexeme :: Parser a -> Parser a
-lexeme = L.lexeme sc
-
-symbol :: String -> Parser String
-symbol = L.symbol sc
Consult Haddocks for description of L.space
, L.lexeme
, and L.symbol
. In short, L.space
is a helper to quickly put together a general-purpose space-consuming parser. We will follow this strategy: assume no white space before lexemes and consume all white space after lexemes. There is a case with white space that can be found before any lexeme, but that will be dealt with specially, see below.
We also need a parser for equation names (x
, y
, and result
in the first example). Like in many other programming languages, we will accept alpha-numeric sequences that do not start with a number:
name :: Parser String
-name = lexeme ((:) <$> letterChar <*> many alphaNumChar) <?> "name"
All too easy. Parsing of expressions could slow us down, but there is a solution out-of-box in Text.Megaparsec.Expr
module:
expr :: Parser Expr
-expr = makeExprParser term table <?> "expression"
-
-term :: Parser Expr
-term = parens expr
- <|> (Reference <$> name)
- <|> (Value <$> number)
-
-table :: [[Operator Parser Expr]]
-table =
- [ [Prefix (Negation <$ symbol "-") ]
- , [ InfixL (Multiplication <$ symbol "*")
- , InfixL (Subtraction <$ symbol "/") ]
- , [ InfixL (Sum <$ symbol "+")
- , InfixL (Division <$ symbol "-") ]
- ]
-
-number :: Parser Double
-number = toRealFloat <$> lexeme L.number
-
-parens :: Parser a -> Parser a
-parens = between (symbol "(") (symbol ")")
We just wrote fairly complete parser for expressions in our language! If you’re new to all this stuff I suggest you load the code into GHCi and play with it a bit. Use parseTest
function to feed input into the parser:
λ> parseTest expr "5"
-Value 5.0
-λ> parseTest expr "5 + foo"
-Sum (Value 5.0) (Reference "foo")
-λ> parseTest expr "(x + y) * 5 + 7 * z"
-Sum
- (Multiplication (Sum (Reference "x") (Reference "y")) (Value 5.0))
- (Multiplication (Value 7.0) (Reference "z"))
Power! The only thing that remains is a parser for equations and a parser for entire program:
-equation :: Parser Equation
-equation = Equation <$> (name <* symbol "=") <*> expr
-
-prog :: Parser Program
-prog = between scn eof (sepEndBy equation scn)
Note that we need to consume leading white-space in prog
manually, as described above. Try the prog
parser—it’s a complete solution that can parse language we described in the beginning. Parsing “end of file” eof
explicitly makes the parser consume all input and fail loudly if it cannot do it, otherwise it would just stop on the first problematic token and return what it has parsed so far.
Our parser is really dandy, it has nice error messages and does its job well. However, every expression is clearly separated from the others by a newline. This separation makes it possible to analyze many expressions independently, even if one of them is malformed, we have no reason to stop and not to check the others. In fact, that’s how some “serious” parsers work (parser of C++ language, although it depends on compiler I guess). Reporting multiple parse errors at once may be a more efficient method of communication with the programmer who needs to fix them, than when he/she has to recompile the program every time to get to the next error. In this section we will make our parser failure-tolerant and able to report multiple error messages at once.
-Let’s add one more type synonym—RawData
:
type RawData t e = [Either (ParseError t e) Equation]
This represents a collection of equations, just like Program
, but every one of them may be malformed: in that case we get the original error message in Left
, otherwise we have properly parsed equation in Right
.
You will be amazed just how easy it is to add recovering to an existing parser:
-rawData :: Parser (RawData Char Dec)
-rawData = between scn eof (sepEndBy e scn)
- where e = withRecovery recover (Right <$> equation)
- recover err = Left err <$ manyTill anyChar eol
Let try it, here is the input:
-foo = (x $ y) * 5 + 7.2 * z
-bar = 15
-Result:
-[ Left
- (ParseError
- { errorPos = SourcePos
- { sourceName = "", sourceLine = Pos 1
- , sourceColumn = Pos 10} :| []
- , errorUnexpected = fromList [Tokens ('$' :| "")]
- , errorExpected = fromList
- [ Tokens (')' :| "")
- , Label ('o' :| "perator")
- , Label ('r' :| "est of expression") ]
- , errorCustom = fromList [] })
-, Right (Equation "bar" (Value 15.0)) ]
How does it work? withRecovery r p
primitive runs the parser p
as usual, but if it fails, it just takes its ParseError
and provides it as an argument of r
. In r
you start right were p
failed—no backtracking happens, because it would make it harder to find position from where to start normal parsing again. Here you have a chance to consume some input to advance the parser’s textual position. In our case it’s as simple as eating all input up to the next newline, but it might be trickier.
You probably want to know now what happens when recovering parser r
fails as well. The answer is: your parser fails as usual, as if no withRecovery
primitive was used. It’s by design that recovering parser cannot influence error messages in any way, or it would lead to quite confusing error messages in some cases, depending on the logic of the recovering parser.
Now it’s up to you what to do with RawData
. You can either take all error messages and print them one by one, or ignore errors altogether and filter only valid equations to work with.
When you want to use withRecovery
, the main thing to remember that parts of text that you want to allow to fail should be clearly separated from each other, so recovering parser can reliably skip to the next part if the current part cannot be parsed. In a language like Python, you could use indentation levels to tell apart high-level definitions, for example. In every case you should use your judgment and creativity to decide how to make use of withRecovery
. In some cases it may be not worth it, but more often than not you will be able to improve experience of people who work with your product by using this new Megaparsec’s feature.
- (Psst! Looking for the source code for this tutorial? - It's here.) -
- - -Megaparsec 4.3.0 introduces new combinators that should be of some use when you want to parse indentation-sensitive input. Megaparsec 5.0.0 adds support for line-folds, completing support for indentation-sensitive parsing. This tutorial shows how these new tools work, compose, and hopefully, feel natural—something we cannot say about ad-hoc solutions to this problem that exist as separate packages to work on top of Parsec, for example.
-From the first release of Megaparsec, there has been the indentGuard
function, which is a great shortcut, but a kind of pain to use for complex tasks. So, we won’t cover it here, instead we will talk about the new combinators built upon it and available beginning from Megaparsec 4.3.0.
First, we have indentLevel
, which is defined just as:
indentLevel :: MonadParsec e s m => m Pos
-indentLevel = sourceColumn <$> getPosition
That’s right, it’s just a shortcut, but I found myself using this idiom so often, so I included it in the public lexer API.
-Second, we have nonIndented
. This allows to make sure that some input is not indented. Just wrap a parser in nonIndented
and you’re done.
nonIndented
is trivial to write as well:
nonIndented :: MonadParsec e s m
- => m () -- ^ How to consume indentation (white space)
- -> m a -- ^ How to parse actual data
- -> m a
-nonIndented sc p = indentGuard sc EQ (unsafePos 1) *> p
However, it’s a part of a logical model behind high-level parsing of indentation-sensitive input. We state that there are top-level items that are not indented (nonIndented
helps to define parsers for them), and that all indented tokens are directly or indirectly are “children” of those top-level definitions. In Megaparsec, we don’t need any additional state to express this. Since indentation is always relative, our idea is to explicitly tie parsers for “reference” tokens and indented tokens, thus defining indentation-sensitive grammar via pure combination of parsers, just like all the other tools in Megaparsec work. This is different from old solutions built on top of Parsec, where you had to deal with ad-hoc state. It’s also more robust and safer, because the less state you have, the better.
So, how do you define an indented block? Let’s take a look at the signature of the indentBlock
helper:
indentBlock :: (MonadParsec e s m, Token s ~ Char)
- => m () -- ^ How to consume indentation (white space)
- -> m (IndentOpt m a b) -- ^ How to parse “reference” token
- -> m a
First, we specify how to consume indentation. An important thing to note here is that this space-consuming parser must consume newlines as well, while tokens (“reference” token and indented tokens) should not normally consume newlines after them.
-As you can see, the second argument allows us to parse “reference” token and return a data structure that tells indentBlock
what to do next. There are several options:
data IndentOpt m a b
- = IndentNone a
- -- ^ Parse no indented tokens, just return the value
- | IndentMany (Maybe Int) ([b] -> m a) (m b)
- -- ^ Parse many indented tokens (possibly zero), use given indentation
- -- level (if 'Nothing', use level of the first indented token); the
- -- second argument tells how to get final result, and third argument
- -- describes how to parse an indented token
- | IndentSome (Maybe Int) ([b] -> m a) (m b)
- -- ^ Just like 'IndentMany', but requires at least one indented token to
- -- be present
We can change our mind and parse no indented tokens, we can parse many (that is, possibly zero) indented tokens or require at least one such token. We can either allow indentBlock
detect indentation level of the first indented token and use that, or manually specify indentation level. This should be flexible enough.
Now it’s time to put our new tools into practice. In this section, we will parse a simple indented list of some items. Let’s begin with the import section:
-{-# LANGUAGE TupleSections #-}
-
-module Main (main) where
-
-import Control.Applicative (empty)
-import Control.Monad (void)
-import Text.Megaparsec
-import Text.Megaparsec.String
-import qualified Text.Megaparsec.Lexer as L
We will need two kinds of space-consumers: one that consumes new lines scn
and one that doesn’t sc
(actually it only parses spaces and tabs here):
lineComment :: Parser ()
-lineComment = L.skipLineComment "#"
-
-scn :: Parser ()
-scn = L.space (void spaceChar) lineComment empty
-
-sc :: Parser ()
-sc = L.space (void $ oneOf " \t") lineComment empty
-
-lexeme :: Parser a -> Parser a
-lexeme = L.lexeme sc
Just for fun, we will allow line comments that start with #
as well.
Assuming pItemList
parses the entire list, we can define the high-level parser as:
parser :: Parser (String, [String])
-parser = pItemList <* eof
This will make it consume all input.
-pItemList
is a top-level form that itself is a combination of “reference” token (header of list) and indented tokens (list items), so:
pItemList :: Parser (String, [String]) -- header and list items
-pItemList = L.nonIndented scn (L.indentBlock scn p)
- where
- p = do
- header <- pItem
- return (L.IndentMany Nothing (return . (header, )) pItem)
For our purposes, an item is a sequence of alpha-numeric characters and dashes:
-pItem :: Parser String
-pItem = lexeme $ some (alphaNumChar <|> char '-')
Now, load the code into GHCi and try it with help of parseTest
built-in:
λ> parseTest parser ""
-1:1:
-unexpected end of input
-expecting '-' or alphanumeric character
-λ> parseTest parser "something"
-("something",[])
-λ> parseTest parser " something"
-1:3:
-incorrect indentation (got 3, should be equal to 1)
-λ> parseTest parser "something\none\ntwo\nthree"
-2:1:
-unexpected 'o'
-expecting end of input
Remember that we’re using IndentMany
option, so empty lists are OK, on the other hand the built-in combinator space
has hidden the phrase “expecting more space” from error messages (usually you don’t want it because it adds noise to all messages), so this error message is perfectly reasonable.
Let’s continue:
-λ> parseTest parser "something\n one\n two\n three"
-3:5:
-incorrect indentation (got 5, should be equal to 3)
-λ> parseTest parser "something\n one\n two\n three"
-4:2:
-incorrect indentation (got 2, should be equal to 3)
-λ> parseTest parser "something\n one\n two\n three"
-("something",["one","two","three"])
This definitely seems to work. Let’s replace IndentMany
with IndentSome
and Nothing
with Just 5
(indentation levels are counted from 1, so it will require 4 spaces before indented items):
pItemList :: Parser (String, [String])
-pItemList = L.nonIndented scn (L.indentBlock scn p)
- where
- p = do
- header <- pItem
- return (L.IndentSome (Just (unsafePos 5)) (return . (header, )) pItem)
Now:
-λ> parseTest parser "something\n"
-2:1:
-incorrect indentation (got 1, should be greater than 1)
-λ> parseTest parser "something\n one"
-2:3:
-incorrect indentation (got 3, should be equal to 5)
-λ> parseTest parser "something\n one"
-("something",["one"])
First message may be a bit surprising, but Megaparsec knows that there must be at least one item in the list, so it checks indentation level and it’s 1, which is incorrect, so it reports it.
-What I like about indentBlock
is that another indentBlock
can be put inside of it and the whole thing will work smoothly, parsing more complex input with several levels of indentation. No additional effort is required.
Let’s allow list items to have sub-items. For this we will need a new parser, pComplexItem
(looks familiar…):
pComplexItem :: Parser (String, [String])
-pComplexItem = L.indentBlock scn p
- where
- p = do
- header <- pItem
- return (L.IndentMany Nothing (return . (header, )) pItem)
A couple of edits to pItemList
(we’re now parsing more complex stuff, so we need to reflect this in the type signatures):
parser :: Parser (String, [(String, [String])])
-parser = pItemList <* eof
-
-pItemList :: Parser (String, [(String, [String])])
-pItemList = L.nonIndented scn (L.indentBlock scn p)
- where
- p = do
- header <- pItem
- return (L.IndentSome Nothing (return . (header, )) pComplexItem)
If I feed something like this:
-first-chapter
- paragraph-one
- note-A # an important note here!
- note-B
- paragraph-two
- note-1
- note-2
- paragraph-three
-…into our parser, I get:
-Right
- ( "first-chapter"
- , [ ("paragraph-one", ["note-A","note-B"])
- , ("paragraph-two", ["note-1","note-2"])
- , ("paragraph-three", []) ] )
And this looks like it works!
-lineFold
helper is introduced in Megaparsec 5.0.0. A line fold consists of several elements that can be put on one line or on several lines as long as indentation level of subsequent items is greater than indentation level of the first item.
Let’s make use of lineFold
and add line folds to our program.
pComplexItem :: Parser (String, [String])
-pComplexItem = L.indentBlock scn p
- where
- p = do
- header <- pItem
- return (L.IndentMany Nothing (return . (header, )) pLineFold)
-
-pLineFold :: Parser String
-pLineFold = L.lineFold scn $ \sc' ->
- let ps = some (alphaNumChar <|> char '-') `sepBy1` try sc'
- in unwords <$> ps <* sc
lineFold
works like this: you give it a space consumer that accepts newlines and it gives you a special space consumer that you can use in the callback to consume space between elements of line fold. An important thing here is that you should use normal space consumer at the end of line fold or your fold will have no end.
Playing with the final version of our parser is left as an exercise for the reader—you can create “items” that consist of multiple words and as long as they are “line-folded” they will be parsed and concatenated with single space between them.
-Note that every sub-list behaves independently—you will see that if you try to feed the parser with various variants of malformed data. And this is no surprise, since no state is shared between different parts of the structure—it’s just assembled purely from simpler parts—sufficiently elegant solution in the spirit of the rest of the library.
- - -- (Psst! Looking for the source code for this tutorial? - It's here.) -
- - -This tutorial will present how to parse a subset of a simple imperative programming language called WHILE (introduced in the book “Principles of Program Analysis” by Nielson, Nielson and Hankin). It includes only a few statements and basic boolean/arithmetic expressions, which makes it nice material for a tutorial.
- -First let’s import the necessary modules:
-module Main (main) where
-
-import Control.Monad (void)
-import Text.Megaparsec
-import Text.Megaparsec.Expr
-import Text.Megaparsec.String -- input stream is of the type ‘String’
-import qualified Text.Megaparsec.Lexer as L
The grammar for expressions is defined as follows:
-a ::= x | n | - a | a opa a
-b ::= true | false | not b | b opb b | a opr a
-opa ::= + | - | * | /
-opb ::= and | or
-opr ::= > | <
-Note that we have three groups of operators—arithmetic, boolean and relational ones.
-And now the definition of statements:
-S ::= x := a | skip | S1; S2 | ( S ) | if b then S1 else S2 | while b do S
-We probably want to parse that into some internal representation of the language (an abstract syntax tree). Therefore we need to define the data structures for the expressions and statements.
-We need to take care of boolean and arithmetic expressions and the appropriate operators. First let’s look at the boolean expressions:
-data BExpr
- = BoolConst Bool
- | Not BExpr
- | BBinary BBinOp BExpr BExpr
- | RBinary RBinOp AExpr AExpr
- deriving (Show)
Binary boolean operators:
-data BBinOp
- = And
- | Or
- deriving (Show)
Relational operators:
-data RBinOp
- = Greater
- | Less
- deriving (Show)
Now we define the types for arithmetic expressions:
-data AExpr
- = Var String
- | IntConst Integer
- | Neg AExpr
- | ABinary ABinOp AExpr AExpr
- deriving (Show)
And arithmetic operators:
-data ABinOp
- = Add
- | Subtract
- | Multiply
- | Divide
- deriving (Show)
Finally let’s take care of the statements:
-data Stmt
- = Seq [Stmt]
- | Assign String AExpr
- | If BExpr Stmt Stmt
- | While BExpr Stmt
- | Skip
- deriving (Show)
Having all the data structures we can go on with writing the code to do the actual parsing. Here we will define lexemes of our language. When writing a lexer for a language it’s always important to define what counts as whitespace and how it should be consumed. space
from Text.Megaparsec.Lexer
module can be helpful here:
sc :: Parser ()
-sc = L.space (void spaceChar) lineCmnt blockCmnt
- where lineCmnt = L.skipLineComment "//"
- blockCmnt = L.skipBlockComment "/*" "*/"
sc
stands for “space consumer”. space
takes three arguments: a parser that parses single whitespace character, a parser for line comments, and a parser for block (multi-line) comments. skipLineComment
and skipBlockComment
help with quickly creating parsers to consume the comments. (If our language didn’t have block comments, we could pass empty
from Control.Applicative
as the third argument of space
.)
Next, we will follow the strategy where whitespace will be consumed after every lexeme automatically, but not before it. Let’s define a wrapper to achieve this:
-lexeme :: Parser a -> Parser a
-lexeme = L.lexeme sc
Perfect. Now we can wrap any parser in lexeme
and it will consume any trailing whitespace with sc
.
Since we often want to parse some “fixed” string, let’s define one more parser called symbol
. It will take a string as argument and parse this string and whitespace after it.
symbol :: String -> Parser String
-symbol = L.symbol sc
With these tools we can create other useful parsers:
--- | 'parens' parses something between parenthesis.
-
-parens :: Parser a -> Parser a
-parens = between (symbol "(") (symbol ")")
-
--- | 'integer' parses an integer.
-
-integer :: Parser Integer
-integer = lexeme L.integer
-
--- | 'semi' parses a semicolon.
-
-semi :: Parser String
-semi = symbol ";"
Great. To parse various operators we can just use symbol
, but reserved words and identifiers are a bit trickier. There are two things to note:
Parsers for reserved words should check that the parsed reserved word is not a prefix of an identifier.
Parsers of identifiers should check that parsed identifier is not a reserved word.
Let’s express it in code:
-rword :: String -> Parser ()
-rword w = string w *> notFollowedBy alphaNumChar *> sc
-
-rws :: [String] -- list of reserved words
-rws = ["if","then","else","while","do","skip","true","false","not","and","or"]
-
-identifier :: Parser String
-identifier = (lexeme . try) (p >>= check)
- where
- p = (:) <$> letterChar <*> many alphaNumChar
- check x = if x `elem` rws
- then fail $ "keyword " ++ show x ++ " cannot be an identifier"
- else return x
identifier
may seem complex, but it’s actually simple. We just parse a sequence of characters where the first character is a letter and the rest is several characters where every one of them can be either a letter or a digit. Once we have parsed such a string, we check if it’s in the list of reserved words, fail with an informative message if it is, and return the result otherwise.
Note the use of try
in identifier
. This is necessary to backtrack to beginning of the identifier in cases when fail
is evaluated. Otherwise things like many identifier
would fail on such identifiers instead of just stopping.
And that’s it, we have just written the lexer for our language, now we can start writing the parser.
-As already mentioned, a program in this language is simply a statement, so the main parser should basically only parse a statement. But remember to take care of initial whitespace—our parsers only get rid of whitespace after the tokens!
-whileParser :: Parser Stmt
-whileParser = between sc eof stmt
Now because any statement might be actually a sequence of statements separated by semicolons, we use sepBy1
to parse at least one statement. The result is a list of statements. We also allow grouping statements with parentheses, which is useful, for instance, in the while
loop.
stmt :: Parser Stmt
-stmt = parens stmt <|> stmtSeq
-
-stmtSeq :: Parser Stmt
-stmtSeq = f <$> sepBy1 stmt' semi
- -- if there's only one stmt return it without using ‘Seq’
- where f l = if length l == 1 then head l else Seq l
Now a single statement is quite simple, it’s either an if
conditional, a while
loop, an assignment or simply a skip
statement. We use <|>
to express choice. So a <|> b
will first try parser a
and if it fails (without actually consuming any input) then parser b
will be used. Note: this means that the order is important.
stmt' :: Parser Stmt
-stmt' = ifStmt <|> whileStmt <|> skipStmt <|> assignStmt
If you have a parser that might fail after consuming some input, and you still want to try the next parser, you should take a look at the try
combinator. For instance try p <|> q
will try parsing with p
and if it fails, even after consuming the input, the q
parser will be used as if nothing has been consumed by p
.
Now let’s define the parsers for all the possible statements. This is quite straightforward as we just use the parsers from the lexer and then use all the necessary information to create appropriate data structures.
-ifStmt :: Parser Stmt
-ifStmt = do
- rword "if"
- cond <- bExpr
- rword "then"
- stmt1 <- stmt
- rword "else"
- stmt2 <- stmt
- return (If cond stmt1 stmt2)
-
-whileStmt :: Parser Stmt
-whileStmt = do
- rword "while"
- cond <- bExpr
- rword "do"
- stmt1 <- stmt
- return (While cond stmt1)
-
-assignStmt :: Parser Stmt
-assignStmt = do
- var <- identifier
- void (symbol ":=")
- expr <- aExpr
- return (Assign var expr)
-
-skipStmt :: Parser Stmt
-skipStmt = Skip <$ rword "skip"
What’s left is to parse the expressions. Fortunately Megaparsec provides an easy way to do that. Let’s define the arithmetic and boolean expressions:
-aExpr :: Parser AExpr
-aExpr = makeExprParser aTerm aOperators
-
-bExpr :: Parser BExpr
-bExpr = makeExprParser bTerm bOperators
Now we have to define the lists with operator precedence, associativity and what constructors to use in each case.
-aOperators :: [[Operator Parser AExpr]]
-aOperators =
- [ [Prefix (Neg <$ symbol "-") ]
- , [ InfixL (ABinary Multiply <$ symbol "*")
- , InfixL (ABinary Divide <$ symbol "/") ]
- , [ InfixL (ABinary Add <$ symbol "+")
- , InfixL (ABinary Subtract <$ symbol "-") ]
- ]
-
-bOperators :: [[Operator Parser BExpr]]
-bOperators =
- [ [Prefix (Not <$ rword "not") ]
- , [InfixL (BBinary And <$ rword "and")
- , InfixL (BBinary Or <$ rword "or") ]
- ]
In the case of prefix operators it is enough to specify which one should be parsed and what is the associated data constructor. Infix operators are defined similarly, but there are several variants of infix constructors for various associativity options. Note that the operator precedence depends only on the order of the elements in the list.
-Finally we have to define the terms. In the case of arithmetic expressions, it is quite simple:
-aTerm :: Parser AExpr
-aTerm = parens aExpr
- <|> Var <$> identifier
- <|> IntConst <$> integer
However, a term in a boolean expression is a bit more tricky. In this case, a term can also be an expression with relational operator consisting of arithmetic expressions.
-bTerm :: Parser BExpr
-bTerm = parens bExpr
- <|> (rword "true" *> pure (BoolConst True))
- <|> (rword "false" *> pure (BoolConst False))
- <|> rExpr
Therefore we have to define a parser for relational expressions:
-rExpr :: Parser BExpr
-rExpr = do
- a1 <- aExpr
- op <- relation
- a2 <- aExpr
- return (RBinary op a1 a2)
-
-relation :: Parser RBinOp
-relation = (symbol ">" *> pure Greater)
- <|> (symbol "<" *> pure Less)
And that’s it. We have a quite simple parser which is able to parse a few statements and arithmetic/boolean expressions.
-If you want to experiment with the parser inside GHCi, these functions might be handy:
-parseTest p input
applies parser p
on input input
and prints results.Original Parsec tutorial in Haskell Wiki:
-https://wiki.haskell.org/Parsing_a_simple_imperative_language
- - -- (Psst! Looking for the source code for this tutorial? - It's here.) -
- - -Some progressive Haskell hackers may wish to switch from Parsec to Megaparsec. This tutorial explains the practical differences between the two libraries that you will need to address if you choose to undertake the switch. Remember, all the functionality available in Parsec is available in Megaparsec and often in a better form.
-Text.Parsec.Token
?You’ll mainly need to replace “Parsec” part in your imports with “Megaparsec”. That’s pretty simple. Typical import section of module that uses Megaparsec looks like this:
--- this module contains commonly useful tools:
-import Text.Megaparsec
--- this module depends on type of data you want to parse, you only need to
--- import one of these:
-import Text.Megaparsec.String -- if you parse ‘String’
-import Text.Megaparsec.ByteString -- if you parse strict ‘ByteString’
-import Text.Megaparsec.ByteString.Lazy -- if you parse lazy ‘ByteString’
-import Text.Megaparsec.Text -- if you parse strict ‘Text’
-import Text.Megaparsec.Text.Lazy -- if you parse lazy ‘Text’
--- if you need to parse permutation phrases:
-import Text.Megaparsec.Perm
--- if you need to parse expressions:
-import Text.Megaparsec.Expr
--- if you need to parse languages:
-import qualified Text.Megaparsec.Lexer as L
So, the only noticeable difference that Megaparsec has no Text.Megaparsec.Token
module which is replaced with Text.Megaparsec.Lexer
, see about this in the section “What happened to Text.Parsec.Token
”.
Megaparsec introduces a more consistent naming scheme, so some things are called differently, but renaming functions is a very easy task, you don’t need to think. Here are renamed items:
-many1
→ some
(re-exported from Control.Applicative
)skipMany1
→ skipSome
tokenPrim
→ token
optionMaybe
→ optional
(re-exported from Control.Applicative
)permute
→ makePermParser
buildExpressionParser
→ makeExprParser
Character parsing:
-alphaNum
→ alphaNumChar
digit
→ digitChar
endOfLine
→ eol
hexDigit
→ hexDigitChar
letter
→ letterChar
lower
→ lowerChar
octDigit
→ octDigitChar
space
→ spaceChar
†spaces
→ space
†upper
→ upperChar
†—pay attention to these, since space
parses many spaceChar
s, including zero, if you write something like many space
, your parser will hang. So be careful to replace many space
with either many spaceChar
or spaces
.
Parsec also has many names for the same or similar things. Megaparsec usually has one function per task that does its job well. Here are the items that were removed in Megaparsec and reasons of their removal:
-parseFromFile
—from file and then parsing its contents is trivial for every instance of Stream
and this function provides no way to use newer methods for running a parser, such as runParser'
.
getState
, putState
, modifyState
—ad-hoc backtracking user state has been eliminated.
unexpected
, token
and tokens
, now there is a bit different versions of these functions under the same name.
Reply
and Consumed
are not public data types anymore, because they are low-level implementation details.
runPT
and runP
were essentially synonyms for runParserT
and runParser
respectively.
chainl
, chainl1
, chainr
, and chainr1
—use Text.Megaparsec.Expr
instead.
In Megaparsec 5 the modules Text.Megaparsec.Pos
and Text.Megaparsec.Error
are completely different from those found in Parsec and Megaparsec 4. Take some time to look at documentation of the modules if your use-case requires operations on error messages or positions. You may like the fact that we have well-typed and extensible error messages now.
The Stream
type class now have updatePos
method that gives precise control over advancing of textual positions during parsing.
Note that argument order of label
has been flipped (the label itself goes first now), so you can write now: myParser = label "my parser" $ …
.
Don’t use the label ""
(or the … <?> ""
) idiom to “hide” some “expected” tokens from error messages, use hidden
.
The new token
parser is more powerful, its first argument provides full control over reported error message while its second argument allows to specify how to report a missing token in case of empty input stream.
Now tokens
parser allows to control how tokens are compared (yes, we have case-insensitive string
called string'
).
The unexpected
parser allows to specify precisely what is unexpected in a well-typed manner.
Tab width is not hard-coded anymore, use getTabWidth
and setTabWidth
to change it. Default tab width is defaultTabWidth
.
Now you can reliably test error messages, equality for them is now defined properly (in Parsec Expect "foo"
is equal to Expect "bar"
), error messages are also well-typed and customizeable.
To render a error message, apply parseErrorPretty
on it.
count' m n p
allows you to parse from m
to n
occurrences of p
.
Now you have someTill
and eitherP
out of the box.
token
-based combinators like string
and string'
backtrack by default, so it’s not necessary to use try
with them (beginning from 4.4.0
). This feature does not affect performance.
The new failure
combinator allows to fail with an arbitrary error message, it even allows to use your own data types.
New character parsers in Text.Megaparsec.Char
may be useful if you work with Unicode:
asciiChar
charCategory
controlChar
latin1Char
markChar
numberChar
printChar
punctuationChar
separatorChar
symbolChar
Ever wanted to have case-insensitive character parsers? Here you go:
-char'
oneOf'
noneOf'
string'
makeExprParser
has flipped order of arguments: term parser first, operator table second. To specify associativity of infix operators you use one of the three Operator
constructors:
InfixN
—non-associative infixInfixL
—left-associative infixInfixR
—right-associative infixText.Parsec.Token
?That module was extremely inflexible and thus it has been eliminated. In Megaparsec you have Text.Megaparsec.Lexer
instead, which doesn’t impose anything on user but provides useful helpers. The module can also parse indentation-sensitive languages.
Let’s quickly describe how you go about writing your lexer with Text.Megaparsec.Lexer
. First, you should import the module qualified, we will use L
as its synonym here.
Start writing your lexer by defining what counts as white space in your language. space
, skipLineComment
, and skipBlockComment
can be helpful:
sc :: Parser () -- ‘sc’ stands for “space consumer”
-sc = L.space (void spaceChar) lineComment blockComment
- where lineComment = L.skipLineComment "//"
- blockComment = L.skipBlockComment "/*" "*/"
This is generally called space consumer, often you’ll need only one space consumer, but you can define as many of them as you want. Note that this new module allows you avoid consuming newline characters automatically, just use something different than void spaceChar
as first argument of space
. Even better, you can control what white space is on per-lexeme basis:
lexeme :: Parser a -> Parser a
-lexeme = L.lexeme sc
-
-symbol :: String -> Parser String
-symbol = L.symbol sc
Note that all tools in Megaparsec work with any instance of MonadParsec
. All commonly useful monad transformers like StateT
and WriterT
are instances of MonadParsec
out of the box. For example, what if you want to collect contents of comments, (say, they are documentation strings of a sort), you may want to have backtracking user state were you put last encountered comment satisfying some criteria, and then when you parse function definition you can check the state and attach doc-string to your parsed function. It’s all possible and easy with Megaparsec:
import Control.Monad.State.Lazy
-
-…
-
-type MyParser = StateT String Parser
-
-skipLineComment' :: MyParser ()
-skipLineComment' = …
-
-skipBlockComment' :: MyParser ()
-skipBlockComment' = …
-
-sc :: MyParser ()
-sc = space (void spaceChar) skipLineComment' skipBlockComment'
Parsing of indentation-sensitive language deserves its own tutorial, but let’s take a look at the basic tools upon which you can build. First of all you should work with space consumer that doesn’t eat newlines automatically. This means you’ll need to pick them up manually.
-The main helper is called indentGuard
. It takes a parser that will be used to consume white space (indentation) and a predicate of type Int -> Bool
. If after running the given parser column number does not satisfy given predicate, the parser fails with message “incorrect indentation”, otherwise it returns current column number.
In simple cases you can explicitly pass around value returned by indentGuard
, i.e. current level of indentation. If you prefer to preserve some sort of state you can achieve backtracking state combining StateT
and ParsecT
, like this:
StateT Int Parser a
Here we have state of the type Int
. You can use get
and put
as usual, although it may be better to write a modified version of indentGuard
that could get current indentation level (indentation level on previous line), then consume indentation of current line, perform necessary checks, and put new level of indentation.
Later update: now we have full support for indentation-sensitive parsing, see nonIndented
, indentBlock
, and lineFold
in the Text.Megaparsec.Lexer
module.
Parsing of string and character literals is done a bit differently than in Parsec. You have the single helper charLiteral
, which parses a character literal. It does not parse surrounding quotes, because different languages may quote character literals differently. Purpose of this parser is to help with parsing of conventional escape sequences (literal character is parsed according to rules defined in Haskell report).
charLiteral :: Parser Char
-charLiteral = char '\'' *> charLiteral <* char '\''
Use charLiteral
to parse string literals. This is simplified version that will accept plain (not escaped) newlines in string literals (it’s easy to make it conform to Haskell syntax, this is left as an exercise for the reader):
stringLiteral :: Parser String
-stringLiteral = char '"' >> manyTill L.charLiteral (char '"')
I should note that in charLiteral
we use built-in support for parsing of all the tricky combinations of characters. On the other hand Parsec re-implements the whole thing. Given that it mostly has no tests at all, I cannot tell for sure that it works.
Parsing of numbers is easy:
-integer :: Parser Integer
-integer = lexeme L.integer
-
-float :: Parser Double
-float = lexeme L.float
-
-number :: Parser Scientific
-number lexeme L.number -- similar to ‘naturalOrFloat’ in Parsec
Note that Megaparsec internally uses the standard Haskell functions to parse floating point numbers, thus no precision loss is possible (and it’s tested). On the other hand, Parsec again re-implements the whole thing. Approach taken by Parsec authors is to parse the numbers one by one and then re-create the floating point number by means of floating point arithmetic. Any professional knows that this is not possible and the only way to parse floating point number is via bit-level manipulation (it’s usually done on OS level, in C libraries). Of course results produced by Parsec built-in parser for floating point numbers are incorrect. This is a known bug now, but it’s been a long time till we “discovered” it, because again, Parsec has no test suite. (Update: it took one year but Parsec’s maintainer has recently merged a pull request that seems to fix that and released Parsec 3.1.11.)
-Hexadecimal and octal numbers do not parse “0x” or “0o” prefixes, because different languages may have other prefixes for this sort of numbers. We should parse the prefixes manually:
-hexadecimal :: Parser Integer
-hexadecimal = lexeme $ char '0' >> char' 'x' >> L.hexadecimal
-
-octal :: Parser Integer
-octal = lexeme $ char '0' >> char' 'o' >> L.octal
Since Haskell report says nothing about sign in numeric literals, basic parsers like integer
do not parse sign. You can easily create parsers for signed numbers with the help of signed
:
signedInteger :: Parser Integer
-signedInteger = L.signed sc integer
-
-signedFloat :: Parser Double
-signedFloat = L.signed sc float
-
-signedNumber :: Parser Scientific
-signedNumber = L.signed sc number
And that’s it, shiny and new, Text.Megaparsec.Lexer
is at your service, now you can implement anything you want without the need to copy and edit entire Text.Parsec.Token
module (people had to do it sometimes, you know).
Changes you may want to perform may be more fundamental than those described here. For example, previously you may have to use a workaround because Text.Parsec.Token
was not sufficiently flexible. Now you can replace it with a proper solution. If you want to use the full potential of Megaparsec, take time to read about its features, they can help you improve your parsers.
If performance of your Megaparsec parser is worse that you hoped, there may be ways to improve it. This short guide will instruct you what to attempt, but you should always check if you’re getting better results by profiling and benchmarking your parsers (that’s the only way to understand if you are doing the right thing when tuning performance).
-If your parser uses a monad stack instead of plain Parsec
monad (which is a monad transformer over Identity
too, but it’s much more lightweight), make sure you use at least version 0.5 of transformers
library, and at least version 5.0 of megaparsec
. Both libraries have critical performance improvements in those versions, so you can just get better performance for free.
Parsec
monad will be always faster then ParsecT
-based monad transformers. Avoid using StateT
, WriterT
, and other monad transformers unless absolutely necessary. When you have relatively simple monad stack, for example with StateT
and nothing more, performance of Megaparsec parser will be on par with Parsec. The more you add to the stack, the slower it will be.
The most expensive operation is backtracking (you enable it with try
and it happens automatically with tokens
-based parsers). Avoid building long chains of alternatives where every alternative can go deep into input before failing.
Inline generously (when it makes sense, of course). You may not believe your eyes when you see how much of a difference inlining can do, especially for short functions. This is especially true for parsers that are defined in one module and used in another one, because INLINE
and INLINEABLE
pragmas make GHC dump functions definitions into an interface file and this facilitates specializing (I’ve written a tutorial about this, available here).
The same parser can be written in many ways. Think about your grammar and how parsing happens, when you get some experience with this process, it will be much easier for you to see how to make your parser faster. Sometimes however, making a parser faster will also make your code less readable. If performance of your parser is not a bottleneck in the system you are building, consider preferring readability over performance.
- - - -