mirror of
https://github.com/mrkkrp/megaparsec.git
synced 2024-12-23 00:01:45 +03:00
Now tutorials are hosted on my personal site
This commit is contained in:
parent
1cd4e2f9e6
commit
df7a3e35f8
67
404.html
67
404.html
@ -1,67 +0,0 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<meta name="description" content />
|
||||
<meta name="author" content />
|
||||
<title>Megaparsec | Page not found</title>
|
||||
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
|
||||
<link rel="stylesheet" type="text/css" href="./css/megaparsec.css" />
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="navbar navbar-default navbar-static-top" role="navigation">
|
||||
<div class="container-fluid">
|
||||
<div class="navbar-header">
|
||||
<a class="navbar-brand" href="./">
|
||||
Megaparsec
|
||||
</a>
|
||||
</div>
|
||||
<div class="navbar-right">
|
||||
<ul class="nav navbar-nav">
|
||||
|
||||
<li>
|
||||
<a href="./tutorials.html">Tutorials</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://hackage.haskell.org/package/megaparsec">Hackage</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec">GitHub</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec-site">Edit the site</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="container">
|
||||
<div class="row-fluid">
|
||||
<div class="col-sm-10 col-sm-offset-1 col-md-8 col-md-offset-2 main">
|
||||
<div class="page-header">
|
||||
<h1>Page not found
|
||||
|
||||
</h1>
|
||||
<hr />
|
||||
<div class="content">
|
||||
<p>The page you were looking for does not exist.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.min.js"></script>
|
||||
<script src="./js/put-anchors.js"></script>
|
||||
</body>
|
||||
</html>
|
@ -1 +0,0 @@
|
||||
body{padding-bottom:60px}blockquote{font-size:16px}.content{font-size:16px;text-align:justify;word-wrap:break-word}code{color:#4070a0}table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode, table.sourceCode pre{margin:0;padding:0;border:0;vertical-align:baseline;border:none}td.lineNumbers{border-right:1px solid #AAAAAA;text-align:right;color:#AAAAAA;padding-right:5px;padding-left:5px}td.sourceCode{padding-left:5px}.sourceCode span.kw{color:#007020;font-weight:bold}.sourceCode span.dt{color:#902000}.sourceCode span.dv{color:#40a070}.sourceCode span.bn{color:#40a070}.sourceCode span.fl{color:#40a070}.sourceCode span.ch{color:#4070a0}.sourceCode span.st{color:#4070a0}.sourceCode span.co{color:#60a0b0;font-style:italic}.sourceCode span.ot{color:#007020}.sourceCode span.al{color:red;font-weight:bold}.sourceCode span.fu{color:#06287e}.sourceCode span.re{}.sourceCode span.er{color:red;font-weight:bold}
|
234
index.html
234
index.html
@ -2,237 +2,7 @@
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<meta name="description" content />
|
||||
<meta name="author" content />
|
||||
<title>Megaparsec | Megaparsec</title>
|
||||
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
|
||||
<link rel="stylesheet" type="text/css" href="./css/megaparsec.css" />
|
||||
<meta http-equiv="refresh" content="0; url=https://markkarpov.com/learn-haskell.html#megaparsec-tutorials">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="navbar navbar-default navbar-static-top" role="navigation">
|
||||
<div class="container-fluid">
|
||||
<div class="navbar-header">
|
||||
<a class="navbar-brand" href="./">
|
||||
Megaparsec
|
||||
</a>
|
||||
</div>
|
||||
<div class="navbar-right">
|
||||
<ul class="nav navbar-nav">
|
||||
|
||||
<li>
|
||||
<a href="./tutorials.html">Tutorials</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://hackage.haskell.org/package/megaparsec">Hackage</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec">GitHub</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec-site">Edit the site</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="container">
|
||||
<div class="row-fluid">
|
||||
<div class="col-sm-10 col-sm-offset-1 col-md-8 col-md-offset-2 main">
|
||||
<div class="page-header">
|
||||
<h1>Megaparsec
|
||||
|
||||
<br />
|
||||
<small>
|
||||
Industrial-strength monadic parser combinator library in Haskell
|
||||
</small>
|
||||
|
||||
</h1>
|
||||
<hr />
|
||||
<div class="content">
|
||||
<p><a href="http://opensource.org/licenses/BSD-2-Clause"><img src="https://img.shields.io/badge/license-FreeBSD-brightgreen.svg" alt="License FreeBSD" /></a> <a href="https://hackage.haskell.org/package/megaparsec"><img src="https://img.shields.io/hackage/v/megaparsec.svg?style=flat" alt="Hackage" /></a> <a href="http://stackage.org/nightly/package/megaparsec"><img src="http://stackage.org/package/megaparsec/badge/nightly" alt="Stackage Nightly" /></a> <a href="http://stackage.org/lts/package/megaparsec"><img src="http://stackage.org/package/megaparsec/badge/lts" alt="Stackage LTS" /></a> <a href="https://travis-ci.org/mrkkrp/megaparsec"><img src="https://travis-ci.org/mrkkrp/megaparsec.svg?branch=master" alt="Build Status" /></a> <a href="https://coveralls.io/github/mrkkrp/megaparsec?branch=master"><img src="https://coveralls.io/repos/mrkkrp/megaparsec/badge.svg?branch=master&service=github" alt="Coverage Status" /></a></p>
|
||||
<ul>
|
||||
<li><a href="#features">Features</a>
|
||||
<ul>
|
||||
<li><a href="#core-features">Core features</a></li>
|
||||
<li><a href="#error-messages">Error messages</a></li>
|
||||
<li><a href="#alex-and-happy-support">Alex and Happy support</a></li>
|
||||
<li><a href="#character-parsing">Character parsing</a></li>
|
||||
<li><a href="#permutation-parsing">Permutation parsing</a></li>
|
||||
<li><a href="#expression-parsing">Expression parsing</a></li>
|
||||
<li><a href="#lexer">Lexer</a></li>
|
||||
</ul></li>
|
||||
<li><a href="#documentation">Documentation</a></li>
|
||||
<li><a href="#tutorials">Tutorials</a></li>
|
||||
<li><a href="#performance">Performance</a></li>
|
||||
<li><a href="#comparison-with-other-solutions">Comparison with other solutions</a>
|
||||
<ul>
|
||||
<li><a href="#megaparsec-vs-attoparsec">Megaparsec vs Attoparsec</a></li>
|
||||
<li><a href="#megaparsec-vs-parsec">Megaparsec vs Parsec</a></li>
|
||||
<li><a href="#megaparsec-vs-trifecta">Megaparsec vs Trifecta</a></li>
|
||||
<li><a href="#megaparsec-vs-earley">Megaparsec vs Earley</a></li>
|
||||
<li><a href="#megaparsec-vs-parsers">Megaparsec vs Parsers</a></li>
|
||||
</ul></li>
|
||||
<li><a href="#related-packages">Related packages</a></li>
|
||||
<li><a href="#links-to-announcements">Links to announcements</a></li>
|
||||
<li><a href="#authors">Authors</a></li>
|
||||
<li><a href="#contribution">Contribution</a></li>
|
||||
<li><a href="#license">License</a></li>
|
||||
</ul>
|
||||
<p>This is an industrial-strength monadic parser combinator library. Megaparsec is a fork of <a href="https://github.com/aslatter/parsec">Parsec</a> library originally written by Daan Leijen.</p>
|
||||
<h2 id="features">Features</h2>
|
||||
<p>This project provides flexible solutions to satisfy common parsing needs. The section describes them shortly. If you’re looking for comprehensive documentation, see the <a href="#documentation">section about documentation</a>.</p>
|
||||
<h3 id="core-features">Core features</h3>
|
||||
<p>The package is built around <code>MonadParsec</code>, an MTL-style monad transformer. All tools and features work with all instances of <code>MonadParsec</code>. You can achieve various effects combining monad transformers, i.e. building monad stack. Since the standard common monad transformers like <code>WriterT</code>, <code>StateT</code>, <code>ReaderT</code> and others are instances of the <code>MonadParsec</code> type class, you can wrap <code>ParsecT</code> <em>in</em> these monads, achieving, for example, backtracking state.</p>
|
||||
<p>On the other hand <code>ParsecT</code> is an instance of many type classes as well. The most useful ones are <code>Monad</code>, <code>Applicative</code>, <code>Alternative</code>, and <code>MonadParsec</code>.</p>
|
||||
<p>The module <a href="https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Combinator.html"><code>Text.Megaparsec.Combinator</code></a> (its functions are included in <code>Text.Megaparsec</code>) contains traditional, general combinators that work with instances of <code>Applicative</code> and <code>Alternative</code>.</p>
|
||||
<p>Let’s enumerate methods of the <code>MonadParsec</code> type class. The class abstracts primitive functions of Megaparsec parsing. The rest of the library is built via combination of these primitives:</p>
|
||||
<ul>
|
||||
<li><p><code>failure</code> allows to fail reporting an arbitrary parse error.</p></li>
|
||||
<li><p><code>label</code> allows to add a “label” to a parser, so if it fails the user will see the label instead of an automatically deduced expected token.</p></li>
|
||||
<li><p><code>hidden</code> hides a parser from error messages altogether. This is the recommended way to hide things, prefer it to the <code>label ""</code> approach.</p></li>
|
||||
<li><p><code>try</code> enables backtracking in parsing.</p></li>
|
||||
<li><p><code>lookAhead</code> allows to parse input without consuming it.</p></li>
|
||||
<li><p><code>notFollowedBy</code> succeeds when its argument fails and does not consume input.</p></li>
|
||||
<li><p><code>withRecovery</code> allows to recover from parse errors “on-the-fly” and continue parsing. Once parsing is finished, several parse errors may be reported or ignored altogether.</p></li>
|
||||
<li><p><code>observing</code> allows to “observe” parse errors without ending parsing (they are returned in <code>Left</code>, while normal results are wrapped in <code>Right</code>).</p></li>
|
||||
<li><p><code>eof</code> only succeeds at the end of input.</p></li>
|
||||
<li><p><code>token</code> is used to parse a single token.</p></li>
|
||||
<li><p><code>tokens</code> makes it easy to parse several tokens in a row.</p></li>
|
||||
<li><p><code>getParserState</code> returns the full parser state.</p></li>
|
||||
<li><p><code>updateParserState</code> applies a given function on the parser state.</p></li>
|
||||
</ul>
|
||||
<p>This list of core functions is longer than in some other libraries. Our goal is efficient, readable implementations, and rich functionality, not minimal number of primitive combinators. You can read the comprehensive description of every primitive function in <a href="https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Prim.html">Megaparsec documentation</a>.</p>
|
||||
<p>Megaparsec can currently work with the following types of input stream out-of-the-box:</p>
|
||||
<ul>
|
||||
<li><code>String</code> = <code>[Char]</code></li>
|
||||
<li><code>ByteString</code> (strict and lazy)</li>
|
||||
<li><code>Text</code> (strict and lazy)</li>
|
||||
</ul>
|
||||
<p>It’s also simple to make it work with custom token streams, and Megaparsec users have done so many times with great success.</p>
|
||||
<h3 id="error-messages">Error messages</h3>
|
||||
<p>Megaparsec 5 introduces well-typed error messages and the ability to use custom data types to adjust the library to specific domain of interest. No need to use a shapeless bunch of strings anymore.</p>
|
||||
<p>The default error component (<code>Dec</code>) has constructors corresponding to the <code>fail</code> function and indentation-related error messages. It is a decent option that should work out-of-box for most parsing needs, while you are free to use your own custom error component when necessary.</p>
|
||||
<p>This new design allowed Megaparsec 5 to have much more helpful error messages for indentation-sensitive parsing instead of the plain “incorrect indentation” phrase.</p>
|
||||
<h3 id="alex-and-happy-support">Alex and Happy support</h3>
|
||||
<p>Megaparsec works well with streams of tokens produced by tools like Alex/Happy. Megaparsec 5 adds <code>updatePos</code> method to <code>Stream</code> type class that gives you full control over textual positions that are used to report token positions in error messages. You can update current position on per character basis or extract it from token.</p>
|
||||
<h3 id="character-parsing">Character parsing</h3>
|
||||
<p>Megaparsec has decent support for Unicode-aware character parsing. Functions for character parsing live in the <a href="https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Char.html"><code>Text.Megaparsec.Char</code></a> module (they all are included in <code>Text.Megaparsec</code>). The functions can be divided into several categories:</p>
|
||||
<ul>
|
||||
<li><p><em>Simple parsers</em>—parsers that parse certain character or several characters of the same kind. This includes <code>newline</code>, <code>crlf</code>, <code>eol</code>, <code>tab</code>, and <code>space</code>.</p></li>
|
||||
<li><p><em>Parsers corresponding to categories of characters</em> parse single character that belongs to certain category of characters, for example: <code>controlChar</code>, <code>spaceChar</code>, <code>upperChar</code>, <code>lowerChar</code>, <code>printChar</code>, <code>digitChar</code>, and others.</p></li>
|
||||
<li><p><em>General parsers</em> that allow you to parse a single character you specify or one of the given characters, or any character except for the given ones, or character satisfying given predicate. Case-insensitive versions of the parsers are available.</p></li>
|
||||
<li><p><em>Parsers for sequences of characters</em> parse strings. Case-sensitive <code>string</code> parser is available as well as case-insensitive <code>string'</code>.</p></li>
|
||||
</ul>
|
||||
<h3 id="permutation-parsing">Permutation parsing</h3>
|
||||
<p>For those who are interested in parsing of permutation phrases, there is <a href="https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Perm.html"><code>Text.Megaparsec.Perm</code></a>. You have to import the module explicitly, it’s not included in the <code>Text.Megaparsec</code> module.</p>
|
||||
<h3 id="expression-parsing">Expression parsing</h3>
|
||||
<p>Megaparsec has a solution for parsing of expressions. Take a look at <a href="https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Expr.html"><code>Text.Megaparsec.Expr</code></a>. You have to import the module explicitly, it’s not included in the <code>Text.Megaparsec</code>.</p>
|
||||
<p>Given a table of operators that describes their fixity and precedence, you can construct a parser that will parse any expression involving the operators. See documentation for comprehensive description of how it works.</p>
|
||||
<h3 id="lexer">Lexer</h3>
|
||||
<p><a href="https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Lexer.html"><code>Text.Megaparsec.Lexer</code></a> is a module that should help you write your lexer. If you have used <code>Parsec</code> in the past, this module “fixes” its particularly inflexible <code>Text.Parsec.Token</code>.</p>
|
||||
<p><code>Text.Megaparsec.Lexer</code> is intended to be imported via a qualified import, it’s not included in <code>Text.Megaparsec</code>. The module doesn’t impose how you should write your parser, but certain approaches may be more elegant than others. An especially important theme is parsing of white space, comments, and indentation.</p>
|
||||
<p>The design of the module allows you quickly solve simple tasks and doesn’t get in your way when you want to implement something less standard.</p>
|
||||
<p>Since Megaparsec 5, all tools for indentation-sensitive parsing are available in <code>Text.Megaparsec.Lexer</code> module—no third party packages required.</p>
|
||||
<h2 id="documentation">Documentation</h2>
|
||||
<p>Megaparsec is well-documented. All functions and data-types are thoroughly described. We pay attention to avoid outdated info or unclear phrases in our documentation. See the <a href="https://hackage.haskell.org/package/megaparsec">current version of Megaparsec documentation on Hackage</a> for yourself.</p>
|
||||
<h2 id="tutorials">Tutorials</h2>
|
||||
<p>You can visit the <a href="https://mrkkrp.github.io/megaparsec/">site of the project</a> which has <a href="https://mrkkrp.github.io/megaparsec/tutorials.html">several tutorials</a> that should help you to start with your parsing tasks. The site also has instructions and tips for Parsec users who decide to migrate to Megaparsec. If you want to improve an existing tutorial or add your own, open a PR against <a href="https://github.com/mrkkrp/megaparsec-site">this repo</a>.</p>
|
||||
<h2 id="performance">Performance</h2>
|
||||
<p>Despite being quite flexible, Megaparsec is also faster than Parsec. The repository includes benchmarks that can be easily used to compare Megaparsec and Parsec. In most cases Megaparsec is faster, sometimes dramatically faster. If you happen to have some other benchmarks, I would appreciate if you add Megaparsec to them and let me know how it performs.</p>
|
||||
<p>If you think your Megaparsec parser is not efficient enough, take a look at <a href="https://mrkkrp.github.io/megaparsec/tutorials/writing-a-fast-parser.html">these instructions</a>.</p>
|
||||
<h2 id="comparison-with-other-solutions">Comparison with other solutions</h2>
|
||||
<p>There are quite a few libraries that can be used for parsing in Haskell, let’s compare Megaparsec with some of them.</p>
|
||||
<h3 id="megaparsec-vs-attoparsec">Megaparsec vs Attoparsec</h3>
|
||||
<p><a href="https://github.com/bos/attoparsec">Attoparsec</a> is another prominent Haskell library for parsing. Although the both libraries deal with parsing, it’s usually easy to decide which you will need in particular project:</p>
|
||||
<ul>
|
||||
<li><p><em>Attoparsec</em> is much faster but not that feature-rich. It should be used when you want to process large amounts of data where performance matters more than quality of error messages.</p></li>
|
||||
<li><p><em>Megaparsec</em> is good for parsing of source code or other human-readable texts. It has better error messages and it’s implemented as monad transformer.</p></li>
|
||||
</ul>
|
||||
<p>So, if you work with something human-readable where size of input data is usually not huge, just go with Megaparsec, otherwise Attoparsec may be a better choice.</p>
|
||||
<h3 id="megaparsec-vs-parsec">Megaparsec vs Parsec</h3>
|
||||
<p>Since Megaparsec is a fork of Parsec, we are bound to list the main differences between the two libraries:</p>
|
||||
<ul>
|
||||
<li><p>Better error messages. We test our error messages using dense QuickCheck tests. Good error messages are just as important for us as correct return values of our parsers. Megaparsec will be especially useful if you write a compiler or an interpreter for some language.</p></li>
|
||||
<li><p>Some quirks and “buggy features” (as well as plain bugs) of original Parsec are fixed. There is no undocumented surprising stuff in Megaparsec.</p></li>
|
||||
<li><p>Better support for Unicode parsing in <code>Text.Megaparsec.Char</code>.</p></li>
|
||||
<li><p>Megaparsec has more powerful combinators and can parse languages where indentation matters.</p></li>
|
||||
<li><p>Comprehensive QuickCheck test suite covering nearly 100% of our code.</p></li>
|
||||
<li><p>We have benchmarks to detect performance regressions.</p></li>
|
||||
<li><p>Better documentation, with 100% of functions covered, without typos and obsolete information, with working examples. Megaparsec’s documentation is well-structured and doesn’t contain things useless to end users.</p></li>
|
||||
<li><p>Megaparsec’s code is clearer and doesn’t contain “magic” found in original Parsec.</p></li>
|
||||
<li><p>Megaparsec has well-typed error messages and custom error messages.</p></li>
|
||||
<li><p>Megaparsec can recover from parse errors “on the fly” and continue parsing.</p></li>
|
||||
<li><p>Megaparsec allows to conditionally process parse errors <em>inside your parser</em> before parsing is finished. In particular, it’s possible to define regions in which parse errors, should they happen, will get a “context tag”, e.g. we could build a context stack like “in function definition foo”, “in expression x”, etc. This is not possible with Parsec.</p></li>
|
||||
<li><p>Megaparsec is faster.</p></li>
|
||||
<li><p>Megaparsec is <del>better</del> supported.</p></li>
|
||||
</ul>
|
||||
<p>If you want to see a detailed change log, <code>CHANGELOG.md</code> may be helpful. Also see <a href="https://notehub.org/w7037">this original announcement</a> for another comparison.</p>
|
||||
<p>To be honest Parsec’s development has seemingly stagnated. It has no test suite (only three per-bug tests), and all its releases beginning from version 3.1.2 (according or its change log) were about introducing and fixing regressions. Parsec is old and somewhat famous in the Haskell community, so we understand there will be some kind of inertia, but we advise you use Megaparsec from now on because it solves many problems of the original Parsec project. If you think you still have a reason to use original Parsec, open an issue.</p>
|
||||
<h3 id="megaparsec-vs-trifecta">Megaparsec vs Trifecta</h3>
|
||||
<p><a href="https://hackage.haskell.org/package/trifecta">Trifecta</a> is another Haskell library featuring good error messages. Like some other projects of Edward Kmett, it’s probably good, but also under-documented, and has unfixed <a href="https://github.com/ekmett/trifecta/issues">bugs and flaws</a> that Edward is too busy to fix (simply a fact, no offense intended). Other reasons one may question choice of Trifecta is his/her parsing library:</p>
|
||||
<ul>
|
||||
<li><p>Complicated, doesn’t have any tutorials available, and documentation doesn’t help at all.</p></li>
|
||||
<li><p>Trifecta can parse <code>String</code> and <code>ByteString</code> natively, but not <code>Text</code>.</p></li>
|
||||
<li><p>Trifecta’s error messages may be different with their own features, but certainly not as flexible as Megaparsec’s error messages in the latest versions.</p></li>
|
||||
<li><p>Depends on <code>lens</code>. This means you’ll pull in half of Hackage as transitive dependencies. Also if you’re not into <code>lens</code> and would like to keep your code “vanilla”, you may not like the API.</p></li>
|
||||
</ul>
|
||||
<h3 id="megaparsec-vs-earley">Megaparsec vs Earley</h3>
|
||||
<p><a href="https://hackage.haskell.org/package/Earley">Earley</a> is a newer library that allows to safely (it your code compiles, then it probably works) parse context-free grammars (CFG). Megaparsec is a lower-level library compared to Earley, but there are still enough reasons to choose it over Earley:</p>
|
||||
<ul>
|
||||
<li><p>Megaparsec is faster.</p></li>
|
||||
<li><p>Your grammar may be not context-free or you may want introduce some sort of state to the parsing process. Almost all non-trivial parsers require something of this sort. Even if your grammar is context-free, state may allow to add some additional niceties. Earley does not support that.</p></li>
|
||||
<li><p>Megaparsec’s error messages are more flexible allowing to include arbitrary data in them, return multiple error messages, mark regions that affect any error that happens in those regions, etc.</p></li>
|
||||
<li><p>The approach Earley uses differs from the conventional monadic parsing. If you work not alone, chances people you work with, especially beginners will be much more productive with libraries taking more traditional path to parsing like Megaparsec.</p></li>
|
||||
</ul>
|
||||
<p>IOW, Megaparsec is less safe but also more powerful.</p>
|
||||
<h3 id="megaparsec-vs-parsers">Megaparsec vs Parsers</h3>
|
||||
<p>There is <a href="https://hackage.haskell.org/package/parsers">Parsers</a> package, which is great. You can use it with Megaparsec or Parsec, but consider the following:</p>
|
||||
<ul>
|
||||
<li><p>It depends on Attoparsec, Parsec, and Trifecta, which means you always grab half of Hackage as transitive dependencies by using it. This is ridiculous, by the way, because this package is supposed to be useful for parser builders, so they can write basic core functionality and get the rest “for free”.</p></li>
|
||||
<li><p>It currently has a <del>bug</del> feature in definition of <code>lookAhead</code> for various monad transformers like <code>StateT</code>, etc. which is visible when you create backtracking state via monad stack, not via built-in features. The feature makes it so <code>lookAhead</code> will backtrack your parser state but not your custom state added via <code>StateT</code>. Kmett thinks this behavior is better.</p></li>
|
||||
</ul>
|
||||
<p>We intended to use Parsers library in Megaparsec at some point, but aside from already mentioned flaws the library has different conventions for naming of things, different set of “core” functions, etc., different approach to lexing. So it didn’t happen, Megaparsec has minimal dependencies, it is feature-rich and self-contained.</p>
|
||||
<h2 id="related-packages">Related packages</h2>
|
||||
<p>The following packages are designed to be used with Megaparsec:</p>
|
||||
<ul>
|
||||
<li><p><a href="https://hackage.haskell.org/package/hspec-megaparsec"><code>hspec-megaparsec</code></a>—utilities for testing Megaparsec parsers with with <a href="https://hackage.haskell.org/package/hspec">Hspec</a>.</p></li>
|
||||
<li><p><a href="https://hackage.haskell.org/package/cassava-megaparsec"><code>cassava-megaparsec</code></a>—Megaparsec parser of CSV files that plays nicely with <a href="https://hackage.haskell.org/package/cassava">Cassava</a>.</p></li>
|
||||
<li><p><a href="https://hackage.haskell.org/package/tagsoup-megaparsec"><code>tagsoup-megaparsec</code></a>—a library for easily using <a href="https://hackage.haskell.org/package/tagsoup">TagSoup</a> as a token type in Megaparsec.</p></li>
|
||||
</ul>
|
||||
<h2 id="links-to-announcements">Links to announcements</h2>
|
||||
<p>Here are some blog posts mainly announcing new features of the project and describing what sort of things are now possible:</p>
|
||||
<ul>
|
||||
<li><a href="https://mrkkrp.github.io/posts/latest-additions-to-megaparsec.html">Latest additions to Megaparsec</a></li>
|
||||
<li><a href="https://mrkkrp.github.io/posts/announcing-megaparsec-5.html">Announcing Megaparsec 5</a></li>
|
||||
<li><a href="https://mrkkrp.github.io/posts/megaparsec-4-and-5.html">Megaparsec 4 and 5</a></li>
|
||||
<li><a href="https://notehub.org/w7037">The original Megaparsec 4.0.0 announcement</a></li>
|
||||
</ul>
|
||||
<h2 id="authors">Authors</h2>
|
||||
<p>The project was started and is currently maintained by Mark Karpov. You can find the complete list of contributors in the <code>AUTHORS.md</code> file in the official repository of the project. Thanks to all the people who propose features and ideas, although they are not in <code>AUTHORS.md</code>, without them Megaparsec would not be that good.</p>
|
||||
<h2 id="contribution">Contribution</h2>
|
||||
<p>Issues (bugs, feature requests or otherwise feedback) may be reported in <a href="https://github.com/mrkkrp/megaparsec/issues">the GitHub issue tracker for this project</a>.</p>
|
||||
<p>Pull requests are also welcome (and yes, they will get attention and will be merged quickly if they are good).</p>
|
||||
<p>If you want to write a tutorial to be hosted on Megaparsec’s site, open an issue or pull request <a href="https://github.com/mrkkrp/megaparsec-site">here</a>.</p>
|
||||
<h2 id="license">License</h2>
|
||||
<p>Copyright © 2015–2017 Megaparsec contributors<br> Copyright © 2007 Paolo Martini<br> Copyright © 1999–2000 Daan Leijen</p>
|
||||
<p>Distributed under FreeBSD license.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.min.js"></script>
|
||||
<script src="./js/put-anchors.js"></script>
|
||||
</body>
|
||||
<body></body>
|
||||
</html>
|
||||
|
@ -1,7 +0,0 @@
|
||||
// put anchors
|
||||
anchors.options = {
|
||||
placement: 'left',
|
||||
visible: 'always'
|
||||
};
|
||||
// anchors.options.visible = 'always';
|
||||
anchors.add('h2, h3, h4');
|
@ -2,93 +2,7 @@
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<meta name="description" content />
|
||||
<meta name="author" content />
|
||||
<title>Megaparsec | Tutorials</title>
|
||||
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
|
||||
<link rel="stylesheet" type="text/css" href="./css/megaparsec.css" />
|
||||
<meta http-equiv="refresh" content="0; url=https://markkarpov.com/learn-haskell.html#megaparsec-tutorials">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="navbar navbar-default navbar-static-top" role="navigation">
|
||||
<div class="container-fluid">
|
||||
<div class="navbar-header">
|
||||
<a class="navbar-brand" href="./">
|
||||
Megaparsec
|
||||
</a>
|
||||
</div>
|
||||
<div class="navbar-right">
|
||||
<ul class="nav navbar-nav">
|
||||
|
||||
<li>
|
||||
<a href="./tutorials.html">Tutorials</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://hackage.haskell.org/package/megaparsec">Hackage</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec">GitHub</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec-site">Edit the site</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="container">
|
||||
<div class="row-fluid">
|
||||
<div class="col-sm-10 col-sm-offset-1 col-md-8 col-md-offset-2 main">
|
||||
<div class="page-header">
|
||||
<h1>Tutorials
|
||||
|
||||
</h1>
|
||||
<hr />
|
||||
<div class="content">
|
||||
<ul>
|
||||
|
||||
<li>
|
||||
<a href="./tutorials/parsing-simple-imperative-language.html">Parsing a simple imperative language</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="./tutorials/custom-error-messages.html">How to introduce custom error messages</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="./tutorials/fun-with-the-recovery-feature.html">Fun with the recovery feature</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="./tutorials/indentation-sensitive-parsing.html">Indentation-sensitive parsing</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="./tutorials/switch-from-parsec-to-megaparsec.html">Switch from Parsec to Megaparsec</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="./tutorials/writing-a-fast-parser.html">Writing a fast parser</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.min.js"></script>
|
||||
<script src="./js/put-anchors.js"></script>
|
||||
</body>
|
||||
<body></body>
|
||||
</html>
|
||||
|
@ -2,415 +2,7 @@
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<meta name="description" content />
|
||||
<meta name="author" content />
|
||||
<title>Megaparsec | How to introduce custom error messages</title>
|
||||
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
|
||||
<link rel="stylesheet" type="text/css" href="../css/megaparsec.css" />
|
||||
<meta http-equiv="refresh" content="0; url=https://markkarpov.com/megaparsec/custom-error-messages.html">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="navbar navbar-default navbar-static-top" role="navigation">
|
||||
<div class="container-fluid">
|
||||
<div class="navbar-header">
|
||||
<a class="navbar-brand" href="../">
|
||||
Megaparsec
|
||||
</a>
|
||||
</div>
|
||||
<div class="navbar-right">
|
||||
<ul class="nav navbar-nav">
|
||||
|
||||
<li>
|
||||
<a href="../tutorials.html">Tutorials</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://hackage.haskell.org/package/megaparsec">Hackage</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec">GitHub</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec-site">Edit the site</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="container">
|
||||
<div class="row-fluid">
|
||||
<div class="col-sm-10 col-sm-offset-1 col-md-8 col-md-offset-2 main">
|
||||
<div class="page-header">
|
||||
<h1>How to introduce custom error messages
|
||||
|
||||
<br />
|
||||
<small>
|
||||
It's possible to use user-defined data types as part of parse errors
|
||||
</small>
|
||||
|
||||
</h1>
|
||||
<hr />
|
||||
<div class="content">
|
||||
|
||||
<em>Last updated on May 25, 2017</em>
|
||||
<hr />
|
||||
|
||||
|
||||
<p>One of the advantages of Megaparsec 5 is the ability to use your own data types as part of data that is returned on parse failure. This opens up the possibility to tailor error messages to your domain of interest in a way that is quite unique to this library. Needless to say, all data that constitutes a error message is typed in Megaparsec 5, so it’s easy to inspect and manipulate it.</p>
|
||||
<h2 id="the-goal">The goal</h2>
|
||||
<p>In this tutorial we will walk through creation of a parser found in the existing library called <a href="https://hackage.haskell.org/package/cassava-megaparsec"><code>cassava-megaparsec</code></a>, which is an alternative parser for the popular <a href="https://hackage.haskell.org/package/cassava"><code>cassava</code></a> library that allows to parse CSV data. The default parser features not very user-friendly error messages, so I was asked to design a better one using Megaparsec 5.</p>
|
||||
<p>In addition to the standard error messages (“expected” and “unexpected” tokens), the library can report problems that have to do with using methods from <code>FromRecord</code> and <code>FromNamedRecord</code> type classes that describe how to transform a collection of <code>ByteString</code>s into a particular instance of those type classes. While performing the conversion, things may go wrong, and we would like to use a special data constructor in these cases.</p>
|
||||
<p>The complete source code can be found in <a href="https://github.com/stackbuilders/cassava-megaparsec">this GitHub repository</a>.</p>
|
||||
<h2 id="language-extensions-and-imports">Language extensions and imports</h2>
|
||||
<p>We will need some language extensions and imports, here is the top of <code>Data.Csv.Parser.Megaparsec</code> almost literally:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">{-# LANGUAGE BangPatterns #-}</span>
|
||||
<span class="ot">{-# LANGUAGE DeriveDataTypeable #-}</span>
|
||||
<span class="ot">{-# LANGUAGE RecordWildCards #-}</span>
|
||||
|
||||
<span class="kw">module</span> <span class="dt">Data.Csv.Parser.Megaparsec</span>
|
||||
( <span class="dt">Cec</span> (<span class="fu">..</span>)
|
||||
, decode
|
||||
, decodeWith
|
||||
, decodeByName
|
||||
, decodeByNameWith )
|
||||
<span class="kw">where</span>
|
||||
|
||||
<span class="kw">import </span><span class="dt">Control.Monad</span>
|
||||
<span class="kw">import </span><span class="dt">Data.ByteString</span> (<span class="dt">ByteString</span>)
|
||||
<span class="kw">import </span><span class="dt">Data.Char</span> (chr)
|
||||
<span class="kw">import </span><span class="dt">Data.Csv</span> <span class="kw">hiding</span>
|
||||
( <span class="dt">Parser</span>
|
||||
, record
|
||||
, namedRecord
|
||||
, header
|
||||
, toNamedRecord
|
||||
, decode
|
||||
, decodeWith
|
||||
, decodeByName
|
||||
, decodeByNameWith )
|
||||
<span class="kw">import </span><span class="dt">Data.Data</span>
|
||||
<span class="kw">import </span><span class="dt">Data.Vector</span> (<span class="dt">Vector</span>)
|
||||
<span class="kw">import </span><span class="dt">Data.Word</span> (<span class="dt">Word8</span>)
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec</span>
|
||||
<span class="kw">import qualified</span> <span class="dt">Data.ByteString.Char8</span> <span class="kw">as</span> <span class="dt">BC8</span>
|
||||
<span class="kw">import qualified</span> <span class="dt">Data.ByteString.Lazy</span> <span class="kw">as</span> <span class="dt">BL</span>
|
||||
<span class="kw">import qualified</span> <span class="dt">Data.Csv</span> <span class="kw">as</span> <span class="dt">C</span>
|
||||
<span class="kw">import qualified</span> <span class="dt">Data.HashMap.Strict</span> <span class="kw">as</span> <span class="dt">H</span>
|
||||
<span class="kw">import qualified</span> <span class="dt">Data.Set</span> <span class="kw">as</span> <span class="dt">S</span>
|
||||
<span class="kw">import qualified</span> <span class="dt">Data.Vector</span> <span class="kw">as</span> <span class="dt">V</span></code></pre></div>
|
||||
<p>Note that there are two imports for <code>Data.Csv</code>, one for some common things like names of type class that I want to keep unprefixed and the second one for the rest of the stuff (qualified as <code>C</code>).</p>
|
||||
<h2 id="what-is-parseerror-actually">What is <code>ParseError</code> actually?</h2>
|
||||
<p>To start with custom error messages we should take a look at how parse errors are represented in Megaparsec 5.</p>
|
||||
<p>The main type for error messages in <code>ParseError</code> which is defined like this:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | 'ParseError' represents… parse errors. It provides the stack of source</span>
|
||||
<span class="co">-- positions, a set of expected and unexpected tokens as well as a set of</span>
|
||||
<span class="co">-- custom associated data. The data type is parametrized over the token type</span>
|
||||
<span class="co">-- @t@ and the custom data @e@.</span>
|
||||
<span class="co">--</span>
|
||||
<span class="co">-- Note that the stack of source positions contains current position as its</span>
|
||||
<span class="co">-- head, and the rest of positions allows to track full sequence of include</span>
|
||||
<span class="co">-- files with topmost source file at the end of the list.</span>
|
||||
<span class="co">--</span>
|
||||
<span class="co">-- 'Semigroup' (and 'Monoid') instance of the data type allows to merge</span>
|
||||
<span class="co">-- parse errors from different branches of parsing. When merging two</span>
|
||||
<span class="co">-- 'ParseError's, the longest match is preferred; if positions are the same,</span>
|
||||
<span class="co">-- custom data sets and collections of message items are combined.</span>
|
||||
|
||||
<span class="kw">data</span> <span class="dt">ParseError</span> t e <span class="fu">=</span> <span class="dt">ParseError</span>
|
||||
{<span class="ot"> errorPos ::</span> <span class="dt">NonEmpty</span> <span class="dt">SourcePos</span> <span class="co">-- ^ Stack of source positions</span>
|
||||
,<span class="ot"> errorUnexpected ::</span> <span class="dt">Set</span> (<span class="dt">ErrorItem</span> t) <span class="co">-- ^ Unexpected items</span>
|
||||
,<span class="ot"> errorExpected ::</span> <span class="dt">Set</span> (<span class="dt">ErrorItem</span> t) <span class="co">-- ^ Expected items</span>
|
||||
,<span class="ot"> errorCustom ::</span> <span class="dt">Set</span> e <span class="co">-- ^ Associated data, if any</span>
|
||||
} <span class="kw">deriving</span> (<span class="dt">Show</span>, <span class="dt">Read</span>, <span class="dt">Eq</span>, <span class="dt">Data</span>, <span class="dt">Typeable</span>, <span class="dt">Generic</span>)</code></pre></div>
|
||||
<p>Conceptually, we have four components in a parse error:</p>
|
||||
<ul>
|
||||
<li>Position (may be multi-dimensional to support include files).</li>
|
||||
<li>Unexpected “items” (see <a href="https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Error.html#t:ErrorItem"><code>ErrorItem</code></a> if you are curious).</li>
|
||||
<li>Expected “items”.</li>
|
||||
<li>Everything else—here we have a set of things of <code>e</code> type. <code>e</code> is the type we will be defining and using in this tutorial.</li>
|
||||
</ul>
|
||||
<h2 id="defining-a-custom-error-component">Defining a custom error component</h2>
|
||||
<p>We cannot ship the library without some sort of default candidate to take the place of <code>e</code> type, so here it is:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | “Default error component”. This in our instance of 'ErrorComponent'</span>
|
||||
<span class="co">-- provided out-of-box.</span>
|
||||
<span class="co">--</span>
|
||||
<span class="co">-- @since 5.0.0</span>
|
||||
|
||||
<span class="kw">data</span> <span class="dt">Dec</span>
|
||||
<span class="fu">=</span> <span class="dt">DecFail</span> <span class="dt">String</span> <span class="co">-- ^ 'fail' has been used in parser monad</span>
|
||||
<span class="fu">|</span> <span class="dt">DecIndentation</span> <span class="dt">Ordering</span> <span class="dt">Pos</span> <span class="dt">Pos</span>
|
||||
<span class="co">-- ^ Incorrect indentation error: desired ordering between reference</span>
|
||||
<span class="co">-- level and actual level, reference indentation level, actual</span>
|
||||
<span class="co">-- indentation level</span>
|
||||
<span class="kw">deriving</span> (<span class="dt">Show</span>, <span class="dt">Read</span>, <span class="dt">Eq</span>, <span class="dt">Ord</span>, <span class="dt">Data</span>, <span class="dt">Typeable</span>)</code></pre></div>
|
||||
<p>As you can see it is just a sum type that accounts for all types of failures that we need to think about in the vanilla Megaparsec:</p>
|
||||
<ul>
|
||||
<li><code>fail</code> method</li>
|
||||
<li>…and incorrect indentation related to the machinery in <a href="https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Lexer.html"><code>Text.Megaparsec.Lexer</code></a>.</li>
|
||||
</ul>
|
||||
<p>What this means is that our new custom type should somehow provide a way to represent those things too. The requirement that a type should be capable of representing the above-mentioned exceptional situations is captured by the <code>ErrorComponent</code> type class:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | The type class defines how to represent information about various</span>
|
||||
<span class="co">-- exceptional situations. Data types that are used as custom data component</span>
|
||||
<span class="co">-- in 'ParseError' must be instances of this type class.</span>
|
||||
<span class="co">--</span>
|
||||
<span class="co">-- @since 5.0.0</span>
|
||||
|
||||
<span class="kw">class</span> <span class="dt">Ord</span> e <span class="ot">=></span> <span class="dt">ErrorComponent</span> e <span class="kw">where</span>
|
||||
|
||||
<span class="co">-- | Represent message passed to 'fail' in parser monad.</span>
|
||||
<span class="co">--</span>
|
||||
<span class="co">-- @since 5.0.0</span>
|
||||
|
||||
<span class="ot"> representFail ::</span> <span class="dt">String</span> <span class="ot">-></span> e
|
||||
|
||||
<span class="co">-- | Represent information about incorrect indentation.</span>
|
||||
<span class="co">--</span>
|
||||
<span class="co">-- @since 5.0.0</span>
|
||||
|
||||
representIndentation
|
||||
<span class="ot"> ::</span> <span class="dt">Ordering</span> <span class="co">-- ^ Desired ordering between reference level and actual level</span>
|
||||
<span class="ot">-></span> <span class="dt">Pos</span> <span class="co">-- ^ Reference indentation level</span>
|
||||
<span class="ot">-></span> <span class="dt">Pos</span> <span class="co">-- ^ Actual indentation level</span>
|
||||
<span class="ot">-></span> e</code></pre></div>
|
||||
<p>Every type that is going to be used as part of <code>ParseError</code> must be an instance of the <code>ErrorComponent</code> type class.</p>
|
||||
<p>Another thing we would like to do with custom error component is to format it somehow, so it could be inserted in pretty-printed representation of <code>ParseError</code>. This behavior is defined by the <code>ShowErrorComponent</code> type class:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | The type class defines how to print custom data component of</span>
|
||||
<span class="co">-- 'ParseError'.</span>
|
||||
<span class="co">--</span>
|
||||
<span class="co">-- @since 5.0.0</span>
|
||||
|
||||
<span class="kw">class</span> <span class="dt">Ord</span> a <span class="ot">=></span> <span class="dt">ShowErrorComponent</span> a <span class="kw">where</span>
|
||||
|
||||
<span class="co">-- | Pretty-print custom data component of 'ParseError'.</span>
|
||||
|
||||
<span class="ot"> showErrorComponent ::</span> a <span class="ot">-></span> <span class="dt">String</span></code></pre></div>
|
||||
<p>We will need to make our new data type instance of that class as well.</p>
|
||||
<p>So, let’s start. We can grab existing definitions and instances of <code>Dec</code> data type and change them as necessary. The special case we want to support is about failed conversion from vector of <code>ByteString</code>s to some particular type, let’s capture this:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | Custom error component for CSV parsing. It allows typed reporting of</span>
|
||||
<span class="co">-- conversion errors.</span>
|
||||
|
||||
<span class="kw">data</span> <span class="dt">Cec</span>
|
||||
<span class="fu">=</span> <span class="dt">CecFail</span> <span class="dt">String</span>
|
||||
<span class="fu">|</span> <span class="dt">CecIndentation</span> <span class="dt">Ordering</span> <span class="dt">Pos</span> <span class="dt">Pos</span>
|
||||
<span class="fu">|</span> <span class="dt">CecConversionError</span> <span class="dt">String</span>
|
||||
<span class="kw">deriving</span> (<span class="dt">Eq</span>, <span class="dt">Data</span>, <span class="dt">Typeable</span>, <span class="dt">Ord</span>, <span class="dt">Read</span>, <span class="dt">Show</span>)
|
||||
|
||||
<span class="kw">instance</span> <span class="dt">ShowErrorComponent</span> <span class="dt">Cec</span> <span class="kw">where</span>
|
||||
showErrorComponent (<span class="dt">CecFail</span> msg) <span class="fu">=</span> msg
|
||||
showErrorComponent (<span class="dt">CecIndentation</span> ord ref actual) <span class="fu">=</span>
|
||||
<span class="st">"incorrect indentation (got "</span> <span class="fu">++</span> show (unPos actual) <span class="fu">++</span>
|
||||
<span class="st">", should be "</span> <span class="fu">++</span> p <span class="fu">++</span> show (unPos ref) <span class="fu">++</span> <span class="st">")"</span>
|
||||
<span class="kw">where</span> p <span class="fu">=</span> <span class="kw">case</span> ord <span class="kw">of</span>
|
||||
<span class="dt">LT</span> <span class="ot">-></span> <span class="st">"less than "</span>
|
||||
<span class="dt">EQ</span> <span class="ot">-></span> <span class="st">"equal to "</span>
|
||||
<span class="dt">GT</span> <span class="ot">-></span> <span class="st">"greater than "</span>
|
||||
showErrorComponent (<span class="dt">CecConversionError</span> msg) <span class="fu">=</span>
|
||||
<span class="st">"conversion error: "</span> <span class="fu">++</span> msg
|
||||
|
||||
<span class="kw">instance</span> <span class="dt">ErrorComponent</span> <span class="dt">Cec</span> <span class="kw">where</span>
|
||||
representFail <span class="fu">=</span> <span class="dt">CecFail</span>
|
||||
representIndentation <span class="fu">=</span> <span class="dt">CecIndentation</span></code></pre></div>
|
||||
<p>We have re-used definitions from Megaparsec’s source code for <code>Dec</code> here and added a special case represented by <code>CecConversionError</code>. It contains a <code>String</code> that conversion functions of Cassava return. We could do better if Cassava provided typed error values, but <code>String</code> is all we have, so let’s work with it.</p>
|
||||
<p>Another handy definition we need is the <code>Parser</code> type synonym. We cannot use one of the default <code>Parser</code> definitions because those assume <code>Dec</code>, so we define it ourselves rather trivially:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | Parser type that uses “custom error component” 'Cec'.</span>
|
||||
|
||||
<span class="kw">type</span> <span class="dt">Parser</span> <span class="fu">=</span> <span class="dt">Parsec</span> <span class="dt">Cec</span> <span class="dt">BL.ByteString</span></code></pre></div>
|
||||
<h2 id="top-level-api-and-helpers">Top level API and helpers</h2>
|
||||
<p>Let’s start from the top and take a look at the top-level, public API:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | Deserialize CSV records form a lazy 'BL.ByteString'. If this fails due</span>
|
||||
<span class="co">-- to incomplete or invalid input, 'Left' is returned. Equivalent to</span>
|
||||
<span class="co">-- 'decodeWith' 'defaultDecodeOptions'.</span>
|
||||
|
||||
<span class="ot">decode ::</span> <span class="dt">FromRecord</span> a
|
||||
<span class="ot">=></span> <span class="dt">HasHeader</span>
|
||||
<span class="co">-- ^ Whether the data contains header that should be skipped</span>
|
||||
<span class="ot">-></span> FilePath
|
||||
<span class="co">-- ^ File name (use empty string if you have none)</span>
|
||||
<span class="ot">-></span> <span class="dt">BL.ByteString</span>
|
||||
<span class="co">-- ^ CSV data</span>
|
||||
<span class="ot">-></span> <span class="dt">Either</span> (<span class="dt">ParseError</span> <span class="dt">Char</span> <span class="dt">Cec</span>) (<span class="dt">Vector</span> a)
|
||||
decode <span class="fu">=</span> decodeWith defaultDecodeOptions
|
||||
|
||||
<span class="co">-- | Like 'decode', but lets you customize how the CSV data is parsed.</span>
|
||||
|
||||
<span class="ot">decodeWith ::</span> <span class="dt">FromRecord</span> a
|
||||
<span class="ot">=></span> <span class="dt">DecodeOptions</span>
|
||||
<span class="co">-- ^ Decoding options</span>
|
||||
<span class="ot">-></span> <span class="dt">HasHeader</span>
|
||||
<span class="co">-- ^ Whether the data contains header that should be skipped</span>
|
||||
<span class="ot">-></span> FilePath
|
||||
<span class="co">-- ^ File name (use empty string if you have none)</span>
|
||||
<span class="ot">-></span> <span class="dt">BL.ByteString</span>
|
||||
<span class="co">-- ^ CSV data</span>
|
||||
<span class="ot">-></span> <span class="dt">Either</span> (<span class="dt">ParseError</span> <span class="dt">Char</span> <span class="dt">Cec</span>) (<span class="dt">Vector</span> a)
|
||||
decodeWith <span class="fu">=</span> decodeWithC csv
|
||||
|
||||
<span class="co">-- | Deserialize CSV records from a lazy 'BL.ByteString'. If this fails due</span>
|
||||
<span class="co">-- to incomplete or invalid input, 'Left' is returned. The data is assumed</span>
|
||||
<span class="co">-- to be preceded by a header. Equivalent to 'decodeByNameWith'</span>
|
||||
<span class="co">-- 'defaultDecodeOptions'.</span>
|
||||
|
||||
<span class="ot">decodeByName ::</span> <span class="dt">FromNamedRecord</span> a
|
||||
<span class="ot">=></span> FilePath <span class="co">-- ^ File name (use empty string if you have none)</span>
|
||||
<span class="ot">-></span> <span class="dt">BL.ByteString</span> <span class="co">-- ^ CSV data</span>
|
||||
<span class="ot">-></span> <span class="dt">Either</span> (<span class="dt">ParseError</span> <span class="dt">Char</span> <span class="dt">Cec</span>) (<span class="dt">Header</span>, <span class="dt">Vector</span> a)
|
||||
decodeByName <span class="fu">=</span> decodeByNameWith defaultDecodeOptions
|
||||
|
||||
<span class="co">-- | Like 'decodeByName', but lets you customize how the CSV data is parsed.</span>
|
||||
|
||||
<span class="ot">decodeByNameWith ::</span> <span class="dt">FromNamedRecord</span> a
|
||||
<span class="ot">=></span> <span class="dt">DecodeOptions</span> <span class="co">-- ^ Decoding options</span>
|
||||
<span class="ot">-></span> FilePath <span class="co">-- ^ File name (use empty string if you have none)</span>
|
||||
<span class="ot">-></span> <span class="dt">BL.ByteString</span> <span class="co">-- ^ CSV data</span>
|
||||
<span class="ot">-></span> <span class="dt">Either</span> (<span class="dt">ParseError</span> <span class="dt">Char</span> <span class="dt">Cec</span>) (<span class="dt">Header</span>, <span class="dt">Vector</span> a)
|
||||
decodeByNameWith opts <span class="fu">=</span> parse (csvWithHeader opts)
|
||||
|
||||
<span class="co">-- | Decode CSV data using the provided parser, skipping a leading header if</span>
|
||||
<span class="co">-- necessary.</span>
|
||||
|
||||
decodeWithC
|
||||
<span class="ot"> ::</span> (<span class="dt">DecodeOptions</span> <span class="ot">-></span> <span class="dt">Parser</span> a)
|
||||
<span class="co">-- ^ Parsing function parametrized by 'DecodeOptions'</span>
|
||||
<span class="ot">-></span> <span class="dt">DecodeOptions</span>
|
||||
<span class="co">-- ^ Decoding options</span>
|
||||
<span class="ot">-></span> <span class="dt">HasHeader</span>
|
||||
<span class="co">-- ^ Whether to expect a header in the input</span>
|
||||
<span class="ot">-></span> FilePath
|
||||
<span class="co">-- ^ File name (use empty string if you have none)</span>
|
||||
<span class="ot">-></span> <span class="dt">BL.ByteString</span>
|
||||
<span class="co">-- ^ CSV data</span>
|
||||
<span class="ot">-></span> <span class="dt">Either</span> (<span class="dt">ParseError</span> <span class="dt">Char</span> <span class="dt">Cec</span>) a
|
||||
decodeWithC p opts<span class="fu">@</span><span class="dt">DecodeOptions</span> {<span class="fu">..</span>} hasHeader <span class="fu">=</span> parse parser
|
||||
<span class="kw">where</span>
|
||||
parser <span class="fu">=</span> <span class="kw">case</span> hasHeader <span class="kw">of</span>
|
||||
<span class="dt">HasHeader</span> <span class="ot">-></span> header decDelimiter <span class="fu">*></span> p opts
|
||||
<span class="dt">NoHeader</span> <span class="ot">-></span> p opts</code></pre></div>
|
||||
<p>Really nothing interesting here, just a bunch of wrappers that boil down to running the <code>parser</code> either with skipping the CSV header or not.</p>
|
||||
<p>What I would really like to show to you is the helpers, because one of them is going to be very handy when you decide to write your own parser after reading this manual. Here are the helpers:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | End parsing signaling a “conversion error”.</span>
|
||||
|
||||
<span class="ot">conversionError ::</span> <span class="dt">String</span> <span class="ot">-></span> <span class="dt">Parser</span> a
|
||||
conversionError msg <span class="fu">=</span> failure S.empty S.empty (S.singleton err)
|
||||
<span class="kw">where</span>
|
||||
err <span class="fu">=</span> <span class="dt">CecConversionError</span> msg
|
||||
|
||||
<span class="co">-- | Convert a 'Record' to a 'NamedRecord' by attaching column names. The</span>
|
||||
<span class="co">-- 'Header' and 'Record' must be of the same length.</span>
|
||||
|
||||
<span class="ot">toNamedRecord ::</span> <span class="dt">Header</span> <span class="ot">-></span> <span class="dt">Record</span> <span class="ot">-></span> <span class="dt">NamedRecord</span>
|
||||
toNamedRecord hdr v <span class="fu">=</span> H.fromList <span class="fu">.</span> V.toList <span class="fu">$</span> V.zip hdr v
|
||||
|
||||
<span class="co">-- | Parse a byte of specified value and return unit.</span>
|
||||
|
||||
<span class="ot">blindByte ::</span> <span class="dt">Word8</span> <span class="ot">-></span> <span class="dt">Parser</span> ()
|
||||
blindByte <span class="fu">=</span> void <span class="fu">.</span> char <span class="fu">.</span> chr <span class="fu">.</span> fromIntegral</code></pre></div>
|
||||
<p>The <code>conversionError</code> is a handy thing to have as you can quickly fail with your custom error message without writing all the <code>failure</code>-related boilerplate. <code>toNamedRecord</code> just converts a <code>Record</code> to <code>NamedRecord</code>, while <code>blindByte</code> reads a character (passed to it as a <code>Word8</code> value) and returns a unit <code>()</code>.</p>
|
||||
<h2 id="the-parser">The parser</h2>
|
||||
<p>Let’s start with parsing a field. A field in a CSV file can be either escaped or unescaped:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | Parse a field. The field may be in either the escaped or non-escaped</span>
|
||||
<span class="co">-- format. The returned value is unescaped.</span>
|
||||
|
||||
<span class="ot">field ::</span> <span class="dt">Word8</span> <span class="ot">-></span> <span class="dt">Parser</span> <span class="dt">Field</span>
|
||||
field del <span class="fu">=</span> label <span class="st">"field"</span> (escapedField <span class="fu"><|></span> unescapedField del)</code></pre></div>
|
||||
<p>An escaped field is written inside straight quotes <code>""</code> and can contain any characters at all, but the quote sign itself <code>"</code> must be escaped by repeating it twice:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | Parse an escaped field.</span>
|
||||
|
||||
<span class="ot">escapedField ::</span> <span class="dt">Parser</span> <span class="dt">ByteString</span>
|
||||
escapedField <span class="fu">=</span>
|
||||
BC8.pack <span class="fu"><$!></span> between (char <span class="ch">'"'</span>) (char <span class="ch">'"'</span>) (many <span class="fu">$</span> normalChar <span class="fu"><|></span> escapedDq)
|
||||
<span class="kw">where</span>
|
||||
normalChar <span class="fu">=</span> noneOf <span class="st">"\""</span> <span class="fu"><?></span> <span class="st">"unescaped character"</span>
|
||||
escapedDq <span class="fu">=</span> label <span class="st">"escaped double-quote"</span> (<span class="ch">'"'</span> <span class="fu"><$</span> string <span class="st">"\"\""</span>)</code></pre></div>
|
||||
<p>Simple so far. <code>unescapedField</code> is even simpler, it can contain any character except for the quote sign <code>"</code>, delimiter sign, and newline characters:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | Parse an unescaped field.</span>
|
||||
|
||||
<span class="ot">unescapedField ::</span> <span class="dt">Word8</span> <span class="ot">-></span> <span class="dt">Parser</span> <span class="dt">ByteString</span>
|
||||
unescapedField del <span class="fu">=</span> BC8.pack <span class="fu"><$!></span> many (noneOf es)
|
||||
<span class="kw">where</span>
|
||||
es <span class="fu">=</span> chr (fromIntegral del) <span class="fu">:</span> <span class="st">"\"\n\r"</span></code></pre></div>
|
||||
<p>To parse a record we have to parse a non-empty collection of fields separated by delimiter characters (supplied from the <code>DecodeOptions</code> thing). Then we convert it to <code>Vector ByteString</code>, because that’s what Cassava’s conversion functions expect:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | Parse a record, not including the terminating line separator. The</span>
|
||||
<span class="co">-- terminating line separate is not included as the last record in a CSV</span>
|
||||
<span class="co">-- file is allowed to not have a terminating line separator.</span>
|
||||
|
||||
record
|
||||
<span class="ot"> ::</span> <span class="dt">Word8</span> <span class="co">-- ^ Field delimiter</span>
|
||||
<span class="ot">-></span> (<span class="dt">Record</span> <span class="ot">-></span> <span class="dt">C.Parser</span> a)
|
||||
<span class="co">-- ^ How to “parse” record to get the data of interest</span>
|
||||
<span class="ot">-></span> <span class="dt">Parser</span> a
|
||||
record del f <span class="fu">=</span> <span class="kw">do</span>
|
||||
notFollowedBy eof <span class="co">-- to prevent reading empty line at the end of file</span>
|
||||
r <span class="ot"><-</span> V.fromList <span class="fu"><$!></span> (sepBy1 (field del) (blindByte del) <span class="fu"><?></span> <span class="st">"record"</span>)
|
||||
<span class="kw">case</span> C.runParser (f r) <span class="kw">of</span>
|
||||
<span class="dt">Left</span> msg <span class="ot">-></span> conversionError msg
|
||||
<span class="dt">Right</span> x <span class="ot">-></span> return x</code></pre></div>
|
||||
<p>The <code>(<$!>)</code> operator works just like the familiar <code>(<$>)</code>operator, but applies <code>V.fromList</code> strictly. Now that we have the vector of <code>ByteString</code>s, we can try to convert it: on success we just return the result, on failure we fail using the <code>conversionError</code> helper.</p>
|
||||
<p>The library also should handle CSV files with headers:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | Parse a CSV file that includes a header.</span>
|
||||
|
||||
<span class="ot">csvWithHeader ::</span> <span class="dt">FromNamedRecord</span> a
|
||||
<span class="ot">=></span> <span class="dt">DecodeOptions</span> <span class="co">-- ^ Decoding options</span>
|
||||
<span class="ot">-></span> <span class="dt">Parser</span> (<span class="dt">Header</span>, <span class="dt">Vector</span> a)
|
||||
<span class="co">-- ^ The parser that parser collection of named records</span>
|
||||
csvWithHeader <span class="fu">!</span><span class="dt">DecodeOptions</span> {<span class="fu">..</span>} <span class="fu">=</span> <span class="kw">do</span>
|
||||
<span class="fu">!</span>hdr <span class="ot"><-</span> header decDelimiter
|
||||
<span class="kw">let</span> f <span class="fu">=</span> parseNamedRecord <span class="fu">.</span> toNamedRecord hdr
|
||||
xs <span class="ot"><-</span> sepEndBy1 (record decDelimiter f) eol
|
||||
eof
|
||||
return <span class="fu">$</span> <span class="kw">let</span> <span class="fu">!</span>v <span class="fu">=</span> V.fromList xs <span class="kw">in</span> (hdr, v)
|
||||
|
||||
<span class="co">-- | Parse a header, including the terminating line separator.</span>
|
||||
|
||||
<span class="ot">header ::</span> <span class="dt">Word8</span> <span class="ot">-></span> <span class="dt">Parser</span> <span class="dt">Header</span>
|
||||
header del <span class="fu">=</span> V.fromList <span class="fu"><$!></span> p <span class="fu"><*</span> eol
|
||||
<span class="kw">where</span>
|
||||
p <span class="fu">=</span> sepBy1 (name del) (blindByte del) <span class="fu"><?></span> <span class="st">"file header"</span>
|
||||
|
||||
<span class="co">-- | Parse a header name. Header names have the same format as regular</span>
|
||||
<span class="co">-- 'field's.</span>
|
||||
|
||||
<span class="ot">name ::</span> <span class="dt">Word8</span> <span class="ot">-></span> <span class="dt">Parser</span> <span class="dt">Name</span>
|
||||
name del <span class="fu">=</span> field del <span class="fu"><?></span> <span class="st">"name in header"</span></code></pre></div>
|
||||
<p>The code should be self-explanatory by now. The only thing that remains is to parse collection of records:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | Parse a CSV file that does not include a header.</span>
|
||||
|
||||
<span class="ot">csv ::</span> <span class="dt">FromRecord</span> a
|
||||
<span class="ot">=></span> <span class="dt">DecodeOptions</span> <span class="co">-- ^ Decoding options</span>
|
||||
<span class="ot">-></span> <span class="dt">Parser</span> (<span class="dt">Vector</span> a) <span class="co">-- ^ The parser that parses collection of records</span>
|
||||
csv <span class="fu">!</span><span class="dt">DecodeOptions</span> {<span class="fu">..</span>} <span class="fu">=</span> <span class="kw">do</span>
|
||||
xs <span class="ot"><-</span> sepEndBy1 (record decDelimiter parseRecord) eol
|
||||
eof
|
||||
return <span class="fu">$!</span> V.fromList xs</code></pre></div>
|
||||
<p>Too simple!</p>
|
||||
<h2 id="trying-it-out">Trying it out</h2>
|
||||
<p>The custom error messages play seamlessly with the rest of the parser. Let’s parse a CSV file into collection of <code>(String, Maybe Int, Double)</code> items. If I try to parse <code>"foo</code>, I get the usual Megaparsec error message with “unexpected” and “expected” parts:</p>
|
||||
<pre><code>my-file.csv:1:5:
|
||||
unexpected end of input
|
||||
expecting '"', escaped double-quote, or unescaped character</code></pre>
|
||||
<p>However, when that phase of parsing is passed successfully, as with <code>foo,12,boo</code> input, the conversion is attempted and its results are reported:</p>
|
||||
<pre><code>my-file.csv:1:11:
|
||||
conversion error: expected Double, got "boo" (Failed reading: takeWhile1)</code></pre>
|
||||
<p>(I wouldn’t mind if <code>(Failed reading: takeWhile1)</code> part were omitted, but that’s what Cassava’s conversion methods are producing.)</p>
|
||||
<h2 id="conclusion">Conclusion</h2>
|
||||
<p>I hope this walk-through has demonstrated that it’s quite trivial to insert your own data into Megaparsec error messages. This way it’s also possible to pump out some data from failing parser or just keep track of things in a type-safe way, which is one thing we should always care about when writing Haskell programs.</p>
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.min.js"></script>
|
||||
<script src="../js/put-anchors.js"></script>
|
||||
</body>
|
||||
<body></body>
|
||||
</html>
|
||||
|
@ -2,223 +2,7 @@
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<meta name="description" content />
|
||||
<meta name="author" content />
|
||||
<title>Megaparsec | Fun with the recovery feature</title>
|
||||
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
|
||||
<link rel="stylesheet" type="text/css" href="../css/megaparsec.css" />
|
||||
<meta http-equiv="refresh" content="0; url=https://markkarpov.com/megaparsec/indentation-sensitive-parsing.html">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="navbar navbar-default navbar-static-top" role="navigation">
|
||||
<div class="container-fluid">
|
||||
<div class="navbar-header">
|
||||
<a class="navbar-brand" href="../">
|
||||
Megaparsec
|
||||
</a>
|
||||
</div>
|
||||
<div class="navbar-right">
|
||||
<ul class="nav navbar-nav">
|
||||
|
||||
<li>
|
||||
<a href="../tutorials.html">Tutorials</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://hackage.haskell.org/package/megaparsec">Hackage</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec">GitHub</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec-site">Edit the site</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="container">
|
||||
<div class="row-fluid">
|
||||
<div class="col-sm-10 col-sm-offset-1 col-md-8 col-md-offset-2 main">
|
||||
<div class="page-header">
|
||||
<h1>Fun with the recovery feature
|
||||
|
||||
<br />
|
||||
<small>
|
||||
Skip errors and report multiple errors at once
|
||||
</small>
|
||||
|
||||
</h1>
|
||||
<hr />
|
||||
<div class="content">
|
||||
|
||||
<em>Last updated on May 25, 2017</em>
|
||||
<hr />
|
||||
|
||||
|
||||
<p>Megaparsec 4.4.0 is a major improvement of the library. Among other things, it provides new primitive combinator <code>withRecovery</code> that allows to recover from parse errors “on-the-fly” and report several errors after parsing is finished or ignore them altogether. In this tutorial, we will learn how to use this incredible tool.</p>
|
||||
<ol style="list-style-type: decimal">
|
||||
<li><a href="#language-that-we-will-parse">Language that we will parse</a></li>
|
||||
<li><a href="#parser-without-recovery">Parser without recovery</a></li>
|
||||
<li><a href="#making-use-of-the-recovery-feature">Making use of the recovery feature</a></li>
|
||||
<li><a href="#conclusion">Conclusion</a></li>
|
||||
</ol>
|
||||
<h2 id="language-that-we-will-parse">Language that we will parse</h2>
|
||||
<p>For the purposes of this tutorial, we will write parser for a simplistic functional language that consists only of equations with symbol on the left hand side and arithmetic expression on the right hand side:</p>
|
||||
<pre><code>y = 10
|
||||
x = 3 * (1 + y)
|
||||
|
||||
result = x - 1 # answer is 32</code></pre>
|
||||
<p>Here, it can only calculate arithmetic expressions, but if we were to design something more powerful, we could introduce more interesting operators to grab input from console, etc., but since our aim is to explore a new parsing feature, this language will do.</p>
|
||||
<p>First, we will write a parser that can parse entire program in this language as list of ASTs representing equations. Then we will make it failure-tolerant in a way, so when it cannot parse particular equation, it does not stop, but continues its work until all input is analyzed.</p>
|
||||
<h2 id="parser-without-recovery">Parser without recovery</h2>
|
||||
<p>The parser is very easy to write. We will need the following imports:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">{-# LANGUAGE FlexibleContexts #-}</span>
|
||||
<span class="ot">{-# LANGUAGE TypeFamilies #-}</span>
|
||||
|
||||
<span class="kw">module</span> <span class="dt">Main</span> <span class="kw">where</span>
|
||||
|
||||
<span class="kw">import </span><span class="dt">Control.Applicative</span> (empty)
|
||||
<span class="kw">import </span><span class="dt">Control.Monad</span> (void)
|
||||
<span class="kw">import </span><span class="dt">Data.Scientific</span> (toRealFloat)
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec.String</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec.Expr</span>
|
||||
<span class="kw">import qualified</span> <span class="dt">Text.Megaparsec.Lexer</span> <span class="kw">as</span> <span class="dt">L</span></code></pre></div>
|
||||
<p>To represent AST of our language we will use these definitions:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">type</span> <span class="dt">Program</span> <span class="fu">=</span> [<span class="dt">Equation</span>]
|
||||
|
||||
<span class="kw">data</span> <span class="dt">Equation</span> <span class="fu">=</span> <span class="dt">Equation</span> <span class="dt">String</span> <span class="dt">Expr</span>
|
||||
<span class="kw">deriving</span> (<span class="dt">Eq</span>, <span class="dt">Show</span>)
|
||||
|
||||
<span class="kw">data</span> <span class="dt">Expr</span>
|
||||
<span class="fu">=</span> <span class="dt">Value</span> <span class="dt">Double</span>
|
||||
<span class="fu">|</span> <span class="dt">Reference</span> <span class="dt">String</span>
|
||||
<span class="fu">|</span> <span class="dt">Negation</span> <span class="dt">Expr</span>
|
||||
<span class="fu">|</span> <span class="dt">Sum</span> <span class="dt">Expr</span> <span class="dt">Expr</span>
|
||||
<span class="fu">|</span> <span class="dt">Subtraction</span> <span class="dt">Expr</span> <span class="dt">Expr</span>
|
||||
<span class="fu">|</span> <span class="dt">Multiplication</span> <span class="dt">Expr</span> <span class="dt">Expr</span>
|
||||
<span class="fu">|</span> <span class="dt">Division</span> <span class="dt">Expr</span> <span class="dt">Expr</span>
|
||||
<span class="kw">deriving</span> (<span class="dt">Eq</span>, <span class="dt">Show</span>)</code></pre></div>
|
||||
<p>It’s obvious that a program in our language is collection of equations, where every equation gives a name to an expression which in turn can be simply a number, reference to other equation, or some math involving those concepts.</p>
|
||||
<p>As usual, the first thing that we need to handle when starting a parser is white space. We will have two space-consuming parsers:</p>
|
||||
<ul>
|
||||
<li><p><code>scn</code>—consumes newlines and white space in general. We will use it for white space between equations, which will start with a newline (since equations are newline-delimited).</p></li>
|
||||
<li><p><code>sc</code>—this does not consume newlines and is used to define lexemes, i.e. things that automatically eat white space after them.</p></li>
|
||||
</ul>
|
||||
<p>Here is what I’ve got:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">lineComment ::</span> <span class="dt">Parser</span> ()
|
||||
lineComment <span class="fu">=</span> L.skipLineComment <span class="st">"#"</span>
|
||||
|
||||
<span class="ot">scn ::</span> <span class="dt">Parser</span> ()
|
||||
scn <span class="fu">=</span> L.space (void spaceChar) lineComment empty
|
||||
|
||||
<span class="ot">sc ::</span> <span class="dt">Parser</span> ()
|
||||
sc <span class="fu">=</span> L.space (void <span class="fu">$</span> oneOf <span class="st">" \t"</span>) lineComment empty
|
||||
|
||||
<span class="ot">lexeme ::</span> <span class="dt">Parser</span> a <span class="ot">-></span> <span class="dt">Parser</span> a
|
||||
lexeme <span class="fu">=</span> L.lexeme sc
|
||||
|
||||
<span class="ot">symbol ::</span> <span class="dt">String</span> <span class="ot">-></span> <span class="dt">Parser</span> <span class="dt">String</span>
|
||||
symbol <span class="fu">=</span> L.symbol sc</code></pre></div>
|
||||
<p>Consult Haddocks for description of <code>L.space</code>, <code>L.lexeme</code>, and <code>L.symbol</code>. In short, <code>L.space</code> is a helper to quickly put together a general-purpose space-consuming parser. We will follow this strategy: <em>assume no white space before lexemes and consume all white space after lexemes</em>. There is a case with white space that can be found before any lexeme, but that will be dealt with specially, see below.</p>
|
||||
<p>We also need a parser for equation names (<code>x</code>, <code>y</code>, and <code>result</code> in the first example). Like in many other programming languages, we will accept alpha-numeric sequences that do not start with a number:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">name ::</span> <span class="dt">Parser</span> <span class="dt">String</span>
|
||||
name <span class="fu">=</span> lexeme ((<span class="fu">:</span>) <span class="fu"><$></span> letterChar <span class="fu"><*></span> many alphaNumChar) <span class="fu"><?></span> <span class="st">"name"</span></code></pre></div>
|
||||
<p>All too easy. Parsing of expressions could slow us down, but there is a solution out-of-box in <code>Text.Megaparsec.Expr</code> module:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">expr ::</span> <span class="dt">Parser</span> <span class="dt">Expr</span>
|
||||
expr <span class="fu">=</span> makeExprParser term table <span class="fu"><?></span> <span class="st">"expression"</span>
|
||||
|
||||
<span class="ot">term ::</span> <span class="dt">Parser</span> <span class="dt">Expr</span>
|
||||
term <span class="fu">=</span> parens expr
|
||||
<span class="fu"><|></span> (<span class="dt">Reference</span> <span class="fu"><$></span> name)
|
||||
<span class="fu"><|></span> (<span class="dt">Value</span> <span class="fu"><$></span> number)
|
||||
|
||||
<span class="ot">table ::</span> [[<span class="dt">Operator</span> <span class="dt">Parser</span> <span class="dt">Expr</span>]]
|
||||
table <span class="fu">=</span>
|
||||
[ [<span class="dt">Prefix</span> (<span class="dt">Negation</span> <span class="fu"><$</span> symbol <span class="st">"-"</span>) ]
|
||||
, [ <span class="dt">InfixL</span> (<span class="dt">Multiplication</span> <span class="fu"><$</span> symbol <span class="st">"*"</span>)
|
||||
, <span class="dt">InfixL</span> (<span class="dt">Subtraction</span> <span class="fu"><$</span> symbol <span class="st">"/"</span>) ]
|
||||
, [ <span class="dt">InfixL</span> (<span class="dt">Sum</span> <span class="fu"><$</span> symbol <span class="st">"+"</span>)
|
||||
, <span class="dt">InfixL</span> (<span class="dt">Division</span> <span class="fu"><$</span> symbol <span class="st">"-"</span>) ]
|
||||
]
|
||||
|
||||
<span class="ot">number ::</span> <span class="dt">Parser</span> <span class="dt">Double</span>
|
||||
number <span class="fu">=</span> toRealFloat <span class="fu"><$></span> lexeme L.number
|
||||
|
||||
<span class="ot">parens ::</span> <span class="dt">Parser</span> a <span class="ot">-></span> <span class="dt">Parser</span> a
|
||||
parens <span class="fu">=</span> between (symbol <span class="st">"("</span>) (symbol <span class="st">")"</span>)</code></pre></div>
|
||||
<p>We just wrote fairly complete parser for expressions in our language! If you’re new to all this stuff I suggest you load the code into GHCi and play with it a bit. Use <code>parseTest</code> function to feed input into the parser:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell">λ<span class="fu">></span> parseTest expr <span class="st">"5"</span>
|
||||
<span class="dt">Value</span> <span class="fl">5.0</span>
|
||||
λ<span class="fu">></span> parseTest expr <span class="st">"5 + foo"</span>
|
||||
<span class="dt">Sum</span> (<span class="dt">Value</span> <span class="fl">5.0</span>) (<span class="dt">Reference</span> <span class="st">"foo"</span>)
|
||||
λ<span class="fu">></span> parseTest expr <span class="st">"(x + y) * 5 + 7 * z"</span>
|
||||
<span class="dt">Sum</span>
|
||||
(<span class="dt">Multiplication</span> (<span class="dt">Sum</span> (<span class="dt">Reference</span> <span class="st">"x"</span>) (<span class="dt">Reference</span> <span class="st">"y"</span>)) (<span class="dt">Value</span> <span class="fl">5.0</span>))
|
||||
(<span class="dt">Multiplication</span> (<span class="dt">Value</span> <span class="fl">7.0</span>) (<span class="dt">Reference</span> <span class="st">"z"</span>))</code></pre></div>
|
||||
<p>Power! The only thing that remains is a parser for equations and a parser for entire program:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">equation ::</span> <span class="dt">Parser</span> <span class="dt">Equation</span>
|
||||
equation <span class="fu">=</span> <span class="dt">Equation</span> <span class="fu"><$></span> (name <span class="fu"><*</span> symbol <span class="st">"="</span>) <span class="fu"><*></span> expr
|
||||
|
||||
<span class="ot">prog ::</span> <span class="dt">Parser</span> <span class="dt">Program</span>
|
||||
prog <span class="fu">=</span> between scn eof (sepEndBy equation scn)</code></pre></div>
|
||||
<p>Note that we need to consume leading white-space in <code>prog</code> manually, as described above. Try the <code>prog</code> parser—it’s a complete solution that can parse language we described in the beginning. Parsing “end of file” <code>eof</code> explicitly makes the parser consume all input and fail loudly if it cannot do it, otherwise it would just stop on the first problematic token and return what it has parsed so far.</p>
|
||||
<h2 id="making-use-of-the-recovery-feature">Making use of the recovery feature</h2>
|
||||
<p>Our parser is really dandy, it has nice error messages and does its job well. However, every expression is clearly separated from the others by a newline. This separation makes it possible to analyze many expressions independently, even if one of them is malformed, we have no reason to stop and not to check the others. In fact, that’s how some “serious” parsers work (parser of C++ language, although it depends on compiler I guess). Reporting multiple parse errors at once may be a more efficient method of communication with the programmer who needs to fix them, than when he/she has to recompile the program every time to get to the next error. In this section we will make our parser failure-tolerant and able to report multiple error messages at once.</p>
|
||||
<p>Let’s add one more type synonym—<code>RawData</code>:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">type</span> <span class="dt">RawData</span> t e <span class="fu">=</span> [<span class="dt">Either</span> (<span class="dt">ParseError</span> t e) <span class="dt">Equation</span>]</code></pre></div>
|
||||
<p>This represents a collection of equations, just like <code>Program</code>, but every one of them may be malformed: in that case we get the original error message in <code>Left</code>, otherwise we have properly parsed equation in <code>Right</code>.</p>
|
||||
<p>You will be amazed just how easy it is to add recovering to an existing parser:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">rawData ::</span> <span class="dt">Parser</span> (<span class="dt">RawData</span> <span class="dt">Char</span> <span class="dt">Dec</span>)
|
||||
rawData <span class="fu">=</span> between scn eof (sepEndBy e scn)
|
||||
<span class="kw">where</span> e <span class="fu">=</span> withRecovery recover (<span class="dt">Right</span> <span class="fu"><$></span> equation)
|
||||
recover err <span class="fu">=</span> <span class="dt">Left</span> err <span class="fu"><$</span> manyTill anyChar eol</code></pre></div>
|
||||
<p>Let try it, here is the input:</p>
|
||||
<pre><code>foo = (x $ y) * 5 + 7.2 * z
|
||||
bar = 15</code></pre>
|
||||
<p>Result:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell">[ <span class="dt">Left</span>
|
||||
(<span class="dt">ParseError</span>
|
||||
{ errorPos <span class="fu">=</span> <span class="dt">SourcePos</span>
|
||||
{ sourceName <span class="fu">=</span> <span class="st">""</span>, sourceLine <span class="fu">=</span> <span class="dt">Pos</span> <span class="dv">1</span>
|
||||
, sourceColumn <span class="fu">=</span> <span class="dt">Pos</span> <span class="dv">10</span>} <span class="fu">:|</span> []
|
||||
, errorUnexpected <span class="fu">=</span> fromList [<span class="dt">Tokens</span> (<span class="ch">'$'</span> <span class="fu">:|</span> <span class="st">""</span>)]
|
||||
, errorExpected <span class="fu">=</span> fromList
|
||||
[ <span class="dt">Tokens</span> (<span class="ch">')'</span> <span class="fu">:|</span> <span class="st">""</span>)
|
||||
, <span class="dt">Label</span> (<span class="ch">'o'</span> <span class="fu">:|</span> <span class="st">"perator"</span>)
|
||||
, <span class="dt">Label</span> (<span class="ch">'r'</span> <span class="fu">:|</span> <span class="st">"est of expression"</span>) ]
|
||||
, errorCustom <span class="fu">=</span> fromList [] })
|
||||
, <span class="dt">Right</span> (<span class="dt">Equation</span> <span class="st">"bar"</span> (<span class="dt">Value</span> <span class="fl">15.0</span>)) ]</code></pre></div>
|
||||
<p>How does it work? <code>withRecovery r p</code> primitive runs the parser <code>p</code> as usual, but if it fails, it just takes its <code>ParseError</code> and provides it as an argument of <code>r</code>. In <code>r</code> you start right were <code>p</code> failed—no backtracking happens, because it would make it harder to find position from where to start normal parsing again. Here you have a chance to consume some input to advance the parser’s textual position. In our case it’s as simple as eating all input up to the next newline, but it might be trickier.</p>
|
||||
<p>You probably want to know now what happens when recovering parser <code>r</code> fails as well. The answer is: your parser fails as usual, as if no <code>withRecovery</code> primitive was used. It’s by design that recovering parser cannot influence error messages in any way, or it would lead to quite confusing error messages in some cases, depending on the logic of the recovering parser.</p>
|
||||
<p>Now it’s up to you what to do with <code>RawData</code>. You can either take all error messages and print them one by one, or ignore errors altogether and filter only valid equations to work with.</p>
|
||||
<h2 id="conclusion">Conclusion</h2>
|
||||
<p>When you want to use <code>withRecovery</code>, the main thing to remember that parts of text that you want to allow to fail should be clearly separated from each other, so recovering parser can reliably skip to the next part if the current part cannot be parsed. In a language like Python, you could use indentation levels to tell apart high-level definitions, for example. In every case you should use your judgment and creativity to decide how to make use of <code>withRecovery</code>. In some cases it may be not worth it, but more often than not you will be able to improve experience of people who work with your product by using this new Megaparsec’s feature.</p>
|
||||
|
||||
|
||||
<hr />
|
||||
<p>
|
||||
(Psst! Looking for the source code for this tutorial?
|
||||
It's <a href="https://github.com/mrkkrp/megaparsec-site/tree/master/tutorial-code/RecoveryFeature.hs">here</a>.)
|
||||
</p>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.min.js"></script>
|
||||
<script src="../js/put-anchors.js"></script>
|
||||
</body>
|
||||
<body></body>
|
||||
</html>
|
||||
|
@ -2,257 +2,7 @@
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<meta name="description" content />
|
||||
<meta name="author" content />
|
||||
<title>Megaparsec | Indentation-sensitive parsing</title>
|
||||
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
|
||||
<link rel="stylesheet" type="text/css" href="../css/megaparsec.css" />
|
||||
<meta http-equiv="refresh" content="0; url=https://markkarpov.com/megaparsec/indentation-sensitive-parsing.html">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="navbar navbar-default navbar-static-top" role="navigation">
|
||||
<div class="container-fluid">
|
||||
<div class="navbar-header">
|
||||
<a class="navbar-brand" href="../">
|
||||
Megaparsec
|
||||
</a>
|
||||
</div>
|
||||
<div class="navbar-right">
|
||||
<ul class="nav navbar-nav">
|
||||
|
||||
<li>
|
||||
<a href="../tutorials.html">Tutorials</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://hackage.haskell.org/package/megaparsec">Hackage</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec">GitHub</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec-site">Edit the site</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="container">
|
||||
<div class="row-fluid">
|
||||
<div class="col-sm-10 col-sm-offset-1 col-md-8 col-md-offset-2 main">
|
||||
<div class="page-header">
|
||||
<h1>Indentation-sensitive parsing
|
||||
|
||||
<br />
|
||||
<small>
|
||||
Native, composable solution
|
||||
</small>
|
||||
|
||||
</h1>
|
||||
<hr />
|
||||
<div class="content">
|
||||
|
||||
<em>Last updated on May 25, 2017</em>
|
||||
<hr />
|
||||
|
||||
|
||||
<p>Megaparsec 4.3.0 introduces new combinators that should be of some use when you want to parse indentation-sensitive input. Megaparsec 5.0.0 adds support for line-folds, completing support for indentation-sensitive parsing. This tutorial shows how these new tools work, compose, and hopefully, <em>feel natural</em>—something we cannot say about ad-hoc solutions to this problem that exist as separate packages to work on top of Parsec, for example.</p>
|
||||
<ol style="list-style-type: decimal">
|
||||
<li><a href="#combinator-overview">Combinator overview</a></li>
|
||||
<li><a href="#parsing-a-simple-indented-list">Parsing a simple indented list</a></li>
|
||||
<li><a href="#nested-indented-list">Nested indented list</a></li>
|
||||
<li><a href="#adding-line-folds">Adding line folds</a></li>
|
||||
<li><a href="#conclusion">Conclusion</a></li>
|
||||
</ol>
|
||||
<h2 id="combinator-overview">Combinator overview</h2>
|
||||
<p>From the first release of Megaparsec, there has been the <code>indentGuard</code> function, which is a great shortcut, but a kind of pain to use for complex tasks. So, we won’t cover it here, instead we will talk about the new combinators built upon it and available beginning from Megaparsec 4.3.0.</p>
|
||||
<p>First, we have <code>indentLevel</code>, which is defined just as:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">indentLevel ::</span> <span class="dt">MonadParsec</span> e s m <span class="ot">=></span> m <span class="dt">Pos</span>
|
||||
indentLevel <span class="fu">=</span> sourceColumn <span class="fu"><$></span> getPosition</code></pre></div>
|
||||
<p>That’s right, it’s just a shortcut, but I found myself using this idiom so often, so I included it in the public lexer API.</p>
|
||||
<p>Second, we have <code>nonIndented</code>. This allows to make sure that some input is not indented. Just wrap a parser in <code>nonIndented</code> and you’re done.</p>
|
||||
<p><code>nonIndented</code> is trivial to write as well:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">nonIndented ::</span> <span class="dt">MonadParsec</span> e s m
|
||||
<span class="ot">=></span> m () <span class="co">-- ^ How to consume indentation (white space)</span>
|
||||
<span class="ot">-></span> m a <span class="co">-- ^ How to parse actual data</span>
|
||||
<span class="ot">-></span> m a
|
||||
nonIndented sc p <span class="fu">=</span> indentGuard sc <span class="dt">EQ</span> (unsafePos <span class="dv">1</span>) <span class="fu">*></span> p</code></pre></div>
|
||||
<p>However, it’s a part of a logical model behind high-level parsing of indentation-sensitive input. We state that there are top-level items that are not indented (<code>nonIndented</code> helps to define parsers for them), and that all indented tokens are directly or indirectly are “children” of those top-level definitions. In Megaparsec, we don’t need any additional state to express this. Since indentation is always relative, our idea is to explicitly tie parsers for “reference” tokens and indented tokens, thus defining indentation-sensitive grammar via pure combination of parsers, just like all the other tools in Megaparsec work. This is different from old solutions built on top of Parsec, where you had to deal with ad-hoc state. It’s also more robust and safer, because the less state you have, the better.</p>
|
||||
<p>So, how do you define an indented block? Let’s take a look at the signature of the <code>indentBlock</code> helper:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">indentBlock ::</span> (<span class="dt">MonadParsec</span> e s m, <span class="dt">Token</span> s <span class="fu">~</span> <span class="dt">Char</span>)
|
||||
<span class="ot">=></span> m () <span class="co">-- ^ How to consume indentation (white space)</span>
|
||||
<span class="ot">-></span> m (<span class="dt">IndentOpt</span> m a b) <span class="co">-- ^ How to parse “reference” token</span>
|
||||
<span class="ot">-></span> m a</code></pre></div>
|
||||
<p>First, we specify how to consume indentation. An important thing to note here is that this space-consuming parser <em>must</em> consume newlines as well, while tokens (“reference” token and indented tokens) should not normally consume newlines after them.</p>
|
||||
<p>As you can see, the second argument allows us to parse “reference” token and return a data structure that tells <code>indentBlock</code> what to do next. There are several options:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">data</span> <span class="dt">IndentOpt</span> m a b
|
||||
<span class="fu">=</span> <span class="dt">IndentNone</span> a
|
||||
<span class="co">-- ^ Parse no indented tokens, just return the value</span>
|
||||
<span class="fu">|</span> <span class="dt">IndentMany</span> (<span class="dt">Maybe</span> <span class="dt">Int</span>) ([b] <span class="ot">-></span> m a) (m b)
|
||||
<span class="co">-- ^ Parse many indented tokens (possibly zero), use given indentation</span>
|
||||
<span class="co">-- level (if 'Nothing', use level of the first indented token); the</span>
|
||||
<span class="co">-- second argument tells how to get final result, and third argument</span>
|
||||
<span class="co">-- describes how to parse an indented token</span>
|
||||
<span class="fu">|</span> <span class="dt">IndentSome</span> (<span class="dt">Maybe</span> <span class="dt">Int</span>) ([b] <span class="ot">-></span> m a) (m b)
|
||||
<span class="co">-- ^ Just like 'IndentMany', but requires at least one indented token to</span>
|
||||
<span class="co">-- be present</span></code></pre></div>
|
||||
<p>We can change our mind and parse no indented tokens, we can parse <em>many</em> (that is, possibly zero) indented tokens or require <em>at least one</em> such token. We can either allow <code>indentBlock</code> detect indentation level of the first indented token and use that, or manually specify indentation level. This should be flexible enough.</p>
|
||||
<h2 id="parsing-a-simple-indented-list">Parsing a simple indented list</h2>
|
||||
<p>Now it’s time to put our new tools into practice. In this section, we will parse a simple indented list of some items. Let’s begin with the import section:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">{-# LANGUAGE TupleSections #-}</span>
|
||||
|
||||
<span class="kw">module</span> <span class="dt">Main</span> (main) <span class="kw">where</span>
|
||||
|
||||
<span class="kw">import </span><span class="dt">Control.Applicative</span> (empty)
|
||||
<span class="kw">import </span><span class="dt">Control.Monad</span> (void)
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec.String</span>
|
||||
<span class="kw">import qualified</span> <span class="dt">Text.Megaparsec.Lexer</span> <span class="kw">as</span> <span class="dt">L</span></code></pre></div>
|
||||
<p>We will need two kinds of space-consumers: one that consumes new lines <code>scn</code> and one that doesn’t <code>sc</code> (actually it only parses spaces and tabs here):</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">lineComment ::</span> <span class="dt">Parser</span> ()
|
||||
lineComment <span class="fu">=</span> L.skipLineComment <span class="st">"#"</span>
|
||||
|
||||
<span class="ot">scn ::</span> <span class="dt">Parser</span> ()
|
||||
scn <span class="fu">=</span> L.space (void spaceChar) lineComment empty
|
||||
|
||||
<span class="ot">sc ::</span> <span class="dt">Parser</span> ()
|
||||
sc <span class="fu">=</span> L.space (void <span class="fu">$</span> oneOf <span class="st">" \t"</span>) lineComment empty
|
||||
|
||||
<span class="ot">lexeme ::</span> <span class="dt">Parser</span> a <span class="ot">-></span> <span class="dt">Parser</span> a
|
||||
lexeme <span class="fu">=</span> L.lexeme sc</code></pre></div>
|
||||
<p>Just for fun, we will allow line comments that start with <code>#</code> as well.</p>
|
||||
<p>Assuming <code>pItemList</code> parses the entire list, we can define the high-level parser as:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">parser ::</span> <span class="dt">Parser</span> (<span class="dt">String</span>, [<span class="dt">String</span>])
|
||||
parser <span class="fu">=</span> pItemList <span class="fu"><*</span> eof</code></pre></div>
|
||||
<p>This will make it consume all input.</p>
|
||||
<p><code>pItemList</code> is a top-level form that itself is a combination of “reference” token (header of list) and indented tokens (list items), so:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">pItemList ::</span> <span class="dt">Parser</span> (<span class="dt">String</span>, [<span class="dt">String</span>]) <span class="co">-- header and list items</span>
|
||||
pItemList <span class="fu">=</span> L.nonIndented scn (L.indentBlock scn p)
|
||||
<span class="kw">where</span>
|
||||
p <span class="fu">=</span> <span class="kw">do</span>
|
||||
header <span class="ot"><-</span> pItem
|
||||
return (<span class="dt">L.IndentMany</span> <span class="dt">Nothing</span> (return <span class="fu">.</span> (header, )) pItem)</code></pre></div>
|
||||
<p>For our purposes, an item is a sequence of alpha-numeric characters and dashes:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">pItem ::</span> <span class="dt">Parser</span> <span class="dt">String</span>
|
||||
pItem <span class="fu">=</span> lexeme <span class="fu">$</span> some (alphaNumChar <span class="fu"><|></span> char <span class="ch">'-'</span>)</code></pre></div>
|
||||
<p>Now, load the code into GHCi and try it with help of <code>parseTest</code> built-in:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell">λ<span class="fu">></span> parseTest parser <span class="st">""</span>
|
||||
<span class="dv">1</span><span class="fu">:</span><span class="dv">1</span><span class="fu">:</span>
|
||||
unexpected end <span class="kw">of</span> input
|
||||
expecting <span class="ch">'-'</span> or alphanumeric character
|
||||
λ<span class="fu">></span> parseTest parser <span class="st">"something"</span>
|
||||
(<span class="st">"something"</span>,[])
|
||||
λ<span class="fu">></span> parseTest parser <span class="st">" something"</span>
|
||||
<span class="dv">1</span><span class="fu">:</span><span class="dv">3</span><span class="fu">:</span>
|
||||
incorrect indentation (got <span class="dv">3</span>, should be equal to <span class="dv">1</span>)
|
||||
λ<span class="fu">></span> parseTest parser <span class="st">"something\none\ntwo\nthree"</span>
|
||||
<span class="dv">2</span><span class="fu">:</span><span class="dv">1</span><span class="fu">:</span>
|
||||
unexpected <span class="ch">'o'</span>
|
||||
expecting end <span class="kw">of</span> input</code></pre></div>
|
||||
<p>Remember that we’re using <code>IndentMany</code> option, so empty lists are OK, on the other hand the built-in combinator <code>space</code> has hidden the phrase “expecting more space” from error messages (usually you don’t want it because it adds noise to all messages), so this error message is perfectly reasonable.</p>
|
||||
<p>Let’s continue:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell">λ<span class="fu">></span> parseTest parser <span class="st">"something\n one\n two\n three"</span>
|
||||
<span class="dv">3</span><span class="fu">:</span><span class="dv">5</span><span class="fu">:</span>
|
||||
incorrect indentation (got <span class="dv">5</span>, should be equal to <span class="dv">3</span>)
|
||||
λ<span class="fu">></span> parseTest parser <span class="st">"something\n one\n two\n three"</span>
|
||||
<span class="dv">4</span><span class="fu">:</span><span class="dv">2</span><span class="fu">:</span>
|
||||
incorrect indentation (got <span class="dv">2</span>, should be equal to <span class="dv">3</span>)
|
||||
λ<span class="fu">></span> parseTest parser <span class="st">"something\n one\n two\n three"</span>
|
||||
(<span class="st">"something"</span>,[<span class="st">"one"</span>,<span class="st">"two"</span>,<span class="st">"three"</span>])</code></pre></div>
|
||||
<p>This definitely seems to work. Let’s replace <code>IndentMany</code> with <code>IndentSome</code> and <code>Nothing</code> with <code>Just 5</code> (indentation levels are counted from 1, so it will require 4 spaces before indented items):</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">pItemList ::</span> <span class="dt">Parser</span> (<span class="dt">String</span>, [<span class="dt">String</span>])
|
||||
pItemList <span class="fu">=</span> L.nonIndented scn (L.indentBlock scn p)
|
||||
<span class="kw">where</span>
|
||||
p <span class="fu">=</span> <span class="kw">do</span>
|
||||
header <span class="ot"><-</span> pItem
|
||||
return (<span class="dt">L.IndentSome</span> (<span class="dt">Just</span> (unsafePos <span class="dv">5</span>)) (return <span class="fu">.</span> (header, )) pItem)</code></pre></div>
|
||||
<p>Now:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell">λ<span class="fu">></span> parseTest parser <span class="st">"something\n"</span>
|
||||
<span class="dv">2</span><span class="fu">:</span><span class="dv">1</span><span class="fu">:</span>
|
||||
incorrect indentation (got <span class="dv">1</span>, should be greater than <span class="dv">1</span>)
|
||||
λ<span class="fu">></span> parseTest parser <span class="st">"something\n one"</span>
|
||||
<span class="dv">2</span><span class="fu">:</span><span class="dv">3</span><span class="fu">:</span>
|
||||
incorrect indentation (got <span class="dv">3</span>, should be equal to <span class="dv">5</span>)
|
||||
λ<span class="fu">></span> parseTest parser <span class="st">"something\n one"</span>
|
||||
(<span class="st">"something"</span>,[<span class="st">"one"</span>])</code></pre></div>
|
||||
<p>First message may be a bit surprising, but Megaparsec knows that there must be at least one item in the list, so it checks indentation level and it’s 1, which is incorrect, so it reports it.</p>
|
||||
<h2 id="nested-indented-list">Nested indented list</h2>
|
||||
<p>What I like about <code>indentBlock</code> is that another <code>indentBlock</code> can be put inside of it and the whole thing will work smoothly, parsing more complex input with several levels of indentation. No additional effort is required.</p>
|
||||
<p>Let’s allow list items to have sub-items. For this we will need a new parser, <code>pComplexItem</code> (looks familiar…):</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">pComplexItem ::</span> <span class="dt">Parser</span> (<span class="dt">String</span>, [<span class="dt">String</span>])
|
||||
pComplexItem <span class="fu">=</span> L.indentBlock scn p
|
||||
<span class="kw">where</span>
|
||||
p <span class="fu">=</span> <span class="kw">do</span>
|
||||
header <span class="ot"><-</span> pItem
|
||||
return (<span class="dt">L.IndentMany</span> <span class="dt">Nothing</span> (return <span class="fu">.</span> (header, )) pItem)</code></pre></div>
|
||||
<p>A couple of edits to <code>pItemList</code> (we’re now parsing more complex stuff, so we need to reflect this in the type signatures):</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">parser ::</span> <span class="dt">Parser</span> (<span class="dt">String</span>, [(<span class="dt">String</span>, [<span class="dt">String</span>])])
|
||||
parser <span class="fu">=</span> pItemList <span class="fu"><*</span> eof
|
||||
|
||||
<span class="ot">pItemList ::</span> <span class="dt">Parser</span> (<span class="dt">String</span>, [(<span class="dt">String</span>, [<span class="dt">String</span>])])
|
||||
pItemList <span class="fu">=</span> L.nonIndented scn (L.indentBlock scn p)
|
||||
<span class="kw">where</span>
|
||||
p <span class="fu">=</span> <span class="kw">do</span>
|
||||
header <span class="ot"><-</span> pItem
|
||||
return (<span class="dt">L.IndentSome</span> <span class="dt">Nothing</span> (return <span class="fu">.</span> (header, )) pComplexItem)</code></pre></div>
|
||||
<p>If I feed something like this:</p>
|
||||
<pre><code>first-chapter
|
||||
paragraph-one
|
||||
note-A # an important note here!
|
||||
note-B
|
||||
paragraph-two
|
||||
note-1
|
||||
note-2
|
||||
paragraph-three</code></pre>
|
||||
<p>…into our parser, I get:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="dt">Right</span>
|
||||
( <span class="st">"first-chapter"</span>
|
||||
, [ (<span class="st">"paragraph-one"</span>, [<span class="st">"note-A"</span>,<span class="st">"note-B"</span>])
|
||||
, (<span class="st">"paragraph-two"</span>, [<span class="st">"note-1"</span>,<span class="st">"note-2"</span>])
|
||||
, (<span class="st">"paragraph-three"</span>, []) ] )</code></pre></div>
|
||||
<p>And this looks like it works!</p>
|
||||
<h2 id="adding-line-folds">Adding line folds</h2>
|
||||
<p><code>lineFold</code> helper is introduced in Megaparsec 5.0.0. A line fold consists of several elements that can be put on one line or on several lines as long as indentation level of subsequent items is greater than indentation level of the first item.</p>
|
||||
<p>Let’s make use of <code>lineFold</code> and add line folds to our program.</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">pComplexItem ::</span> <span class="dt">Parser</span> (<span class="dt">String</span>, [<span class="dt">String</span>])
|
||||
pComplexItem <span class="fu">=</span> L.indentBlock scn p
|
||||
<span class="kw">where</span>
|
||||
p <span class="fu">=</span> <span class="kw">do</span>
|
||||
header <span class="ot"><-</span> pItem
|
||||
return (<span class="dt">L.IndentMany</span> <span class="dt">Nothing</span> (return <span class="fu">.</span> (header, )) pLineFold)
|
||||
|
||||
<span class="ot">pLineFold ::</span> <span class="dt">Parser</span> <span class="dt">String</span>
|
||||
pLineFold <span class="fu">=</span> L.lineFold scn <span class="fu">$</span> \sc' <span class="ot">-></span>
|
||||
<span class="kw">let</span> ps <span class="fu">=</span> some (alphaNumChar <span class="fu"><|></span> char <span class="ch">'-'</span>) <span class="ot">`sepBy1`</span> try sc'
|
||||
<span class="kw">in</span> unwords <span class="fu"><$></span> ps <span class="fu"><*</span> sc</code></pre></div>
|
||||
<p><code>lineFold</code> works like this: you give it a space consumer that accepts newlines and it gives you a special space consumer that you can use in the callback to consume space between elements of line fold. An important thing here is that you should use normal space consumer at the end of line fold or your fold will have no end.</p>
|
||||
<p>Playing with the final version of our parser is left as an exercise for the reader—you can create “items” that consist of multiple words and as long as they are “line-folded” they will be parsed and concatenated with single space between them.</p>
|
||||
<h2 id="conclusion">Conclusion</h2>
|
||||
<p>Note that every sub-list behaves independently—you will see that if you try to feed the parser with various variants of malformed data. And this is no surprise, since no state is shared between different parts of the structure—it’s just assembled purely from simpler parts—sufficiently elegant solution in the spirit of the rest of the library.</p>
|
||||
|
||||
|
||||
<hr />
|
||||
<p>
|
||||
(Psst! Looking for the source code for this tutorial?
|
||||
It's <a href="https://github.com/mrkkrp/megaparsec-site/tree/master/tutorial-code/IndentationSensitiveParsing.hs">here</a>.)
|
||||
</p>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.min.js"></script>
|
||||
<script src="../js/put-anchors.js"></script>
|
||||
</body>
|
||||
<body></body>
|
||||
</html>
|
||||
|
@ -2,304 +2,7 @@
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<meta name="description" content />
|
||||
<meta name="author" content />
|
||||
<title>Megaparsec | Parsing a simple imperative language</title>
|
||||
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
|
||||
<link rel="stylesheet" type="text/css" href="../css/megaparsec.css" />
|
||||
<meta http-equiv="refresh" content="0; url=https://markkarpov.com/megaparsec/parsing-simple-imperative-language.html">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="navbar navbar-default navbar-static-top" role="navigation">
|
||||
<div class="container-fluid">
|
||||
<div class="navbar-header">
|
||||
<a class="navbar-brand" href="../">
|
||||
Megaparsec
|
||||
</a>
|
||||
</div>
|
||||
<div class="navbar-right">
|
||||
<ul class="nav navbar-nav">
|
||||
|
||||
<li>
|
||||
<a href="../tutorials.html">Tutorials</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://hackage.haskell.org/package/megaparsec">Hackage</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec">GitHub</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec-site">Edit the site</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="container">
|
||||
<div class="row-fluid">
|
||||
<div class="col-sm-10 col-sm-offset-1 col-md-8 col-md-offset-2 main">
|
||||
<div class="page-header">
|
||||
<h1>Parsing a simple imperative language
|
||||
|
||||
<br />
|
||||
<small>
|
||||
Based on original Parsec tutorial
|
||||
</small>
|
||||
|
||||
</h1>
|
||||
<hr />
|
||||
<div class="content">
|
||||
|
||||
<em>Last updated on May 25, 2017</em>
|
||||
<hr />
|
||||
|
||||
|
||||
<p>This tutorial will present how to parse a subset of a simple imperative programming language called <em>WHILE</em> (introduced in the book “Principles of Program Analysis” by Nielson, Nielson and Hankin). It includes only a few statements and basic boolean/arithmetic expressions, which makes it nice material for a tutorial.</p>
|
||||
<ol style="list-style-type: decimal">
|
||||
<li><a href="#imports">Imports</a></li>
|
||||
<li><a href="#the-language">The language</a></li>
|
||||
<li><a href="#data-structures">Data structures</a></li>
|
||||
<li><a href="#lexer">Lexer</a></li>
|
||||
<li><a href="#parser">Parser</a></li>
|
||||
<li><a href="#expressions">Expressions</a></li>
|
||||
<li><a href="#notes">Notes</a></li>
|
||||
</ol>
|
||||
<h2 id="imports">Imports</h2>
|
||||
<p>First let’s import the necessary modules:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">module</span> <span class="dt">Main</span> (main) <span class="kw">where</span>
|
||||
|
||||
<span class="kw">import </span><span class="dt">Control.Monad</span> (void)
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec.Expr</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec.String</span> <span class="co">-- input stream is of the type ‘String’</span>
|
||||
<span class="kw">import qualified</span> <span class="dt">Text.Megaparsec.Lexer</span> <span class="kw">as</span> <span class="dt">L</span></code></pre></div>
|
||||
<h2 id="the-language">The language</h2>
|
||||
<p>The grammar for expressions is defined as follows:</p>
|
||||
<pre><code>a ::= x | n | - a | a opa a
|
||||
b ::= true | false | not b | b opb b | a opr a
|
||||
opa ::= + | - | * | /
|
||||
opb ::= and | or
|
||||
opr ::= > | <</code></pre>
|
||||
<p>Note that we have three groups of operators—arithmetic, boolean and relational ones.</p>
|
||||
<p>And now the definition of statements:</p>
|
||||
<pre><code>S ::= x := a | skip | S1; S2 | ( S ) | if b then S1 else S2 | while b do S</code></pre>
|
||||
<p>We probably want to parse that into some internal representation of the language (an <a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree">abstract syntax tree</a>). Therefore we need to define the data structures for the expressions and statements.</p>
|
||||
<h2 id="data-structures">Data structures</h2>
|
||||
<p>We need to take care of boolean and arithmetic expressions and the appropriate operators. First let’s look at the boolean expressions:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">data</span> <span class="dt">BExpr</span>
|
||||
<span class="fu">=</span> <span class="dt">BoolConst</span> <span class="dt">Bool</span>
|
||||
<span class="fu">|</span> <span class="dt">Not</span> <span class="dt">BExpr</span>
|
||||
<span class="fu">|</span> <span class="dt">BBinary</span> <span class="dt">BBinOp</span> <span class="dt">BExpr</span> <span class="dt">BExpr</span>
|
||||
<span class="fu">|</span> <span class="dt">RBinary</span> <span class="dt">RBinOp</span> <span class="dt">AExpr</span> <span class="dt">AExpr</span>
|
||||
<span class="kw">deriving</span> (<span class="dt">Show</span>)</code></pre></div>
|
||||
<p>Binary boolean operators:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">data</span> <span class="dt">BBinOp</span>
|
||||
<span class="fu">=</span> <span class="dt">And</span>
|
||||
<span class="fu">|</span> <span class="dt">Or</span>
|
||||
<span class="kw">deriving</span> (<span class="dt">Show</span>)</code></pre></div>
|
||||
<p>Relational operators:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">data</span> <span class="dt">RBinOp</span>
|
||||
<span class="fu">=</span> <span class="dt">Greater</span>
|
||||
<span class="fu">|</span> <span class="dt">Less</span>
|
||||
<span class="kw">deriving</span> (<span class="dt">Show</span>)</code></pre></div>
|
||||
<p>Now we define the types for arithmetic expressions:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">data</span> <span class="dt">AExpr</span>
|
||||
<span class="fu">=</span> <span class="dt">Var</span> <span class="dt">String</span>
|
||||
<span class="fu">|</span> <span class="dt">IntConst</span> <span class="dt">Integer</span>
|
||||
<span class="fu">|</span> <span class="dt">Neg</span> <span class="dt">AExpr</span>
|
||||
<span class="fu">|</span> <span class="dt">ABinary</span> <span class="dt">ABinOp</span> <span class="dt">AExpr</span> <span class="dt">AExpr</span>
|
||||
<span class="kw">deriving</span> (<span class="dt">Show</span>)</code></pre></div>
|
||||
<p>And arithmetic operators:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">data</span> <span class="dt">ABinOp</span>
|
||||
<span class="fu">=</span> <span class="dt">Add</span>
|
||||
<span class="fu">|</span> <span class="dt">Subtract</span>
|
||||
<span class="fu">|</span> <span class="dt">Multiply</span>
|
||||
<span class="fu">|</span> <span class="dt">Divide</span>
|
||||
<span class="kw">deriving</span> (<span class="dt">Show</span>)</code></pre></div>
|
||||
<p>Finally let’s take care of the statements:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">data</span> <span class="dt">Stmt</span>
|
||||
<span class="fu">=</span> <span class="dt">Seq</span> [<span class="dt">Stmt</span>]
|
||||
<span class="fu">|</span> <span class="dt">Assign</span> <span class="dt">String</span> <span class="dt">AExpr</span>
|
||||
<span class="fu">|</span> <span class="dt">If</span> <span class="dt">BExpr</span> <span class="dt">Stmt</span> <span class="dt">Stmt</span>
|
||||
<span class="fu">|</span> <span class="dt">While</span> <span class="dt">BExpr</span> <span class="dt">Stmt</span>
|
||||
<span class="fu">|</span> <span class="dt">Skip</span>
|
||||
<span class="kw">deriving</span> (<span class="dt">Show</span>)</code></pre></div>
|
||||
<h2 id="lexer">Lexer</h2>
|
||||
<p>Having all the data structures we can go on with writing the code to do the actual parsing. Here we will define <em>lexemes</em> of our language. When writing a lexer for a language it’s always important to define what counts as whitespace and how it should be consumed. <code>space</code> from <code>Text.Megaparsec.Lexer</code> module can be helpful here:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">sc ::</span> <span class="dt">Parser</span> ()
|
||||
sc <span class="fu">=</span> L.space (void spaceChar) lineCmnt blockCmnt
|
||||
<span class="kw">where</span> lineCmnt <span class="fu">=</span> L.skipLineComment <span class="st">"//"</span>
|
||||
blockCmnt <span class="fu">=</span> L.skipBlockComment <span class="st">"/*"</span> <span class="st">"*/"</span></code></pre></div>
|
||||
<p><code>sc</code> stands for “space consumer”. <code>space</code> takes three arguments: a parser that parses single whitespace character, a parser for line comments, and a parser for block (multi-line) comments. <code>skipLineComment</code> and <code>skipBlockComment</code> help with quickly creating parsers to consume the comments. (If our language didn’t have block comments, we could pass <code>empty</code> from <code>Control.Applicative</code> as the third argument of <code>space</code>.)</p>
|
||||
<p>Next, we will follow the strategy where whitespace will be consumed <em>after</em> every lexeme automatically, but not before it. Let’s define a wrapper to achieve this:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">lexeme ::</span> <span class="dt">Parser</span> a <span class="ot">-></span> <span class="dt">Parser</span> a
|
||||
lexeme <span class="fu">=</span> L.lexeme sc</code></pre></div>
|
||||
<p>Perfect. Now we can wrap any parser in <code>lexeme</code> and it will consume any trailing whitespace with <code>sc</code>.</p>
|
||||
<p>Since we often want to parse some “fixed” string, let’s define one more parser called <code>symbol</code>. It will take a string as argument and parse this string and whitespace after it.</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">symbol ::</span> <span class="dt">String</span> <span class="ot">-></span> <span class="dt">Parser</span> <span class="dt">String</span>
|
||||
symbol <span class="fu">=</span> L.symbol sc</code></pre></div>
|
||||
<p>With these tools we can create other useful parsers:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- | 'parens' parses something between parenthesis.</span>
|
||||
|
||||
<span class="ot">parens ::</span> <span class="dt">Parser</span> a <span class="ot">-></span> <span class="dt">Parser</span> a
|
||||
parens <span class="fu">=</span> between (symbol <span class="st">"("</span>) (symbol <span class="st">")"</span>)
|
||||
|
||||
<span class="co">-- | 'integer' parses an integer.</span>
|
||||
|
||||
<span class="ot">integer ::</span> <span class="dt">Parser</span> <span class="dt">Integer</span>
|
||||
integer <span class="fu">=</span> lexeme L.integer
|
||||
|
||||
<span class="co">-- | 'semi' parses a semicolon.</span>
|
||||
|
||||
<span class="ot">semi ::</span> <span class="dt">Parser</span> <span class="dt">String</span>
|
||||
semi <span class="fu">=</span> symbol <span class="st">";"</span></code></pre></div>
|
||||
<p>Great. To parse various operators we can just use <code>symbol</code>, but reserved words and identifiers are a bit trickier. There are two things to note:</p>
|
||||
<ul>
|
||||
<li><p>Parsers for reserved words should check that the parsed reserved word is not a prefix of an identifier.</p></li>
|
||||
<li><p>Parsers of identifiers should check that parsed identifier is not a reserved word.</p></li>
|
||||
</ul>
|
||||
<p>Let’s express it in code:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">rword ::</span> <span class="dt">String</span> <span class="ot">-></span> <span class="dt">Parser</span> ()
|
||||
rword w <span class="fu">=</span> string w <span class="fu">*></span> notFollowedBy alphaNumChar <span class="fu">*></span> sc
|
||||
|
||||
<span class="ot">rws ::</span> [<span class="dt">String</span>] <span class="co">-- list of reserved words</span>
|
||||
rws <span class="fu">=</span> [<span class="st">"if"</span>,<span class="st">"then"</span>,<span class="st">"else"</span>,<span class="st">"while"</span>,<span class="st">"do"</span>,<span class="st">"skip"</span>,<span class="st">"true"</span>,<span class="st">"false"</span>,<span class="st">"not"</span>,<span class="st">"and"</span>,<span class="st">"or"</span>]
|
||||
|
||||
<span class="ot">identifier ::</span> <span class="dt">Parser</span> <span class="dt">String</span>
|
||||
identifier <span class="fu">=</span> (lexeme <span class="fu">.</span> try) (p <span class="fu">>>=</span> check)
|
||||
<span class="kw">where</span>
|
||||
p <span class="fu">=</span> (<span class="fu">:</span>) <span class="fu"><$></span> letterChar <span class="fu"><*></span> many alphaNumChar
|
||||
check x <span class="fu">=</span> <span class="kw">if</span> x <span class="ot">`elem`</span> rws
|
||||
<span class="kw">then</span> fail <span class="fu">$</span> <span class="st">"keyword "</span> <span class="fu">++</span> show x <span class="fu">++</span> <span class="st">" cannot be an identifier"</span>
|
||||
<span class="kw">else</span> return x</code></pre></div>
|
||||
<p><code>identifier</code> may seem complex, but it’s actually simple. We just parse a sequence of characters where the first character is a letter and the rest is several characters where every one of them can be either a letter or a digit. Once we have parsed such a string, we check if it’s in the list of reserved words, fail with an informative message if it is, and return the result otherwise.</p>
|
||||
<p>Note the use of <code>try</code> in <code>identifier</code>. This is necessary to backtrack to beginning of the identifier in cases when <code>fail</code> is evaluated. Otherwise things like <code>many identifier</code> would fail on such identifiers instead of just stopping.</p>
|
||||
<p>And that’s it, we have just written the lexer for our language, now we can start writing the parser.</p>
|
||||
<h2 id="parser">Parser</h2>
|
||||
<p>As already mentioned, a program in this language is simply a statement, so the main parser should basically only parse a statement. But remember to take care of initial whitespace—our parsers only get rid of whitespace <em>after</em> the tokens!</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">whileParser ::</span> <span class="dt">Parser</span> <span class="dt">Stmt</span>
|
||||
whileParser <span class="fu">=</span> between sc eof stmt</code></pre></div>
|
||||
<p>Now because any statement might be actually a sequence of statements separated by semicolons, we use <code>sepBy1</code> to parse at least one statement. The result is a list of statements. We also allow grouping statements with parentheses, which is useful, for instance, in the <code>while</code> loop.</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">stmt ::</span> <span class="dt">Parser</span> <span class="dt">Stmt</span>
|
||||
stmt <span class="fu">=</span> parens stmt <span class="fu"><|></span> stmtSeq
|
||||
|
||||
<span class="ot">stmtSeq ::</span> <span class="dt">Parser</span> <span class="dt">Stmt</span>
|
||||
stmtSeq <span class="fu">=</span> f <span class="fu"><$></span> sepBy1 stmt' semi
|
||||
<span class="co">-- if there's only one stmt return it without using ‘Seq’</span>
|
||||
<span class="kw">where</span> f l <span class="fu">=</span> <span class="kw">if</span> length l <span class="fu">==</span> <span class="dv">1</span> <span class="kw">then</span> head l <span class="kw">else</span> <span class="dt">Seq</span> l</code></pre></div>
|
||||
<p>Now a single statement is quite simple, it’s either an <code>if</code> conditional, a <code>while</code> loop, an assignment or simply a <code>skip</code> statement. We use <code><|></code> to express choice. So <code>a <|> b</code> will first try parser <code>a</code> and if it fails (without actually consuming any input) then parser <code>b</code> will be used. <em>Note: this means that the order is important.</em></p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">stmt' ::</span> <span class="dt">Parser</span> <span class="dt">Stmt</span>
|
||||
stmt' <span class="fu">=</span> ifStmt <span class="fu"><|></span> whileStmt <span class="fu"><|></span> skipStmt <span class="fu"><|></span> assignStmt</code></pre></div>
|
||||
<p>If you have a parser that might fail after consuming some input, and you still want to try the next parser, you should take a look at the <code>try</code> combinator. For instance <code>try p <|> q</code> will try parsing with <code>p</code> and if it fails, even after consuming the input, the <code>q</code> parser will be used as if nothing has been consumed by <code>p</code>.</p>
|
||||
<p>Now let’s define the parsers for all the possible statements. This is quite straightforward as we just use the parsers from the lexer and then use all the necessary information to create appropriate data structures.</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">ifStmt ::</span> <span class="dt">Parser</span> <span class="dt">Stmt</span>
|
||||
ifStmt <span class="fu">=</span> <span class="kw">do</span>
|
||||
rword <span class="st">"if"</span>
|
||||
cond <span class="ot"><-</span> bExpr
|
||||
rword <span class="st">"then"</span>
|
||||
stmt1 <span class="ot"><-</span> stmt
|
||||
rword <span class="st">"else"</span>
|
||||
stmt2 <span class="ot"><-</span> stmt
|
||||
return (<span class="dt">If</span> cond stmt1 stmt2)
|
||||
|
||||
<span class="ot">whileStmt ::</span> <span class="dt">Parser</span> <span class="dt">Stmt</span>
|
||||
whileStmt <span class="fu">=</span> <span class="kw">do</span>
|
||||
rword <span class="st">"while"</span>
|
||||
cond <span class="ot"><-</span> bExpr
|
||||
rword <span class="st">"do"</span>
|
||||
stmt1 <span class="ot"><-</span> stmt
|
||||
return (<span class="dt">While</span> cond stmt1)
|
||||
|
||||
<span class="ot">assignStmt ::</span> <span class="dt">Parser</span> <span class="dt">Stmt</span>
|
||||
assignStmt <span class="fu">=</span> <span class="kw">do</span>
|
||||
var <span class="ot"><-</span> identifier
|
||||
void (symbol <span class="st">":="</span>)
|
||||
expr <span class="ot"><-</span> aExpr
|
||||
return (<span class="dt">Assign</span> var expr)
|
||||
|
||||
<span class="ot">skipStmt ::</span> <span class="dt">Parser</span> <span class="dt">Stmt</span>
|
||||
skipStmt <span class="fu">=</span> <span class="dt">Skip</span> <span class="fu"><$</span> rword <span class="st">"skip"</span></code></pre></div>
|
||||
<h2 id="expressions">Expressions</h2>
|
||||
<p>What’s left is to parse the expressions. Fortunately Megaparsec provides an easy way to do that. Let’s define the arithmetic and boolean expressions:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">aExpr ::</span> <span class="dt">Parser</span> <span class="dt">AExpr</span>
|
||||
aExpr <span class="fu">=</span> makeExprParser aTerm aOperators
|
||||
|
||||
<span class="ot">bExpr ::</span> <span class="dt">Parser</span> <span class="dt">BExpr</span>
|
||||
bExpr <span class="fu">=</span> makeExprParser bTerm bOperators</code></pre></div>
|
||||
<p>Now we have to define the lists with operator precedence, associativity and what constructors to use in each case.</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">aOperators ::</span> [[<span class="dt">Operator</span> <span class="dt">Parser</span> <span class="dt">AExpr</span>]]
|
||||
aOperators <span class="fu">=</span>
|
||||
[ [<span class="dt">Prefix</span> (<span class="dt">Neg</span> <span class="fu"><$</span> symbol <span class="st">"-"</span>) ]
|
||||
, [ <span class="dt">InfixL</span> (<span class="dt">ABinary</span> <span class="dt">Multiply</span> <span class="fu"><$</span> symbol <span class="st">"*"</span>)
|
||||
, <span class="dt">InfixL</span> (<span class="dt">ABinary</span> <span class="dt">Divide</span> <span class="fu"><$</span> symbol <span class="st">"/"</span>) ]
|
||||
, [ <span class="dt">InfixL</span> (<span class="dt">ABinary</span> <span class="dt">Add</span> <span class="fu"><$</span> symbol <span class="st">"+"</span>)
|
||||
, <span class="dt">InfixL</span> (<span class="dt">ABinary</span> <span class="dt">Subtract</span> <span class="fu"><$</span> symbol <span class="st">"-"</span>) ]
|
||||
]
|
||||
|
||||
<span class="ot">bOperators ::</span> [[<span class="dt">Operator</span> <span class="dt">Parser</span> <span class="dt">BExpr</span>]]
|
||||
bOperators <span class="fu">=</span>
|
||||
[ [<span class="dt">Prefix</span> (<span class="dt">Not</span> <span class="fu"><$</span> rword <span class="st">"not"</span>) ]
|
||||
, [<span class="dt">InfixL</span> (<span class="dt">BBinary</span> <span class="dt">And</span> <span class="fu"><$</span> rword <span class="st">"and"</span>)
|
||||
, <span class="dt">InfixL</span> (<span class="dt">BBinary</span> <span class="dt">Or</span> <span class="fu"><$</span> rword <span class="st">"or"</span>) ]
|
||||
]</code></pre></div>
|
||||
<p>In the case of prefix operators it is enough to specify which one should be parsed and what is the associated data constructor. Infix operators are defined similarly, but there are several variants of infix constructors for various associativity options. Note that the operator precedence depends only on the order of the elements in the list.</p>
|
||||
<p>Finally we have to define the terms. In the case of arithmetic expressions, it is quite simple:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">aTerm ::</span> <span class="dt">Parser</span> <span class="dt">AExpr</span>
|
||||
aTerm <span class="fu">=</span> parens aExpr
|
||||
<span class="fu"><|></span> <span class="dt">Var</span> <span class="fu"><$></span> identifier
|
||||
<span class="fu"><|></span> <span class="dt">IntConst</span> <span class="fu"><$></span> integer</code></pre></div>
|
||||
<p>However, a term in a boolean expression is a bit more tricky. In this case, a term can also be an expression with relational operator consisting of arithmetic expressions.</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">bTerm ::</span> <span class="dt">Parser</span> <span class="dt">BExpr</span>
|
||||
bTerm <span class="fu">=</span> parens bExpr
|
||||
<span class="fu"><|></span> (rword <span class="st">"true"</span> <span class="fu">*></span> pure (<span class="dt">BoolConst</span> <span class="dt">True</span>))
|
||||
<span class="fu"><|></span> (rword <span class="st">"false"</span> <span class="fu">*></span> pure (<span class="dt">BoolConst</span> <span class="dt">False</span>))
|
||||
<span class="fu"><|></span> rExpr</code></pre></div>
|
||||
<p>Therefore we have to define a parser for relational expressions:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">rExpr ::</span> <span class="dt">Parser</span> <span class="dt">BExpr</span>
|
||||
rExpr <span class="fu">=</span> <span class="kw">do</span>
|
||||
a1 <span class="ot"><-</span> aExpr
|
||||
op <span class="ot"><-</span> relation
|
||||
a2 <span class="ot"><-</span> aExpr
|
||||
return (<span class="dt">RBinary</span> op a1 a2)
|
||||
|
||||
<span class="ot">relation ::</span> <span class="dt">Parser</span> <span class="dt">RBinOp</span>
|
||||
relation <span class="fu">=</span> (symbol <span class="st">">"</span> <span class="fu">*></span> pure <span class="dt">Greater</span>)
|
||||
<span class="fu"><|></span> (symbol <span class="st">"<"</span> <span class="fu">*></span> pure <span class="dt">Less</span>)</code></pre></div>
|
||||
<p>And that’s it. We have a quite simple parser which is able to parse a few statements and arithmetic/boolean expressions.</p>
|
||||
<h2 id="notes">Notes</h2>
|
||||
<p>If you want to experiment with the parser inside GHCi, these functions might be handy:</p>
|
||||
<ul>
|
||||
<li><code>parseTest p input</code> applies parser <code>p</code> on input <code>input</code> and prints results.</li>
|
||||
</ul>
|
||||
<hr />
|
||||
<p>Original Parsec tutorial in Haskell Wiki:</p>
|
||||
<p><a href="https://wiki.haskell.org/Parsing_a_simple_imperative_language" class="uri">https://wiki.haskell.org/Parsing_a_simple_imperative_language</a></p>
|
||||
|
||||
|
||||
<hr />
|
||||
<p>
|
||||
(Psst! Looking for the source code for this tutorial?
|
||||
It's <a href="https://github.com/mrkkrp/megaparsec-site/tree/master/tutorial-code/ParsingWhile.hs">here</a>.)
|
||||
</p>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.min.js"></script>
|
||||
<script src="../js/put-anchors.js"></script>
|
||||
</body>
|
||||
<body></body>
|
||||
</html>
|
||||
|
@ -2,263 +2,7 @@
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<meta name="description" content />
|
||||
<meta name="author" content />
|
||||
<title>Megaparsec | Switch from Parsec to Megaparsec</title>
|
||||
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
|
||||
<link rel="stylesheet" type="text/css" href="../css/megaparsec.css" />
|
||||
<meta http-equiv="refresh" content="0; url=https://markkarpov.com/megaparsec/switch-from-parsec-to-megaparsec.html">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="navbar navbar-default navbar-static-top" role="navigation">
|
||||
<div class="container-fluid">
|
||||
<div class="navbar-header">
|
||||
<a class="navbar-brand" href="../">
|
||||
Megaparsec
|
||||
</a>
|
||||
</div>
|
||||
<div class="navbar-right">
|
||||
<ul class="nav navbar-nav">
|
||||
|
||||
<li>
|
||||
<a href="../tutorials.html">Tutorials</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://hackage.haskell.org/package/megaparsec">Hackage</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec">GitHub</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec-site">Edit the site</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="container">
|
||||
<div class="row-fluid">
|
||||
<div class="col-sm-10 col-sm-offset-1 col-md-8 col-md-offset-2 main">
|
||||
<div class="page-header">
|
||||
<h1>Switch from Parsec to Megaparsec
|
||||
|
||||
<br />
|
||||
<small>
|
||||
Practical recommendations
|
||||
</small>
|
||||
|
||||
</h1>
|
||||
<hr />
|
||||
<div class="content">
|
||||
|
||||
<em>Last updated on May 25, 2017</em>
|
||||
<hr />
|
||||
|
||||
|
||||
<p>Some progressive Haskell hackers may wish to switch from Parsec to Megaparsec. This tutorial explains the practical differences between the two libraries that you will need to address if you choose to undertake the switch. Remember, all the functionality available in Parsec is available in Megaparsec and often in a better form.</p>
|
||||
<ol style="list-style-type: decimal">
|
||||
<li><a href="#imports">Imports</a></li>
|
||||
<li><a href="#renamed-things">Renamed things</a></li>
|
||||
<li><a href="#removed-things">Removed things</a></li>
|
||||
<li><a href="#completely-changed-things">Completely changed things</a></li>
|
||||
<li><a href="#other">Other</a></li>
|
||||
<li><a href="#character-parsing">Character parsing</a></li>
|
||||
<li><a href="#expression-parsing">Expression parsing</a></li>
|
||||
<li><a href="#what-happened-to-text.parsec.token">What happened to <code>Text.Parsec.Token</code>?</a></li>
|
||||
<li><a href="#whats-next">What’s next?</a></li>
|
||||
</ol>
|
||||
<h2 id="imports">Imports</h2>
|
||||
<p>You’ll mainly need to replace “Parsec” part in your imports with “Megaparsec”. That’s pretty simple. Typical import section of module that uses Megaparsec looks like this:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="co">-- this module contains commonly useful tools:</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec</span>
|
||||
<span class="co">-- this module depends on type of data you want to parse, you only need to</span>
|
||||
<span class="co">-- import one of these:</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec.String</span> <span class="co">-- if you parse ‘String’</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec.ByteString</span> <span class="co">-- if you parse strict ‘ByteString’</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec.ByteString.Lazy</span> <span class="co">-- if you parse lazy ‘ByteString’</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec.Text</span> <span class="co">-- if you parse strict ‘Text’</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec.Text.Lazy</span> <span class="co">-- if you parse lazy ‘Text’</span>
|
||||
<span class="co">-- if you need to parse permutation phrases:</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec.Perm</span>
|
||||
<span class="co">-- if you need to parse expressions:</span>
|
||||
<span class="kw">import </span><span class="dt">Text.Megaparsec.Expr</span>
|
||||
<span class="co">-- if you need to parse languages:</span>
|
||||
<span class="kw">import qualified</span> <span class="dt">Text.Megaparsec.Lexer</span> <span class="kw">as</span> <span class="dt">L</span></code></pre></div>
|
||||
<p>So, the only noticeable difference that Megaparsec has no <code>Text.Megaparsec.Token</code> module which is replaced with <code>Text.Megaparsec.Lexer</code>, see about this in the section <a href="#what-happened-to-text.parsec.token">“What happened to <code>Text.Parsec.Token</code>”</a>.</p>
|
||||
<h2 id="renamed-things">Renamed things</h2>
|
||||
<p>Megaparsec introduces a more consistent naming scheme, so some things are called differently, but renaming functions is a very easy task, you don’t need to think. Here are renamed items:</p>
|
||||
<ul>
|
||||
<li><code>many1</code> → <code>some</code> (re-exported from <code>Control.Applicative</code>)</li>
|
||||
<li><code>skipMany1</code> → <code>skipSome</code></li>
|
||||
<li><code>tokenPrim</code> → <code>token</code></li>
|
||||
<li><code>optionMaybe</code> → <code>optional</code> (re-exported from <code>Control.Applicative</code>)</li>
|
||||
<li><code>permute</code> → <code>makePermParser</code></li>
|
||||
<li><code>buildExpressionParser</code> → <code>makeExprParser</code></li>
|
||||
</ul>
|
||||
<p>Character parsing:</p>
|
||||
<ul>
|
||||
<li><code>alphaNum</code> → <code>alphaNumChar</code></li>
|
||||
<li><code>digit</code> → <code>digitChar</code></li>
|
||||
<li><code>endOfLine</code> → <code>eol</code></li>
|
||||
<li><code>hexDigit</code> → <code>hexDigitChar</code></li>
|
||||
<li><code>letter</code> → <code>letterChar</code></li>
|
||||
<li><code>lower</code> → <code>lowerChar</code></li>
|
||||
<li><code>octDigit</code> → <code>octDigitChar</code></li>
|
||||
<li><code>space</code> → <code>spaceChar</code> †</li>
|
||||
<li><code>spaces</code> → <code>space</code> †</li>
|
||||
<li><code>upper</code> → <code>upperChar</code></li>
|
||||
</ul>
|
||||
<p>†—pay attention to these, since <code>space</code> parses <em>many</em> <code>spaceChar</code>s, including zero, if you write something like <code>many space</code>, your parser will hang. So be careful to replace <code>many space</code> with either <code>many spaceChar</code> or <code>spaces</code>.</p>
|
||||
<h2 id="removed-things">Removed things</h2>
|
||||
<p>Parsec also has many names for the same or similar things. Megaparsec usually has one function per task that does its job well. Here are the items that were removed in Megaparsec and reasons of their removal:</p>
|
||||
<ul>
|
||||
<li><p><code>parseFromFile</code>—from file and then parsing its contents is trivial for every instance of <code>Stream</code> and this function provides no way to use newer methods for running a parser, such as <code>runParser'</code>.</p></li>
|
||||
<li><p><code>getState</code>, <code>putState</code>, <code>modifyState</code>—ad-hoc backtracking user state has been eliminated.</p></li>
|
||||
<li><p><code>unexpected</code>, <code>token</code> and <code>tokens</code>, now there is a bit different versions of these functions under the same name.</p></li>
|
||||
<li><p><code>Reply</code> and <code>Consumed</code> are not public data types anymore, because they are low-level implementation details.</p></li>
|
||||
<li><p><code>runPT</code> and <code>runP</code> were essentially synonyms for <code>runParserT</code> and <code>runParser</code> respectively.</p></li>
|
||||
<li><p><code>chainl</code>, <code>chainl1</code>, <code>chainr</code>, and <code>chainr1</code>—use <a href="https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Expr.html"><code>Text.Megaparsec.Expr</code></a> instead.</p></li>
|
||||
</ul>
|
||||
<h2 id="completely-changed-things">Completely changed things</h2>
|
||||
<p>In Megaparsec 5 the modules <code>Text.Megaparsec.Pos</code> and <code>Text.Megaparsec.Error</code> are completely different from those found in Parsec and Megaparsec 4. Take some time to look at documentation of the modules if your use-case requires operations on error messages or positions. You may like the fact that we have well-typed and extensible error messages now.</p>
|
||||
<h2 id="other">Other</h2>
|
||||
<ul>
|
||||
<li><p>The <code>Stream</code> type class now have <code>updatePos</code> method that gives precise control over advancing of textual positions during parsing.</p></li>
|
||||
<li><p>Note that argument order of <code>label</code> has been flipped (the label itself goes first now), so you can write now: <code>myParser = label "my parser" $ …</code>.</p></li>
|
||||
<li><p>Don’t use the <code>label ""</code> (or the <code>… <?> ""</code>) idiom to “hide” some “expected” tokens from error messages, use <code>hidden</code>.</p></li>
|
||||
<li><p>The new <code>token</code> parser is more powerful, its first argument provides full control over reported error message while its second argument allows to specify how to report a missing token in case of empty input stream.</p></li>
|
||||
<li><p>Now <code>tokens</code> parser allows to control how tokens are compared (yes, we have case-insensitive <code>string</code> called <code>string'</code>).</p></li>
|
||||
<li><p>The <code>unexpected</code> parser allows to specify precisely what is unexpected in a well-typed manner.</p></li>
|
||||
<li><p>Tab width is not hard-coded anymore, use <code>getTabWidth</code> and <code>setTabWidth</code> to change it. Default tab width is <code>defaultTabWidth</code>.</p></li>
|
||||
<li><p>Now you can reliably test error messages, equality for them is now defined properly (in Parsec <code>Expect "foo"</code> is equal to <code>Expect "bar"</code>), error messages are also well-typed and customizeable.</p></li>
|
||||
<li><p>To render a error message, apply <code>parseErrorPretty</code> on it.</p></li>
|
||||
<li><p><code>count' m n p</code> allows you to parse from <code>m</code> to <code>n</code> occurrences of <code>p</code>.</p></li>
|
||||
<li><p>Now you have <code>someTill</code> and <code>eitherP</code> out of the box.</p></li>
|
||||
<li><p><code>token</code>-based combinators like <code>string</code> and <code>string'</code> backtrack by default, so it’s not necessary to use <code>try</code> with them (beginning from <code>4.4.0</code>). This feature does not affect performance.</p></li>
|
||||
<li><p>The new <code>failure</code> combinator allows to fail with an arbitrary error message, it even allows to use your own data types.</p></li>
|
||||
</ul>
|
||||
<h2 id="character-parsing">Character parsing</h2>
|
||||
<p>New character parsers in <a href="https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Char.html"><code>Text.Megaparsec.Char</code></a> may be useful if you work with Unicode:</p>
|
||||
<ul>
|
||||
<li><code>asciiChar</code></li>
|
||||
<li><code>charCategory</code></li>
|
||||
<li><code>controlChar</code></li>
|
||||
<li><code>latin1Char</code></li>
|
||||
<li><code>markChar</code></li>
|
||||
<li><code>numberChar</code></li>
|
||||
<li><code>printChar</code></li>
|
||||
<li><code>punctuationChar</code></li>
|
||||
<li><code>separatorChar</code></li>
|
||||
<li><code>symbolChar</code></li>
|
||||
</ul>
|
||||
<p>Ever wanted to have case-insensitive character parsers? Here you go:</p>
|
||||
<ul>
|
||||
<li><code>char'</code></li>
|
||||
<li><code>oneOf'</code></li>
|
||||
<li><code>noneOf'</code></li>
|
||||
<li><code>string'</code></li>
|
||||
</ul>
|
||||
<h2 id="expression-parsing">Expression parsing</h2>
|
||||
<p><code>makeExprParser</code> has flipped order of arguments: term parser first, operator table second. To specify associativity of infix operators you use one of the three <code>Operator</code> constructors:</p>
|
||||
<ul>
|
||||
<li><code>InfixN</code>—non-associative infix</li>
|
||||
<li><code>InfixL</code>—left-associative infix</li>
|
||||
<li><code>InfixR</code>—right-associative infix</li>
|
||||
</ul>
|
||||
<h2 id="what-happened-to-text.parsec.token">What happened to <code>Text.Parsec.Token</code>?</h2>
|
||||
<p>That module was extremely inflexible and thus it has been eliminated. In Megaparsec you have <a href="https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Lexer.html"><code>Text.Megaparsec.Lexer</code></a> instead, which doesn’t impose anything on user but provides useful helpers. The module can also parse indentation-sensitive languages.</p>
|
||||
<p>Let’s quickly describe how you go about writing your lexer with <code>Text.Megaparsec.Lexer</code>. First, you should import the module qualified, we will use <code>L</code> as its synonym here.</p>
|
||||
<h3 id="white-space">White space</h3>
|
||||
<p>Start writing your lexer by defining what counts as <em>white space</em> in your language. <code>space</code>, <code>skipLineComment</code>, and <code>skipBlockComment</code> can be helpful:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">sc ::</span> <span class="dt">Parser</span> () <span class="co">-- ‘sc’ stands for “space consumer”</span>
|
||||
sc <span class="fu">=</span> L.space (void spaceChar) lineComment blockComment
|
||||
<span class="kw">where</span> lineComment <span class="fu">=</span> L.skipLineComment <span class="st">"//"</span>
|
||||
blockComment <span class="fu">=</span> L.skipBlockComment <span class="st">"/*"</span> <span class="st">"*/"</span></code></pre></div>
|
||||
<p>This is generally called <em>space consumer</em>, often you’ll need only one space consumer, but you can define as many of them as you want. Note that this new module allows you avoid consuming newline characters automatically, just use something different than <code>void spaceChar</code> as first argument of <code>space</code>. Even better, you can control what white space is on per-lexeme basis:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">lexeme ::</span> <span class="dt">Parser</span> a <span class="ot">-></span> <span class="dt">Parser</span> a
|
||||
lexeme <span class="fu">=</span> L.lexeme sc
|
||||
|
||||
<span class="ot">symbol ::</span> <span class="dt">String</span> <span class="ot">-></span> <span class="dt">Parser</span> <span class="dt">String</span>
|
||||
symbol <span class="fu">=</span> L.symbol sc</code></pre></div>
|
||||
<h3 id="monad-transformers">Monad transformers</h3>
|
||||
<p>Note that all tools in Megaparsec work with any instance of <code>MonadParsec</code>. All commonly useful monad transformers like <code>StateT</code> and <code>WriterT</code> are instances of <code>MonadParsec</code> out of the box. For example, what if you want to collect contents of comments, (say, they are documentation strings of a sort), you may want to have backtracking user state were you put last encountered comment satisfying some criteria, and then when you parse function definition you can check the state and attach doc-string to your parsed function. It’s all possible and easy with Megaparsec:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="kw">import </span><span class="dt">Control.Monad.State.Lazy</span>
|
||||
|
||||
…
|
||||
|
||||
<span class="kw">type</span> <span class="dt">MyParser</span> <span class="fu">=</span> <span class="dt">StateT</span> <span class="dt">String</span> <span class="dt">Parser</span>
|
||||
|
||||
<span class="ot">skipLineComment' ::</span> <span class="dt">MyParser</span> ()
|
||||
skipLineComment' <span class="fu">=</span> …
|
||||
|
||||
<span class="ot">skipBlockComment' ::</span> <span class="dt">MyParser</span> ()
|
||||
skipBlockComment' <span class="fu">=</span> …
|
||||
|
||||
<span class="ot">sc ::</span> <span class="dt">MyParser</span> ()
|
||||
sc <span class="fu">=</span> space (void spaceChar) skipLineComment' skipBlockComment'</code></pre></div>
|
||||
<h3 id="indentation-sensitive-languages">Indentation-sensitive languages</h3>
|
||||
<p>Parsing of indentation-sensitive language deserves its own tutorial, but let’s take a look at the basic tools upon which you can build. First of all you should work with space consumer that doesn’t eat newlines automatically. This means you’ll need to pick them up manually.</p>
|
||||
<p>The main helper is called <code>indentGuard</code>. It takes a parser that will be used to consume white space (indentation) and a predicate of type <code>Int -> Bool</code>. If after running the given parser column number does not satisfy given predicate, the parser fails with message “incorrect indentation”, otherwise it returns current column number.</p>
|
||||
<p>In simple cases you can explicitly pass around value returned by <code>indentGuard</code>, i.e. current level of indentation. If you prefer to preserve some sort of state you can achieve backtracking state combining <code>StateT</code> and <code>ParsecT</code>, like this:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="dt">StateT</span> <span class="dt">Int</span> <span class="dt">Parser</span> a</code></pre></div>
|
||||
<p>Here we have state of the type <code>Int</code>. You can use <code>get</code> and <code>put</code> as usual, although it may be better to write a modified version of <code>indentGuard</code> that could get current indentation level (indentation level on previous line), then consume indentation of current line, perform necessary checks, and put new level of indentation.</p>
|
||||
<p><em>Later update</em>: now we have full support for indentation-sensitive parsing, see <code>nonIndented</code>, <code>indentBlock</code>, and <code>lineFold</code> in the <code>Text.Megaparsec.Lexer</code> module.</p>
|
||||
<h3 id="character-and-string-literals">Character and string literals</h3>
|
||||
<p>Parsing of string and character literals is done a bit differently than in Parsec. You have the single helper <code>charLiteral</code>, which parses a character literal. It <em>does not</em> parse surrounding quotes, because different languages may quote character literals differently. Purpose of this parser is to help with parsing of conventional escape sequences (literal character is parsed according to rules defined in Haskell report).</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">charLiteral ::</span> <span class="dt">Parser</span> <span class="dt">Char</span>
|
||||
charLiteral <span class="fu">=</span> char <span class="ch">'\''</span> <span class="fu">*></span> charLiteral <span class="fu"><*</span> char <span class="ch">'\''</span></code></pre></div>
|
||||
<p>Use <code>charLiteral</code> to parse string literals. This is simplified version that will accept plain (not escaped) newlines in string literals (it’s easy to make it conform to Haskell syntax, this is left as an exercise for the reader):</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">stringLiteral ::</span> <span class="dt">Parser</span> <span class="dt">String</span>
|
||||
stringLiteral <span class="fu">=</span> char <span class="ch">'"'</span> <span class="fu">>></span> manyTill L.charLiteral (char <span class="ch">'"'</span>)</code></pre></div>
|
||||
<p>I should note that in <code>charLiteral</code> we use built-in support for parsing of all the tricky combinations of characters. On the other hand Parsec re-implements the whole thing. Given that it mostly has no tests at all, I cannot tell for sure that it works.</p>
|
||||
<h3 id="numbers">Numbers</h3>
|
||||
<p>Parsing of numbers is easy:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">integer ::</span> <span class="dt">Parser</span> <span class="dt">Integer</span>
|
||||
integer <span class="fu">=</span> lexeme L.integer
|
||||
|
||||
<span class="ot">float ::</span> <span class="dt">Parser</span> <span class="dt">Double</span>
|
||||
float <span class="fu">=</span> lexeme L.float
|
||||
|
||||
<span class="ot">number ::</span> <span class="dt">Parser</span> <span class="dt">Scientific</span>
|
||||
number lexeme L.number <span class="co">-- similar to ‘naturalOrFloat’ in Parsec</span></code></pre></div>
|
||||
<p>Note that Megaparsec internally uses the standard Haskell functions to parse floating point numbers, thus no precision loss is possible (and it’s tested). On the other hand, Parsec again re-implements the whole thing. Approach taken by Parsec authors is to parse the numbers one by one and then re-create the floating point number by means of floating point arithmetic. Any professional knows that this is not possible and the only way to parse floating point number is via bit-level manipulation (it’s usually done on OS level, in C libraries). Of course results produced by Parsec built-in parser for floating point numbers are incorrect. This is a known bug now, but it’s been a long time till we “discovered” it, because again, Parsec has no test suite. (<em>Update</em>: it took one year but Parsec’s maintainer has recently merged a pull request that seems to fix that and released Parsec 3.1.11.)</p>
|
||||
<p>Hexadecimal and octal numbers do not parse “0x” or “0o” prefixes, because different languages may have other prefixes for this sort of numbers. We should parse the prefixes manually:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">hexadecimal ::</span> <span class="dt">Parser</span> <span class="dt">Integer</span>
|
||||
hexadecimal <span class="fu">=</span> lexeme <span class="fu">$</span> char <span class="ch">'0'</span> <span class="fu">>></span> char' <span class="ch">'x'</span> <span class="fu">>></span> L.hexadecimal
|
||||
|
||||
<span class="ot">octal ::</span> <span class="dt">Parser</span> <span class="dt">Integer</span>
|
||||
octal <span class="fu">=</span> lexeme <span class="fu">$</span> char <span class="ch">'0'</span> <span class="fu">>></span> char' <span class="ch">'o'</span> <span class="fu">>></span> L.octal</code></pre></div>
|
||||
<p>Since Haskell report says nothing about sign in numeric literals, basic parsers like <code>integer</code> do not parse sign. You can easily create parsers for signed numbers with the help of <code>signed</code>:</p>
|
||||
<div class="sourceCode"><pre class="sourceCode haskell"><code class="sourceCode haskell"><span class="ot">signedInteger ::</span> <span class="dt">Parser</span> <span class="dt">Integer</span>
|
||||
signedInteger <span class="fu">=</span> L.signed sc integer
|
||||
|
||||
<span class="ot">signedFloat ::</span> <span class="dt">Parser</span> <span class="dt">Double</span>
|
||||
signedFloat <span class="fu">=</span> L.signed sc float
|
||||
|
||||
<span class="ot">signedNumber ::</span> <span class="dt">Parser</span> <span class="dt">Scientific</span>
|
||||
signedNumber <span class="fu">=</span> L.signed sc number</code></pre></div>
|
||||
<p>And that’s it, shiny and new, <code>Text.Megaparsec.Lexer</code> is at your service, now you can implement anything you want without the need to copy and edit entire <code>Text.Parsec.Token</code> module (people had to do it sometimes, you know).</p>
|
||||
<h2 id="whats-next">What’s next?</h2>
|
||||
<p>Changes you may want to perform may be more fundamental than those described here. For example, previously you may have to use a workaround because <code>Text.Parsec.Token</code> was not sufficiently flexible. Now you can replace it with a proper solution. If you want to use the full potential of Megaparsec, take time to read about its features, they can help you improve your parsers.</p>
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.min.js"></script>
|
||||
<script src="../js/put-anchors.js"></script>
|
||||
</body>
|
||||
<body></body>
|
||||
</html>
|
||||
|
@ -2,86 +2,7 @@
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<meta name="description" content />
|
||||
<meta name="author" content />
|
||||
<title>Megaparsec | Writing a fast parser</title>
|
||||
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
|
||||
<link rel="stylesheet" type="text/css" href="../css/megaparsec.css" />
|
||||
<meta http-equiv="refresh" content="0; url=https://markkarpov.com/megaparsec/writing-a-fast-parser.html">
|
||||
</head>
|
||||
|
||||
<body>
|
||||
|
||||
<div class="navbar navbar-default navbar-static-top" role="navigation">
|
||||
<div class="container-fluid">
|
||||
<div class="navbar-header">
|
||||
<a class="navbar-brand" href="../">
|
||||
Megaparsec
|
||||
</a>
|
||||
</div>
|
||||
<div class="navbar-right">
|
||||
<ul class="nav navbar-nav">
|
||||
|
||||
<li>
|
||||
<a href="../tutorials.html">Tutorials</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://hackage.haskell.org/package/megaparsec">Hackage</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec">GitHub</a>
|
||||
</li>
|
||||
|
||||
<li>
|
||||
<a href="https://github.com/mrkkrp/megaparsec-site">Edit the site</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="container">
|
||||
<div class="row-fluid">
|
||||
<div class="col-sm-10 col-sm-offset-1 col-md-8 col-md-offset-2 main">
|
||||
<div class="page-header">
|
||||
<h1>Writing a fast parser
|
||||
|
||||
<br />
|
||||
<small>
|
||||
Practical recommendations
|
||||
</small>
|
||||
|
||||
</h1>
|
||||
<hr />
|
||||
<div class="content">
|
||||
|
||||
<em>Last updated on May 25, 2017</em>
|
||||
<hr />
|
||||
|
||||
|
||||
<p>If performance of your Megaparsec parser is worse that you hoped, there may be ways to improve it. This short guide will instruct you what to attempt, but you should always check if you’re getting better results by profiling and benchmarking your parsers (that’s the only way to understand if you are doing the right thing when tuning performance).</p>
|
||||
<ul>
|
||||
<li><p>If your parser uses a monad stack instead of plain <code>Parsec</code> monad (which is a monad transformer over <code>Identity</code> too, but it’s much more lightweight), make sure you use at least version 0.5 of <code>transformers</code> library, and at least version 5.0 of <code>megaparsec</code>. Both libraries have critical performance improvements in those versions, so you can just get better performance for free.</p></li>
|
||||
<li><p><code>Parsec</code> monad will be always faster then <code>ParsecT</code>-based monad transformers. Avoid using <code>StateT</code>, <code>WriterT</code>, and other monad transformers unless absolutely necessary. When you have relatively simple monad stack, for example with <code>StateT</code> and nothing more, performance of Megaparsec parser will be on par with Parsec. The more you add to the stack, the slower it will be.</p></li>
|
||||
<li><p>The most expensive operation is backtracking (you enable it with <code>try</code> and it happens automatically with <code>tokens</code>-based parsers). Avoid building long chains of alternatives where every alternative can go deep into input before failing.</p></li>
|
||||
<li><p>Inline generously (when it makes sense, of course). You may not believe your eyes when you see how much of a difference inlining can do, especially for short functions. This is especially true for parsers that are defined in one module and used in another one, because <code>INLINE</code> and <code>INLINEABLE</code> pragmas make GHC dump functions definitions into an interface file and this facilitates specializing (I’ve written a tutorial about this, <a href="https://www.stackbuilders.com/tutorials/haskell/ghc-optimization-and-fusion/">available here</a>).</p></li>
|
||||
</ul>
|
||||
<p>The same parser can be written in many ways. Think about your grammar and how parsing happens, when you get some experience with this process, it will be much easier for you to see how to make your parser faster. Sometimes however, making a parser faster will also make your code less readable. If performance of your parser is not a bottleneck in the system you are building, consider preferring readability over performance.</p>
|
||||
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/3.1.0/anchor.min.js"></script>
|
||||
<script src="../js/put-anchors.js"></script>
|
||||
</body>
|
||||
<body></body>
|
||||
</html>
|
||||
|
Loading…
Reference in New Issue
Block a user