Even though the haddock for ‘Text.Megaparsec.Lexer.space’ says that
Parsing of white space is an important part of any parser. We
propose a convention where every lexeme parser assumes no spaces
before the lexeme and consumes all spaces after the lexeme;
all the indentation-sensitive parsing combinators assume/consume whitespace
_before_ the thing to be parsed. This would normally mean they can't be used
with combinators like ‘many’ and ‘some’ without using ‘try’ (and sacrificing
performance). Fortunately ‘indentBlock’ also consumes whitespace _after_,
but unfortunately it didn't do that in the ‘IndentNone’ case. Now it does
and it works with many and some without try!
This improves position reporting/advancing for ‘eof’, ‘token’, and
‘tokens’ combinators. Previously it was slightly incorrect for streams
consisting of custom token as new tests demonstrated.
Evaluation of ‘npos’ is not necessary when we are going to report an
error. Similarly, evaluation of current position is not necessary when
we just need to get incremented position. This seemingly minor change
has profound impact on performance because call to ‘token’ function is
at the base of significant part of parsing process.
This is rather a sketch, we need to work on documentation, tests, and
perhaps on performance, but it should show the direction Megaparsec
5.0.0 is taking.
Close#95.
Here we introduce ‘scientific’ parser that can parse arbitrary big
numbers without error or memory overflow. ‘float’ still returns
‘Double’, but it's defined in terms of ‘scientific’ now. Since
‘Scientific’ type can reliably represent integer values as well as
floating point values, ‘number’ now returns ‘Scientific’ instead of
‘Either Integer Double’ (‘Integer’ or ‘Double’ can be extracted from
‘Scientific’ value anyway). This in turn makes ‘signed’ parser more
natural and general, because we do not need ad-hoc ‘Signed’ type class
anymore.
This should improve experience of users who use Megaparsec with Alex and
Happy. The commit also introduces some minor changes in
‘Text.Megaparsec.Pos’ module (improving argument order).
Removed ‘parseFromFile’ and ‘StorableStream’ type-class that was
necessary for it. The reason for removal is that reading from file and
then parsing its contents is trivial for every instance of ‘Stream’ and
this function provides no way to use newer methods for running a parser,
such as ‘runParser'’. So, simply put, it adds little value and was
included in 4.x versions for compatibility purposes.
Collection of constraints changed from ‘Alternative m, Monad m, Stream s
t’ to ‘MonadPlus m, Stream s t’. This is done to make it easier to write
more abstract code with older GHC where such primitives as ‘guard’ are
defined for instances of ‘MonadPlus’, not ‘Alternative’.
This was Parsec's legacy that we should eliminate now. ‘Message’ does
not constitute enumeration, ‘toEnum’ was never properly defined for
it. The idea to use ‘fromEnum’ to determine type of ‘Message’ is also
ugly, for this purpose new functions ‘isUnexpected’, ‘isExpected’, and
‘isMessage’ are defined in ‘Text.Megaparsec.Error’.
This also enables the respective warnings flags in dev mode
to help megaparsec remain forward compatible.
The dependencies on `semigroup` and `fail` are conditional on
`impl(ghc >= 8)` and avoid CPP and conditionally defined instances
(which would result in an conditional API).
Close#81.
This solution is mostly OK as it passes tests and almost all benchmarks
show that there is no performance degradation.
The only function that bothers me is ‘pPlus’ (or ‘mplus’, or
‘<|>’). Benchmarks ‘choice/match’, ‘choice/nomatch’, and ‘manyTill’ show
about 44 % worse performance with current implementation of the feature
— this is not acceptable. All these functions are defined via ‘mplus’,
so it's necessary to find a way to improve that function.
Also, ‘mplus’ is tricky in that it combines different branches of
parsing. Previously, all logic describing how to combine failing
branches into one ‘ParseError’ were in ‘mergeError’ function. Now we
have to have ‘longestMatch’ function to choose right state as well,
because it's natural to expect that state on failure would correspond to
‘ParseError’. This should be done elegantly.
In particular, if input has no newline at the end, we need to treat it
specially, because otherwise we will get confusing “incorrect
indentation” message.
Close#75.
Now accumulated hints are not used with ‘ParseError’ records that have
only custom messages in them (created with ‘Message’ constructor, as
opposed to ‘Unexpected’ or ‘Expected’). This strips “expected” line from
custom error messages where it's unlikely to be relevant anyway.
Arbitrary messages created with ‘Message’ constructor should not be
rendered as “or”-separated list. This commit makes every such message be
displayed on new line.
After some thinking I decided that this may be not desirable in some
cases, so we should not enable it by default. I've edited documentation
of ‘makeExprParser’ to explain why this doesn't work by default and how
to make it work.
Close#64.
‘makeExprParser’ now generates parser that can handle several
occurrences of the same prefix or postfix operator in a row. This allows
to parse something like C pointers (for example ‘**i’) without resorting
to hacks.
The feature is experimental, I'm not entirely sure it's not
buggy. Upcoming additional tests for ‘Text.Megaparsec.Expr’ will show
whether it behaves correctly in all cases and doesn't have adverse
effects. For now, I've edited existing test to generate data with
repeating prefix negations and postfix factorials. Current code-base
passes the test.
Close#69.
Although previously used syntax is correct Haskell syntax for multi-line
string literals, CPP extension that we need to use for compatibility
reasons obviously makes ‘\’ symbol escape following newline character
that leads to ‘\t’ being interpreted as tab character.
The proposed solution just concatenates result error message from list
of strings — the most lightweight and reliable solution in our case.
What Parsec used is called “FreeBSD” or “BSD 2 clause”. Addition of the
third clause may require contacting all the authors. To hell with it,
let it be “FreeBSD” (which is anyway better than “BSD-like”), I'm a
hacker, not a lawyer (tm).
This commit clarifies license of the software replacing “BSD3” with more
conventional “BSD 3 clause”.
Another change is addition of the third clause originally missing in
license of Parsec (which is licensed under BSD 2 clause license). The
addition of the third clause in form:
* Neither the names of the copyright holders nor the names of
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
does not violate original BSD 2 clause license effectively making it BSD
3 clause license (which I find preferable).
Close#43.
The method allows to fail with arbitrary collection of
messages. ‘unexpected’ is not defined in terms of ‘failure’. One
consequence of this design decision is that ‘failure’ is now method of
‘MonadParsec’, while ‘unexpected’ is not.
Close#47, close#57.
This commit introduces ‘runParser'’ and ‘runParserT'’ functions that
take and return parser state. This makes it possible to partially parse
input, resume parsing, specify non-standard initial textual position,
etc.
Internal changes involve some refactoring to make ‘Reply’ more
readable and facilitate extraction of complete parser state on failure
as well as success.
The commit adds basic tests for the new functionality as well.