2015-11-12 17:33:09 +03:00
|
|
|
|
## About ##
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2015-11-12 17:33:09 +03:00
|
|
|
|
| | |
|
|
|
|
|
| ---------- | ------------------------- |
|
2017-01-19 06:43:15 +03:00
|
|
|
|
| Title: | MultiMarkdown 6 |
|
2016-08-25 04:44:09 +03:00
|
|
|
|
| Author: | Fletcher T. Penney |
|
2017-01-29 03:57:51 +03:00
|
|
|
|
| Date: | 2017-01-28 |
|
2017-01-19 06:43:15 +03:00
|
|
|
|
| Copyright: | Copyright © 2016 - 2017 Fletcher T. Penney. |
|
2017-01-29 03:57:51 +03:00
|
|
|
|
| Version: | 0.1.1a |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Updates ##
|
|
|
|
|
|
|
|
|
|
* 2017-01-28 -- v 0.1.1a includes a few updates:
|
|
|
|
|
|
|
|
|
|
* Metadata support
|
|
|
|
|
* Metadata variables support
|
|
|
|
|
* Extended ASCII range character checking
|
|
|
|
|
* Rudimentary language translations, including German
|
|
|
|
|
* Improved performance
|
|
|
|
|
* Additional testing:
|
|
|
|
|
* CriticMarkup
|
|
|
|
|
* HTML Blokcs
|
|
|
|
|
* Metadata/Variables
|
|
|
|
|
* "pathologic" test cases from CommonMark
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-02-04 16:48:56 +03:00
|
|
|
|
* 2017-?? -- v 0.1.2a:
|
|
|
|
|
|
|
|
|
|
* "pathologic" test suite -- fix handling of nested brackets, e.g.
|
|
|
|
|
`[[[[foo]]]]` to avoid bogging down checking for reference links that
|
|
|
|
|
don't exist.
|
2017-02-05 20:00:30 +03:00
|
|
|
|
* Table support -- a single blank line separates sections of tables, so
|
|
|
|
|
at least two blank lines are needed between adjacent tables.
|
2017-02-05 19:57:11 +03:00
|
|
|
|
* Definition list support
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
## An Announcement! ##
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
I would like to officially announce that MultiMarkdown version 6 is in public
|
|
|
|
|
alpha. It's finally at a point where it is usable, but there are quite a few
|
|
|
|
|
caveats.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
This post is a way for me to organize some of my thoughts, provide some
|
|
|
|
|
history for those who are interested, and to provide some tips and tricks from
|
|
|
|
|
my experiences for those who are working on their own products.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
But first, some background...
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
### Why a New Version? ###
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
MultiMarkdown version 5 was released in November of 2015, but the codebase was
|
|
|
|
|
essentially the same as that of v4 -- and that was released in beta in April
|
|
|
|
|
of 2013. A few key things prompted work on a new version:
|
2015-06-16 17:28:29 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
* Accuracy -- MMD v4 and v5 were the most accurate versions yet, and a lot of
|
|
|
|
|
effort went into finding and resolving various edge cases. However, it began
|
|
|
|
|
to feel like a game of whack-a-mole where new bugs would creep in every time I
|
|
|
|
|
fixed an old one. The PEG began to feel rather convoluted in spots, even
|
|
|
|
|
though it did allow for a precise (if not always accurate) specification of
|
|
|
|
|
the grammar.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
* Performance -- "Back in the day" [peg-markdown] was one of the fastest
|
|
|
|
|
Markdown parsers around. MMD v3 was based on peg-markdown, and would leap-
|
|
|
|
|
frog with it in terms of performance. Then [CommonMark] was released, which
|
|
|
|
|
was a bit faster. Then a couple of years went by and CommonMark became *much*
|
|
|
|
|
faster -- in one of my test suites, MMD v 5.4.0 takes about 25 times longer to
|
|
|
|
|
process a long document than CommonMark 0.27.0.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
[peg-markdown]: https://github.com/jgm/peg-markdown
|
|
|
|
|
[CommonMark]: http://commonmark.org/
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
Last spring, I decided I wanted to rewrite MultiMarkdown from scratch,
|
|
|
|
|
building the parser myself rather than relying on a pre-rolled solution. (I
|
|
|
|
|
had been using [greg](https://github.com/ooc-lang/greg) to compile the PEG
|
|
|
|
|
into parser code. It worked well overall, but lacked some features I needed,
|
|
|
|
|
requiring a lot of workarounds.)
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
## First Attempt ##
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
My first attempt started by hand-crafting a parser that scanned through the
|
|
|
|
|
document a line at a time, deciding what to do with each line as it found
|
|
|
|
|
them. I used regex parsers made with [re2c](http://re2c.org/index.html) to
|
|
|
|
|
help classify each line, and then a separate parser layer to process groups of
|
|
|
|
|
lines into blocks. Initially this approach worked well, and was really
|
|
|
|
|
efficient. But I quickly began to code my way into a dead-end -- the strategy
|
|
|
|
|
was not elegant enough to handle things like nested lists, etc.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
One thing that did turn out well from the first attempt, however, was an
|
|
|
|
|
approach for handling `<emph>` and `<strong>` parsing. I've learned over the
|
|
|
|
|
years that this can be one of the hardest parts of coding accurately for
|
|
|
|
|
Markdown. There are many examples that are obvious to a person, but difficult
|
|
|
|
|
to properly "explain" how to parse to a computer.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
No solution is perfect, but I developed an approach that seems to accurately
|
|
|
|
|
handle a wide range of situations without a great deal of complexity:
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
1. Scan the documents for asterisks (`*`). Each one will be handled one at a
|
|
|
|
|
time.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
2. Unlike brackets (`[` and `]`), an asterisk is "ambidextrous", in that it
|
|
|
|
|
may be able to open a matched pair of asterisks, close a pair, or both. For
|
|
|
|
|
example, in `foo *bar* foo`:
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
1. The first asterisk can open a pair, but not close one.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
2. The second asterisk can close a pair, but not open one.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
3. So, once the asterisks have been identified, each has to be examined to
|
|
|
|
|
determine whether it can open/close/both. The algorithm is not that complex,
|
|
|
|
|
but I'll describe it in general terms. Check the code for more specifics.
|
|
|
|
|
This approach seems to work, but might still need some slight tweaking. In
|
|
|
|
|
the future, I'll codify this better in language rather than just in code.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
1. If there is whitespace to the left of an asterisk, it can't close.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
2. If there is whitespace or punctuation to the right it can't open.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
3. "Runs" of asterisks, e.g. `**bar` are treated as a unit in terms of
|
|
|
|
|
looking left/right.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
4. Asterisks inside a word are a bit trickier -- we look at the number of
|
|
|
|
|
asterisks before the word, the number in the current run, and the number
|
|
|
|
|
of asterisks after the word to determine which combinations, if any, are
|
|
|
|
|
permitted.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
4. Once all asterisks have been tagged as able to open/close/both, we proceed
|
|
|
|
|
through them in order:
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
1. When we encounter a tag that can close, we look to see if there is a
|
|
|
|
|
previous opener that has not been paired off. If so, pair the two and
|
|
|
|
|
remove the opener from the list of available asterisks.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
2. When we encounter an opener, add it to the stack of available openers.
|
|
|
|
|
|
|
|
|
|
3. When encounter an asterisk that can do both, see if it can close an
|
|
|
|
|
existing opener. If not, then add it to the stack.
|
|
|
|
|
|
|
|
|
|
5. After all tokens in the block have been paired, then we look for nesting
|
|
|
|
|
pairs of asterisks in order to create `<emph>` and `<strong>` sets. For
|
|
|
|
|
example, assume we have six asterisks wrapped around a word, three in front,
|
|
|
|
|
and three after. The asterisks are indicated with numbers: `123foo456`. We
|
|
|
|
|
proceed in the following manner:
|
2015-06-13 22:09:24 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
1. Based on the pairing algorithm above, these asterisks would be paired as
|
|
|
|
|
follows, with matching asterisks sharing numbers -- `123foo321`.
|
|
|
|
|
|
|
|
|
|
2. Moving forwards, we come to asterisk "1". It is followed by an
|
|
|
|
|
asterisk, so we check to see if they should be grouped as a `<strong>`.
|
|
|
|
|
Since the "1" asterisks are wrapped immediately outside the "2" asterisks,
|
|
|
|
|
they are joined together. More than two pairs can't be joined, so we now
|
|
|
|
|
get the following -- `112foo211`, where the "11" represents the opening
|
|
|
|
|
and closing of a `<strong>`, and the "2" represents a `<emph>`.
|
|
|
|
|
|
|
|
|
|
6. When matching a pair, any unclosed openers that are on the stack are
|
|
|
|
|
removed, preventing pairs from "crossing" or "intersecting". Pairs can wrap
|
|
|
|
|
around each other, e.g. `[(foo)]`, but not intersect like `[(foo])`. In the
|
|
|
|
|
second case, the brackets would close, removing the `(` from the stack.
|
|
|
|
|
|
|
|
|
|
7. This same approach is used in all tokens that are matched in pairs--
|
|
|
|
|
`[foo]`, `(foo)`, `_foo_`, etc. There's slightly more to it, but once you
|
|
|
|
|
figure out how to assign opening/closing ability, the rest is easy. By using
|
|
|
|
|
a stack to track available openers, it can be performed efficiently.
|
|
|
|
|
|
|
|
|
|
In my testing, this approach has worked quite well. It handles all the basic
|
|
|
|
|
scenarios I've thrown at it, and all of the "basic" and "devious" edge cases I
|
|
|
|
|
have thought of (some of these don't necessarily have a "right" answer -- but
|
|
|
|
|
v6 gives consistency answers that seem as reasonable as any others to me).
|
|
|
|
|
There are also three more edge cases I've come up can still stump it, and
|
|
|
|
|
ironically they are handled correctly by most implementations. They just
|
|
|
|
|
don't follow the rules above. I'll continue to work on this.
|
|
|
|
|
|
|
|
|
|
In the end, I scrapped this effort, but kept the lessons learned in the token
|
|
|
|
|
pairing algorithm.
|
2015-06-13 22:09:24 +03:00
|
|
|
|
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
## Second Attempt ##
|
|
|
|
|
|
|
|
|
|
I tried again this past Fall. This time, I approached the problem with lots
|
|
|
|
|
of reading. *Lots and lots* of reading -- tons of websites, computer science
|
|
|
|
|
journal articles, PhD theses, etc. Learned a lot about lexers, and a lot
|
|
|
|
|
about parsers, including hand-crafting vs using parser generators. In brief:
|
|
|
|
|
|
|
|
|
|
1. I learned about the [Aho–Corasick algorithm], which is a great way to
|
|
|
|
|
efficiently search a string for multiple target strings at once. I used this
|
|
|
|
|
to create a custom lexer to identify tokens in a MultiMarkdown text document
|
|
|
|
|
(e.g. `*`, `[ `, `{++`, etc.). I learned a lot, and had a good time working
|
|
|
|
|
out the implementation. This code efficiently allowed me to break a string of
|
|
|
|
|
text into the tokens that mattered for Markdown parsing.
|
|
|
|
|
|
|
|
|
|
2. However, in a few instances I really needed some features of regular
|
|
|
|
|
expressions to simplify more complex structures. After a quick bit of testing,
|
|
|
|
|
using re2c to create a tokenizer was just as efficient, and allowed me to
|
|
|
|
|
incorporate some regex functionality that simplified later parsing. I'll keep
|
|
|
|
|
the Aho-Corasick stuff around, and will probably experiment more with it
|
|
|
|
|
later. But I didn't need it for MMD now. `lexer.re` contains the source for
|
|
|
|
|
the tokenizer.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
[Aho–Corasick algorithm]: https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
I looked long and hard for a way to simplify the parsing algorithm to try and
|
|
|
|
|
"touch" each token only once. Ideally, the program could step through each
|
|
|
|
|
token, and decide when to create a new block, when to pair things together,
|
|
|
|
|
etc. But I'm not convinced it's possible. Since Markdown's grammar varies
|
|
|
|
|
based on context, it seems to work best when handled in distinct phases:
|
|
|
|
|
|
|
|
|
|
1. Tokenize the string to identify key sections of text. This includes line
|
|
|
|
|
breaks, allowing the text to be examined one line at time.
|
|
|
|
|
|
|
|
|
|
2. Join series of lines together into blocks, such as paragraphs, code blocks,
|
|
|
|
|
lists, etc.
|
|
|
|
|
|
|
|
|
|
3. The tokens inside each block can then be paired together to create more
|
|
|
|
|
complex syntax such as links, strong, emphasis, etc.
|
|
|
|
|
|
|
|
|
|
To handle the block parsing, I started off using the [Aho-Corasick] code to
|
|
|
|
|
handle my first attempt. I had actually implemented some basic regex
|
|
|
|
|
functionality, and used that to group lines together to create blocks. But
|
|
|
|
|
this quickly fell apart in the face of more complex structures such as
|
|
|
|
|
recursive lists. After a lot of searching, and *tons* more reading, I
|
|
|
|
|
ultimately decided to use a parser generator to handle the task of group lines
|
|
|
|
|
into blocks. `parser.y` has the source for this, and it is processed by the
|
|
|
|
|
[lemon](http://www.hwaci.com/sw/lemon/) parser generator to create the actual
|
|
|
|
|
code.
|
|
|
|
|
|
|
|
|
|
I chose to do this because hand-crafting the block parser would be complex.
|
|
|
|
|
The end result would likely be difficult to read and understand, which would
|
|
|
|
|
make it difficult to update later on. Using the parser generator allows me to
|
|
|
|
|
write things out in a way that can more easily be understood by a person. In
|
|
|
|
|
all likelihood, the performance is probably as good as anything I could do
|
|
|
|
|
anyway, if not better.
|
|
|
|
|
|
|
|
|
|
Because lemon is a LALR(1) parser, it does require a bit of thinking ahead
|
|
|
|
|
about how to create the grammar used. But so far, it has been able to handle
|
|
|
|
|
everything I have thrown at it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Optimization ##
|
|
|
|
|
|
|
|
|
|
One of my goals for MMD 6 was performance. So I've paid attention to speed
|
|
|
|
|
along the way, and have tried to use a few tricks to keep things fast. Here
|
|
|
|
|
are some things I've learned along the way. In no particular order:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### Memory Allocation ###
|
|
|
|
|
|
|
|
|
|
When parsing a long document, a *lot* of token structures are created. Each
|
|
|
|
|
one requires a small bit of memory to be allocated. In aggregate, that time
|
|
|
|
|
added up and slowed down performance.
|
|
|
|
|
|
|
|
|
|
After reading for a bit, I ended up coming up with an approach that uses
|
|
|
|
|
larger chunks of memory. I allocate pools of of memory in large slabs for
|
|
|
|
|
smaller "objects"". For example, I allocate memory for 1024 tokens at a
|
|
|
|
|
single time, and then dole that memory out as needed. When the slab is empty,
|
|
|
|
|
a new one is allocated. This dramatically improved performance.
|
|
|
|
|
|
|
|
|
|
When pairing tokens, I created a new stack for each block. I realized that an
|
|
|
|
|
empty stack didn't have any "leftover" cruft to interfere with re-use, so I
|
|
|
|
|
just used one for the entire document. Again a sizeable improvement in
|
|
|
|
|
performance from only allocating one object instead of many. When recursing
|
|
|
|
|
to a deeper level, the stack just gets deeper, but earlier levels aren't
|
|
|
|
|
modified.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
Speaking of tokens, I realized that the average document contains a lot of
|
|
|
|
|
single spaces (there's one between every two words I have written, for
|
|
|
|
|
example.) The vast majority of the time, these single spaces have no effect
|
|
|
|
|
on the output of Markdown documents. I changed my whitespace token search to
|
|
|
|
|
only flag runs of 2 or more spaces, dramatically reducing the number of
|
|
|
|
|
tokens. This gives the benefit of needing fewer memory allocations, and also
|
|
|
|
|
reduces the number of tokens that need to be processed later on. The only
|
|
|
|
|
downside is remember to check for a single space character in a few instances
|
|
|
|
|
where it matters.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
### Proper input buffering ###
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
When I first began last spring, I was amazed to see how much time was being
|
|
|
|
|
spent by MultiMarkdown simply reading the input file. Then I discovered it
|
|
|
|
|
was because I was reading it one character at a time. I switched to using a
|
|
|
|
|
buffered read approach and the time to read the file went to almost nothing. I
|
|
|
|
|
experimented with different buffer sizes, but they did not seem to make a
|
|
|
|
|
measurable difference.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
### Output Buffering ###
|
|
|
|
|
|
|
|
|
|
I experimented with different approaches to creating the output after parsing.
|
|
|
|
|
I tried printing directly to `stdout`, and even played with different
|
|
|
|
|
buffering settings. None of those seemed to work well, and all were slower
|
|
|
|
|
than using the `d_string` approach (formerly call `GString` in MMD 5).
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
### Fast Searches ###
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
After getting basic Markdown functionality complete, I discovered during
|
|
|
|
|
testing that the time required to parse a document grew exponentially as the
|
|
|
|
|
document grew longer. Performance was on par with CommonMark for shorter
|
|
|
|
|
documents, but fell increasingly behind in larger tests. Time profiling found
|
|
|
|
|
that the culprit was searching for link definitions when they didn't exist.
|
|
|
|
|
My first approach was to keep a stack of used link definitions, and to iterate
|
|
|
|
|
through them when necessary. In long documents, this performs very poorly.
|
|
|
|
|
More research and I ended up using
|
|
|
|
|
[uthash](http://troydhanson.github.io/uthash/). This allows me to search for
|
|
|
|
|
a link (or footnote, etc.) by "name" rather than searching through an array.
|
|
|
|
|
This allowed me to get MMD's performance back to O(n), taking roughly twice as
|
|
|
|
|
much time to process a document that is twice as long.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
### Efficient Utility Functions ###
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
It is frequently necessary when parsing Markdown to check what sort of
|
|
|
|
|
character we are dealing with at a certain position -- a letter, whitespace,
|
|
|
|
|
punctuation, etc. I created a lookup table for this via `char_lookup.c` and
|
|
|
|
|
hard-coded it in `char.c`. These routines allow me to quickly, and
|
|
|
|
|
consistently, classify any byte within a document. This saved a lot of
|
|
|
|
|
programming time, and saved time tracking down bugs from handling things
|
|
|
|
|
slightly differently under different circumstances. I also suspect it
|
|
|
|
|
improved performance, but don't have the data to back it up.
|
2016-08-25 04:44:09 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
### Testing While Writing ###
|
2016-08-25 04:44:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
I developed several chunks of code in parallel while creating MMD 6. The vast
|
|
|
|
|
majority of it was developed largely in a [test-driven development] approach.
|
|
|
|
|
The other code was largely created with extensive unit testing to accomplish
|
|
|
|
|
this.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
[test-driven development]: https://en.wikipedia.org/wiki/Test-driven_development
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
MMD isn't particularly amenable to this approach at the small level, but
|
|
|
|
|
instead I relied more on integration testing with an ever-growing collection
|
|
|
|
|
of text files and the corresponding HTML files in the MMD 6 test suite. This
|
|
|
|
|
allowed me to ensure new features work properly and that old features aren't
|
|
|
|
|
broken. At this time, there are 29 text files in the test suite, and many
|
|
|
|
|
more to come.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
### Other Lessons ###
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
Some things that didn't do me any good....
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
I considered differences between using `malloc` and `calloc` when initializing
|
|
|
|
|
tokens. The time saved by using `malloc` was basically exactly offset by the
|
|
|
|
|
initial time required to initialize the token to default null values as
|
|
|
|
|
compared to using `calloc`. When trying `calloc` failed to help me out
|
|
|
|
|
(thinking that clearing a single slab in the object pool would be faster), I
|
|
|
|
|
stuck with `malloc` as it makes more sense to me in my workflow.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
I read a bit about [struct padding] and reordered some of my structs. It was
|
|
|
|
|
until later that I discovered the `-Wpadded` option, and it's not clear
|
|
|
|
|
whether my changes modified anything. Since the structs were being padded
|
|
|
|
|
automatically, there was no noticeable performance change, and I didn't have
|
|
|
|
|
the tools to measure whether I could have improved memory usage at all. Not
|
|
|
|
|
sure this would be worth the effort -- much lower hanging fruit available.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
[struct padding]: http://www.catb.org/esr/structure-packing/
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
## Differences in MultiMarkdown Itself ##
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
MultiMarkdown v6 is mostly about making a better MMD parser, but it will
|
|
|
|
|
likely involve a few changes to the MultiMarkdown language itself.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
1. I am thinking about removing Setext headers from the language. I almost
|
|
|
|
|
never use them, much preferring to use ATX style headers (`# foo #`).
|
|
|
|
|
Additionally, I have never liked the fact that Setext headers allow the
|
|
|
|
|
meaning of a line to be completely changed by the following line. It makes
|
|
|
|
|
the parsing slightly more difficult on a technical level (requiring some
|
|
|
|
|
backtracking at times). I'm not 100% certain on this, but right now I believe
|
|
|
|
|
it's the only Markdown feature that doesn't exist in MMD 6 yet.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
2. Whitespace is not allowed between the text brackets and label brackets in
|
|
|
|
|
reference links, images, footnotes, etc. For example `[foo] [bar]` will no
|
|
|
|
|
longer be the same as `[foo][bar]`.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
3. Link and image titles can be quoted with `'foo'`, `"foo"`, or `(foo)`.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
4. HTML elements are handled slightly differently. There is no longer a
|
|
|
|
|
`markdown="1"` feature. Instead, HTML elements that are on a line by
|
|
|
|
|
themselves will open an HTML block that will cause the rest of the "paragraph"
|
|
|
|
|
to be treated as HTML such that Markdown will not be parsed in side of it.
|
|
|
|
|
HTML block-level tags are even "stronger" at starting an HTML block. It is
|
|
|
|
|
not quite as complex as the approach used in CommonMark, but is similar under
|
|
|
|
|
most circumstances.
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
For example, this would not be parsed:
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
<div>
|
|
|
|
|
*foo*
|
|
|
|
|
</div>
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
But this would be:
|
2016-08-25 05:01:59 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
<div>
|
2016-08-25 05:01:59 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
*foo*
|
2016-08-25 05:01:59 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
</div>
|
2016-08-25 05:01:59 +03:00
|
|
|
|
|
2017-02-08 06:38:32 +03:00
|
|
|
|
5. "Malformed" reference link definitions are handled slightly differently.
|
|
|
|
|
For example, `Reference Footnotes.text` is parsed differently in compatibility
|
|
|
|
|
mode than MMD-5. This started as a side-effect of the parsing algorithm, but
|
|
|
|
|
I actually think it makes sense. This may or may not change in the future.
|
2016-08-25 05:01:59 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
## Where Does MultiMarkdown 6 Stand? ##
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
### Features ###
|
2015-11-12 17:33:09 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
I *think* that all basic Markdown features have been implemented, except for
|
|
|
|
|
Setext headers, as mentioned above. Additionally, the following MultiMarkdown
|
|
|
|
|
features have been implemented:
|
|
|
|
|
|
|
|
|
|
* Automatic cross-reference targets
|
|
|
|
|
* Basic Citation support
|
|
|
|
|
* CriticMarkup support
|
2017-02-08 06:38:32 +03:00
|
|
|
|
* Definition lists
|
|
|
|
|
* Footnotes
|
2017-01-19 06:43:15 +03:00
|
|
|
|
* Inline and reference footnotes
|
|
|
|
|
* Image and Link attributes (attributes can now be used with inline links as
|
|
|
|
|
well as reference links)
|
|
|
|
|
* Math support
|
|
|
|
|
* Smart quotes (support for languages other than english is not fully
|
|
|
|
|
implemented yet)
|
|
|
|
|
* Superscripts/subscripts
|
2017-01-29 15:58:57 +03:00
|
|
|
|
* Table of Contents
|
2017-02-05 20:01:12 +03:00
|
|
|
|
* Tables
|
2017-01-19 06:43:15 +03:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Things that are partially completed:
|
|
|
|
|
|
|
|
|
|
* Citations -- still need:
|
|
|
|
|
* Syntax for "not cited" entries
|
|
|
|
|
* Output format
|
|
|
|
|
* HTML --> separate footnotes and citations?
|
|
|
|
|
* Locators required?
|
|
|
|
|
* CriticMarkup -- need to decide:
|
|
|
|
|
* How to handle CM stretches that include blank lines
|
|
|
|
|
* Fenced code blocks
|
2017-02-07 17:56:52 +03:00
|
|
|
|
* Headers -- need support for manual labels
|
2017-01-29 15:58:57 +03:00
|
|
|
|
* Metadata
|
|
|
|
|
* Full/Snippet modes
|
2017-01-19 06:43:15 +03:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Things yet to be completed:
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
* Abbreviations
|
|
|
|
|
* Glossaries
|
|
|
|
|
* File Transclusion
|
2015-06-14 17:32:16 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-29 15:58:57 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
### Accuracy ###
|
2015-06-14 17:32:16 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
MultiMarkdown v6 successfully parses the Markdown [syntax page], except for
|
|
|
|
|
the Setext header at the top. It passes the 29 test files currently in place.
|
2017-02-04 16:48:56 +03:00
|
|
|
|
There are a few at
|
2015-06-14 17:32:16 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
[syntax page]: https://daringfireball.net/projects/markdown/syntax
|
2015-06-14 17:32:16 +03:00
|
|
|
|
|
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
### Performance ###
|
2015-06-14 17:32:16 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
Basic tests show that currently MMD 6 takes about 20-25% longer the CommonMark
|
|
|
|
|
0.27.0 to process long files (e.g. 0.2 MB). However, it is around 5% *faster*
|
|
|
|
|
than CommonMark when parsing a shorter file (27 kB) (measured by parsing the
|
|
|
|
|
same file 200 times over). This test suite is performed by using the Markdown
|
|
|
|
|
[syntax page], modified to avoid the use of the Setext header at the top. The
|
|
|
|
|
longer files tested are created by copying the same syntax page onto itself,
|
|
|
|
|
thereby doubling the length of the file with each iteration.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
The largest file I test is approximately 108 MB (4096 copies of the syntax
|
|
|
|
|
page). On my machine (2012 Mac mini with 2.3 GHz Intel Core i7, 16 GB RAM),
|
|
|
|
|
it takes approximately 4.4 seconds to parse with MMD 6 and 3.7 seconds with
|
|
|
|
|
CommonMark. MMD 6 processes approximately 25 MB/s on this test file.
|
|
|
|
|
CommonMark 0.27.0 gets about 29 MB/s on the same machine.
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
There are some slight variations with the smaller test files (8-32 copies),
|
|
|
|
|
but overall the performance of both programs (MMD 6 and CommonMark) are
|
|
|
|
|
roughly linear as the test file gets bigger (double the file size and it takes
|
|
|
|
|
twice as long to parse, aka O(n)).
|
|
|
|
|
|
|
|
|
|
Out of curiosity, I ran the same tests on the original Markdown.pl by Gruber
|
|
|
|
|
(v 1.0.2b8). It took approximately 178 seconds to parse 128 copies of the
|
|
|
|
|
file (3.4 MB) and was demonstrating quadratic performance characteristics
|
|
|
|
|
(double the file size and it takes 2^2 or 4 times longer to process, aka
|
|
|
|
|
O(n^2)). I didn't bother running it on larger versions of the test file. For
|
|
|
|
|
comparison, MMD 6 can process 128 copies in approximately 140 msec.
|
|
|
|
|
|
|
|
|
|
Of note, the throughput speed drops when testing more complicated files
|
|
|
|
|
containing more advanced MultiMarkdown features, though it still seems to
|
|
|
|
|
maintain linear performance characteristics. A second test file is created by
|
|
|
|
|
concatenating all of the test suite files (including the Markdown syntax
|
|
|
|
|
file). In this case, MMD gets about 13 MB/s. CommonMark doesn't support
|
|
|
|
|
these additional features, so testing it with that file is not relevant. I
|
|
|
|
|
will work to see whether there are certain features in particular that are
|
|
|
|
|
more challenging and see whether they can be reworked to improve performance.
|
|
|
|
|
|
|
|
|
|
As above, I have done some high level optimization of the parse strategy, but
|
|
|
|
|
I'm sure there's still a lot of room for further improvement to be made.
|
|
|
|
|
Suggestions welcome!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## License ##
|
2015-06-13 21:59:00 +03:00
|
|
|
|
|
2017-01-19 06:43:15 +03:00
|
|
|
|
The `MultiMarkdown 6` project is released under the MIT License..
|
|
|
|
|
|
|
|
|
|
GLibFacade.c and GLibFacade.h are from the MultiMarkdown v4 project:
|
|
|
|
|
|
|
|
|
|
https://github.com/fletcher/MultiMarkdown-4/
|
|
|
|
|
|
|
|
|
|
MMD 4 is released under both the MIT License and GPL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
CuTest is released under the zlib/libpng license. See CuTest.c for the text
|
|
|
|
|
of the license.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## The MIT License ##
|
|
|
|
|
|
|
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
|
|
|
of this software and associated documentation files (the "Software"), to deal
|
|
|
|
|
in the Software without restriction, including without limitation the rights
|
|
|
|
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
|
|
|
copies of the Software, and to permit persons to whom the Software is
|
|
|
|
|
furnished to do so, subject to the following conditions:
|
|
|
|
|
|
|
|
|
|
The above copyright notice and this permission notice shall be included in
|
|
|
|
|
all copies or substantial portions of the Software.
|
|
|
|
|
|
|
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
|
|
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
|
|
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
|
|
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
|
|
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
|
|
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
|
|
|
|
THE SOFTWARE.
|