Lightweight markup processor to produce HTML, LaTeX, and more.
Go to file
2017-01-19 13:29:06 -05:00
build Initialize project 2015-06-13 14:59:00 -04:00
lemon CHANGED: Initial public commit 2017-01-18 22:43:15 -05:00
src CHANGED: Initial public commit 2017-01-18 22:43:15 -05:00
templates CHANGED: Initial public commit 2017-01-18 22:43:15 -05:00
test NOTE: Rename test script 2017-01-19 09:30:34 -05:00
tests CHANGED: Initial public commit 2017-01-18 22:43:15 -05:00
tools add 32 bit flag to older MinGW toolchain 2015-11-15 20:55:25 -05:00
.gitignore CHANGED: Initial public commit 2017-01-18 22:43:15 -05:00
CMakeLists.txt CHANGED: Initial public commit 2017-01-18 22:43:15 -05:00
configure-gh-pages ADDED: Support for doxygen documentation in gh-pages branch 2016-08-24 21:44:09 -04:00
IMPORTANT update cmake notes 2015-11-13 09:53:39 -05:00
LICENSE.txt CHANGED: Initial public commit 2017-01-18 22:43:15 -05:00
link_git_modules add notes to git scripts 2015-11-12 09:31:28 -05:00
Makefile CHANGED: Initial public commit 2017-01-18 22:43:15 -05:00
new_source ADDED: new_source script to automate creating foo.c and foo.h 2016-08-29 19:01:25 -04:00
README.md NOTE: Rename test script 2017-01-19 09:30:34 -05:00
update_git_modules add notes to git scripts 2015-11-12 09:31:28 -05:00

About

Title: MultiMarkdown 6
Author: Fletcher T. Penney
Date: 2017-01-18
Copyright: Copyright © 2016 - 2017 Fletcher T. Penney.
Version: 0.1.0a

An Announcement!

I would like to officially announce that MultiMarkdown version 6 is in public alpha. It's finally at a point where it is usable, but there are quite a few caveats.

This post is a way for me to organize some of my thoughts, provide some history for those who are interested, and to provide some tips and tricks from my experiences for those who are working on their own products.

But first, some background...

Why a New Version?

MultiMarkdown version 5 was released in November of 2015, but the codebase was essentially the same as that of v4 -- and that was released in beta in April of 2013. A few key things prompted work on a new version:

  • Accuracy -- MMD v4 and v5 were the most accurate versions yet, and a lot of effort went into finding and resolving various edge cases. However, it began to feel like a game of whack-a-mole where new bugs would creep in every time I fixed an old one. The PEG began to feel rather convoluted in spots, even though it did allow for a precise (if not always accurate) specification of the grammar.

  • Performance -- "Back in the day" peg-markdown was one of the fastest Markdown parsers around. MMD v3 was based on peg-markdown, and would leap- frog with it in terms of performance. Then CommonMark was released, which was a bit faster. Then a couple of years went by and CommonMark became much faster -- in one of my test suites, MMD v 5.4.0 takes about 25 times longer to process a long document than CommonMark 0.27.0.

Last spring, I decided I wanted to rewrite MultiMarkdown from scratch, building the parser myself rather than relying on a pre-rolled solution. (I had been using greg to compile the PEG into parser code. It worked well overall, but lacked some features I needed, requiring a lot of workarounds.)

First Attempt

My first attempt started by hand-crafting a parser that scanned through the document a line at a time, deciding what to do with each line as it found them. I used regex parsers made with re2c to help classify each line, and then a separate parser layer to process groups of lines into blocks. Initially this approach worked well, and was really efficient. But I quickly began to code my way into a dead-end -- the strategy was not elegant enough to handle things like nested lists, etc.

One thing that did turn out well from the first attempt, however, was an approach for handling <emph> and <strong> parsing. I've learned over the years that this can be one of the hardest parts of coding accurately for Markdown. There are many examples that are obvious to a person, but difficult to properly "explain" how to parse to a computer.

No solution is perfect, but I developed an approach that seems to accurately handle a wide range of situations without a great deal of complexity:

  1. Scan the documents for asterisks (*). Each one will be handled one at a time.

  2. Unlike brackets ([ and ]), an asterisk is "ambidextrous", in that it may be able to open a matched pair of asterisks, close a pair, or both. For example, in foo *bar* foo:

    1. The first asterisk can open a pair, but not close one.

    2. The second asterisk can close a pair, but not open one.

  3. So, once the asterisks have been identified, each has to be examined to determine whether it can open/close/both. The algorithm is not that complex, but I'll describe it in general terms. Check the code for more specifics. This approach seems to work, but might still need some slight tweaking. In the future, I'll codify this better in language rather than just in code.

    1. If there is whitespace to the left of an asterisk, it can't close.

    2. If there is whitespace or punctuation to the right it can't open.

    3. "Runs" of asterisks, e.g. **bar are treated as a unit in terms of looking left/right.

    4. Asterisks inside a word are a bit trickier -- we look at the number of asterisks before the word, the number in the current run, and the number of asterisks after the word to determine which combinations, if any, are permitted.

  4. Once all asterisks have been tagged as able to open/close/both, we proceed through them in order:

    1. When we encounter a tag that can close, we look to see if there is a previous opener that has not been paired off. If so, pair the two and remove the opener from the list of available asterisks.

    2. When we encounter an opener, add it to the stack of available openers.

    3. When encounter an asterisk that can do both, see if it can close an existing opener. If not, then add it to the stack.

  5. After all tokens in the block have been paired, then we look for nesting pairs of asterisks in order to create <emph> and <strong> sets. For example, assume we have six asterisks wrapped around a word, three in front, and three after. The asterisks are indicated with numbers: 123foo456. We proceed in the following manner:

    1. Based on the pairing algorithm above, these asterisks would be paired as follows, with matching asterisks sharing numbers -- 123foo321.

    2. Moving forwards, we come to asterisk "1". It is followed by an asterisk, so we check to see if they should be grouped as a <strong>. Since the "1" asterisks are wrapped immediately outside the "2" asterisks, they are joined together. More than two pairs can't be joined, so we now get the following -- 112foo211, where the "11" represents the opening and closing of a <strong>, and the "2" represents a <emph>.

  6. When matching a pair, any unclosed openers that are on the stack are removed, preventing pairs from "crossing" or "intersecting". Pairs can wrap around each other, e.g. [(foo)], but not intersect like [(foo]). In the second case, the brackets would close, removing the ( from the stack.

  7. This same approach is used in all tokens that are matched in pairs-- [foo], (foo), _foo_, etc. There's slightly more to it, but once you figure out how to assign opening/closing ability, the rest is easy. By using a stack to track available openers, it can be performed efficiently.

In my testing, this approach has worked quite well. It handles all the basic scenarios I've thrown at it, and all of the "basic" and "devious" edge cases I have thought of (some of these don't necessarily have a "right" answer -- but v6 gives consistency answers that seem as reasonable as any others to me). There are also three more edge cases I've come up can still stump it, and ironically they are handled correctly by most implementations. They just don't follow the rules above. I'll continue to work on this.

In the end, I scrapped this effort, but kept the lessons learned in the token pairing algorithm.

Second Attempt

I tried again this past Fall. This time, I approached the problem with lots of reading. Lots and lots of reading -- tons of websites, computer science journal articles, PhD theses, etc. Learned a lot about lexers, and a lot about parsers, including hand-crafting vs using parser generators. In brief:

  1. I learned about the AhoCorasick algorithm, which is a great way to efficiently search a string for multiple target strings at once. I used this to create a custom lexer to identify tokens in a MultiMarkdown text document (e.g. *, [ , {++, etc.). I learned a lot, and had a good time working out the implementation. This code efficiently allowed me to break a string of text into the tokens that mattered for Markdown parsing.

  2. However, in a few instances I really needed some features of regular expressions to simplify more complex structures. After a quick bit of testing, using re2c to create a tokenizer was just as efficient, and allowed me to incorporate some regex functionality that simplified later parsing. I'll keep the Aho-Corasick stuff around, and will probably experiment more with it later. But I didn't need it for MMD now. lexer.re contains the source for the tokenizer.

I looked long and hard for a way to simplify the parsing algorithm to try and "touch" each token only once. Ideally, the program could step through each token, and decide when to create a new block, when to pair things together, etc. But I'm not convinced it's possible. Since Markdown's grammar varies based on context, it seems to work best when handled in distinct phases:

  1. Tokenize the string to identify key sections of text. This includes line breaks, allowing the text to be examined one line at time.

  2. Join series of lines together into blocks, such as paragraphs, code blocks, lists, etc.

  3. The tokens inside each block can then be paired together to create more complex syntax such as links, strong, emphasis, etc.

To handle the block parsing, I started off using the [Aho-Corasick] code to handle my first attempt. I had actually implemented some basic regex functionality, and used that to group lines together to create blocks. But this quickly fell apart in the face of more complex structures such as recursive lists. After a lot of searching, and tons more reading, I ultimately decided to use a parser generator to handle the task of group lines into blocks. parser.y has the source for this, and it is processed by the lemon parser generator to create the actual code.

I chose to do this because hand-crafting the block parser would be complex. The end result would likely be difficult to read and understand, which would make it difficult to update later on. Using the parser generator allows me to write things out in a way that can more easily be understood by a person. In all likelihood, the performance is probably as good as anything I could do anyway, if not better.

Because lemon is a LALR(1) parser, it does require a bit of thinking ahead about how to create the grammar used. But so far, it has been able to handle everything I have thrown at it.

Optimization

One of my goals for MMD 6 was performance. So I've paid attention to speed along the way, and have tried to use a few tricks to keep things fast. Here are some things I've learned along the way. In no particular order:

Memory Allocation

When parsing a long document, a lot of token structures are created. Each one requires a small bit of memory to be allocated. In aggregate, that time added up and slowed down performance.

After reading for a bit, I ended up coming up with an approach that uses larger chunks of memory. I allocate pools of of memory in large slabs for smaller "objects"". For example, I allocate memory for 1024 tokens at a single time, and then dole that memory out as needed. When the slab is empty, a new one is allocated. This dramatically improved performance.

When pairing tokens, I created a new stack for each block. I realized that an empty stack didn't have any "leftover" cruft to interfere with re-use, so I just used one for the entire document. Again a sizeable improvement in performance from only allocating one object instead of many. When recursing to a deeper level, the stack just gets deeper, but earlier levels aren't modified.

Speaking of tokens, I realized that the average document contains a lot of single spaces (there's one between every two words I have written, for example.) The vast majority of the time, these single spaces have no effect on the output of Markdown documents. I changed my whitespace token search to only flag runs of 2 or more spaces, dramatically reducing the number of tokens. This gives the benefit of needing fewer memory allocations, and also reduces the number of tokens that need to be processed later on. The only downside is remember to check for a single space character in a few instances where it matters.

Proper input buffering

When I first began last spring, I was amazed to see how much time was being spent by MultiMarkdown simply reading the input file. Then I discovered it was because I was reading it one character at a time. I switched to using a buffered read approach and the time to read the file went to almost nothing. I experimented with different buffer sizes, but they did not seem to make a measurable difference.

Output Buffering

I experimented with different approaches to creating the output after parsing. I tried printing directly to stdout, and even played with different buffering settings. None of those seemed to work well, and all were slower than using the d_string approach (formerly call GString in MMD 5).

Fast Searches

After getting basic Markdown functionality complete, I discovered during testing that the time required to parse a document grew exponentially as the document grew longer. Performance was on par with CommonMark for shorter documents, but fell increasingly behind in larger tests. Time profiling found that the culprit was searching for link definitions when they didn't exist. My first approach was to keep a stack of used link definitions, and to iterate through them when necessary. In long documents, this performs very poorly. More research and I ended up using uthash. This allows me to search for a link (or footnote, etc.) by "name" rather than searching through an array. This allowed me to get MMD's performance back to O(n), taking roughly twice as much time to process a document that is twice as long.

Efficient Utility Functions

It is frequently necessary when parsing Markdown to check what sort of character we are dealing with at a certain position -- a letter, whitespace, punctuation, etc. I created a lookup table for this via char_lookup.c and hard-coded it in char.c. These routines allow me to quickly, and consistently, classify any byte within a document. This saved a lot of programming time, and saved time tracking down bugs from handling things slightly differently under different circumstances. I also suspect it improved performance, but don't have the data to back it up.

Testing While Writing

I developed several chunks of code in parallel while creating MMD 6. The vast majority of it was developed largely in a test-driven development approach. The other code was largely created with extensive unit testing to accomplish this.

MMD isn't particularly amenable to this approach at the small level, but instead I relied more on integration testing with an ever-growing collection of text files and the corresponding HTML files in the MMD 6 test suite. This allowed me to ensure new features work properly and that old features aren't broken. At this time, there are 29 text files in the test suite, and many more to come.

Other Lessons

Some things that didn't do me any good....

I considered differences between using malloc and calloc when initializing tokens. The time saved by using malloc was basically exactly offset by the initial time required to initialize the token to default null values as compared to using calloc. When trying calloc failed to help me out (thinking that clearing a single slab in the object pool would be faster), I stuck with malloc as it makes more sense to me in my workflow.

I read a bit about struct padding and reordered some of my structs. It was until later that I discovered the -Wpadded option, and it's not clear whether my changes modified anything. Since the structs were being padded automatically, there was no noticeable performance change, and I didn't have the tools to measure whether I could have improved memory usage at all. Not sure this would be worth the effort -- much lower hanging fruit available.

Differences in MultiMarkdown Itself

MultiMarkdown v6 is mostly about making a better MMD parser, but it will likely involve a few changes to the MultiMarkdown language itself.

  1. I am thinking about removing Setext headers from the language. I almost never use them, much preferring to use ATX style headers (# foo #). Additionally, I have never liked the fact that Setext headers allow the meaning of a line to be completely changed by the following line. It makes the parsing slightly more difficult on a technical level (requiring some backtracking at times). I'm not 100% certain on this, but right now I believe it's the only Markdown feature that doesn't exist in MMD 6 yet.

  2. Whitespace is not allowed between the text brackets and label brackets in reference links, images, footnotes, etc. For example [foo] [bar] will no longer be the same as [foo][bar].

  3. Link and image titles can be quoted with 'foo', "foo", or (foo).

  4. HTML elements are handled slightly differently. There is no longer a markdown="1" feature. Instead, HTML elements that are on a line by themselves will open an HTML block that will cause the rest of the "paragraph" to be treated as HTML such that Markdown will not be parsed in side of it. HTML block-level tags are even "stronger" at starting an HTML block. It is not quite as complex as the approach used in CommonMark, but is similar under most circumstances.

    For example, this would not be parsed:

     <div>
     *foo*
     </div>
    

    But this would be:

     <div>
    
     *foo*
    
     </div>
    
  5. I haven't worked a lot yet on the MMD-specific features, so there may be more changes to come. One thing I do anticipate is that if fenced code blocks stay, they will work slightly differently. Currently, an opening fence doesn't mean anything unless there is a closing fence that follows it. Again, this requires backtracking in the parser. I suspect that an opening fence will definitely open a code block. If there is no closing fence, then the rest of the document will remain inside the code block. This is the approach used by CommonMark and it's a reasonable one, IMO.

Where Does MultiMarkdown 6 Stand?

Features

I think that all basic Markdown features have been implemented, except for Setext headers, as mentioned above. Additionally, the following MultiMarkdown features have been implemented:

  • Automatic cross-reference targets
  • Basic Citation support
  • CriticMarkup support
  • Inline and reference footnotes
  • Image and Link attributes (attributes can now be used with inline links as well as reference links)
  • Math support
  • Smart quotes (support for languages other than english is not fully implemented yet)
  • Superscripts/subscripts

Things that are partially completed:

  • Citations -- still need:
    • Syntax for "not cited" entries
    • Output format
    • HTML --> separate footnotes and citations?
    • Locators required?
  • CriticMarkup -- need to decide:
    • How to handle CM stretches that include blank lines
  • Fenced code blocks

Things yet to be completed:

  • Multiple blocks inside of reference footnotes
  • Manually specified labels for headers
  • Definition lists
  • Abbreviations
  • Metadata
  • Glossaries
  • Tables
  • Table of Contents
  • File Transclusion

Accuracy

MultiMarkdown v6 successfully parses the Markdown syntax page, except for the Setext header at the top. It passes the 29 test files currently in place. There are a few ad

Performance

Basic tests show that currently MMD 6 takes about 20-25% longer the CommonMark 0.27.0 to process long files (e.g. 0.2 MB). However, it is around 5% faster than CommonMark when parsing a shorter file (27 kB) (measured by parsing the same file 200 times over). This test suite is performed by using the Markdown syntax page, modified to avoid the use of the Setext header at the top. The longer files tested are created by copying the same syntax page onto itself, thereby doubling the length of the file with each iteration.

The largest file I test is approximately 108 MB (4096 copies of the syntax page). On my machine (2012 Mac mini with 2.3 GHz Intel Core i7, 16 GB RAM), it takes approximately 4.4 seconds to parse with MMD 6 and 3.7 seconds with CommonMark. MMD 6 processes approximately 25 MB/s on this test file. CommonMark 0.27.0 gets about 29 MB/s on the same machine.

There are some slight variations with the smaller test files (8-32 copies), but overall the performance of both programs (MMD 6 and CommonMark) are roughly linear as the test file gets bigger (double the file size and it takes twice as long to parse, aka O(n)).

Out of curiosity, I ran the same tests on the original Markdown.pl by Gruber (v 1.0.2b8). It took approximately 178 seconds to parse 128 copies of the file (3.4 MB) and was demonstrating quadratic performance characteristics (double the file size and it takes 2^2 or 4 times longer to process, aka O(n^2)). I didn't bother running it on larger versions of the test file. For comparison, MMD 6 can process 128 copies in approximately 140 msec.

Of note, the throughput speed drops when testing more complicated files containing more advanced MultiMarkdown features, though it still seems to maintain linear performance characteristics. A second test file is created by concatenating all of the test suite files (including the Markdown syntax file). In this case, MMD gets about 13 MB/s. CommonMark doesn't support these additional features, so testing it with that file is not relevant. I will work to see whether there are certain features in particular that are more challenging and see whether they can be reworked to improve performance.

As above, I have done some high level optimization of the parse strategy, but I'm sure there's still a lot of room for further improvement to be made. Suggestions welcome!

License

The `MultiMarkdown 6` project is released under the MIT License..

GLibFacade.c and GLibFacade.h are from the MultiMarkdown v4 project:

	https://github.com/fletcher/MultiMarkdown-4/

MMD 4 is released under both the MIT License and GPL.


CuTest is released under the zlib/libpng license. See CuTest.c for the text
of the license.


## The MIT License ##

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.