A high-performance, reasonably robust HTML5 tokenizer
Go to file
2023-10-16 14:00:47 +03:00
.github/workflows Drop support for GHC 8.2 and earlier 2022-09-05 12:38:30 -04:00
src Add \r to the isWhitespace predicate 2023-04-11 10:53:27 -07:00
tests Add character reference support 2022-09-05 12:38:30 -04:00
.gitignore Hide GHC dump files 2016-04-10 13:25:41 +02:00
Benchmark.hs More rearranging of export list 2016-04-13 23:13:47 +02:00
changelog.md Bump version to 0.2.1.0 2022-09-05 12:38:30 -04:00
gen_entities.py Add character reference support 2022-09-05 12:38:30 -04:00
html-parse.cabal tested with ghc-9.6.3 2023-10-16 14:00:47 +03:00
LICENSE Initial commit 2016-04-06 01:22:08 +02:00
Microbench.hs Add a small microbenchmark 2016-04-06 11:02:48 +02:00
README.mkd Add some typical performance numbers 2017-10-07 10:48:24 -04:00
Setup.hs Initial commit 2016-04-06 01:22:08 +02:00

html-parse

html-parse is an efficient, reasonably robust HTML tokenizer based on the HTML5 tokenization specification. The parser is written using the fast attoparsec parsing library and can exposes both a native attoparsec Parser as well as convenience functions for lazily parsing token streams out of strict and lazy Text values.

For instance,

>>> parseTokens "<div><h1>Hello World</h1><br/><p class=widget>Example!</p></div>"
[TagOpen "div" [],TagOpen "h1" [],ContentText "Hello World",TagClose "h1",TagSelfClose "br" [],TagOpen "p" [Attr "class" "widget"],ContentText "Example!",TagClose "p",TagClose "div"]

Performance

Here are some typical performance numbers taken from parsing a fairly long Wikipedia article,

benchmarking Forced/tagsoup fast Text
time                 171.2 ms   (166.4 ms .. 177.3 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 171.9 ms   (169.4 ms .. 173.2 ms)
std dev              2.516 ms   (1.104 ms .. 3.558 ms)
variance introduced by outliers: 12% (moderately inflated)

benchmarking Forced/tagsoup normal Text
time                 176.9 ms   (167.3 ms .. 188.5 ms)
                     0.998 R²   (0.994 R² .. 1.000 R²)
mean                 180.7 ms   (177.5 ms .. 183.7 ms)
std dev              4.246 ms   (2.316 ms .. 5.803 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking Forced/html-parser
time                 20.88 ms   (20.60 ms .. 21.25 ms)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 20.99 ms   (20.81 ms .. 21.20 ms)
std dev              446.1 μs   (336.4 μs .. 596.2 μs)