a simple, tolerant & efficient HTML/XML parser (with HTML in mind though)
Go to file
Alp Mestanogullari b9df6e468d Merge pull request #8 from alpmestan/nix
Added a nix derivation, intended for the shell.
2014-09-04 09:47:08 +02:00
bench add ability to convert special html entities to unicode 2014-06-12 20:44:44 +02:00
example add ability to convert special html entities to unicode 2014-06-12 20:44:44 +02:00
html_files initial commit 2014-06-02 21:22:35 +02:00
src/Text Lenient attribute parser: non-empty sequence of any charatcter but [>=]. 2014-07-25 17:30:02 +08:00
tests A test suite to validate a successful parse on each html_files/*html. 2014-07-25 17:27:50 +08:00
.ghci Added .ghci; src is in scope of tests, always enable OverloadedStrings. 2014-06-23 14:38:55 +08:00
.travis.yml travis update: run tests 2014-06-23 12:20:37 +02:00
default.nix Added a nix derivation, intended for the shell. 2014-09-04 06:21:58 +08:00
LICENSE initial commit 2014-06-02 21:22:35 +02:00
README.md readme update 2014-07-01 12:25:39 +02:00
report-entities.html add bench result 2014-06-13 10:14:03 +02:00
Setup.hs initial commit 2014-06-02 21:22:35 +02:00
taggy.cabal Bump version to 0.1.2 2014-07-25 18:27:19 +08:00

taggy

An attoparsec based html parser. Build Status

Currently very WIP but already supports a fairly decent range of common websites. I haven't managed to find a website with which it chokes, using the current parser. The performance is quite promising.

Using taggy

taggy has a taggyWith function to work on HTML à la tagsoup.

taggyWith :: Bool -> LT.Text -> [Tag]

The Bool there just lets you specify whether you want to convert the special HTML entities to their corresponding unicode character. True means "yes convert them please". This function takes lazy Text as input.

Or you can use the raw run function, which returns a good old Result from attoparsec.

run :: Bool -> LT.Text -> AttoLT.Result [Tag]

For example, if you want to read the html code from a file, and print one tag per line, you could do:

import Data.Attoparsec.Text.Lazy (eitherResult)
import qualified Data.Text.Lazy.IO as T
import Text.Taggy (run)

taggy :: FilePath -> IO ()
taggy fp = do
  content <- T.readFile fp
  either (\s -> putStrLn $ "couldn't parse: " ++ s) 
         (mapM_ print) 
         (eitherResult $ run True content)

But taggy also started providing support for DOM-syle documents. This is computed from the list of tags gained by using taggyWith.

If you fire up ghci with taggy loaded:

$ cabal repl # if working with a copy of this repo

You can see this domify in action.

λ> :set -XOverloadedStrings
λ> head . domify . taggyWith False $ "<html><head></head><body>yo</body></html>"
NodeElement (Element {eltName = "html", eltAttrs = fromList [], eltChildren = [NodeElement (Element {eltName = "head", eltAttrs = fromList [], eltChildren = []}),NodeElement (Element {eltName = "body", eltAttrs = fromList [], eltChildren = [NodeContent "yo"]})]})

Note that the Text.Taggy.DOM module contains a function that composes domify and taggyWith for you: parseDOM.

Lenses for taggy

We (well, mostly Vikram Virma to be honest) have put up a companion taggy-lens library.

Haddocks

I try to keep an up-to-date copy of the docs on my server: