a simple, tolerant & efficient HTML/XML parser (with HTML in mind though)
Go to file
2014-06-16 16:37:20 +02:00
bench add ability to convert special html entities to unicode 2014-06-12 20:44:44 +02:00
example add ability to convert special html entities to unicode 2014-06-12 20:44:44 +02:00
html_files initial commit 2014-06-02 21:22:35 +02:00
src/Text more forgiving auto-closing tag detection 2014-06-16 16:37:20 +02:00
.travis.yml hopefully last travis fix 2014-06-12 15:33:56 +02:00
LICENSE initial commit 2014-06-02 21:22:35 +02:00
README.md readme update 2014-06-12 14:13:14 +02:00
report-entities.html add bench result 2014-06-13 10:14:03 +02:00
Setup.hs initial commit 2014-06-02 21:22:35 +02:00
taggy.cabal add ability to convert special html entities to unicode 2014-06-12 20:44:44 +02:00

taggy

An attoparsec based html parser.

Currently very WIP but already supports a fairly decent range of common websites. I haven't managed to find a website with which it chokes, using the current parser.

The performance is quite promising for now, but we don't do a lof of things that tagsoup does, like converting & to &, etc.

Using taggy

taggy has a linksIn function to work on HTML à la tagsoup.

tagsIn :: LT.Text -> [Tag]

Or you can use the raw run function, which returns a good old Result from attoparsec.

run :: LT.Text -> AttoLT.Result [Tag]

For example, if you want to read the html code from a file, and print one tag per line, you could do:

import Data.Attoparsec.Text.Lazy (eitherResult)
import qualified Data.Text.Lazy.IO as T
import Text.Taggy (run)

taggy :: FilePath -> IO ()
taggy fp = do
  content <- T.readFile fp
  either (\s -> putStrLn $ "couldn't parse: " ++ s) 
         (mapM_ print) 
         (eitherResult $ run content)

But taggy also started providing support for DOM-syle documents. This is computed from the list of tags gained by using tagsIn.

If you fire up ghci with taggy loaded:

$ cabal repl # if working with a copy of this repo

You can see this domify in action.

λ> :set -XOverloadedStrings
λ> head . domify . tagsIn $ "<html><head></head><body>yo</body></html>"
NodeElement (Element {eltName = "html", eltAttrs = fromList [], eltChildren = [NodeElement (Element {eltName = "head", eltAttrs = fromList [], eltChildren = []}),NodeElement (Element {eltName = "body", eltAttrs = fromList [], eltChildren = [NodeContent "yo"]})]})

I'm already working on a companion taggy-lens library.