taggy

An attoparsec based html parser.

Currently very WIP but already supports a fairly decent range of common websites. I haven't managed to find a website with which it chokes, using the current parser.

The performance is quite promising for now, but we don't do a lof of things that tagsoup does, like converting & to &, etc.

Using `taggy`

taggy has a linksIn function to work on HTML à la tagsoup.

tagsIn :: LT.Text -> [Tag]

Or you can use the raw run function, which returns a good old Result from attoparsec.

run :: LT.Text -> AttoLT.Result [Tag]

For example, if you want to read the html code from a file, and print one tag per line, you could do:

import Data.Attoparsec.Text.Lazy (eitherResult)
import qualified Data.Text.Lazy.IO as T
import Text.Taggy (run)

taggy :: FilePath -> IO ()
taggy fp = do
  content <- T.readFile fp
  either (\s -> putStrLn $ "couldn't parse: " ++ s) 
         (mapM_ print) 
         (eitherResult $ run content)

But taggy also started providing support for DOM-syle documents. This is computed from the list of tags gained by using tagsIn.

If you fire up ghci with taggy loaded:

$ cabal repl # if working with a copy of this repo

You can see this domify in action.

λ> :set -XOverloadedStrings
λ> head . domify . tagsIn $ "<html><head></head><body>yo</body></html>"
NodeElement (Element {eltName = "html", eltAttrs = fromList [], eltChildren = [NodeElement (Element {eltName = "head", eltAttrs = fromList [], eltChildren = []}),NodeElement (Element {eltName = "body", eltAttrs = fromList [], eltChildren = [NodeContent "yo"]})]})

I'm already working on a companion taggy-lens library.

1.7 KiB Raw Blame History

taggy

Using taggy

1.7 KiB

Raw Blame History

Using `taggy`