1.7 KiB
taggy
An attoparsec based html parser.
Currently very WIP but already supports a fairly decent range of common websites. I haven't managed to find a website with which it chokes, using the current parser.
The performance is quite promising for now, but we don't do a lof of things that tagsoup does, like converting &
to &
, etc.
Using taggy
taggy has a linksIn
function to work on HTML à la tagsoup.
tagsIn :: LT.Text -> [Tag]
Or you can use the raw run
function, which returns a good old Result
from attoparsec.
run :: LT.Text -> AttoLT.Result [Tag]
For example, if you want to read the html code from a file, and print one tag per line, you could do:
import Data.Attoparsec.Text.Lazy (eitherResult)
import qualified Data.Text.Lazy.IO as T
import Text.Taggy (run)
taggy :: FilePath -> IO ()
taggy fp = do
content <- T.readFile fp
either (\s -> putStrLn $ "couldn't parse: " ++ s)
(mapM_ print)
(eitherResult $ run content)
But taggy also started providing support for DOM-syle documents. This is computed from the list of tags gained by using tagsIn
.
If you fire up ghci with taggy loaded:
$ cabal repl # if working with a copy of this repo
You can see this domify
in action.
λ> :set -XOverloadedStrings
λ> head . domify . tagsIn $ "<html><head></head><body>yo</body></html>"
NodeElement (Element {eltName = "html", eltAttrs = fromList [], eltChildren = [NodeElement (Element {eltName = "head", eltAttrs = fromList [], eltChildren = []}),NodeElement (Element {eltName = "body", eltAttrs = fromList [], eltChildren = [NodeContent "yo"]})]})
I'm already working on a companion taggy-lens
library.