Prompted by <https://github.com/alpmestan/taggy-lens/issues/2>.
2.4 KiB
taggy
An attoparsec based html parser.
Currently very WIP but already supports a fairly decent range of common websites. I haven't managed to find a website with which it chokes, using the current parser. The performance is quite promising.
Using taggy
taggy has a taggyWith
function to work on HTML à la tagsoup.
taggyWith :: Bool -> LT.Text -> [Tag]
The Bool
there just lets you specify whether you want to convert the special HTML entities to their corresponding unicode character. True
means "yes convert them please". This function takes lazy Text
as input.
Or you can use the raw run
function, which returns a good old Result
from attoparsec.
run :: Bool -> LT.Text -> AttoLT.Result [Tag]
For example, if you want to read the html code from a file, and print one tag per line, you could do:
import Data.Attoparsec.Text.Lazy (eitherResult)
import qualified Data.Text.Lazy.IO as T
import Text.Taggy (run)
taggy :: FilePath -> IO ()
taggy fp = do
content <- T.readFile fp
either (\s -> putStrLn $ "couldn't parse: " ++ s)
(mapM_ print)
(eitherResult $ run True content)
But taggy also started providing support for DOM-syle documents. This is computed from the list of tags gained by using taggyWith
.
If you fire up ghci with taggy loaded:
$ cabal repl # if working with a copy of this repo
You can see this domify
in action.
λ> :set -XOverloadedStrings
λ> head . domify . taggyWith False $ "<html><head></head><body>yo</body></html>"
NodeElement (Element {eltName = "html", eltAttrs = fromList [], eltChildren = [NodeElement (Element {eltName = "head", eltAttrs = fromList [], eltChildren = []}),NodeElement (Element {eltName = "body", eltAttrs = fromList [], eltChildren = [NodeContent "yo"]})]})
Note that the Text.Taggy.DOM
module contains a function
that composes domify
and taggyWith
for you: parseDOM
.
Lenses for taggy
We (well, mostly Vikram Virma to be honest) have put up a companion taggy-lens library.
Haddocks
I try to keep an up-to-date copy of the docs on my server: