mirror of https://github.com/alpmestan/taggy.git synced 2024-08-16 18:30:30 +03:00

a simple, tolerant & efficient HTML/XML parser (with HTML in mind though)

Go to file

Alp Mestanogullari 22bec0911a more forgiving auto-closing tag detection		2014-06-16 16:37:20 +02:00
bench	add ability to convert special html entities to unicode	2014-06-12 20:44:44 +02:00
example	add ability to convert special html entities to unicode	2014-06-12 20:44:44 +02:00
html_files	initial commit	2014-06-02 21:22:35 +02:00
src/Text	more forgiving auto-closing tag detection	2014-06-16 16:37:20 +02:00
.travis.yml	hopefully last travis fix	2014-06-12 15:33:56 +02:00
LICENSE	initial commit	2014-06-02 21:22:35 +02:00
README.md	readme update	2014-06-12 14:13:14 +02:00
report-entities.html	add bench result	2014-06-13 10:14:03 +02:00
Setup.hs	initial commit	2014-06-02 21:22:35 +02:00
taggy.cabal	add ability to convert special html entities to unicode	2014-06-12 20:44:44 +02:00

README.md

taggy

An attoparsec based html parser.

Currently very WIP but already supports a fairly decent range of common websites. I haven't managed to find a website with which it chokes, using the current parser.

The performance is quite promising for now, but we don't do a lof of things that tagsoup does, like converting & to &, etc.

Using `taggy`

taggy has a linksIn function to work on HTML à la tagsoup.

tagsIn :: LT.Text -> [Tag]

Or you can use the raw run function, which returns a good old Result from attoparsec.

run :: LT.Text -> AttoLT.Result [Tag]

For example, if you want to read the html code from a file, and print one tag per line, you could do:

import Data.Attoparsec.Text.Lazy (eitherResult)
import qualified Data.Text.Lazy.IO as T
import Text.Taggy (run)

taggy :: FilePath -> IO ()
taggy fp = do
  content <- T.readFile fp
  either (\s -> putStrLn $ "couldn't parse: " ++ s) 
         (mapM_ print) 
         (eitherResult $ run content)

But taggy also started providing support for DOM-syle documents. This is computed from the list of tags gained by using tagsIn.

If you fire up ghci with taggy loaded:

$ cabal repl # if working with a copy of this repo

You can see this domify in action.

λ> :set -XOverloadedStrings
λ> head . domify . tagsIn $ "<html><head></head><body>yo</body></html>"
NodeElement (Element {eltName = "html", eltAttrs = fromList [], eltChildren = [NodeElement (Element {eltName = "head", eltAttrs = fromList [], eltChildren = []}),NodeElement (Element {eltName = "body", eltAttrs = fromList [], eltChildren = [NodeContent "yo"]})]})

I'm already working on a companion taggy-lens library.

README.md

taggy

Using taggy

Using `taggy`