mirror of https://github.com/alpmestan/taggy.git synced 2024-08-16 18:30:30 +03:00

a simple, tolerant & efficient HTML/XML parser (with HTML in mind though)

Go to file

vi a9c6224541 Added subtrees combinator + miscellaneous refactoring.		2014-06-23 19:04:57 +08:00
bench	add ability to convert special html entities to unicode	2014-06-12 20:44:44 +02:00
example	add ability to convert special html entities to unicode	2014-06-12 20:44:44 +02:00
html_files	initial commit	2014-06-02 21:22:35 +02:00
src/Text	Added subtrees combinator + miscellaneous refactoring.	2014-06-23 19:04:57 +08:00
tests	Added subtrees combinator + miscellaneous refactoring.	2014-06-23 19:04:57 +08:00
.ghci	Added .ghci; src is in scope of tests, always enable OverloadedStrings.	2014-06-23 14:38:55 +08:00
.travis.yml	travis update: run tests	2014-06-23 12:20:37 +02:00
LICENSE	initial commit	2014-06-02 21:22:35 +02:00
README.md	Fix `domify` example in README.	2014-06-23 14:13:35 +08:00
report-entities.html	add bench result	2014-06-13 10:14:03 +02:00
Setup.hs	initial commit	2014-06-02 21:22:35 +02:00
taggy.cabal	Merge pull request #5 from fmapfmapfmap/combinators	2014-06-23 12:09:36 +02:00

README.md

taggy

An attoparsec based html parser.

Currently very WIP but already supports a fairly decent range of common websites. I haven't managed to find a website with which it chokes, using the current parser. The performance is quite promising.

Using `taggy`

taggy has a taggyWith function to work on HTML à la tagsoup.

taggyWith :: Bool -> LT.Text -> [Tag]

The Bool there just lets you specify whether you want to convert the special HTML entities to their corresponding unicode character. True means "yes convert them please". This function takes lazy Text as input.

Or you can use the raw run function, which returns a good old Result from attoparsec.

run :: Bool -> LT.Text -> AttoLT.Result [Tag]

For example, if you want to read the html code from a file, and print one tag per line, you could do:

import Data.Attoparsec.Text.Lazy (eitherResult)
import qualified Data.Text.Lazy.IO as T
import Text.Taggy (run)

taggy :: FilePath -> IO ()
taggy fp = do
  content <- T.readFile fp
  either (\s -> putStrLn $ "couldn't parse: " ++ s) 
         (mapM_ print) 
         (eitherResult $ run True content)

But taggy also started providing support for DOM-syle documents. This is computed from the list of tags gained by using taggyWith.

If you fire up ghci with taggy loaded:

$ cabal repl # if working with a copy of this repo

You can see this domify in action.

λ> :set -XOverloadedStrings
λ> head . domify . taggyWith False $ "<html><head></head><body>yo</body></html>"
NodeElement (Element {eltName = "html", eltAttrs = fromList [], eltChildren = [NodeElement (Element {eltName = "head", eltAttrs = fromList [], eltChildren = []}),NodeElement (Element {eltName = "body", eltAttrs = fromList [], eltChildren = [NodeContent "yo"]})]})

I'm already working on a companion taggy-lens library.

README.md

taggy

Using taggy

Using `taggy`