mirror of https://github.com/danneu/html-parser.git synced 2024-11-25 08:19:17 +03:00

a lenient html5 parser written in Elm

Go to file

Dan Neumann 09bf0f643e Update readme		2022-05-16 20:51:51 -05:00
src/Html	Initial commit	2022-05-16 20:39:48 -05:00
tests	Initial commit	2022-05-16 20:39:48 -05:00
.prettierrc.json	Initial commit	2022-05-16 20:39:48 -05:00
elm.json	Initial commit	2022-05-16 20:39:48 -05:00
package-lock.json	Initial commit	2022-05-16 20:39:48 -05:00
package.json	Initial commit	2022-05-16 20:39:48 -05:00
README.md	Update readme	2022-05-16 20:51:51 -05:00

README.md

elm-html-parser

Note: Not currently published to Elm packages.

A lenient html5 parser implemented with Elm.

A lenient alternative to hecrj/elm-html-parser.

Usage

run to parse an html string into a list of html nodes.
runDocument to parse <!doctype html>[...] into a root node.

import Html.Parser 

Html.Parser.run "<p class=greeting>hello <strong>world</strong></p>"
-- Ok 
--     [ Element "p" [ ("class", "greeting") ] 
--          [ Text "hello "
--          , Element "strong" [] [ Text "world" ] 
--          ] 
--     ]

Rendering:

nodeToHtml or nodesToHtml to render parsed nodes into virtual dom nodes that Elm can render.
nodeToString and nodesToString to render parsed nodes into a string.
nodeToPrettyString and nodesToPrettyString to render parsed nodes into indented strings.

Goals

Leniency
- Avoids validating while parsing
- Prefers to immitate browser parsing behavior rather than html5 spec.
- Prefers to use the html5 spec only to handle ambiguous cases rather than to prohibit invalid html5
- Prefers to fall back to text nodes than short-circuit with parse errors
Handle user-written html
- Users don't write character entities like & and <. This parser should strive to handle cases like <p><:</p> -> Element "p" [] [ Text "<:" ].

Features / Quirks

Characters don't need to be escaped into entities.

e.g. <div><:</div> will parse correctly and doesn't need to be rewritten into <div><:</div>.
Tags that should not nest are autoclosed.

e.g. <p>a<p>b -> <p>a</p><p>b</p>.
Closing tags that have no matching open tags are ignored.

e.g. </a><div></div></div></b> -> <div></div>
Ignores comments in whitespace positions:

e.g. <div /> -> <div/>
Parses comments in text node positions:

e.g. div></div> -> Element "div" [ Comment "comment" ]

Differences from existing packages

Currently, there is only one html parser published to Elm packages: hecrj/elm-html-parser.

@hecjr has said that following the html5 spec is a goal of their parser, so their parser is stricter by design and rejects invalid html5.

Development

git clone and npm install.

npm test to run tests
npm docs to preview docs locally

Special thanks

@hecrj and their contributors.
@ymtszw for their work on the Javascript <script> parser.