Accept XHTML-style self-closing void tags (#305)

Allow the self-closing `/>` end for void tags. For non-void tags these
were already "allowed" due to how the HTML parser works, but for
elements where they actually occur, like `<br/>`, they caused a parse
error. Support for them was not implemented since we only expect valid
HTML5, e.g. the output of Firefox' Element.innerHTML.

Use case: TranslateLocally uses Qt's HTML representation of rich text.
That HTML uses self-closing tags like `<meta .../>` and `<br/>`.
Implementing a string replace operation that would only match these
elements without parsing HTML is tricky. Fixing it in
bergamot-translator is not.

Implementation: Currently `<img>` is marked as a void tag (an element
which cannot have children or text, and therefore treated differently.
Since void tags normally have no close tag, they are treated as
immediately closed. The HTML parser we use reads `<img/>` as
`<img></img>` which thus causes a problem since now we close an element
that was never open, to begin with.

This fix ignores the `TT_TAG_END` token from the parser when the tag
name is that of a void tag.
This commit is contained in:
Jelmer 2022-01-19 09:22:46 +00:00 committed by GitHub
parent 6a4f409cda
commit acbc46d816
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 10 additions and 2 deletions

View File

@ -482,6 +482,12 @@ TEST_CASE("Test self-closing tag (HTML5)") {
CHECK(input == "hello world and other creatures"); // Note double space between "hello" and "world"
}
TEST_CASE("Test self-closing tag (XHTML)") {
std::string input("<p>hello<img/>world</p>");
HTML html(std::move(input), true);
CHECK(input == "hello world"); // <img/> introduced space
}
TEST_CASE("Test empty void tag at end of input") {
std::string input("hello <br>");
HTML html(std::move(input), true);

View File

@ -350,8 +350,10 @@ HTML::HTML(std::string &&source, bool process_markup, Options &&options) : optio
} break;
case markup::Scanner::TT_TAG_END:
// Note: self-closing tags emit TT_TAG_END immediately after TT_TAG_START
// but since we're parsing HTML5, a sole <img> will never emit a TT_TAG_END
// If this is the closing bit of a void tag, i.e. triggered by the "/>"
// bit of "<img/>", then completely ignore it.
if (contains(options_.voidTags, std::string(scanner.tag()))) break;
if (stack.empty()) throw BadHTML(format("Encountered more closing tags ({}) than opening tags", scanner.tag()));
if (stack.back()->name != scanner.tag())