Description
Bug report
Originally, the definition of the HTML format was not formally strict. It was similar to SGML and XML, but with a lot of looseness. HTMLParser
tried its best to parse anything that looked like HTML. But after creation of HTML5, its specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. It is important to follow these rules, for security reasons.
The current HTMLParser
mainly follows the HTML5 specification, but there are a number of differences:
--!>
should end the comment. gh-102555: Fix comment parsing in HTMLParser #135664-- >
should not end the comment. gh-102555: Fix comment parsing in HTMLParser #135664<-->
and<--->
should be abnormally ended empty comments. gh-102555: Fix comment parsing in HTMLParser #135664] ]>
and]] >
should not end the CDATA section.- CDATA handling should depend on the current node. This is important, because the ending condition are different for the CDATA section and the bogus comment (
]]>
and>
). - Whitespaces should not be acceptable between
</
and the tag name. E.g.</ script>
should not end the script section. gh-135661: Fix parsing start and end tags in HTMLParser #135930 - Vertical tabulation (
\v
) and non-ASCII whitespaces should not be recognized as whitespaces. The only whitespaces are\t\n\r\f
. gh-135661: Fix parsing start and end tags in HTMLParser #135930 - Null character (U+0000) should not end the tag name. gh-135661: Fix parsing start and end tags in HTMLParser #135930
- Null character (U+0000), surrogate characters and many other special characters should be replaced by
\xfffd
. I think we can leave this, because it is easy to do in pre-processing or post-processing, and they usually do not cause issues in Python. - End tag can have attributes and slashes after tag name. It can not end after the first
>
. E.g.</script/foo=">"/>
. gh-135661: Fix parsing start and end tags in HTMLParser #135930 - Case-insensitive matching should only transform ASCII letters. E.g.
</script>
does not match</ſcript>
, andLINK
does not matchLINK
(the last letter is U+212A). gh-135661: Fix parsing start and end tags in HTMLParser #135930 - There may be multiple slashes and whitespaces between the last attribute and closing
>
in both start and end tags. E.g.<a foo=bar/ //>
. gh-135661: Fix parsing start and end tags in HTMLParser #135930 - There shou
8BFA
ld only be one
=
separator between attribute name and value. E.g.<a foo==bar>
should have attribute "foo" with value "=bar". gh-135661: Fix parsing start and end tags in HTMLParser #135930 - No whitespace should be acceptable between the
=
separator and attribute name and value. E.g.<a foo =bar>
should have two attributes "foo" and "=bar", both with value None;<a foo= bar>
should have two attributes: "foo" with value "" and "bar" with value None. gh-135661: Fix parsing start and end tags in HTMLParser #135930
This can cause security issues for some programs. If the program uses HTMLParser
to check the HTML input for dangerous code, it can miss some code. For example, "<!----!><script>...</script><!---->
" is parsed by browsers as a script block surrounded by two comments, but the current HTMLParser
parses it as a single comment.
Linked PRs
Metadata
Metadata
Assignees
Labels
Projects
Status