8000 gh-69426: only unescape properly terminated character entities in attribute values by sissbruecker · Pull Request #95215 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

gh-69426: only unescape properly terminated character entities in attribute values #95215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prev Previous commit
Next Next commit
Address review comments in parser.py
  • Loading branch information
sissbruecker committed Jan 14, 2023
commit a7af75064a0262994177d9c3800707c3921c2a59
12 changes: 6 additions & 6 deletions Lib/html/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,22 +62,22 @@
# See: https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state
attr_charref = re.compile(r'&(#[0-9]+|#[xX][0-9a-fA-F]+|[a-zA-Z][a-zA-Z0-9]*)[;=]?')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This partially duplicates an existing Regex, but I was not able to reuse the existing one for this purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the issue only seems to affect named character references, is there a reason to include numeric charrefs too in this regex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the new _unescape_attrvalue is effectively a wrapper for html.escape that only delegates to html.escape if the attribute specific conditions are met. Since we still want to escape numeric and hex char refs in attributes, we need to include them in the regex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move this immediately after the definition of entityref and charref. If we change one regexp, we will not forget to change the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


def replace_attr_charref(match):
def _replace_attr_charref(match):
ref = match.group(0)
# Numeric / hex char refs must always be unescaped
if ref[1] == '#':
if ref.startswith('&#'):
return unescape(ref)
# Named character / entity references must only be unescaped
# if they are an exact match, and they are not followed by an equals sign
terminates_with_equals = ref[-1:] == '='
terminates_with_equals = ref.endswith('=')
exact_match = ref.lstrip('&').rstrip('=') in html5_entities
if exact_match and not terminates_with_equals:
return unescape(ref)
# Otherwise do not unescape
return ref

def unescape_attrvalue(s):
return attr_charref.sub(replace_attr_charref, s)
def _unescape_attrvalue(s):
return attr_charref.sub(_replace_attr_charref, s)


class HTMLParser(_markupbase.ParserBase):
Expand Down Expand Up @@ -343,7 +343,7 @@ def parse_starttag(self, i):
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
if attrvalue:
attrvalue = unescape_attrvalue(attrvalue)
attrvalue = _unescape_attrvalue(attrvalue)
attrs.append((attrname.lower(), attrvalue))
k = m.end()

Expand Down
0