8000 html.parser: fix ‘<![CDATA[ ... ]]>’ handling not capturing ‘]’ · python/cpython@6deda18 · GitHub
[go: up one dir, main page]

Skip to content

Commit 6deda18

Browse files
committed
html.parser: fix ‘<![CDATA[ ... ]]>’ handling not capturing ‘]’
Per documentation, the unknown_decl method is called with ‘the entire contents of the declaration inside the <![...]> markup.’ However, this is not quite the case for ‘<![CDATA[...]]>’ where the first of the two final closing square brackets should be included but isn’t. In other words, for such declaration unknown_decl is called with ‘CDATA[...’ string (observe unmatched brackets). Not including the closing bracket doesn’t fit the documentation but also makes it impossible to output the declaration without change since ‘"<![" + data + "]>"’ eats one of the brackets at the end. Fix by including the first of the closing brackets when calling unknown_decl.
1 parent 94894dd commit 6deda18

File tree

2 files changed

+11
-3
lines changed

2 files changed

+11
-3
lines changed

Lib/_markupbase.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,12 @@
1010
_declname_match = re.compile(r'[a-zA-Z][-_.a-zA-Z0-9]*\s*').match
1111
_declstringlit_match = re.compile(r'(\'[^\']*\'|"[^"]*")\s*').match
1212
_commentclose = re.compile(r'--\s*>')
13-
_markedsectionclose = re.compile(r']\s*]\s*>')
13+
_markedsectionclose = re.compile(r'](\s*]\s*>)')
1414

1515
# An analysis of the MS-Word extensions is available at
1616
# http://www.planetpublish.com/xmlarena/xap/Thursday/WordtoXML.pdf
1717

18-
_msmarkedsectionclose = re.compile(r']\s*>')
18+
_msmarkedsectionclose = re.compile(r'(]\s*>)')
1919

2020
del re
2121

@@ -157,7 +157,7 @@ def parse_marked_section(self, i, report=1):
157157
if not match:
158158
return -1
159159
if report:
160-
j = match.start(0)
160+
j = match.start(1)
161161
self.unknown_decl(rawdata[i+3: j])
162162
return match.end(0)
163163

Lib/test/test_htmlparser.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -315,6 +315,14 @@ def get_events(self):
315315
("endtag", element_lower)],
316316
collector=Collector(convert_charrefs=False))
317317

318+
def test_cdata_decl(self):
319+
self._run_check('<math><ms><![CDATA[x<y]]></ms></math>',
320+
[('starttag', 'math', []),
321+
('starttag', 'ms', []),
322+
('unknown decl', 'CDATA[x<y]'),
323+
('endtag', 'ms'),
324+
('endtag', 'math')])
325+
318326
def test_comments(self):
319327
html = ("<!-- I'm a valid comment -->"
320328
'<!--me too!-->'

0 commit comments

Comments
 (0)
0