From de70ae4ca71b15bc8c77f54c4e60ed345214c4df Mon Sep 17 00:00:00 2001 From: inikulin Date: Fri, 23 Jun 2017 21:50:04 +0300 Subject: [PATCH 01/68] Fix malformed JSON from previous commit --- tokenizer/entities.test | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tokenizer/entities.test b/tokenizer/entities.test index 1daff254..7c514563 100644 --- a/tokenizer/entities.test +++ b/tokenizer/entities.test @@ -17,14 +17,14 @@ {"description": "Semicolonless named entity 'not' followed by 'i;' in body", "input":"¬i;", -"output": [["Character", "\u00ACi;"]]}, +"output": [["Character", "\u00ACi;"]], "errors":[ { "code": "missing-semicolon-after-character-reference", "line": 1, "col": 5 } ]}, {"description": "Very long undefined named entity in body", "input":"&ammmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmp;", -"output": [["Character", "&ammmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmp;"]]}, +"output": [["Character", "&ammmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmp;"]], "errors":[ { "code": "unknown-named-character-reference", "line": 1, "col": 950 } ]}, From a5c88a483e4f643a5446ecca579ce344e6bd6d8a Mon Sep 17 00:00:00 2001 From: Ingvar Stepanyan Date: Wed, 12 Jul 2017 16:53:16 +0100 Subject: [PATCH 02/68] Remove `ignoreErrorOrder` option from docs It's not used anymore with changes in #92. --- tokenizer/README.md | 8 -------- 1 file changed, 8 deletions(-) diff --git a/tokenizer/README.md b/tokenizer/README.md index 56956369..50ba680f 100644 --- a/tokenizer/README.md +++ b/tokenizer/README.md @@ -84,14 +84,6 @@ If `test.doubleEscaped` is present and `true`, then every string within `test.output` must be further unescaped (as described above) before comparing with the tokenizer's output. -`test.ignoreErrorOrder` is a boolean value indicating that the order of -`ParseError` tokens relative to other tokens in the output stream is -unimportant, and implementations should ignore such differences between -their output and `expected_output_tokens`. (This is used for errors -emitted by the input stream preprocessing stage, since it is useful to -test that code but it is undefined when the errors occur). If it is -omitted, it defaults to `false`. - xmlViolation tests ------------------ From 8e19e7ad29473842154977d7624aee0097a6def2 Mon Sep 17 00:00:00 2001 From: Ingvar Stepanyan Date: Mon, 17 Jul 2017 15:56:04 +0100 Subject: [PATCH 03/68] Concatenate character tokens Looks like these few places were missed when ParseError token type was removed. This PR fixes them to restore the state promised in the README: > All adjacent character tokens are coalesced into a single ["Character", data] token. --- tokenizer/test1.test | 4 ++-- tokenizer/test2.test | 6 +++--- tokenizer/test3.test | 4 ++-- tokenizer/test4.test | 6 +++--- tokenizer/unicodeCharsProblematic.test | 4 ++-- 5 files changed, 12 insertions(+), 12 deletions(-) diff --git a/tokenizer/test1.test b/tokenizer/test1.test index 09d15024..8b85050f 100644 --- a/tokenizer/test1.test +++ b/tokenizer/test1.test @@ -182,14 +182,14 @@ {"description":"Entity without trailing semicolon (1)", "input":"I'm ¬it", -"output":[["Character","I'm "], ["Character", "\u00ACit"]], +"output":[["Character","I'm \u00ACit"]], "errors": [ {"code" : "missing-semicolon-after-character-reference", "line": 1, "col": 9 } ]}, {"description":"Entity without trailing semicolon (2)", "input":"I'm ¬in", -"output":[["Character","I'm "], ["Character", "\u00ACin"]], +"output":[["Character","I'm \u00ACin"]], "errors": [ {"code" : "missing-semicolon-after-character-reference", "line": 1, "col": 9 } ]}, diff --git a/tokenizer/test2.test b/tokenizer/test2.test index 73f0421d..521694ca 100644 --- a/tokenizer/test2.test +++ b/tokenizer/test2.test @@ -119,7 +119,7 @@ {"description":"Hexadecimal entity pair representing a surrogate pair", "input":"��", -"output":[["Character", "\uFFFD"], ["Character", "\uFFFD"]], +"output":[["Character", "\uFFFD\uFFFD"]], "errors":[ { "code": "surrogate-character-reference", "line": 1, "col": 9 }, { "code": "surrogate-character-reference", "line": 1, "col": 17 } @@ -195,7 +195,7 @@ {"description":"Unescaped <", "input":"foo < bar", -"output":[["Character", "foo "], ["Character", "< bar"]], +"output":[["Character", "foo < bar"]], "errors":[ { "code": "invalid-first-character-of-tag-name", "line": 1, "col": 6 } ]}, @@ -242,7 +242,7 @@ {"description":"Empty end tag with following characters", "input":"abc", -"output":[["Character", "a"], ["Character", "bc"]], +"output":[["Character", "abc"]], "errors":[ { "code": "missing-end-tag-name", "line": 1, "col": 4 } ]}, diff --git a/tokenizer/test3.test b/tokenizer/test3.test index ba3c15b3..85139d4d 100644 --- a/tokenizer/test3.test +++ b/tokenizer/test3.test @@ -88,7 +88,7 @@ {"description":"<\\u0000", "input":"<\u0000", -"output":[["Character", "<"], ["Character", "\u0000"]], +"output":[["Character", "<\u0000"]], "errors":[ { "code": "invalid-first-character-of-tag-name", "line": 1, "col": 2 }, { "code": "unexpected-null-character", "line": 1, "col": 2 } @@ -8415,7 +8415,7 @@ {"description":"<<", "input":"<<", -"output":[["Character", "<"], ["Character", "<"]], +"output":[["Character", "<<"]], "errors":[ { "code": "invalid-first-character-of-tag-name", "line": 1, "col": 2 }, { "code": "eof-before-tag-name", "line": 1, "col": 3 } diff --git a/tokenizer/test4.test b/tokenizer/test4.test index 8e55e767..dd247d54 100644 --- a/tokenizer/test4.test +++ b/tokenizer/test4.test @@ -190,7 +190,7 @@ {"description":"Empty hex numeric entities", "input":"&#x &#X ", -"output":[["Character", "&#x "], ["Character", "&#X "]], +"output":[["Character", "&#x &#X "]], "errors":[ { "code": "absence-of-digits-in-numeric-character-reference", "line": 1, "col": 4 }, { "code": "absence-of-digits-in-numeric-character-reference", "line": 1, "col": 8 } @@ -205,7 +205,7 @@ {"description":"Empty decimal numeric entities", "input":"&# &#; ", -"output":[["Character", "&# "], ["Character", "&#; "]], +"output":[["Character", "&# &#; "]], "errors":[ { "code": "absence-of-digits-in-numeric-character-reference", "line": 1, "col": 3 }, { "code": "absence-of-digits-in-numeric-character-reference", "line": 1, "col": 6 } @@ -274,7 +274,7 @@ {"description":"Surrogate code point edge cases", "input":"퟿����", -"output":[["Character", "\uD7FF"], ["Character", "\uFFFD"], ["Character", "\uFFFD"], ["Character", "\uFFFD"], ["Character", "\uFFFD\uE000"]], +"output":[["Character", "\uD7FF\uFFFD\uFFFD\uFFFD\uFFFD\uE000"]], "errors":[ { "code": "surrogate-character-reference", "line": 1, "col": 17 }, { "code": "surrogate-character-reference", "line": 1, "col": 25 }, diff --git a/tokenizer/unicodeCharsProblematic.test b/tokenizer/unicodeCharsProblematic.test index 346cad17..3ddb96c0 100644 --- a/tokenizer/unicodeCharsProblematic.test +++ b/tokenizer/unicodeCharsProblematic.test @@ -18,7 +18,7 @@ {"description": "Invalid Unicode character U+DFFF with valid preceding character", "doubleEscaped":true, "input": "a\\uDFFF", -"output":[["Character", "a"], ["Character", "\\uDFFF"]], +"output":[["Character", "a\\uDFFF"]], "errors":[ { "code": "surrogate-in-input-stream", "line": 1, "col": 2 } ]}, @@ -33,7 +33,7 @@ {"description":"CR followed by U+0000", "input":"\r\u0000", -"output":[["Character", "\n"], ["Character", "\u0000"]], +"output":[["Character", "\n\u0000"]], "errors":[ { "code": "unexpected-null-character", "line": 2, "col": 1 } ]} From 9314ef76ec48af7fe89aba23e754d47df6bb8a4b Mon Sep 17 00:00:00 2001 From: Ingvar Stepanyan Date: Tue, 25 Jul 2017 22:36:23 +0100 Subject: [PATCH 04/68] Add a list of currently allowed initial states (#101) Fixes #99 --- tokenizer/README.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/tokenizer/README.md b/tokenizer/README.md index 50ba680f..66b81e8f 100644 --- a/tokenizer/README.md +++ b/tokenizer/README.md @@ -45,9 +45,18 @@ into the corresponding Unicode code point. (Note that this option also affects the interpretation of `test.output`.) `test.initialStates` is a list of strings, each being the name of a -tokenizer state. The test should be run once for each string, using it +tokenizer state which can be one of the following: + +- `Data state` +- `PLAINTEXT state` +- `RCDATA state` +- `RAWTEXT state` +- `Script data state` +- `CDATA section state` + + The test should be run once for each string, using it to set the tokenizer's initial state for that run. If -`test.initialStates` is omitted, it defaults to `["data state"]`. +`test.initialStates` is omitted, it defaults to `["Data state"]`. `test.lastStartTag` is a lowercase string that should be used as "the tag name of the last start tag to have been emitted from this From cbafeba94586a1ade00d55e600fc52da8f849986 Mon Sep 17 00:00:00 2001 From: Simon Pieters Date: Tue, 22 Aug 2017 11:34:03 +0200 Subject: [PATCH 05/68] Test U+0000 in bogus comment and bogus doctype states Follows https://github.com/whatwg/html/pull/2939 --- tokenizer/test3.test | 161 +++++++++++++++++++++--- tokenizer/test4.test | 3 +- tree-construction/plain-text-unsafe.dat | Bin 9291 -> 9388 bytes 3 files changed, 148 insertions(+), 16 deletions(-) diff --git a/tokenizer/test3.test b/tokenizer/test3.test index 85139d4d..cb04d037 100644 --- a/tokenizer/test3.test +++ b/tokenizer/test3.test @@ -141,7 +141,8 @@ "input":"$`G|yxwc7w^Qc~*hw&Bu73i2(q7H3_Z& From be9fb2431d679e4e0c4a9db5f350cf0686a729b1 Mon Sep 17 00:00:00 2001 From: Henri Sivonen Date: Tue, 23 Jan 2018 17:33:37 +0200 Subject: [PATCH 06/68] Move `#script-off` to the usual place relative to the other sections of a test --- tree-construction/tests18.dat | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tree-construction/tests18.dat b/tree-construction/tests18.dat index 3ce39fc6..05363b39 100644 --- a/tree-construction/tests18.dat +++ b/tree-construction/tests18.dat @@ -51,11 +51,11 @@ #data