8000 [DomCrawler] HTML5 not recognized when document starts with a comment · Issue #37681 · symfony/symfony · GitHub
[go: up one dir, main page]

Skip to content

[DomCrawler] HTML5 not recognized when document starts with a comment #37681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bert-w opened this issue Jul 27, 2020 · 1 comment
Closed

[DomCrawler] HTML5 not recognized when document starts with a comment #37681

bert-w opened this issue Jul 27, 2020 · 1 comment
Labels
Bug DomCrawler Good first issue Ideal for your first contribution! (some Symfony experience may be required) Help wanted Issues and PRs which are looking for volunteers to complete them.

Comments

@bert-w
Copy link
bert-w commented Jul 27, 2020

Symfony version(s) affected: 4.4.11 (symfony/dom-crawler)

Description
Symfony DOM crawler has an option to use a HTML5 parser when you install the respective package ( masterminds/html5). However, this parser is specifically checking for a HTML5 doc-type as the first content in the HTML. The following situation therefore does not work (see reproduction):

How to reproduce
Consider the following file sample.html:

<!--
    This is a comment
-->
<!DOCTYPE html>
<html lang="en">
<body>
    <h1>Hello</h1>
</body>
</html>

Next, we create a crawler with this content:

$crawler = new \Symfony\Component\DomCrawler\Crawler(file_get_contents('sample.html'), 'https://example.com');

The file above is now parsed using the regular non-html5 parser.

As seen on this line https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L186,
it evaluates to parseXhtml instead of the expected parseHtml5:

$dom = null !== $this->html5Parser && strspn($content, " \t\r\n") === stripos($content, '<!doctype html>') ? $this->parseHtml5($content, $charset) : $this->parseXhtml($content, $charset);

This creates trivial issues since it is actually a HTML5 document.

P.S. I dont know if the html sample above is according to spec.

Possible Solution
1)
A dirty fix I'm using is simply discarding any HTML comments using a regex:

$content = preg_replace('/<!--.*?-->/s', '', file_get_contents('sample.html'));
$crawler = new Crawler($content, ...);

This is unlikely to be a closing solution. I can imagine there being websites that have <script> tags or even other html elements before the <!DOCTYPE html> definition. Again, I do not know if this is against html5 spec.

2)
Add a feature so the HTML5 parser can be forced for any content you pass. I have no clue what implications this has because this causes non-html5 content to be parsed by the HTML5 parser.

@stof
Copy link
Member
stof commented Jul 28, 2020

According to https://html.spec.whatwg.org/multipage/syntax.html#writing, we should indeed account for an optional BOM char and optional comments and whitespaces before the doctype.

@stof stof added Good first issue Ideal for your first contribution! (some Symfony experience may be required) Help wanted Issues and PRs which are looking for volunteers to complete them. labels Aug 6, 2020
RevZer0 pushed a commit to RevZer0/symfony that referenced this issue Aug 11, 2020
@fabpot fabpot closed this as completed Aug 12, 2020
fabpot pushed a commit to RevZer0/symfony that referenced this issue Aug 12, 2020
fabpot added a commit that referenced this issue Aug 12, 2020
This PR was squashed before being merged into the 4.4 branch.

Discussion
----------

Fix for issue #37681

| Q             | A
| ------------- | ---
| Branch?       | 4.4
| Bug fix?      | yes
| New feature?  | no
| Deprecations? | no
| Tickets       | Fix #37681
| License       | MIT
| Doc PR        |

Allow BOM character and comments before `<!DOCTYPE html>` declaration in DomCrawler while choosing a parser implementation

Commits
-------

9bc249e Fix for issue #37681
fabpot added a commit that referenced this issue Aug 12, 2020
* 4.4:
  Fix for issue #37681
  Revert changes to Table->fillCells()
  [Console] Table: support cells with newlines after a cell with colspan >= 2
  Fix redis connect with empty password
  [Validator] Add BC layer for notInRangeMessage when min and max are set
fabpot added a commit that referenced this issue Aug 12, 2020
* 5.1:
  Fix for issue #37681
  Revert changes to Table->fillCells()
  [Console] Table: support cells with newlines after a cell with colspan >= 2
  Fix redis connect with empty password
  [Validator] Add BC layer for notInRangeMessage when min and max are set
This was referenced Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug DomCrawler Good first issue Ideal for your first contribution! (some Symfony experience may be required) Help wanted Issues and PRs which are looking for volunteers to complete them.
Projects
None yet
Development

No branches or pull requests

4 participants
0