[DomCrawler] HTML5 not recognized when document starts with a comment #37681
Labels
Bug
DomCrawler
Good first issue
Ideal for your first contribution! (some Symfony experience may be required)
Help wanted
Issues and PRs which are looking for volunteers to complete them.
Symfony version(s) affected: 4.4.11 (
symfony/dom-crawler
)Description
Symfony DOM crawler has an option to use a HTML5 parser when you install the respective package (
masterminds/html5
). However, this parser is specifically checking for a HTML5 doc-type as the first content in the HTML. The following situation therefore does not work (see reproduction):How to reproduce
Consider the following file
sample.html
:Next, we create a crawler with this content:
The file above is now parsed using the regular non-html5 parser.
As seen on this line https://github.com/symfony/symfony/blob/master/src/Symfony/Component/DomCrawler/Crawler.php#L186,
it evaluates to
parseXhtml
instead of the expectedparseHtml5
:This creates trivial issues since it is actually a HTML5 document.
P.S. I dont know if the html sample above is according to spec.
Possible Solution
1)
A dirty fix I'm using is simply discarding any HTML comments using a regex:
This is unlikely to be a closing solution. I can imagine there being websites that have
<script>
tags or even other html elements before the<!DOCTYPE html>
definition. Again, I do not know if this is against html5 spec.2)
Add a feature so the HTML5 parser can be forced for any content you pass. I have no clue what implications this has because this causes non-html5 content to be parsed by the HTML5 parser.
The text was updated successfully, but these errors were encountered: