Closed
Description
Symfony version(s) affected
6.1.0
Description
The text() method mangles some UTF8 content, if the normalizeWhitespace
option is used. This happens, because the preg_replace
does not set the utf-8
modifier for preg_replace
.
How to reproduce
Example XML:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<node>mą</node>
Example PHP Code:
$document = new \Symfony\Component\DomCrawler\Crawler(file_get_contents($sourcePath));
$text = $document->filter('node')->text();
$correctText = $document->filter('node')->getNode(0)->nodeValue;
print_r(array_map('dechex', array_map('ord', preg_split('//', $text))));
print_r(array_map('dechex', array_map('ord', preg_split('//', $correctText))));
Output (prints the hexcode for the content)
Array
(
[0] => 0
[1] => 6d
[2] => c4
[3] => 0
)
Array
(
[0] => 0
[1] => 6d
[2] => c4
[3] => 85
[4] => 0
)
Possible Solution
The solution is simple: Set the utf8 modifier for the text() method:
https://github.com/symfony/dom-crawler/blob/6.1/Crawler.php#L558
return trim(preg_replace('/(?:\s{2,}+|[^\S ])/u', ' ', $text));
Additional Context
No response