8000 [DomCrawler] `text()` method mangles UTF8 text · Issue #46822 · symfony/symfony · GitHub
[go: up one dir, main page]

Skip to content
[DomCrawler] text() method mangles UTF8 text #46822
Closed
@rvock

Description

@rvock

Symfony version(s) affected

6.1.0

Description

The text() method mangles some UTF8 content, if the normalizeWhitespace option is used. This happens, because the preg_replace does not set the utf-8 modifier for preg_replace.

How to reproduce

Example XML:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<node>mą</node>

Example PHP Code:

$document = new \Symfony\Component\DomCrawler\Crawler(file_get_contents($sourcePath));
$text = $document->filter('node')->text();
$correctText = $document->filter('node')->getNode(0)->nodeValue;
print_r(array_map('dechex', array_map('ord', preg_split('//', $text))));
print_r(array_map('dechex', array_map('ord', preg_split('//', $correctText))));

Output (prints the hexcode for the content)

Array
(
    [0] => 0
    [1] => 6d
    [2] => c4
    [3] => 0
)
Array
(
    [0] => 0
    [1] => 6d
    [2] => c4
    [3] => 85
    [4] => 0
)

Possible Solution

The solution is simple: Set the utf8 modifier for the text() method:
https://github.com/symfony/dom-crawler/blob/6.1/Crawler.php#L558

return trim(preg_replace('/(?:\s{2,}+|[^\S ])/u', ' ', $text));

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0