[9.x] Handle unicode characters on TrimStrings middleware #40600

rodrigopedra · 2022-01-25T02:49:14Z

PR #38117 introduced the ability to other space unicode characters to the TrimStrings middleware.

As issue #40577 stated the current implementation conflicts with some other unicode characters which end with the same escape sequence as the NBSP character.

In particular these japanese characters:

だ (unicode E381A0)
ム (unicode E383A0)

The implementation from PR #38117 causes the A0 sequence to be trimmed, which converts the sequence to an invalid one.

This PR:

Changes the implementation to use preg_replace to trim the spaces using the proper modifiers
Add additional test cases that failed with the previous implementation and passed with the proposed one

Closes #40577

foremtehan · 2022-01-25T14:07:02Z

‌‌‌‌‌‌‌‌

foremtehan · 2022-01-25T14:07:33Z

Can this also remove Word Joiner? See my above comment

nshiro · 2022-01-25T15:03:58Z

@foremtehan
Well, I'm afraid not.
This is how I tested.

<?php
// This file is written in UTF-8.

$words[] = 'a';
$words[] = pack('C*',0xC2,0xA0) . 'a'; // NBSP (No-Break Space)
$words[] = pack('C*',0xE2,0x81,0xA0) . 'a'; // 	Word Joiner

echo '<meta charset="utf-8">', PHP_EOL;
echo '<table border="1">', PHP_EOL;

foreach ($words as $word) {
    echo '<tr>', PHP_EOL;

    $value = preg_replace('~^\s+|\s+$~iu', '', $word);

    echo '<td>', $value, '</td>', PHP_EOL;
    echo '<td>', bin2hex($value), '</td>', PHP_EOL;
    echo '</tr>', PHP_EOL;
}

echo '</table>', PHP_EOL;

The result is below.

NBSP (No-Break Space)
https://unicode-table.com/jp/00A0/

Word Joiner
https://unicode-table.com/jp/2060/

rodrigopedra · 2022-01-25T18:18:44Z

Well technically the word-joiner character is not a space.

It could be removed by using this:

<?php
// This file is written in UTF-8.

$words[] = 'a';
$words[] = pack('C*',0xC2,0xA0) . 'a'; // NBSP (No-Break Space)
$words[] = pack('C*',0xE2,0x81,0xA0) . 'a'; // 	Word Joiner

echo '<meta charset="utf-8">', PHP_EOL;
echo '<table border="1">', PHP_EOL;

foreach ($words as $word) {
    echo '<tr>', PHP_EOL;

    $value = preg_replace('~^[\s\x{2060}]+|[\s\x{2060}]+$~iu', '', $word);

    echo '<td>', $word, '</td>', PHP_EOL;
    echo '<td>', $value, '</td>', PHP_EOL;
    echo '<td>', bin2hex($value), '</td>', PHP_EOL;
    echo '</tr>', PHP_EOL;
}

echo '</table>', PHP_EOL;

But I am afraid that if we keep sending PR to each occurrence of a undesired character it might be overwhelming for maintainers.

Is there a list of non-visible characters that we could use?

In the meantime, you can override the transform method from the TrimString middleware an use this version.

Handle unicode characters on TrimStrings middleware

cdfd49b

rodrigopedra mentioned this pull request Jan 25, 2022

Multi-byte word problem with TrimStrings Middleware. #40577

Closed

driesvints approved these changes Jan 25, 2022

View reviewed changes

driesvints linked an issue Jan 25, 2022 that may be closed by this pull request

Multi-byte word problem with TrimStrings Middleware. #40577

Closed

taylorotwell merged commit 4491530 into laravel:9.x Jan 25, 2022

rodrigopedra deleted the 9.x branch January 25, 2022 17:53

rodrigopedra mentioned this pull request Apr 7, 2022

[9.x] Improve Unicode support on Str::squish() #41877

Merged

MrMicky-FR mentioned this pull request Apr 20, 2022

[9.x] Fix TrimStrings middleware with non-UTF8 characters #42065

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[9.x] Handle unicode characters on TrimStrings middleware #40600

[9.x] Handle unicode characters on TrimStrings middleware #40600

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[9.x] Handle unicode characters on TrimStrings middleware #40600

[9.x] Handle unicode characters on TrimStrings middleware #40600

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!