E5B1 [9.x] Handle unicode characters on TrimStrings middleware by rodrigopedra · Pull Request #40600 · laravel/framework · GitHub
[go: up one dir, main page]

Skip to content

[9.x] Handle unicode characters on TrimStrings middleware #40600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 25, 2022
Merged

[9.x] Handle unicode characters on TrimStrings middleware #40600

merged 1 commit into from
Jan 25, 2022

Conversation

rodrigopedra
Copy link
Contributor

PR #38117 introduced the ability to other space unicode characters to the TrimStrings middleware.

As issue #40577 stated the current implementation conflicts with some other unicode characters which end with the same escape sequence as the NBSP character.

In particular these japanese characters:

  • だ (unicode E381A0)
  • ム (unicode E383A0)

The implementation from PR #38117 causes the A0 sequence to be trimmed, which converts the sequence to an invalid one.

This PR:

  • Changes the implementation to use preg_replace to trim the spaces using the proper modifiers
  • Add additional test cases that failed with the previous implementation and passed with the proposed one

Closes #40577

@driesvints driesvints linked an issue Jan 25, 2022 that may be closed by this pull request
@foremtehan
Copy link
Contributor

‌‌‌‌‌‌‌‌

@foremtehan
Copy link
Contributor

Can this also remove Word Joiner? See my above comment

@nshiro
Copy link
Contributor
nshiro commented Jan 25, 2022

@foremtehan
Well, I'm afraid not.
This is how I tested.

<?php
// This file is written in UTF-8.

$words[] = 'a';
$words[] = pack('C*',0xC2,0xA0) . 'a'; // NBSP (No-Break Space)
$words[] = pack('C*',0xE2,0x81,0xA0) . 'a'; // 	Word Joiner

echo '<meta charset="utf-8">', PHP_EOL;
echo '<table border="1">', PHP_EOL;

foreach ($words as $word) {
    echo '<tr>', PHP_EOL;

    $value = preg_replace('~^\s+|\s+$~iu', '', $word);

    echo '<td>', $value, '</td>', PHP_EOL;
    echo '<td>', bin2hex($value), '</td>', PHP_EOL;
    echo '</tr>', PHP_EOL;
}

echo '</table>', PHP_EOL;

The result is below.
2022-01-26_00h02_04

NBSP (No-Break Space)
https://unicode-table.com/jp/00A0/

Word Joiner
https://unicode-table.com/jp/2060/

@taylorotwell taylorotwell merged commit 4491530 into laravel:9.x Jan 25, 2022
@rodrigopedra rodrigopedra deleted the 9.x branch January 25, 2022 17:53
@rodrigopedra
Copy link
Contributor Author

Well technically the word-joiner character is not a space.

It could be removed by using this:

<?php
// This file is written in UTF-8.

$words[] = 'a';
$words[] = pack('C*',0xC2,0xA0) . 'a'; // NBSP (No-Break Space)
$words[] = pack('C*',0xE2,0x81,0xA0) . 'a'; // 	Word Joiner

echo '<meta charset="utf-8">', PHP_EOL;
echo '<table border="1">', PHP_EOL;

foreach ($words as $word) {
    echo '<tr>', PHP_EOL;

    $value = preg_replace('~^[\s\x{2060}]+|[\s\x{2060}]+$~iu', '', $word);

    echo '<td>', $word, '</td>', PHP_EOL;
    echo '<td>', $value, '</td>', PHP_EOL;
    echo '<td>', bin2hex($value), '</td>', PHP_EOL;
    echo '</tr>', PHP_EOL;
}

echo '</table>', PHP_EOL;

But I am afraid that if we keep sending PR to each occurrence of a undesired character it might be overwhelming for maintainers.

Is there a list of non-visible characters that we could use?

In the meantime, you can override the transform method from the TrimString middleware an use this version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-byte word problem with TrimStrings Middleware.
5 participants
0