8000 Make Unicode conversions more robust against invalid data by binary1248 · Pull Request #3628 · SFML/SFML · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@binary1248
Copy link
Member

Title.

These changes are intended to make Unicode (UTF-8/16) conversions more robust against invalid sequences. Previously, feeding the conversions invalid sequences could result in those bytes being copied 1:1 to the UTF-32 output without proper validation. This resulted in confusion when attempting to decode non-UTF-8-encoded data as UTF-8 because the resulting sf::String would be filled with codepoints which did not correspond to the source data in its original encoding. An example of this can be seen in #2985.

The decoding and encoding functions already provided a parameter to specify a replacement codepoint to be inserted into the resulting data stream if decoding encountered an invalid sequence. Because validation was not performed correctly the replacement codepoint would not always be inserted when it had to be. This change also exposes the replacement parameter to all functions that call the decoding functions so the user has the possibility of specifying their own replacement codepoint when e.g. decoding UTF-8 directly into an sf::String. Not being able to do so would mean they are forced to use the default behaviour which would be to discard invalid sequences without replacing them.

Because silent conversion errors can be hard to detect and lead to many other unwanted effects, warning messages are output in debug builds to notify the user that conversion errors occurred so they can investigate.

Long term we should consider adding conversion functions that return the result of a conversion so the user can handle conversion errors from within the code itself.

We should also consider a mechanism to allow '\0' to be a replacement character. Currently this value is interpreted as an indication that invalid sequences should be discarded.

@binary1248 binary1248 self-assigned this Dec 20, 2025
@eXpl0it3r eXpl0it3r added this to the 3.1 milestone Dec 20, 2025
@github-project-automation github-project-automation bot moved this to Planned in SFML 3.1.0 Dec 20, 2025
@eXpl0it3r eXpl0it3r moved this from Planned to In Review in SFML 3.1.0 Dec 20, 2025
@binary1248 binary1248 force-pushed the bugfix/unicode_robustness branch from c28593c to f20b640 Compare December 20, 2025 18:54
@ZXShady
Copy link
Contributor
ZXShady commented Dec 20, 2025

Would using std::optional<char32_t> replacement be a good fit?

@github-actions
Copy link

Pull Request Test Coverage Report for Build 20398652844

Details

  • 50 of 50 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.07%) to 59.762%

Totals Coverage Status
Change from base Build 20375801732: 0.07%
Covered Lines: 13315
Relevant Lines: 21494

💛 - Coveralls

@binary1248 binary1248 force-pushed the bugfix/unicode_robustness branch from f20b640 to 3086b1f Compare December 21, 2025 02:17
@binary1248
Copy link
Member Author

@ZXShady That could work. From what I can tell, because std::optional<char32_t> allows implicit construction from char32_t it shouldn't even require an API break.

@eXpl0it3r
Copy link
Member

I like the idea of std::optional instead of another surrogate value, if we can implement it without breaking the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants

0