Make Unicode conversions more robust against invalid data #3628

binary1248 · 2025-12-20T17:21:43Z

Title.

These changes are intended to make Unicode (UTF-8/16) conversions more robust against invalid sequences. Previously, feeding the conversions invalid sequences could result in those bytes being copied 1:1 to the UTF-32 output without proper validation. This resulted in confusion when attempting to decode non-UTF-8-encoded data as UTF-8 because the resulting sf::String would be filled with codepoints which did not correspond to the source data in its original encoding. An example of this can be seen in #2985.

The decoding and encoding functions already provided a parameter to specify a replacement codepoint to be inserted into the resulting data stream if decoding encountered an invalid sequence. Because validation was not performed correctly the replacement codepoint would not always be inserted when it had to be. This change also exposes the replacement parameter to all functions that call the decoding functions so the user has the possibility of specifying their own replacement codepoint when e.g. decoding UTF-8 directly into an sf::String. Not being able to do so would mean they are forced to use the default behaviour which would be to discard invalid sequences without replacing them.

Because silent conversion errors can be hard to detect and lead to many other unwanted effects, warning messages are output in debug builds to notify the user that conversion errors occurred so they can investigate.

Long term we should consider adding conversion functions that return the result of a conversion so the user can handle conversion errors from within the code itself.

We should also consider a mechanism to allow '\0' to be a replacement character. Currently this value is interpreted as an indication that invalid sequences should be discarded.

ZXShady · 2025-12-20T19:03:35Z

Would using std::optional<char32_t> replacement be a good fit?

github-actions · 2025-12-20T19:42:57Z

Pull Request Test Coverage Report for Build 20398652844

Details

50 of 50 (100.0%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.07%) to 59.762%

Totals
Change from base Build 20375801732:	0.07%
Covered Lines:	13315
Relevant Lines:	21494

💛 - Coveralls

binary1248 · 2025-12-21T02:26:18Z

@ZXShady That could work. From what I can tell, because std::optional<char32_t> allows implicit construction from char32_t it shouldn't even require an API break.

eXpl0it3r · 2026-01-08T09:53:34Z

I like the idea of std::optional instead of another surrogate value, if we can implement it without breaking the API.

binary1248 requested review from ChrisThrasher, MarioLiebisch and eXpl0it3r December 20, 2025 17:21

binary1248 self-assigned this Dec 20, 2025

binary1248 added bug m:sfml-system labels Dec 20, 2025

eXpl0it3r added this to SFML 3.1.0 Dec 20, 2025

eXpl0it3r added this to the 3.1 milestone Dec 20, 2025

github-project-automation bot moved this to Planned in SFML 3.1.0 Dec 20, 2025

eXpl0it3r moved this from Planned to In Review in SFML 3.1.0 Dec 20, 2025

binary1248 force-pushed the bugfix/unicode_robustness branch from c28593c to f20b640 Compare December 20, 2025 18:54

binary1248 mentioned this pull request Dec 21, 2025

Graphic glitch when displaying text containing non-breaking spaces #2985

Open

3 tasks

Make Unicode conversions more robust against invalid data.

3086b1f

binary1248 force-pushed the bugfix/unicode_robustness branch from f20b640 to 3086b1f Compare December 21, 2025 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make Unicode conversions more robust against invalid data #3628

Make Unicode conversions more robust against invalid data #3628

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Make Unicode conversions more robust against invalid data #3628

Are you sure you want to change the base?

Make Unicode conversions more robust against invalid data #3628

Conversation

Uh oh!

Uh oh!

Pull Request Test Coverage Report for Build 20398652844

Details

💛 - Coveralls

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants