Make Unicode conversions more robust against invalid data #3628
+351
−90
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Title.
These changes are intended to make Unicode (UTF-8/16) conversions more robust against invalid sequences. Previously, feeding the conversions invalid sequences could result in those bytes being copied 1:1 to the UTF-32 output without proper validation. This resulted in confusion when attempting to decode non-UTF-8-encoded data as UTF-8 because the resulting
sf::Stringwould be filled with codepoints which did not correspond to the source data in its original encoding. An example of this can be seen in #2985.The decoding and encoding functions already provided a parameter to specify a replacement codepoint to be inserted into the resulting data stream if decoding encountered an invalid sequence. Because validation was not performed correctly the replacement codepoint would not always be inserted when it had to be. This change also exposes the replacement parameter to all functions that call the decoding functions so the user has the possibility of specifying their own replacement codepoint when e.g. decoding UTF-8 directly into an
sf::String. Not being able to do so would mean they are forced to use the default behaviour which would be to discard invalid sequences without replacing them.Because silent conversion errors can be hard to detect and lead to many other unwanted effects, warning messages are output in debug builds to notify the user that conversion errors occurred so they can investigate.
Long term we should consider adding conversion functions that return the result of a conversion so the user can handle conversion errors from within the code itself.
We should also consider a mechanism to allow
'\0'to be a replacement character. Currently this value is interpreted as an indication that invalid sequences should be discarded.