-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
The function unicodedata.normalize() should always return an instance of the built-in str type. #129569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is it really an issue that we return the object as is? Can't we document that behaviour instead? We also need to document that change. I'm not against making a true string for the normalized value but this needs to be communicated properly to the end user. |
I consider this a bug. An optimization should not affect the type of the result, and returning the original object is merely an optimization. |
I guess it also makes sense in that sense. Ok for categorizing it as a bug then. |
8000
Note that @corona10 removed the backports labels on the PR so maybe we could discuss whether to back port this here as well? |
If @serhiy-storchaka considers this a bug, then we can backport it. I was confused between this as behavior changes or a bug :) |
…in str (#129570) Co-authored-by: Victor Stinner <vstinner@python.org>
…built-in str (pythonGH-129570) (cherry picked from commit c359fcd) Co-authored-by: Hizuru <106918920+Hizuru3@users.noreply.github.com> Co-authored-by: Victor Stinner <vstinner@python.org>
…built-in str (pythonGH-129570) (cherry picked from commit c359fcd) Co-authored-by: Hizuru <106918920+Hizuru3@users.noreply.github.com> Co-authored-by: Victor Stinner <vstinner@python.org>
Since this is a breaking change, shouldn't there be a deprecation period applied ? Also note that this behavior has been in existence for 20+ years. Also see https://discuss.python.org/t/the-function-unicodedata-normalize-should-always-return-an-instance-of-the-built-in-str-type/79090/10 for the discussion. |
Oh, I didn't notice that there was a discussion on discuss.python.org about this change. |
Uh oh!
There was an error while loading. Please reload this page.
Bug report
Bug description:
The current implementation of unicodedata.normalize() returns a new reference to the input string when the data is already normalized. It is fine for instances of the built-in str type. However, if the function receives an instance of a subclass of str, the return type becomes inconsistent.
In addition, passing instances of user-defined str subclasses can lead to unexpected sharing of modifiable attributes:
The solution would be to use the PyUnicode_FromObject() API for early returns in the normalize() function implementation instead of Py_NewRef() to make sure that the function always returns an instance of the built-in str type.
CPython versions tested on:
3.11, 3.13
Operating systems tested on:
Windows
Linked PRs
The text was updated successfully, but these errors were encountered: