8000 The function unicodedata.normalize() should always return an instance of the built-in str type. · Issue #129569 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

The function unicodedata.normalize() should always return an instance of the built-in str type. #129569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Hizuru3 opened this issue Feb 2, 2025 · 7 comments · Fixed by webfutureiorepo/cpython#31
Labels
extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error

Comments

@Hizuru3
Copy link
Contributor
Hizuru3 commented Feb 2, 2025

Bug report

Bug description:

The current implementation of unicodedata.normalize() returns a new reference to the input string when the data is already normalized. It is fine for instances of the built-in str type. However, if the function receives an instance of a subclass of str, the return type becomes inconsistent.

import unicodedata

class MyStr(str):
	pass

s1 = unicodedata.normalize('NFKC', MyStr('Å')) # U+00C5 (already normalized)
s2 = unicodedata.normalize('NFKC', MyStr('Å')) # U+0041 U+030A (not normalized)

print(type(s1), type(s2))		# <class '__main__.MyStr'> <class 'str'>

In addition, passing instances of user-defined str subclasses can lead to unexpected sharing of modifiable attributes:

import unicodedata

class MyStr(str):
	pass


original = MyStr('ascii string')
original.is_original = True

verified = unicodedata.normalize('NFKC', original)
verified.is_original = False

print(original.is_original)		# False

The solution would be to use the PyUnicode_FromObject() API for early returns in the normalize() function implementation instead of Py_NewRef() to make sure that the function always returns an instance of the built-in str type.

CPython versions tested on:

3.11, 3.13

Operating systems tested on:

Windows

Linked PRs

@Hizuru3 Hizuru3 added the type-bug An unexpected behavior, bug, or error label Feb 2, 2025
@picnixz picnixz added the extension-modules C modules in the Modules dir label Feb 2, 2025
@picnix
8000
z
Copy link
Member
picnixz commented Feb 2, 2025

Is it really an issue that we return the object as is? Can't we document that behaviour instead? Some existing code may rely on this feature so I think we should have a proper depreciation period where we would warn in the future that non-str objects will not be accepted and that their underlying str() representation will be used.

We also need to document that change. I'm not against making a true string for the normalized value but this needs to be communicated properly to the end user.

@picnixz picnixz added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Feb 2, 2025
@serhiy-storchaka
Copy link
Member

I consider this a bug. An optimization should not affect the type of the result, and returning the original object is merely an optimization.

@picnixz
Copy link
Member
picnixz commented Feb 2, 2025

I guess it also makes sense in that sense. Ok for categorizing it as a bug then.

@picnixz
Copy link
Member
picnixz commented Feb 2, 2025
8000

Note that @corona10 removed the backports labels on the PR so maybe we could discuss whether to back port this here as well?

@picnixz picnixz added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Feb 2, 2025
@corona10
Copy link
Member
corona10 commented Feb 2, 2025

If @serhiy-storchaka considers this a bug, then we can backport it. I was confused between this as behavior changes or a bug :)

vstinner added a commit that referenced this issue Feb 21, 2025
…in str (#129570)

Co-authored-by: Victor Stinner <vstinner@python.org>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Feb 21, 2025
…built-in str (pythonGH-129570)

(cherry picked from commit c359fcd)

Co-authored-by: Hizuru <106918920+Hizuru3@users.noreply.github.com>
Co-authored-by: Victor Stinner <vstinner@python.org>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Feb 21, 2025
…built-in str (pythonGH-129570)

(cherry picked from commit c359fcd)

Co-authored-by: Hizuru <106918920+Hizuru3@users.noreply.github.com>
Co-authored-by: Victor Stinner <vstinner@python.org>
@malemburg
Copy link
Member
malemburg commented Feb 21, 2025

Since this is a breaking change, shouldn't there be a deprecation period applied ? Also note that this behavior has been in existence for 20+ years.

Also see https://discuss.python.org/t/the-function-unicodedata-normalize-should-always-return-an-instance-of-the-built-in-str-type/79090/10 for the discussion.

@vstinner
Copy link
Member

Oh, I didn't notice that there was a discussion on discuss.python.org about this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
0