The function unicodedata.normalize() should always return an instance of the built-in str type. #129569

Hizuru3 · 2025-02-02T09:39:25Z

Bug report

Bug description:

The current implementation of unicodedata.normalize() returns a new reference to the input string when the data is already normalized. It is fine for instances of the built-in str type. However, if the function receives an instance of a subclass of str, the return type becomes inconsistent.

import unicodedata

class MyStr(str):
	pass

s1 = unicodedata.normalize('NFKC', MyStr('Å')) # U+00C5 (already normalized)
s2 = unicodedata.normalize('NFKC', MyStr('Å')) # U+0041 U+030A (not normalized)

print(type(s1), type(s2))		# <class '__main__.MyStr'> <class 'str'>

In addition, passing instances of user-defined str subclasses can lead to unexpected sharing of modifiable attributes:

import unicodedata

class MyStr(str):
	pass


original = MyStr('ascii string')
original.is_original = True

verified = unicodedata.normalize('NFKC', original)
verified.is_original = False

print(original.is_original)		# False

The solution would be to use the PyUnicode_FromObject() API for early returns in the normalize() function implementation instead of Py_NewRef() to make sure that the function always returns an instance of the built-in str type.

CPython versions tested on:

3.11, 3.13

Operating systems tested on:

Windows

Linked PRs

The text was updated successfully, but these errors were encountered:

picnixz · 2025-02-02T13:47:32Z

Is it really an issue that we return the object as is? Can't we document that behaviour instead? Some existing code may rely on this feature so I think we should have a proper depreciation period where we would warn in the future that non-str objects will not be accepted and that their underlying str() representation will be used.

We also need to document that change. I'm not against making a true string for the normalized value but this needs to be communicated properly to the end user.

serhiy-storchaka · 2025-02-02T14:09:15Z

I consider this a bug. An optimization should not affect the type of the result, and returning the original object is merely an optimization.

picnixz · 2025-02-02T14:24:48Z

I guess it also makes sense in that sense. Ok for categorizing it as a bug then.

picnixz · 2025-02-02T14:25:21Z

8000

Note that @corona10 removed the backports labels on the PR so maybe we could discuss whether to back port this here as well?

corona10 · 2025-02-02T14:27:08Z

If @serhiy-storchaka considers this a bug, then we can backport it. I was confused between this as behavior changes or a bug :)

…in str (#129570) Co-authored-by: Victor Stinner <vstinner@python.org>

…built-in str (pythonGH-129570) (cherry picked from commit c359fcd) Co-authored-by: Hizuru <106918920+Hizuru3@users.noreply.github.com> Co-authored-by: Victor Stinner <vstinner@python.org>

malemburg · 2025-02-21T14:01:49Z

Since this is a breaking change, shouldn't there be a deprecation period applied ? Also note that this behavior has been in existence for 20+ years.

Also see https://discuss.python.org/t/the-function-unicodedata-normalize-should-always-return-an-instance-of-the-built-in-str-type/79090/10 for the discussion.

vstinner · 2025-02-21T14:17:57Z

Oh, I didn't notice that there was a discussion on discuss.python.org about this change.

Hizuru3 added the type-bug An unexpected behavior, bug, or error label Feb 2, 2025

Hizuru3 mentioned this issue Feb 2, 2025

gh-129569: The function unicodedata.normalize() always returns built-in str #129570

Merged

picnixz added the extension-modules C modules in the Modules dir label Feb 2, 2025

picnixz added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Feb 2, 2025

picnixz added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Feb 2, 2025

vstinner added a commit that referenced this issue Feb 21, 2025

gh-129569: The function unicodedata.normalize() always returns built-…

c359fcd

…in str (#129570) Co-authored-by: Victor Stinner <vstinner@python.org>

This was referenced Feb 21, 2025

[3.13] gh-129569: The function unicodedata.normalize() always returns built-in str (GH-129570) #130403

Closed

[3.12] gh-129569: The function unicodedata.normalize() always returns built-in str (GH-129570) #130404

Closed

vstinner closed this as completed Feb 21, 2025

sourcery-ai bot mentioned this issue Feb 21, 2025

[pull] main from python:main webfutureiorepo/cpython#31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

The function unicodedata.normalize() should always return an instance of the built-in str type. #129569

The function unicodedata.normalize() should always return an instance of the built-in str type. #129569

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

The function unicodedata.normalize() should always return an instance of the built-in str type. #129569

The function unicodedata.normalize() should always return an instance of the built-in str type. #129569

Comments

Uh oh!

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!