Improve UTF-8 decode speed #126024

methane · 2024-10-27T01:35:18Z

Feature or enhancement

Proposal:

DuckDB and orjson don't use PyUnicode_FromStringAndSize() but PyUnicode_4BYTE_DATA() because they think Python UTF-8 decoder is slow.

vs DuckDB

DuckDB decoder: https://github.com/duckdb/duckdb/blob/e791508e9bc2eb84bc87eb794074f4893093b743/tools/pythonpkg/src/numpy/array_wrapper.cpp#L215
PyUnicode_FromStringAndSize is 4x faster when decoding long ASCII string.
DuckDB is 1.5x faster when decoding short non-ASCII string.

vs orjson

I don't create simple extension module that contains only orjson UTF-8 decoder because I am not familiar with Rust yet.

When running their benchmark suite, using PyUnicode_FromStringAndSize slows down decoding twitter.json. twitter.json contains many non-ASCII (almost UCS2) strings. Most of them are Japanese text in ~140 chars.

Why Python decoder slow & possible optimization

Current Python decoder tries ASCII string first. When input is not ASCII, decoder need to convert buffer string into latin1/UCS2/UCS4.
When UTF-8 is 120 bytes, decoder allocates 120 codepoitns. But if all codepoints are 3byte UTF-8 sequene, result string has 40 codepoints. Decoder need to reallocate string at end.

When text is long, reallocation cost is relatively small. But for short (~200byte) strings, reallocate is much slower than decoding.

We can avoide some reallocation without slowing down decoding ASCII?

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

https://discuss.python.org/t/pep-756-c-api-add-pyunicode-export-and-pyunicode-import-c-functions/63891/53

Linked PRs

gh-126024: unicodeobject: optimize find_first_nonascii #127790

The text was updated successfully, but these errors were encountered:

…GH-127566)

…irst_nonascii` (pythonGH-127566)" Some hypothesis tests are failing. This reverts commit 36c6178.

Remove 1 branch.

…ython#126025)

…ascii` (pythonGH-127566)

…H-127790) Remove 1 branch.

…ython#126025)

methane added the type-feature A feature request or enhancement label Oct 27, 2024

bedevere-app bot mentioned this issue Oct 27, 2024

gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

Merged

methane linked a pull request Oct 27, 2024 that will close this issue

gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025

Merged

methane added the performance Performance or resource usage label Oct 27, 2024

Eclips4 added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Oct 27, 2024

methane mentioned this issue Nov 1, 2024

Rework client API and prepared statements, and improve DuckDB -> Pandas conversion duckdb/duckdb#1260

Merged

methane added a commit that referenced this issue Nov 29, 2024

gh-126024: optimize UTF-8 decoder for short non-ASCII string (#126025)

322b486

methane closed this as completed in #126025 Nov 29, 2024

picnixz mentioned this issue Dec 3, 2024

UBSan: misaligned memory loads in Objects/dictobject.c #127563

Closed

bedevere-app bot mentioned this issue Dec 3, 2024

gh-126024: fix UBSan failure in unicodeobject.c:find_first_nonascii #127566

Merged

colesbury pushed a commit that referenced this issue Dec 6, 2024

gh-126024: fix UBSan failure in unicodeobject.c:find_first_nonascii (…

36c6178

…GH-127566)

colesbury added a commit to colesbury/cpython that referenced this issue Dec 6, 2024

Revert "pythongh-126024: fix UBSan failure in `unicodeobject.c:find_f…

8000

4b022a7

…irst_nonascii` (pythonGH-127566)" Some hypothesis tests are failing. This reverts commit 36c6178.

This was referenced Dec 6, 2024

Revert "gh-126024: fix UBSan failure in unicodeobject.c:find_first_nonascii (GH-127566)" #127695

Closed

gh-126024: Use only memcpy for unaligned loads in find_first_nonascii #127769

Closed

encukou pushed a commit that referenced this issue Dec 13, 2024

gh-126024: unicodeobject: optimize find_first_nonascii (GH-127790)

5dd775b

Remove 1 branch.

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Jan 8, 2025

pythongh-126024: optimize UTF-8 decoder for short non-ASCII string (p…

ea2aa0a

…ython#126025)

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Jan 8, 2025

pythongh-126024: fix UBSan failure in `unicodeobject.c:find_first_non…

3e0f748

…ascii` (pythonGH-127566)

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Jan 8, 2025

pythongh-126024: unicodeobject: optimize find_first_nonascii (pythonG…

2be555f

…H-127790) Remove 1 branch.

ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025

pythongh-126024: optimize UTF-8 decoder for short non-ASCII string (p…

19b9628

…ython#126025)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve UTF-8 decode speed #126024

Improve UTF-8 decode speed #126024

Uh oh!

Improve UTF-8 decode speed #126024

Improve UTF-8 decode speed #126024

Comments

Uh oh!

Feature or enhancement

Proposal:

vs DuckDB

vs orjson

Why Python decoder slow & possible optimization

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs