python: use PyUnicode_FromStringAndSize() #14895

methane · 2024-11-19T09:07:07Z

DuckDB introduced optimization for UTF-8 decoder.
It is up to 40% faster for short non-ASCII case.
But it is 4x slower for long ASCII case.

Python has optimized code to decode ASCII. So decoding UTF-8 containing long ASCII part is faster than UTF8Proc::UTF8ToCodepoint.
And I am optimizing short non-ASCII case handling in CPython.

ref: python/cpython#126025 (comment)

Background

Using PEP 393 based API that heavily depending on current CPython internal in 3rd party code makes difficult to evolve Python internal (e.g. use UTF-8 as internal representation of Unicode).
Using PEP 393 slows down Python implementations other than CPython that use UTF-8 string representations. e.g. PyPy.
PyUnicode_FromStringAndSize is Stable ABI. Moving from non-Stable ABI to Stable ABI makes you possible to build Python modules that works with several Python versions.

Mytherin · 2024-11-19T16:38:58Z

Thanks! Playing around with this locally it seems to perform around as well as the current implementation, in which case definitely agreed we should switch back to the stable API.

Top-N: Improve performance with large heaps, and correctly call Reduce (duckdb/duckdb#14900) python: use PyUnicode_FromStringAndSize() (duckdb/duckdb#14895)

Top-N: Improve performance with large heaps, and correctly call Reduce (duckdb/duckdb#14900) python: use PyUnicode_FromStringAndSize() (duckdb/duckdb#14895) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>

python: use Unicode_FromStringAndSize

dad2ed7

Mytherin merged commit fbbfc4a into duckdb:main Nov 19, 2024
17 checks passed

methane deleted the python-use-capi branch November 20, 2024 03:44

github-actions bot mentioned this pull request Dec 24, 2024

vendor: Update vendored sources to duckdb/duckdb@8e2c944aac9b39b0482569e1a2ada87e035c2d57 duckdb/duckdb-r#711

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

python: use PyUnicode_FromStringAndSize() #14895

python: use PyUnicode_FromStringAndSize() #14895

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

python: use PyUnicode_FromStringAndSize() #14895

python: use PyUnicode_FromStringAndSize() #14895

Uh oh!

Conversation

Background

Uh oh!

Uh oh!

Uh oh!

Uh oh!