-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
Improve UTF-8 decode speed #126024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
interpreter-core
(Objects, Python, Grammar, and Parser dirs)
performance
Performance or resource usage
type-feature
A feature request or enhancement
Comments
methane
added a commit
that referenced
this issue
Nov 29, 2024
colesbury
pushed a commit
that referenced
this issue
Dec 6, 2024
colesbury
added a commit
to colesbury/cpython
that referenced
this issue
Dec 6, 2024
…irst_nonascii` (pythonGH-127566)" Some hypothesis tests are failing. This reverts commit 36c6178.
This was referenced Dec 6, 2024
Closed
encukou
pushed a commit
that referenced
this issue
Dec 13, 2024
srinivasreddy
pushed a commit
to srinivasreddy/cpython
that referenced
this issue
Jan 8, 2025
srinivasreddy
pushed a commit
to srinivasreddy/cpython
that referenced
this issue
Jan 8, 2025
srinivasreddy
pushed a commit
to srinivasreddy/cpython
that referenced
this issue
Jan 8, 2025
ebonnal
pushed a commit
to ebonnal/cpython
that referenced
this issue
Jan 12, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
interpreter-core
(Objects, Python, Grammar, and Parser dirs)
performance
Performance or resource usage
type-feature
A feature request or enhancement
Uh oh!
There was an error while loading. Please reload this page.
Feature or enhancement
Proposal:
DuckDB and orjson don't use PyUnicode_FromStringAndSize() but PyUnicode_4BYTE_DATA() because they think Python UTF-8 decoder is slow.
vs DuckDB
vs orjson
I don't create simple extension module that contains only orjson UTF-8 decoder because I am not familiar with Rust yet.
When running their benchmark suite, using PyUnicode_FromStringAndSize slows down decoding twitter.json. twitter.json contains many non-ASCII (almost UCS2) strings. Most of them are Japanese text in ~140 chars.
Why Python decoder slow & possible optimization
When text is long, reallocation cost is relatively small. But for short (~200byte) strings, reallocate is much slower than decoding.
We can avoide some reallocation without slowing down decoding ASCII?
Has this already been discussed elsewhere?
I have already discussed this feature proposal on Discourse
Links to previous discussion of this feature:
https://discuss.python.org/t/pep-756-c-api-add-pyunicode-export-and-pyunicode-import-c-functions/63891/53
Linked PRs
unicodeobject.c:find_first_nonascii
#127566unicodeobject.c:find_first_nonascii
(GH-127566)" #127695The text was updated successfully, but these errors were encountered: