-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 #120248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII. Remove unused 'consumed' parameter of unicode_decode_utf8_writer().
I chose the "replace" error handler since it's hard to debug decoding errors (UnicodeDecodeError) at the C level in a function creating a string. For example, does the decoding error comes from the format string or an argument? If it's an argument, which one? Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for @serhiy-storchaka @methane: Would you mind to review this change? |
|
|
But why? If you want to include a non-ASCII string, you can pass it as a separate argument with the PyUnicode_Format("%s", "\xe2\x82\xac") |
I prefer "strict" because "hard to notice" is also hard to debug. |
I would like to accept UTF-8 format string to make functions consistent: use UTF-8 basically everywhere. It's also to use the UTF-8 decoder (with |
@serhiy-storchaka @methane: Would you mind to review the updated PR?
I modified the change to use the strict error handler. I also modified the implementation to still raise UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 21: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line 391, in test_from_format
PyUnicode_FromFormat(b'invalid format string\xff: %s', b'abc')
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line 377, in PyUnicode_FromFormat
return _PyUnicode_FromFormat(format, *cargs)
ValueError: PyUnicode_FromFormatV() expects a valid UTF-8-encoded format string, got an invalid UTF-8 string |
Replace PyErr_Format() with PyErr_SetString()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that this change is necessary, but I do not strongly oppose it.
if (unicode_decode_utf8_writer(&writer, f, len, | ||
_Py_ERROR_STRICT, "strict") < 0) { | ||
PyObject *exc = PyErr_GetRaisedException(); | ||
PyErr_SetString(PyExc_ValueError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why raise ValueError explicitly? If you want a ValueError for compatibility, UnicodeDecode is a subclass of ValueError, so this is a backward compatible change. Other functions which take const char *
do not raise ValueError explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message helps debugging such issue: it points directly to the format string.
Well, my first motivation for this change was to reuse the more efficient ASCII and UTF-8 decoders and Benchmark: diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index b139b46c826..4efef31ef4c 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -3305,6 +3305,22 @@ function_set_warning(PyObject *Py_UNUSED(module), PyObject *Py_UNUSED(args))
Py_RETURN_NONE;
}
+static PyObject *
+bench(PyObject *Py_UNUSED(module), PyObject *args)
+{
+ const char *format;
+ if (!PyArg_ParseTuple(args, "y", &format)) {
+ return NULL;
+ }
+
+
+ PyObject *str = PyUnicode_FromFormat(format, 123);
+ assert(str != NULL);
+ Py_DECREF(str);
+
+ Py_RETURN_NONE;
+}
+
static PyMethodDef TestMethods[] = {
{"set_errno", set_errno, METH_VARARGS},
{"test_config", test_config, METH_NOARGS},
@@ -3446,6 +3462,7 @@ static PyMethodDef TestMethods[] = {
{"check_pyimport_addmodule", check_pyimport_addmodule, METH_VARARGS},
{"test_weakref_capi", test_weakref_capi, METH_NOARGS},
{"function_set_warning", function_set_warning, METH_NOARGS},
+ {"bench", bench, METH_VARARGS},
{NULL, NULL} /* sentinel */
};
Script: import pyperf
import _testcapi
runner = pyperf.Runner()
runner.bench_func('bench 3', _testcapi.bench, b'x' * 3 + b'%i')
runner.bench_func('bench 30', _testcapi.bench, b'x' * 30 + b'%i')
runner.bench_func('bench 100', _testcapi.bench, b'x' * 100 + b'%i') Result:
|
@erlend-aasland @corona10: Do you have an opinion on this change? |
Since switching to UTF-8 seems to be controversial and my main motivation was to optimize the code, I wrote PR gh-120796 which keeps ASCII but optimizes the code using similar code paths: strchr() + ucs1lib_find_max_char(). There is a similar speedup. I close this PR. |
I am not so strongly against this idea, I only asked about the reason. In any case, errors in the format string should not be ignored. |
Well, I'm not convinced myself anymore, so I prefer to abandon this PR. |
PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII.
Remove unused 'consumed' parameter of unicode_decode_utf8_writer().
📚 Documentation preview 📚: https://cpython-previews--120248.org.readthedocs.build/