8000 gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 by vstinner · Pull Request #120248 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 #120248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

vstinner
Copy link
Member
@vstinner vstinner commented Jun 7, 2024

PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII.

Remove unused 'consumed' parameter of unicode_decode_utf8_writer().


📚 Documentation preview 📚: https://cpython-previews--120248.org.readthedocs.build/

PyUnicode_FromFormat() now decodes the format string from UTF-8 with
the "replace" error handler, instead of decoding it from ASCII.

Remove unused 'consumed' parameter of unicode_decode_utf8_writer().
@vstinner
Copy link
Member Author
vstinner commented Jun 7, 2024

I chose the "replace" error handler since it's hard to debug decoding errors (UnicodeDecodeError) at the C level in a function creating a string. For example, does the decoding error comes from the format string or an argument? If it's an argument, which one?

Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for %s arguments.

@serhiy-storchaka @methane: Would you mind to review this change?

@vstinner
Copy link
Member Author
vstinner commented Jun 7, 2024

Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for %s arguments.

PyUnicode_FromFormat() is strict for anything else:

  • width is too big
  • %c argument is out of the Unicode range
  • etc.

@vstinner
Copy link
Member Author
vstinner commented Jun 7, 2024

PyUnicode_FromFormat() is used by PyErr_Format(), PyErr_FormatUnraisable(), and will be used by the incoming PyUnicodeWriter_Format().

@serhiy-storchaka
Copy link
Member

But why? If you want to include a non-ASCII string, you can pass it as a separate argument with the %s format unit.

PyUnicode_Format("%s", "\xe2\x82\xac")

@methane
Copy link
Member
methane commented Jun 10, 2024

I chose the "replace" error handler since it's hard to debug decoding errors (UnicodeDecodeError) at the C level in a function creating a string. For example, does the decoding error comes from the format string or an argument? If it's an argument, which one?

Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for %s arguments.

I prefer "strict" because "hard to notice" is also hard to debug.

@vstinner
Copy link
Member Author

But why? If you want to include a non-ASCII string, you can pass it as a separate argument with the %s format unit.

I would like to accept UTF-8 format string to make functions consistent: use UTF-8 basically everywhere. It's also to use the UTF-8 decoder (with strchr('%') to get the string length) instead of parsing manually the string for check for non-ASCII characters.

@vstinner
Copy link
Member Author
8000

@methane:

I prefer "strict" because "hard to notice" is also hard to debug.

Ok, I created a dedicated PR for that: #120307.

@vstinner
Copy link
Member Author

@serhiy-storchaka @methane: Would you mind to review the updated PR?

@methane:

I prefer "strict" because "hard to notice" is also hard to debug.

I modified the change to use the strict error handler.

I also modified the implementation to still raise ValueError if the format string is not a valid UTF-8 string, but chain the exception to the internal UnicodeDecodeError which contains details. Example:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 21: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line 391, in test_from_format
    PyUnicode_FromFormat(b'invalid format string\xff: %s', b'abc')
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line 377, in PyUnicode_FromFormat
    return _PyUnicode_FromFormat(format, *cargs)
ValueError: PyUnicode_FromFormatV() expects a valid UTF-8-encoded format string, got an invalid UTF-8 string

Replace PyErr_Format() with PyErr_SetString()
Copy link
Member
@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that this change is necessary, but I do not strongly oppose it.

if (unicode_decode_utf8_writer(&writer, f, len,
_Py_ERROR_STRICT, "strict") < 0) {
PyObject *exc = PyErr_GetRaisedException();
PyErr_SetString(PyExc_ValueError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why raise ValueError explicitly? If you want a ValueError for compatibility, UnicodeDecode is a subclass of ValueError, so this is a backward compatible change. Other functions which take const char * do not raise ValueError explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message helps debugging such issue: it points directly to the format string.

@vstinner
Copy link
Member Author

@serhiy-storchaka:

I do not think that this change is necessary, but I do not strongly oppose it.

Well, my first motivation for this change was to reuse the more efficient ASCII and UTF-8 decoders and strchr(). It makes PyUnicode_FromFormat() between 1.08x (format string of 30 characters) and 1.21x faster (format string of 100 characters). The speedup should be even better with longer format string.

Benchmark:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index b139b46c826..4efef31ef4c 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -3305,6 +3305,22 @@ function_set_warning(PyObject *Py_UNUSED(module), PyObject *Py_UNUSED(args))
     Py_RETURN_NONE;
 }
 
+static PyObject *
+bench(PyObject *Py_UNUSED(module), PyObject *args)
+{
+    const char *format;
+    if (!PyArg_ParseTuple(args, "y", &format)) {
+        return NULL;
+    }
+
+
+    PyObject *str = PyUnicode_FromFormat(format, 123);
+    assert(str != NULL);
+    Py_DECREF(str);
+
+    Py_RETURN_NONE;
+}
+
 static PyMethodDef TestMethods[] = {
     {"set_errno",               set_errno,                       METH_VARARGS},
     {"test_config",             test_config,                     METH_NOARGS},
@@ -3446,6 +3462,7 @@ static PyMethodDef TestMethods[] = {
     {"check_pyimport_addmodule", check_pyimport_addmodule, METH_VARARGS},
     {"test_weakref_capi", test_weakref_capi, METH_NOARGS},
     {"function_set_warning", function_set_warning, METH_NOARGS},
+    {"bench", bench, METH_VARARGS},
     {NULL, NULL} /* sentinel */
 };
 

Script:

import pyperf
import _testcapi

runner = pyperf.Runner()
runner.bench_func('bench 3', _testcapi.bench, b'x' * 3 + b'%i')
runner.bench_func('bench 30', _testcapi.bench, b'x' * 30 + b'%i')
runner.bench_func('bench 100', _testcapi.bench, b'x' * 100 + b'%i')

Result:

+----------------+--------+----------------------+
| Benchmark      | ref    | change               |
+================+========+======================+
| bench 30       | 215 ns | 200 ns: 1.08x faster |
+----------------+--------+----------------------+
| bench 100      | 252 ns | 208 ns: 1.21x faster |
+----------------+--------+----------------------+
| Geometric mean | (ref)  | 1.09x faster         |
+----------------+--------+----------------------+

Benchmark hidden because not significant (1): bench 3

@vstinner
Copy link
Member Author

@erlend-aasland @corona10: Do you have an opinion on this change?

@vstinner vstinner changed the title gh-119182: Decode PyUnicode_FromFormat() format from UTF-8 gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 Jun 11, 2024
@vstinner
Copy link
Member Author

Since switching to UTF-8 seems to be controversial and my main motivation was to optimize the code, I wrote PR gh-120796 which keeps ASCII but optimizes the code using similar code paths: strchr() + ucs1lib_find_max_char(). There is a similar speedup. I close this PR.

@vstinner vstinner closed this Jun 20, 2024
@vstinner vstinner deleted the format_utf8 branch June 20, 2024 12:51
@serhiy-storchaka
Copy link
Member

I am not so strongly against this idea, I only asked about the reason. In any case, errors in the format string should not be ignored.

@vstinner
Copy link
Member Author
vstinner commented Jun 20, 2024

Well, I'm not convinced myself anymore, so I prefer to abandon this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0