gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 #120248

vstinner · 2024-06-07T20:40:05Z

PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII.

Remove unused 'consumed' parameter of unicode_decode_utf8_writer().

Issue: [C API] Add an efficient public PyUnicodeWriter API #119182

📚 Documentation preview 📚: https://cpython-previews--120248.org.readthedocs.build/

PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII. Remove unused 'consumed' parameter of unicode_decode_utf8_writer().

vstinner · 2024-06-07T20:42:34Z

I chose the "replace" error handler since it's hard to debug decoding errors (UnicodeDecodeError) at the C level in a function creating a string. For example, does the decoding error comes from the format string or an argument? If it's an argument, which one?

Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for %s arguments.

@serhiy-storchaka @methane: Would you mind to review this change?

vstinner · 2024-06-07T20:47:08Z

Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for %s arguments.

PyUnicode_FromFormat() is strict for anything else:

width is too big
%c argument is out of the Unicode range
etc.

vstinner · 2024-06-07T20:55:12Z

PyUnicode_FromFormat() is used by PyErr_Format(), PyErr_FormatUnraisable(), and will be used by the incoming PyUnicodeWriter_Format().

serhiy-storchaka · 2024-06-10T05:40:30Z

But why? If you want to include a non-ASCII string, you can pass it as a separate argument with the %s format unit.

PyUnicode_Format("%s", "\xe2\x82\xac")

methane · 2024-06-10T07:07:15Z

I chose the "replace" error handler since it's hard to debug decoding errors (UnicodeDecodeError) at the C level in a function creating a string. For example, does the decoding error comes from the format string or an argument? If it's an argument, which one?

Well, change my mind :-) I'm open to use the "strict" error handler for the format string and for %s arguments.

I prefer "strict" because "hard to notice" is also hard to debug.

vstinner · 2024-06-10T08:25:47Z

But why? If you want to include a non-ASCII string, you can pass it as a separate argument with the %s format unit.

I would like to accept UTF-8 format string to make functions consistent: use UTF-8 basically everywhere. It's also to use the UTF-8 decoder (with strchr('%') to get the string length) instead of parsing manually the string for check for non-ASCII characters.

vstinner · 2024-06-10T09:08:52Z

@methane:

I prefer "strict" because "hard to notice" is also hard to debug.

Ok, I created a dedicated PR for that: #120307.

vstinner · 2024-06-11T10:52:36Z

@serhiy-storchaka @methane: Would you mind to review the updated PR?

@methane:

I prefer "strict" because "hard to notice" is also hard to debug.

I modified the change to use the strict error handler.

I also modified the implementation to still raise ValueError if the format string is not a valid UTF-8 string, but chain the exception to the internal UnicodeDecodeError which contains details. Example:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 21: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line 391, in test_from_format
    PyUnicode_FromFormat(b'invalid format string\xff: %s', b'abc')
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vstinner/python/main/Lib/test/test_capi/test_unicode.py", line 377, in PyUnicode_FromFormat
    return _PyUnicode_FromFormat(format, *cargs)
ValueError: PyUnicode_FromFormatV() expects a valid UTF-8-encoded format string, got an invalid UTF-8 string

Replace PyErr_Format() with PyErr_SetString()

serhiy-storchaka

I do not think that this change is necessary, but I do not strongly oppose it.

Doc/c-api/unicode.rst

serhiy-storchaka · 2024-06-11T11:28:22Z

Objects/unicodeobject.c

+            if (unicode_decode_utf8_writer(&writer, f, len,
+                                           _Py_ERROR_STRICT, "strict") < 0) {
+                PyObject *exc = PyErr_GetRaisedException();
+                PyErr_SetString(PyExc_ValueError,


Why raise ValueError explicitly? If you want a ValueError for compatibility, UnicodeDecode is a subclass of ValueError, so this is a backward compatible change. Other functions which take const char * do not raise ValueError explicitly.

The error message helps debugging such issue: it points directly to the format string.

Lib/test/test_capi/test_unicode.py

vstinner · 2024-06-11T12:04:40Z

@serhiy-storchaka:

I do not think that this change is necessary, but I do not strongly oppose it.

Well, my first motivation for this change was to reuse the more efficient ASCII and UTF-8 decoders and strchr(). It makes PyUnicode_FromFormat() between 1.08x (format string of 30 characters) and 1.21x faster (format string of 100 characters). The speedup should be even better with longer format string.

Benchmark:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index b139b46c826..4efef31ef4c 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -3305,6 +3305,22 @@ function_set_warning(PyObject *Py_UNUSED(module), PyObject *Py_UNUSED(args))
     Py_RETURN_NONE;
 }
 
+static PyObject *
+bench(PyObject *Py_UNUSED(module), PyObject *args)
+{
+    const char *format;
+    if (!PyArg_ParseTuple(args, "y", &format)) {
+        return NULL;
+    }
+
+
+    PyObject *str = PyUnicode_FromFormat(format, 123);
+    assert(str != NULL);
+    Py_DECREF(str);
+
+    Py_RETURN_NONE;
+}
+
 static PyMethodDef TestMethods[] = {
     {"set_errno",               set_errno,                       METH_VARARGS},
     {"test_config",             test_config,                     METH_NOARGS},
@@ -3446,6 +3462,7 @@ static PyMethodDef TestMethods[] = {
     {"check_pyimport_addmodule", check_pyimport_addmodule, METH_VARARGS},
     {"test_weakref_capi", test_weakref_capi, METH_NOARGS},
     {"function_set_warning", function_set_warning, METH_NOARGS},
+    {"bench", bench, METH_VARARGS},
     {NULL, NULL} /* sentinel */
 };

Script:

import pyperf
import _testcapi

runner = pyperf.Runner()
runner.bench_func('bench 3', _testcapi.bench, b'x' * 3 + b'%i')
runner.bench_func('bench 30', _testcapi.bench, b'x' * 30 + b'%i')
runner.bench_func('bench 100', _testcapi.bench, b'x' * 100 + b'%i')

Result:

+----------------+--------+----------------------+
| Benchmark      | ref    | change               |
+================+========+======================+
| bench 30       | 215 ns | 200 ns: 1.08x faster |
+----------------+--------+----------------------+
| bench 100      | 252 ns | 208 ns: 1.21x faster |
+----------------+--------+----------------------+
| Geometric mean | (ref)  | 1.09x faster         |
+----------------+--------+----------------------+

Benchmark hidden because not significant (1): bench 3

vstinner · 2024-06-11T12:10:57Z

@erlend-aasland @corona10: Do you have an opinion on this change?

Objects/unicodeobject.c

vstinner · 2024-06-20T12:51:33Z

Since switching to UTF-8 seems to be controversial and my main motivation was to optimize the code, I wrote PR gh-120796 which keeps ASCII but optimizes the code using similar code paths: strchr() + ucs1lib_find_max_char(). There is a similar speedup. I close this PR.

serhiy-storchaka · 2024-06-20T13:28:58Z

I am not so strongly against this idea, I only asked about the reason. In any case, errors in the format string should not be ignored.

vstinner · 2024-06-20T19:04:03Z

Well, I'm not convinced myself anymore, so I prefer to abandon this PR.

pythongh-119182: Decode PyUnicode_FromFormat() format from UTF-8

3d5bca4

PyUnicode_FromFormat() now decodes the format string from UTF-8 with the "replace" error handler, instead of decoding it from ASCII. Remove unused 'consumed' parameter of unicode_decode_utf8_writer().

bedevere-app bot mentioned this pull request Jun 7, 2024

[C API] Add an efficient public PyUnicodeWriter API #119182

Closed

bedevere-app bot added the awaiting core review label Jun 7, 2024

Update test_exceptions

6a87915

vstinner mentioned this pull request Jun 10, 2024

gh-119182: Use strict error handler in PyUnicode_FromFormat() #120307

Closed

Use strict error handler

e830944

Fix error handling

242e6cb

Replace PyErr_Format() with PyErr_SetString()

serhiy-storchaka reviewed Jun 11, 2024

View reviewed changes

vstinner added 2 commits June 11, 2024 14:08

Add tests on truncated UTF-8 format strings

d04269f

Don't mention the strict error handler

94da5e7

vstinner changed the title ~~gh-119182: Decode PyUnicode_FromFormat() format from UTF-8~~ gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 Jun 11, 2024

serhiy-storchaka reviewed Jun 11, 2024

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

Revert consumed parameter

89fd69a

vstinner closed this Jun 20, 2024

vstinner deleted the format_utf8 branch June 20, 2024 12:51

vstinner mentioned this pull request Jun 20, 2024

gh-119182: Optimize PyUnicode_FromFormat() #120796

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 #120248

gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 #120248

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 #120248

gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 #120248

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!