gh-110235: Raise `TypeError` for duplicate/unknown fields in PyStructSequence constructor #110258

XuehaiPan · 2023-10-03T06:17:10Z

In this PR, we iterate over dict keys and check if there are duplicate fields provided by sequence or there are any unexpected field names.

Issue: PyStructSequence constructor ignores unknown field names #110235

…nstructor

Objects/structseq.c

serhiy-storchaka

The code in the constructor is too complicated, and it slow down normal case, because it tests fields already passed in the sequence argument.

Simply count the number of fields found in the dict argument. If it less than the size of the dict, then raise a TypeError.

I think that it is not needed to make the error message more specific, test whether some of keys match the fields already specified in sequence argument or are not field names at all -- it only adds a lot of complicated code for a little gain. If users came with requests for more details, then we may provide them, but currently it is a waste of efforts.

Lib/test/test_structseq.py

XuehaiPan · 2023-10-03T11:41:54Z

I updated the implementation to only count the number of unexpected fields rather than list them in the error message.

serhiy-storchaka

It is still too complicated. No need to check for multiple values for field.

XuehaiPan · 2023-10-04T08:38:06Z

It is still too complicated. No need to check for multiple values for field.

The signature for structseq.__new__() is:

SomeStructSeq(sequence: Iterable[Any], dict: Dict[str, Any]) -> SomeStructSeq

In this PR, I raise an error for values that come from both sequence and dict. If we don't do this, in the result tuple, the value from dict will override the value from sequence.

IMO, it should raise an error as we did for namedtuple.__new__(*args, **kwargs). If a value is provided in 8000 both args and kwargs, it will raise a TypeError.

mport collections
MyTuple = collections.namedtuple('MyTuple', ['a', 'b', 'c', 'd'])

MyTuple(1, 2, 3, 4, a=5, b=6)  # TypeError: MyTuple.__new__() got multiple values for argument 'a'

But namedtuple and structseq are slightly different. In structseq, the arguments sequence and dict do not have * and ** unpacking. Any thoughts?

def func(a, **kwargs):
    pass

func(1, a=2)  # TypeError: func() got multiple values for argument 'a'

serhiy-storchaka · 2023-10-04T08:50:54Z

I meant that it will automatically raise TypeError if dict contains extra keys. No need to check it specially, and no need to generate more specific error message. Simply:

for i in range(len, max_len):
    name = fields[i].name
    if name in dict:
        items[i] = dict[name]
        count++
    else:
        items[i] = None
    i++
if count < len(dict):
    raise TypeError("simple generic error message")

In C it will be just few additional lines of code.

XuehaiPan · 2023-10-04T09:19:22Z

I meant that it will automatically raise TypeError if dict contains extra keys.

Thanks for the code snippet. My point is #110258 (comment):

The signature for structseq.__new__() is:
SomeStructSeq(sequence: Iterable[Any], dict: Dict[str, Any]) -> SomeStructSeq
In this PR, I raise an error for values that come from both sequence and dict. If we don't do this, in the result tuple, the value from dict will override the value from sequence.

For example:

time.struct_time
fields: [
    "tm_year",    #  0, visiable
    "tm_mon",     #  1, visiable
    "tm_mday",    #  2, visiable
    "tm_hour",    #  3, visiable
    "tm_min",     #  4, visiable
    "tm_sec",     #  5, visiable
    "tm_wday",    #  6, visiable
    "tm_yday",    #  7, visiable
    "tm_isdst",   #  8, visiable
    "tm_zone",    #  9, invisible
    "tm_gmtoff",  # 10, invisible
]
n_sequence_fields: 9
n_unnamed_fields: 0
n_fields: 11

In the code snippet (#110258 (comment)), it will raise an error if there is a visible field name or an unknown name in dict (already part of this PR). But it will silently override the value for invisible fields when the value came from both sequence and dict.

time.struct_time(seqeunce=list(range(10)), dict={'tm_zone': -1})
# tm_zone is firstly set from sequence[9], then it will be override by dict['tm_zone'] without raising an error

The implementation of this PR is roughly equivalent to:

  len = sequence.__len__()
  min_len = n_sequence_fields
  max_len = n_fields
  
  for i, value in enumerate(sequence):
      items[i] = value  # for i >= min_len, it will set invisible fields
  
  count = 0
- for i in range(len, max_len):
+ for i in range(min_len, max_len):
      name = fields[i].name
+     if i < len:
+         if name in dict:
+             raise TypeError("duplicate invisible field values")
+         continue
      if name in dict:
          items[i] = dict[name]
          count += 1
      else:
          items[i] = None
  if count < dict.__len__():
      raise TypeError("simple generic error message")

serhiy-storchaka · 2023-10-04T10:05:00Z

The following code:

diff --git a/Objects/structseq.c b/Objects/structseq.c
index 0ca622edc2..08af71a88d 100644
--- a/Objects/structseq.c
+++ b/Objects/structseq.c
@@ -216,6 +216,7 @@ structseq_new_impl(PyTypeObject *type, PyObject *arg, PyObject *dict)
         res->ob_item[i] = Py_NewRef(v);
     }
     Py_DECREF(arg);
+    Py_ssize_t count = 0;
     for (; i < max_len; ++i) {
         PyObject *ob = NULL;
         if (dict != NULL) {
@@ -228,8 +229,18 @@ structseq_new_impl(PyTypeObject *type, PyObject *arg, PyObject *dict)
         if (ob == NULL) {
             ob = Py_NewRef(Py_None);
         }
+        else {
+            count++;
+        }
         res->ob_item[i] = ob;
     }
+    if (dict != NULL && PyDict_GET_SIZE(dict) > count) {
+        PyErr_Format(PyExc_TypeError,
+                     "%.500s() got unexpected field name(s).",
+                     type->tp_name);
+        Py_DECREF(res);
+        return NULL;
+    }
 
     _PyObject_GC_TRACK(res);
     return (PyObject*) res;

passes your tests (ignoring error message). It raises error for your example

>>> time.struct_time(sequence=list(range(10)), dict={'tm_zone': -1})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: time.struct_time() got unexpected field name(s).

Look, your loop can be split on two, range(min_len, len) and range(len, max_len). In the latter loop i < len is always false, so this can be removed. In the former loop it is always true, and the code in this "if" block is the only one that executed in the loop. It raises a TypeError if name is in dict. But if any name is found in dict in the former loop, it cannot be counted in the later loop and you get a TypeError in any case.

XuehaiPan · 2023-10-04T13:33:56Z

passes your tests (ignoring error message). It raises error for your example

I check this and it works. I have updated my implementation to raise a single generic error.

XuehaiPan · 2023-10-04T13:36:41Z

Objects/structseq.c

-        PyObject *ob = NULL;
-        if (dict != NULL) {
-            const char *name = type->tp_members[i-n_unnamed_fields].name;
+    if (dict != NULL && PyDict_GET_SIZE(dict) > 0) {


I add this guard check before the for-loop to avoid unnecessary dictionary lookup for empty dict.

XuehaiPan · 2023-10-04T15:11:12Z

Thanks for the code review and suggestions!

…ructSequence constructor (pythonGH-110258)

Raise TypeError for duplicate/unknown fields in PyStructSequence co…

da93f2f

…nstructor

bedevere-app bot mentioned this pull request Oct 3, 2023

PyStructSequence constructor ignores unknown field names #110235

Closed

📜🤖 Added by blurb_it.

9632370

XuehaiPan commented Oct 3, 2023

View reviewed changes

Objects/structseq.c Outdated Show resolved Hide resolved

Add test for duplicate/unknown fields in PyStructSequence constructor

464cbaf

XuehaiPan marked this pull request as ready for review October 3, 2023 06:34

bedevere-app bot added the awaiting review label Oct 3, 2023

XuehaiPan added 6 commits October 3, 2023 14:38

Add test for duplicate/unknown fields in PyStructSequence constructor

11a8435

Handle case for PyStructSequence with unnamed fields

9fc1501

Handle case for PyStructSequence with unnamed fields

981e403

Add more test cases

67954db

Merge branch 'main' into structseq-new

06c5d44

Add more test cases

9c3f7d2

serhiy-storchaka reviewed Oct 3, 2023

View reviewed changes

Lib/test/test_structseq.py Outdated Show resolved Hide resolved

Lib/test/test_structseq.py Outdated Show resolved Hide resolved

XuehaiPan added 4 commits October 3, 2023 18:18

Move tests to better places

c927e1e

Revert test changes for structseq.__reduce__

c173451

Refactor and simplify error handling

704ccaa

Refactor and simplify error handling

8f216b8

XuehaiPan added 3 commits October 3, 2023 20:06

Also check the input dict size

04a19cc

Merge branch 'main' into structseq-new

782fa4a

Merge branch 'main' into structseq-new

297459e

serhiy-storchaka reviewed Oct 4, 2023

View reviewed changes

XuehaiPan added 3 commits October 4, 2023 21:29

Split for-loop for better code readability

bd2782c

Remove unnecessary error checking

04a554d

Merge branch 'main' into structseq-new

41e4d0b

XuehaiPan commented Oct 4, 2023

View reviewed changes

serhiy-storchaka approved these changes Oct 4, 2023

View reviewed changes

bedevere-app bot added awaiting core review and removed awaiting review labels Oct 4, 2023

serhiy-storchaka merged commit 9561648 into python:main Oct 4, 2023

bedevere-app bot removed the awaiting core review label Oct 4, 2023

XuehaiPan deleted the structseq-new branch October 4, 2023 20:43

Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024

pythongh-110235: Raise TypeError for duplicate/unknown fields in PySt…

609f3dd

…ructSequence constructor (pythonGH-110258)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-110235: Raise `TypeError` for duplicate/unknown fields in PyStructSequence constructor #110258

gh-110235: Raise `TypeError` for duplicate/unknown fields in PyStructSequence constructor #110258

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gh-110235: Raise TypeError for duplicate/unknown fields in PyStructSequence constructor #110258

gh-110235: Raise TypeError for duplicate/unknown fields in PyStructSequence constructor #110258

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gh-110235: Raise `TypeError` for duplicate/unknown fields in PyStructSequence constructor #110258

gh-110235: Raise `TypeError` for duplicate/unknown fields in PyStructSequence constructor #110258