dtype richcompare doesn't respect protocol #5345

pitrou · 2014-12-03T14:58:56Z

When a type's __eq__ (or tp_richcompare in C) doesn't understand the other type, it should return NotImplemented so that the other type's __eq__ has a chance to be called. Unfortunately, NumPy's ArrayDesrc doesn't respect that convention and raises TypeError instead, making comparisons fragile:

>>> import numpy as np
>>> class A(object):
...     def __eq__(self, other):
...         return True
... 
>>> dt = np.dtype('i8')
>>> A() == dt
True
>>> dt == A()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: data type not understood

It is generally considered bad practice to raise TypeError in a comparison method, and can prevent advanced uses: such as defining a custom type comparable with ArrayDescr, or simply mixing ArrayDescr with other types in e.g. a dictionary.

Indeed let's say I use the standard lru_cache, then hash collisions can produce unwanted errors because of the dictionary used internally comparing the keys:

>>> @functools.lru_cache()
... def f(x): return str(x)
... 
>>> 
>>> class A(object):
...   def __hash__(self):   
...      return hash(dt)
... 
>>> f(dt)
'int64'
>>> f(A())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/antoine/34/lib/python3.4/functools.py", line 446, in wrapper
    link = cache_get(key)
TypeError: data type not understood

The text was updated successfully, but these errors were encountered:

njsmith · 2014-12-03T23:40:32Z

Ugh. dtype hashing is... kinda a mess.

Dtypes are hashable objects that are nonetheless mutable (though
fortunately rarely mutated), and whose == believes that it knows
perfectly well how to handle all object types. It handles them like:

if isinstance(other, np.dtype):
# do a real comparison
else:
return self == np.dtype(other)

np.dtype coercion itself is happy to interpret a startling variety of
objects, including strings, types, tuples, lists, dicts, and None.
(For fun, try: np.dtype("f8") == None.) It can also fail in a variety
of ways, and "Errors should never pass silently". Unfortunately this
coercion is very widely used in real code (things like 'if arr.dtype
== "float32": ...'; if someone writes "flaot32" by accident then
letting this pass silently would be pretty bad indeed).

And, of course, dtype.hash makes no attempt to be consistent with
the hash functions for these other types. Which means that if you put
objects of any of the above types into the same dict as a dtype,
whether they end up counting as the same object will depend on your
hash seed and bucket count.

In theory we could try to clean up bits of this (like, maybe suppress
the error in some cases and return NotImplemented in others?), but
that would still leave a lot of mess behind. The best idea I'm
thinking of right now is to just say, if you want to put dtype objects
in a dict, then only put dtype objects in that dict. Or find some
other way to ensure that they will not be compared to non-dtype
objects (e.g., replace each object with a tuple (True, dt) or (False,
non_dt)).

Any better ideas?

On Wed, Dec 3, 2014 at 2:58 PM, Antoine Pitrou notifications@github.com wrote:

When a type's eq (or tp_richcompare in C) doesn't understand the other type, it should return NotImplemented so that the other type's eq has a chance to be called. Unfortunately, NumPy's ArrayDesrc doesn't respect that convention and raises TypeError instead, making comparisons fragile:

import numpy as np
class A(object):
... def eq(self, other):
... return True
...
dt = np.dtype('i8')
A() == dt
True
dt == A()
Traceback (most recent call last):
File "", line 1, in
TypeError: data type not understood

It is generally considered bad practice to raise TypeError in a comparison method, and can prevent advanced uses: such as defining a custom type comparable with ArrayDescr, or simply mixing ArrayDescr with other types in e.g. a dictionary.

Indeed let's say I use the standard lru_cache, then hash collisions can produce unwanted errors because of the dictionary used internally comparing the keys:

@functools.lru_cache()
... def f(x): return str(x)
...

class A(object):
... def hash(self):
... return hash(dt)
...
f(dt)
'int64'
f(A())
Traceback (most recent call last):
File "", line 1, in
File "/home/antoine/34/lib/python3.4/functools.py", line 446, in wrapper
link = cache_get(key)
TypeError: data type not understood

—
Reply to this email directly or view it on GitHub.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

pitrou · 2014-12-04T00:54:55Z

if someone writes "flaot32" by accident then letting this pass silently would be pretty bad indeed

Would it? It's not any worse then doing it when the object being compared is a plain string.
If people worry aboyt typos they probably should write np.float32 instead of "float32" :-)

pitrou · 2014-12-04T00:56:05Z

As for heterogenous dicts, that's why I brought the lru_cache() example. There are situations where it can make sense to have different key types in a dict.

njsmith · 2014-12-04T04:33:25Z

Yeah, in an ideal world dtype.eq probably wouldn't try to coerce
strings (or anything else) in the first place, but that's not how numpy's
original designers did things. So == "float32" is a common idiom, and
you've got to support the users you have, not the users you want...

I definitely see the use case for heterogenous dicts. I just don't see a
way to make it work without massively breaking the dtype API -- AFAICT it
won't work reliably unless we get rid of not just the TypeErrors on failed
coercions, but the coercions altogether. That's why I was asking if you had
a better idea :-).
On 4 Dec 2014 00:56, "Antoine Pitrou" notifications@github.com wrote:

As for heterogenous dicts, that's why I brought the lru_cache() example.
There are situations where it can make sense to have different key types in
a dict.

—
Reply to this email directly or view it on GitHub
#5345 (comment).

stefansjs · 2019-06-27T18:58:39Z

Ouch, I just found this out by trying to put a dtype into a type that wants to be hashable. Thanks for the clear writeup! I understand why equality is violates the python API, but couldn't hash be fixed?

I'm imagining

if you compare equivalent to a builtin type, return the hash for that type instead of yourself.
if you are some kind of object/array type hash your string description, plus some alignment info

I can definitely see an argument for hashing dtypes differently from builtin types even though the equality function would return true in that case (which would help me immensely for my use case). In this case, maybe you could return the hash of immutable parameters, like byte order, alignment, size, and builtin type used to construct you. Is this reasonable?

Further, as a workaround, for my custom type, is it sufficient to do something like this?

def hash_dtype(np_dtype):
    """ Provides a stable hash for dtypes by hashing some (but not all) properties """
    hash_properties = (np_dtype.type, np_dtype.byteorder, np_dtype.alignment, np_dtype.str)
    return hash(hash_properties)

seberg · 2021-02-24T20:33:26Z

Closing this as a duplicate of gh-7242 (both issues have quite a lot of discussion).

seberg mentioned this issue Feb 24, 2021

DEP: np.dtype('float64') == None and np.dtype(None) should be deprecated #18434

Open

seberg closed this as completed Feb 24, 2021

brandon-b-miller mentioned this issue Jul 12, 2021

[FEA] Use extension dtype for numerical columns rapidsai/cudf#8707

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

dtype richcompare doesn't respect protocol #5345

dtype richcompare doesn't respect protocol #5345

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dtype richcompare doesn't respect protocol #5345

dtype richcompare doesn't respect protocol #5345

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!