-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
datetime64 breaks equality and hash invariant. #3836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
very likely related to #3800 which fixes integer hashing. Probably a similar fix can be applied. |
currently numpy just hashes the int64, which breaks the invariant if two datetime represent the same time in different units. can timezone based times (Y/M/W/D) ever compare equal to non zone units (s/ms...)? |
I don't think there's any real concern about speed here, e.g. just blowing |
I don't know anything about datetime (for what is this undocumented num variable in the metadata(?!)), but I think this should work: def hash(input)
basemeta = smallest_unit // attoseconds
PyLong num, PyLong denom = get_base_conversion(input, basemeta)
if timezone:
base = get_timezone_to_D_base_conversion() // 356 12, 7
num, denom = mix(base, num, denom)
PyLong result = convert_to_base_unit(num, denom, input.asPyLong)
return hash(result) which basically means reimplementing a bunch of the existing conversion stuff (for casting) in PyLong. |
do we still want to do this for 1.8 or defer it to a 1.8.1? |
This bug is already there in 1.7, right? And has no PR yet? I think we just leave it for 1.8.1/1.9. |
Ah, this explains some unexpected behavior I've been seeing.
It means that |
Another way in which this fails today: now = datetime.datetime.now()
np_now = np.datetime64(now)
assert now == np_now
assert hash(now) == hash(np_now) # fails |
Pandas has a PR open now about this here: pandas-dev/pandas#50960 which may make sense to upstream as is or similar. (Sorry had the wrong link first) |
For (a bit) of visibility: The PR currently converts to datetime and if that is out of range does something similar to a tuple hash: But, if anyone has a thought on a better scheme that would also be good. We do need to extract the year anyway to decide whether we need to use the builtin python hashing. |
It looks as though the
datetime64
dtype breaks the Python rule thatx == y
should implyhash(x) == hash(y)
. This broke a Pandas application that was grouping on dates, and then doing a dictionary lookup to find the lines of aDataFrame
associated to a particular date.The text was updated successfully, but these errors were encountered: