-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
BUG MaskedArray __eq__ wrong for masked scalar, multi-d recarray #8590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
numpy/ma/core.py
Outdated
# Structured array. Mask element if all parts are masked. | ||
mask = mask.view(bool_).reshape(mask.shape+(-1,)).all(-1) | ||
|
||
if check.shape == () and check.dtype.names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still wondering whether it would be better to use .view(type(self))
as long as that is not mvoid
while the shape is that of an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you mean in the following line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed. In the old code, the result is unconditionally turned into type(self)
, which is a problem if one compares a MaskedArray
with an mvoid
(where the latter is a single element, which will always have precedence since it is a subclass of MaskedArray
.
Possibly, it is better solved in mvoid
...
Anyway, as written the code works, but let me know if there are other suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, probably it can also be solved by defining _compare
in mvoid
, but I agree this is fine.
(also, not sure if the test above for check.dtype.names
should be something like check.dtype == np.void
to account for dtype('V4')
... but maybe that is overthinking it since I suspect truly void types don't work well with masks anyway)
numpy/ma/core.py
Outdated
def __eq__(self, other): | ||
""" | ||
Check whether other equals self elementwise. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use a bit more explanation of the role of the mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use a bit more explanation of the role of the mask.
Wasn't there before, but happy to do so (but will wait for possible other comments first).
numpy/ma/core.py
Outdated
mask = mask_or(self._mask, omask) | ||
if mask.dtype.names: | ||
# Structured array. Mask element if all parts are masked. | ||
mask = mask.view(bool_).reshape(mask.shape+(-1,)).all(-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is the old behavior, but just want to note that any
makes more sense to me than all
.
With all
, I think it causes the result of the comparison to depend on the data at a masked value, eg:
>>> m = np.ma.mvoid((1,2), mask=(0,1), dtype='i4,f4')
>>> n = np.ma.mvoid((1,0), dtype='i4,f4')
>>> o = np.ma.mvoid((1,1), dtype='i4,f4')
>>> (n == m), (m == n), (o == m)
(True, masked, False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahaldane To be clear, is that the behaviour you want, or the behaviour that all
implies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the behavior I get with this PR as it is right now (I fetchd + tested it).
I actually didn't work out why n == m
is different from m ==n
yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should m == n
return masked
or True
? Either seem somewhat reasonable (but obviously both as they are right now is bad!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you ask, I have been going back and forth on it...
After some thought, the only consistent result I see is to return masked
for all three cases, except that we need some extra code so that m == m
, ie
>>> p = np.ma.mvoid((1,7), mask=(0,1), dtype='i4,f4')
>>> (m == p), (m == m)
True, True
which I think requires replacement of mask_or
by mask_xor
, and then use any
instead of all
. But possibly we should have m == m
return masked
, I'm still a little undecided.
Returning True
seems worse to me because intuitively I think n
and m
don't represent equal values, and secondly because the code would have to be much more complicated: We would have to reimplements numpy's internal structure comparison (somewhere in the C code) inside of MaskedArray to ignore fields with masked values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahaldane: I'm suggesting all
semantics on the flattened view of the masked data (when such a thing can exist), not on the mask alone. I'm suggesting this because that's what the semantics of normal structured arrays are.
Flattening the data in your example, we get consistent behaviour
>>> o = np.ma.array([[1]], mask=[[True]])
>>> p = np.ma.array([[0]], mask=[[True]])
>>> (o == p).all(axis=-1)
masked_array(data = [--],
mask = [ True],
fill_value = True)
I'd expect your example to return a 1D array, not a scalar, but I assume that's just a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm with @eric-wieser here, but maybe best to postpone further discussion until I have the next version up; @ahaldane's example shows that this one is more broken than I had thought (though an improvement on what was there!). In the meantime, I also found that other parts of np.ma
are relying on comparison of two masked items being returned as masked True
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I just noticed there are unused, half-implemented methods _get_recordmask
and _set_recordmask
defined for MaskedArray...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't np.ma
lovely...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of work an exploration was put into it, which hopefully we can learn from in a future tidying/reimplementation. (incidentally, if more numpy members who know about maskedarrays are joining recently, maybe we can make a reimplementation happen soon)
With respect to this PR: I recognize that, as often happens, this PR might be snowballing now that the unit tests are failing upon further changes. I might be in favor of getting this PR through now as a partial fix (at least m == m[0]
will work). What do you think?
Besides the Also, congrats on becoming a numpy member @mhvk ! I'm very happy to have someone else who often works on MaskedArrays here. |
@ahaldane - hmm, that's a nasty problem you point out: ideally, we would ignore any masked values in the comparison, but that is not how it was done (or done in my PR). Needs some thought... |
@ahaldane - on the "snowballing" -- I think it is not too bad, but I agree that if I don't get to something better by the end of the week, we probably should take this as at least an initial step. |
@mhvk all right, sounds good |
1be0dc1
to
6b84bdb
Compare
@ahaldane, @eric-wieser - I pushed an updated version of |
numpy/ma/tests/test_core.py
Outdated
assert_equal(n == o, False) | ||
assert_equal(n == m, False) | ||
assert_equal(o == m, True) | ||
assert_equal(o == n, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think that every single one of these tests should return np.masked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, can any two mvoid
objects return np.masked
on comparison with this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My argument would be that there are 3 cases here:
- No matter what values the masked fields take, two objects are always equal
- No matter what values the masked fields take, two objects are always not equal
- The equality depends on the values of the masked fields
Since the first two of these cases are represented by True
and False
, surely we should use masked
to represent the third case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, mvoid
comparisons can yield masked
, but only if all elements are masked (in at least one of the two):
m = mvoid((1, 2.), mask=(0, 1), dtype='i4,f4')
p = mvoid((1, 2.), mask=(1, 0), dtype=m.dtype)
m == p
# masked
I probably should add that to the test cases...
@eric-wieser - replying in the main thread so things do not get lost: I was actually working by an analogy (I think) you made earlier, that the structured arrays should work the same way as would be the case if one tested equality over an axis. Now for that, masked elements are simply ignored:
I think this is actually reasonably logical given the interpretation that masking elements means one should ignore them: if both arrays mask a given element, just don't care, if one does not mask it, it cannot possibly be equal to a masked one. |
Actually, @eric-wieser, your real example was:
while with my PR,
so I'm not consistent. Darn. |
@mhvk : Perhaps the test suite should try every permutation of mask / value, and compare structured vs unstructured comparison? |
This is a little of a further-out idea, but I'd like to try it out on you. Maybe we want masked structure comparison to be fundamentally different from plain structure comparison. What I mean is, maybe instead of returning a single True/False value per element, comparison should return a new structure similar to the
Incidentally, plain structure comparison is itself somewhat poorly defined and gives you lots of deprecation warnings in various cases. Its conceivable we could also change plain structure comparisons to do a similar thing (just without the masks). However I haven't thought it through, and that might be too big a change. |
7c1847a
to
ed86a0b
Compare
@eric-wieser, @ahaldane - a new version, which now explicitly tests that the structured comparison gives the same result as doing |
numpy/ma/tests/test_core.py
Outdated
b = array([(1, 1), (2, 2)], mask=[(0, 1), (1, 0)], dtype=ndtype) | ||
test = (a == b) | ||
assert_equal(test, [True, False]) | ||
assert_equal(test, [True, True]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the test I had to change (as well as its equivalent for ne
).
@ahaldane - now that |
ed86a0b
to
7f191b5
Compare
Ok, I like the behavior now and I am happy that the code seems quite elegant now compared to what was there before! I'll merge in an in an hour or two, if there are no further comments. (We can keep the "more logical" possible way for another time, if ever. It's such a niche case I don't think its a priority) |
7f191b5
to
400e0a6
Compare
@ahaldane - yes, I liked as well that the code ended up becoming reasonably elegant! Note that I realised I had not updated the docstring for |
numpy/ma/core.py
Outdated
check._mask = mask | ||
mask = mask_or(smask, omask) | ||
|
||
odata = other.data if isinstance(other, MaskedArray) else other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems inconsistent with the duck-typing we seem to use elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done.
numpy/ma/core.py
Outdated
# elements, masked or not. To avoid this, we set all masked | ||
# elements in self to other, so they'll count as equal. | ||
# To prepare, we ensure we have the right shape. | ||
sbroadcast = np.broadcast_to(self, np.broadcast(self, other).shape, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be clearer as just sbroadcast, odatabroadcast = np.broadcast_arrays(self, odata, subok=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on second thought, I agree clarity beats speed here (I essentially wrote out what np.broadcast_arrays
does, but skipping broadcasting odata
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhvk: I'm also suggesting you pass odatabroadcast
rather than odata
into filled
, for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I'm starting to think that this should be the default behaviour of filled
, allowing filling with a larger fill-value than the dimensions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, that breaks tests, it turns out: broadcast_arrays
is not guaranteed to return a new instance, and since I set the mask right after, I effectively change self
. Not sure that is desired behaviour, but outside of the scope of this PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And looking into this, I realise mask
is not properly broadcast either (and never was): this fails
np.ma.array([0, 1], mask=[0, 1]) == np.ma.array([[0, 2]])
(it fails in __repr__
but would otherwise be bad too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-wieser - as you'll see, I had to leave the broadcasting alone, since otherwise I had to make a copy. I agree with you that, ideally, filled
would just do the right thing itself. Even better if the filled
function could take a mask
rather than just calling the method. But I think that's best left for another PR!
numpy/ma/core.py
Outdated
|
||
check = compare(sdata, odata) | ||
|
||
if isinstance(check, (bool_, bool)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably better as if np.ndim(check) == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, unless this is deliberately trying to return 0d arrays and not scalars
numpy/ma/core.py
Outdated
# Note that this works automatically for structured arrays too. | ||
check[mask] = compare(smask, omask)[mask] | ||
|
||
check = check.view(MaskedArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a function somewhere within this file for picking the appropriate subclass of MaskedArray
c540303
to
04f83e5
Compare
colons inserted... |
Great. I'll leave this a day in case @ahaldane has any more comments or the tests fail, then looks good to merge. Thanks for tolerating my hole-picking! |
Nice catches @eric-wieser ! I just read through it again, and nothing sticks out to me. Feel free to merge, if you don't I'll do it tomorrow. |
"""Check whether other does not equal self elementwise. | ||
|
||
When either of the elements is masked, the result is masked as well, | ||
but the underlying boolean data are still set, with self and other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any tests of this underlying bool data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, note how the tests all do something like test = (?? != ??)
and then test both the values and the mask.
I think this needs a release note, since we actually did change behaviour - it was a narrow and buggy enough case that probably no one is affected, but if they are, then it would be nice to have a release note to point at |
Yes, fine to add a quick note to the release: @charris - this is a bug fix mainly still, which I think should go in 1.12.1. But do I add a note to 1.12.1-notes.rst or to |
numpy/ma/tests/test_core.py
Outdated
@@ -1327,14 +1327,32 @@ def test_eq_on_structured(self): | |||
test = (a == a) | |||
assert_equal(test, [True, True]) | |||
assert_equal(test.mask, [False, False]) | |||
test = (a == a[0]) | |||
assert_equal(test, [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is trying to test the underlying values, I think it needs a .data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're presuming that numpy functions correctly account for subclasses.... ;-)
In [2]: test = np.ma.MaskedArray([True, True], mask=[False, True])
In [3]: assert_equal(test, [True, True])
In [4]: test = np.ma.MaskedArray([True, False], mask=[False, True])
In [5]: assert_equal(test, [True, False])
(they should, of course, but that's for another time....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanna bet they don't?
In [2]: test = np.ma.MaskedArray([True, True], mask=[False, True])
In [3]: assert_equal(test, [True, True]) # correct, as before
In [4]: assert_equal(test, [True, False]) # uh oh...
In [5]: assert_equal(test, [True, np.array("oh dear", dtype=object)]) # oh geez
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grrrrrr.... OK, will change.
04f83e5
to
7e8bfd6
Compare
@mhvk When this showed up I was tempted to make it a backport, but because of the changes in tested behavior decided not to. I'd like to branch 1.13 by the end of next month, so the delay should not be excessive. |
As to where to make release note additions for maintenance releases, it varies ;) Generally it should start in current master, but I did it the other way round in the 1.12.0 release because of the numerous updates. As a more practical matter, the 1.12.1 notes don't yet exist in the 1.12.x branch. |
7e8bfd6
to
3e6a69f
Compare
OK, added a changelog entry. |
doc/release/1.13.0-notes.rst
Outdated
solved. In the process, it was ensured that in getting the result for a | ||
structured array, masked fields are properly ignored, i.e., the result is equal | ||
if all fields that are non-masked in both are equal, thus making the behaviour | ||
identical to what one gets by comparing a regular array and then doing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regular would be better as un-structured
here, since regular falsely implies "not masked"
3e6a69f
to
f49708b
Compare
In the process of trying to fix the "questionable behaviour in `MaskedArray.__eq__`" (numpygh-8589), it became clear that the code was buggy. E.g., `ma == ma[0]` failed if `ma` held a structured dtype; multi-d structured dtypes failed generally; and, more worryingly, a masked scalar comparison could be wrong: `np.ma.MaskedArray(1, mask=True) == 0` yields True. This commit solves these problems, adding tests to prevent regression. In the process, it also ensures that the results for structured arrays always equals what one would get by logically combining the results over individual parts of the structure.
f49708b
to
3435dd9
Compare
OK, done. I cancelled the builds, since this passed already. |
Not sure I'm comfortable merging this without the builds... Problem with rewriting history is I have no way of knowing if you accidentally messed something up, because I can't just see the one-word diff you intended to make. I'm sure you didn't, but it'd be nice to have a passed test to prove it. |
Thanks for your patience, @mhvk! |
Thanks for all the comments, @eric-wieser -- the result is a much better new method than I had initially, one that does more at the same time as being smaller. |
In the process of trying to fix the "questionable behaviour in
MaskedArray.__eq__
" (#8589), it became clear that the code was more than a little buggy. E.g.,ma == ma[0]
failed ifma
held a structured dtype; multi-d structured dtypes failed generally; and, more worryingly, a masked scalar comparison could be wrong:It doesn't help to do tests on data filled with 0 if one doesn't consistently check the mask after... (for the rest, see the new test cases)