8000 BUG MaskedArray __eq__ wrong for masked scalar, multi-d recarray by mhvk · Pull Request #8590 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG MaskedArray __eq__ wrong for masked scalar, multi-d recarray #8590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 28, 2017

Conversation

mhvk
Copy link
Contributor
@mhvk mhvk commented Feb 9, 2017

In the process of trying to fix the "questionable behaviour in MaskedArray.__eq__" (#8589), it became clear that the code was more than a little buggy. E.g., ma == ma[0] failed if ma held a structured dtype; multi-d structured dtypes failed generally; and, more worryingly, a masked scalar comparison could be wrong:

np.ma.MaskedArray(1, mask=True) == 0
# True

It doesn't help to do tests on data filled with 0 if one doesn't consistently check the mask after... (for the rest, see the new test cases)

numpy/ma/core.py Outdated
# Structured array. Mask element if all parts are masked.
mask = mask.view(bool_).reshape(mask.shape+(-1,)).all(-1)

if check.shape == () and check.dtype.names:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wondering whether it would be better to use .view(type(self)) as long as that is not mvoid while the shape is that of an array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you mean in the following line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed. In the old code, the result is unconditionally turned into type(self), which is a problem if one compares a MaskedArray with an mvoid (where the latter is a single element, which will always have precedence since it is a subclass of MaskedArray.

Possibly, it is better solved in mvoid...

Anyway, as written the code works, but let me know if there are other suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably it can also be solved by defining _compare in mvoid, but I agree this is fine.

(also, not sure if the test above for check.dtype.names should be something like check.dtype == np.void to account for dtype('V4')... but maybe that is overthinking it since I suspect truly void types don't work well with masks anyway)

numpy/ma/core.py Outdated
def __eq__(self, other):
"""
Check whether other equals self elementwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a bit more explanation of the role of the mask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a bit more explanation of the role of the mask.

Wasn't there before, but happy to do so (but will wait for possible other comments first).

@charris
Copy link
Member
charris commented Feb 10, 2017

@ahaldane @shoyer Comment?

numpy/ma/core.py Outdated
mask = mask_or(self._mask, omask)
if mask.dtype.names:
# Structured array. Mask element if all parts are masked.
mask = mask.view(bool_).reshape(mask.shape+(-1,)).all(-1)
Copy link
Member
@ahaldane ahaldane Feb 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is the old behavior, but just want to note that any makes more sense to me than all.

With all, I think it causes the result of the comparison to depend on the data at a masked value, eg:

>>> m = np.ma.mvoid((1,2), mask=(0,1), dtype='i4,f4')
>>> n = np.ma.mvoid((1,0), dtype='i4,f4')
>>> o = np.ma.mvoid((1,1), dtype='i4,f4')
>>> (n == m), (m == n), (o == m)
(True, masked, False)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahaldane To be clear, is that the behaviour you want, or the behaviour that all implies?

Copy link
Membe 8000 r

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the behavior I get with this PR as it is right now (I fetchd + tested it).

I actually didn't work out why n == m is different from m ==n yet.

Copy link
Member
@eric-wieser eric-wieser Feb 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should m == n return masked or True? Either seem somewhat reasonable (but obviously both as they are right now is bad!)

Copy link
Member
@ahaldane ahaldane Feb 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you ask, I have been going back and forth on it...

After some thought, the only consistent result I see is to return masked for all three cases, except that we need some extra code so that m == m, ie

>>> p = np.ma.mvoid((1,7), mask=(0,1), dtype='i4,f4')
>>> (m == p), (m == m)
True, True

which I think requires replacement of mask_or by mask_xor, and then use any instead of all. But possibly we should have m == m return masked, I'm still a little undecided.

Returning True seems worse to me because intuitively I think n and m don't represent equal values, and secondly because the code would have to be much more complicated: We would have to reimplements numpy's internal structure comparison (somewhere in the C code) inside of MaskedArray to ignore fields with masked values.

Copy link
Member
@eric-wieser eric-wieser Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahaldane: I'm suggesting all semantics on the flattened view of the masked data (when such a thing can exist), not on the mask alone. I'm suggesting this because that's what the semantics of normal structured arrays are.

Flattening the data in your example, we get consistent behaviour

>>> o = np.ma.array([[1]], mask=[[True]])
>>> p = np.ma.array([[0]], mask=[[True]])
>>> (o == p).all(axis=-1)
masked_array(data = [--],
             mask = [ True],
       fill_value = True)

I'd expect your example to return a 1D array, not a scalar, but I assume that's just a mistake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm with @eric-wieser here, but maybe best to postpone further discussion until I have the next version up; @ahaldane's example shows that this one is more broken than I had thought (though an improvement on what was there!). In the meantime, I also found that other parts of np.ma are relying on comparison of two masked items being returned as masked True...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I just noticed there are unused, half-implemented methods _get_recordmask and _set_recordmask defined for MaskedArray...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't np.ma lovely...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of work an exploration was put into it, which hopefully we can learn from in a future tidying/reimplementation. (incidentally, if more numpy members who know about maskedarrays are joining recently, maybe we can make a reimplementation happen soon)

With respect to this PR: I recognize that, as often happens, this PR might be snowballing now that the unit tests are failing upon further changes. I might be in favor of getting this PR through now as a partial fix (at least m == m[0] will work). What do you think?

@ahaldane
Copy link
Member
ahaldane commented Feb 14, 2017

Besides the any vs all examples, the rest looks good.

Also, congrats on becoming a numpy member @mhvk ! I'm very happy to have someone else who often works on MaskedArrays here.

@mhvk
Copy link
Contributor Author
mhvk commented Feb 15, 2017

@ahaldane - hmm, that's a nasty problem you point out: ideally, we would ignore any masked values in the comparison, but that is not how it was done (or done in my PR). Needs some thought...

@mhvk
Copy link
Contributor Author
mhvk commented Feb 15, 2017

@ahaldane - on the "snowballing" -- I think it is not too bad, but I agree that if I don't get to something better by the end of the week, we probably should take this as at least an initial step.

@ahaldane
Copy link
Member

@mhvk all right, sounds good

@mhvk mhvk force-pushed the ma/eq_ne_axis_bug branch from 1be0dc1 to 6b84bdb Compare February 16, 2017 15:04
@mhvk
Copy link
Contributor Author
mhvk commented Feb 16, 2017

@ahaldane, @eric-wieser - I pushed an updated version of MaskedArray.__eq__ -- I think I got all the corner cases right now and sufficient comments that it should be clear what is the intent for whoever looks at this next... (in the process, I also got rid of the rather hacky mask.view(bool_), which would not work if the memory layout of the mask is non-contiguous.)

assert_equal(n == o, False)
assert_equal(n == m, False)
assert_equal(o == m, True)
assert_equal(o == n, False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that every single one of these tests should return np.masked

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, can any two mvoid objects return np.masked on comparison with this patch?

Copy link
Member
@eric-wieser eric-wieser Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My argument would be that there are 3 cases here:

  • No matter what values the masked fields take, two objects are always equal
  • No matter what values the masked fields take, two objects are always not equal
  • The equality depends on the values of the masked fields

Since the first two of these cases are represented by True and False, surely we should use masked to represent the third case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, mvoid comparisons can yield masked, but only if all elements are masked (in at least one of the two):

m = mvoid((1, 2.), mask=(0, 1), dtype='i4,f4')
p = mvoid((1, 2.), mask=(1, 0), dtype=m.dtype)
m == p
# masked

I probably should add that to the test cases...

@mhvk
Copy link
Contributor Author
mhvk commented Feb 16, 2017

@eric-wieser - replying in the main thread so things do not get lost: I was actually working by an analogy (I think) you made earlier, that the structured arrays should work the same way as would be the case if one tested equality over an axis. Now for that, masked elements are simply ignored:

np.all(np.ma.array([True, False], mask=(0,0)))
# False
np.all(np.ma.array([True, False], mask=(0,1)))
# True
np.all(np.ma.array([True, False], mask=(1,1)))
# masked

I think this is actually reasonably logical given the interpretation that masking elements means one should ignore them: if both arrays mask a given element, just don't care, if one does not mask it, it cannot possibly be equal to a masked one.

@mhvk
Copy link
Contributor Author
mhvk commented Feb 16, 2017

Actually, @eric-wieser, your real example was:

m = np.ma.array((1,2), mask=(0,1))
n = np.ma.array((1,0))
m == n
# masked_array(data = [True --],
#              mask = [False  True],
#        fill_value = True)
(m == n).all()
# True

while with my PR,

m = np.ma.array((1,2), mask=(0,1), dtype='i4,i4')
n = np.ma.array((1,0), dtype=m.dtype)
m == n
#  False

so I'm not consistent.

Darn.

@eric-wieser
Copy link
Member

@mhvk : Perhaps the test suite should try every permutation of mask / value, and compare structured vs unstructured comparison?

@ahaldane
Copy link
Member

This is a little of a further-out idea, but I'd like to try it out on you.

Maybe we want masked structure comparison to be fundamentally different from plain structure comparison.

What I mean is, maybe instead of returning a single True/False value per element, comparison should return a new structure similar to the mask of a masked structured array, which preserves all fields but makes them booleans. So

>>> m = np.ma.array([(1,2)], mask=[(0,1)], dtype='i4,f4')
>>> n = np.ma.array([(1,1)],  dtype='i4,f4')
>>> m == n
masked_array(data = [(True, --)],
             mask = [(False,  True)],
       fill_value = ( True,  True),
            dtype = [('f0', '?'), ('f1', '?')])

Incidentally, plain structure comparison is itself somewhat poorly defined and gives you lots of deprecation warnings in various cases. Its conceivable we could also change plain structure comparisons to do a similar thing (just without the masks). However I haven't thought it through, and that might be too big a change.

@mhvk mhvk force-pushed the ma/eq_ne_axis_bug branch 3 times, most recently from 7c1847a to ed86a0b Compare February 18, 2017 21:49
@mhvk
Copy link
Contributor Author
mhvk commented Feb 18, 2017

@eric-wieser, @ahaldane - a new version, which now explicitly tests that the structured comparison gives the same result as doing .all() (or .any() for ne) over an axis. I think this makes by far the most sense given the present setup, but do note that it meant I had to change a test -- which apparently did not assume this (I think almost certainly an oversight, and writing tests to match the code). So, there is now a risk of regressions (though given how buggy this part was, I think a fairly small one).

b = array([(1, 1), (2, 2)], mask=[(0, 1), (1, 0)], dtype=ndtype)
test = (a == b)
assert_equal(test, [True, False])
assert_equal(test, [True, True])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test I had to change (as well as its equivalent for ne).

@mhvk
Copy link
Contributor Author
mhvk commented Feb 18, 2017

@ahaldane - now that __eq__ is always guaranteed to return an array, I think it would indeed be most logical to let it return a structured bool array. I do fear, though, that that ship has sailed long ago -- it would simply break too much code, which relies on such structured array comparisons to work item by item. That said, possibly one should have a new function that does it the more logical way?

@mhvk mhvk force-pushed the ma/eq_ne_axis_bug branch from ed86a0b to 7f191b5 Compare February 18, 2017 22:03
@ahaldane
Copy link
Member

Ok, I like the behavior now and I am happy that the code seems quite elegant now compared to what was there before!

I'll merge in an in an hour or two, if there are no further comments.

(We can keep the "more logical" possible way for another time, if ever. It's such a niche case I don't think its a priority)

@mhvk mhvk force-pushed the ma/eq_ne_axis_bug branch from 7f191b5 to 400e0a6 Compare February 19, 2017 17:09
@mhvk
Copy link
Contributor Author
mhvk commented Feb 19, 2017

@ahaldane - yes, I liked as well that the code ended up becoming reasonably elegant! Note that I realised I had not updated the docstring for __ne__, so I pushed a last changed version...

numpy/ma/core.py Outdated
check._mask = mask
mask = mask_or(smask, omask)

odata = other.data if isinstance(other, MaskedArray) else other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems inconsistent with the duck-typing we seem to use elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done.

numpy/ma/core.py Outdated
# elements, masked or not. To avoid this, we set all masked
# elements in self to other, so they'll count as equal.
# To prepare, we ensure we have the right shape.
sbroadcast = np.broadcast_to(self, np.broadcast(self, other).shape,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be clearer as just sbroadcast, odatabroadcast = np.broadcast_arrays(self, odata, subok=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on second thought, I agree clarity beats speed here (I essentially wrote out what np.broadcast_arrays does, but skipping broadcasting odata).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhvk: I'm also suggesting you pass odatabroadcast rather than odata into filled, for clarity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I'm starting to think that this should be the default behaviour of filled, allowing filling with a larger fill-value than the dimensions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, that breaks tests, it turns out: broadcast_arrays is not guaranteed to return a new instance, and since I set the mask right after, I effectively change self. Not sure that is desired behaviour, but outside of the scope of this PR...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And looking into this, I realise mask is not properly broadcast either (and never was): this fails

np.ma.array([0, 1], mask=[0, 1]) == np.ma.array([[0, 2]])

(it fails in __repr__ but would otherwise be bad too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wieser - as you'll see, I had to leave the broadcasting alone, since otherwise I had to make a copy. I agree with you that, ideally, filled would just do the right thing itself. Even better if the filled function could take a mask rather than just calling the method. But I think that's best left for another PR!

numpy/ma/core.py Outdated

check = compare(sdata, odata)

if isinstance(check, (bool_, bool)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably better as if np.ndim(check) == 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, unless this is deliberately trying to return 0d arrays and not scalars

numpy/ma/core.py Outdated
# Note that this works automatically for structured arrays too.
check[mask] = compare(smask, omask)[mask]

check = check.view(MaskedArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a function somewhere within this file for picking the appropriate subclass of MaskedArray

@mhvk mhvk force-pushed the ma/eq_ne_axis_bug branch from c540303 to 04f83e5 Compare February 21, 2017 22:21
@mhvk
Copy link
Contributor Author
mhvk commented Feb 21, 2017

colons inserted...

@eric-wieser
Copy link
Member

Great. I'll leave this a day in case @ahaldane has any more comments or the tests fail, then looks good to merge. Thanks for tolerating my hole-picking!

@ahaldane
Copy link
Member

Nice catches @eric-wieser !

I just read through it again, and nothing sticks out to me. Feel free to merge, if you don't I'll do it tomorrow.

"""Check whether other does not equal self elementwise.

When either of the elements is masked, the result is masked as well,
but the underlying boolean data are still set, with self and other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any tests of this underlying bool data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, note how the tests all do something like test = (?? != ??) and then test both the values and the mask.

@eric-wieser
Copy link
Member

I think this needs a release note, since we actually did change behaviour - it was a narrow and buggy enough case that probably no one is affected, but if they are, then it would be nice to have a release note to point at

@mhvk
Copy link
Contributor Author
mhvk commented Feb 22, 2017

Yes, fine to add a quick note to the release: @charris - this is a bug fix mainly still, which I think should go in 1.12.1. But do I add a note to 1.12.1-notes.rst or to 1.13.0-notes.rst?

@@ -1327,14 +1327,32 @@ def test_eq_on_structured(self):
test = (a == a)
assert_equal(test, [True, True])
assert_equal(test.mask, [False, False])
test = (a == a[0])
assert_equal(test, [True, False])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is trying to test the underlying values, I think it needs a .data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're presuming that numpy functions correctly account for subclasses.... ;-)

In [2]: test = np.ma.MaskedArray([True, True], mask=[False, True])
In [3]: assert_equal(test, [True, True])
In [4]: test = np.ma.MaskedArray([True, False], mask=[False, True])
In [5]: assert_equal(test, [True, False])

(they should, of course, but that's for another time....)

Copy link
Member
@eric-wieser eric-wieser Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanna bet they don't?

In [2]: test = np.ma.MaskedArray([True, True], mask=[False, True])
In [3]: assert_equal(test, [True, True])  # correct, as before
In [4]: assert_equal(test, [True, False])  # uh oh...
In [5]: assert_equal(test, [True, np.array("oh dear", dtype=object)])  # oh geez

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grrrrrr.... OK, will change.

@mhvk mhvk force-pushed the ma/eq_ne_axis_bug branch from 04f83e5 to 7e8bfd6 Compare February 22, 2017 16:40
@charris
Copy link
Member
charris commented Feb 22, 2017

@mhvk When this showed up I was tempted to make it a backport, but because of the changes in tested behavior decided not to. I'd like to branch 1.13 by the end of next month, so the delay should not be excessive.

@charris
Copy link
Member
charris commented Feb 22, 2017

As to where to make release note additions for maintenance releases, it varies ;) Generally it should start in current master, but I did it the other way round in the 1.12.0 release because of the numerous updates. As a more practical matter, the 1.12.1 notes don't yet exist in the 1.12.x branch.

@eric-wieser eric-wieser added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Feb 26, 2017
@mhvk mhvk force-pushed the ma/eq_ne_axis_bug branch from 7e8bfd6 to 3e6a69f Compare February 27, 2017 15:04
@mhvk
Copy link
Contributor Author
mhvk commented Feb 27, 2017

OK, added a changelog entry.

solved. In the process, it was ensured that in getting the result for a
structured array, masked fields are properly ignored, i.e., the result is equal
if all fields that are non-masked in both are equal, thus making the behaviour
identical to what one gets by comparing a regular array and then doing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regular would be better as un-structured here, since regular falsely implies "not masked"

@mhvk mhvk force-pushed the ma/eq_ne_axis_bug branch from 3e6a69f to f49708b Compare February 27, 2017 16:23
mhvk added 2 commits February 27, 2017 11:24
In the process of trying to fix the "questionable behaviour in
`MaskedArray.__eq__`" (numpygh-8589), it became clear that the code was
buggy. E.g., `ma == ma[0]` failed if `ma` held a structured dtype;
multi-d structured dtypes failed generally; and, more worryingly,
a masked scalar comparison could be wrong:
`np.ma.MaskedArray(1, mask=True) == 0` yields True.

This commit solves these problems, adding tests to prevent regression.
In the process, it also ensures that the results for structured arrays
always equals what one would get by logically combining the results
over individual parts of the structure.
@mhvk mhvk force-pushed the ma/eq_ne_axis_bug branch from f49708b to 3435dd9 Compare February 27, 2017 16:24
@mhvk
Copy link
Contributor Author
mhvk commented Feb 27, 2017

OK, done. I cancelled the builds, since this passed already.

@eric-wieser
Copy link
Member

Not sure I'm comfortable merging this without the builds... Problem with rewriting history is I have no way of knowing if you accidentally messed something up, because I can't just see the one-word diff you intended to make. I'm sure you didn't, but it'd be nice to have a passed test to prove it.

@eric-wieser eric-wieser reopened this Feb 27, 2017
@eric-wieser
Copy link
Member

Thanks for your patience, @mhvk!

@mhvk
Copy link
Contributor Author
mhvk commented Feb 28, 2017

Thanks for all the comments, @eric-wieser -- the result is a much better new method than I had initially, one that does more at the same time as being smaller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: numpy.ma masked arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0