-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Handle nan/masked values Axes.vlines and hlines #7408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I'm not holding it up for that line.
@@ -979,14 +979,14 @@ def hlines(self, y, xmin, xmax, colors='k', linestyles='solid', | |||
|
|||
verts = [((thisxmin, thisy), (thisxmax, thisy)) | |||
for thisxmin, thisxmax, thisy in zip(xmin, xmax, y)] | |||
coll = mcoll.LineCollection(verts, colors=colors, | |||
lines = mcoll.LineCollection(verts, colors=colors, | |||
linestyles=linestyles, label=label) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this line need to indent to keep things lined up (pep8 and all that...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @phobson !
This supports nans in the input, but not masked arrays; in most cases we try to support both. If you want to handle masked arrays, you could run the arguments through a function analogous to Overall, we have a somewhat confusing mixture of handling missing points via nans (which only works for floating point, and which requires nanmax etc.) and via masked arrays, for which methods work mostly as they do on plain ndarrays. |
Second thought: for these functions, would it make more sense to simply filter out the bad values at the start? I don't see any point in pushing the handling of bad values down into the min and max calculations. Admittedly, I haven't looked closely. |
@efiring Filtering was my thought when I first went into this. Let me mess around with feeding masked arrays into this function before we merge. In either case, handling both masked arrays and sequences with invalid ( |
With the current implementation in this PR, masked arrays does not work as expected. For example %matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
fig3, ax5 = plt.subplots()
x5 = np.ma.masked_equal([2, 4, 6, 8, 10, 12], 8)
ymin5 = np.ma.masked_equal([0, 1, -1, 0, 2, 1], 2)
ymax5 = np.ma.masked_equal([13, 14, 15, 16, 17, 18], 18)
ax5.vlines(x5, ymin5, ymax5, colors='k', linewidth=2) Produces a plot with 5 lines. Only the line where I'll get back to this later this weekend. |
Filtering a single array is easy: x, ymin, ymax = cbook.delete_masked_points(x, ymin, ymax) and I think that will take care of everything. |
@efiring Great. Thanks for the tip. I'll include that and the example from my previous comment as a test. |
Closes GH matplotlib#7406. Also applied fixes to Axes.hlines and wrote tests for both.
39b5b24
to
747c3cd
Compare
complete reworked how nan/masked values are handled. need fresh review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but would like a clarification on the question below.
minx = min(xmin.min(), xmax.min()) | ||
maxx = max(xmin.max(), xmax.max()) | ||
minx = np.min([xmin.min(), xmax.min()]) | ||
maxx = np.max([xmin.max(), xmax.max()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale behind the change from using the builtin min
and max
to numpy's? If we're talking micro-performance-optimization, the builtins will be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, vlines
used the builtin min
and max
, while hlines
used a mix of builtin and numpy. I was going for consistency.
To my eyes, the numpy syntax is more familiar and it feels "safer" (not sure if that's true). I wasn't considering performance with those changes.
My understanding is also that it's faster:
x = numpy.ranndom.normal(size=3700)
%timeit x.max()
The slowest run took 53.27 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.88 µs per loop
where(2.9 * 55 ~ 160 µs)
%timeit max(x)
1000 loops, best of 3: 231 µs per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm referring specifically to this change:
minx = min(xmin.min(), xmax.min())
to
minx = np.min([xmin.min(), xmax.min()])
and similar for max()
--not the change from min(ymin)
to ymin.min()
, etc. This latter change I agree with. For the former, I dislike using numpy's min
just to find the minimum of two scalars. You end up creating an extra list, which then descends into numpy and creates an array. On my system, taking the minimum of two scalars takes 10us with numpy, 4us with the builtin. There's also more for my eye to scan across with how you've changed it.
But I don't want to bikeshed this to death (6us is not going to have any impact on matplotlib's slowness)--but I'd be interested if we as developers have consensus on what the convention is or should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree: builtin min is cleaner for comparing two scalars. As @anntzer has pointed out, however, there is a difference in the way NaN is handled. Numpy also has minimum
, which can be used with two scalars and is faster than using np.min
:
In [1]: import numpy as np
In [2]: min(np.nan, 1.1)
nan
In [3]: min(1.1, np.nan)
1.1
In [7]: timeit np.minimum(2.2, 1.1)
The slowest run took 16.35 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 1.13 µs per loop
In [8]: timeit min(2.2, 1.1)
The slowest run took 11.75 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 197 ns per loop
In [9]: timeit np.min([2.2, 1.1])
The slowest run took 260.96 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 9.37 µs per loop
Conclusion: if there is any possibility of nans, use np.minimum; otherwise, builtin min is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike using numpy's min just to find the minimum of two scalars.
Ahh that makes sense. I'll make the change. You've convinced me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiring since we're using e.g., cbook.delete_masked_points(x, ymin, ymax)
I think the possibility for NaNs has been removed.
This is a little simpler since we're finding the min or max of two scalars (avoids an intermediate list + diving into numpy). Also synced up the decorator between the two functions.
Test failure is a weird linkchecker failure--seems unrelated. |
Link checker failed because the documentation does not exist; the failure is here and appears to be related to this PR. |
Wow, not sure how I missed that. |
We should probably change Travis to not run link checker when the doc build fails. |
That failure is really confusing to me because the tests pass and the docs build on my local branch. Nonetheless, I think the latest commit (deleting masked/invalid points after we make sure the first arg is an iterable) should do the trick. |
All green--I think we're good here. |
If this is actually intended for 2.0, it will need to be cherry-picked. |
[MRG] Handle nan/masked values Axes.vlines and hlines Conflicts: lib/matplotlib/axes/_axes.py We keep the v2.x version on 2 lines: use unpack_labeled_data, instead of _preprocess_data, which is only in master.
back-ported to v2.x as ad192f0 |
thanks for the guidance and merging everyone |
Closes #7406.
Also applied fixes to Axes.hlines and wrote tests for both.