-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Weighted quantile #9211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Weighted quantile #9211
Conversation
This comment has been minimized.
This comment has been minimized.
numpy/lib/function_base.py
Outdated
if weights is None: | ||
Nx = ap.shape[-1] | ||
indices = q * (Nx - 1) | ||
indices = np.broadcast_to(indices, ap[..., 0].shape + q.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not ap.shape[:-1]
?
numpy/lib/function_base.py
Outdated
def _take1d(vec, inds): | ||
return take(vec, inds) | ||
|
||
vec_take = np.vectorize(_take1d, signature='(n),(m)->(m)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a superset of arraysort
above? Just use this for both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might work, but I think having separate names (and signature
) provides more clarity on what these two routines are trying to accomplish. I think It certainly helps understanding/debugging the code. Unless there's a real performance impact I prefer to keep it the way it is, though I'd defer to your judgment.
numpy/lib/function_base.py
Outdated
An array of weights associated with the values in `a`. Each value in | ||
`a` contributes to the average according to its associated weight. | ||
The weights array can either be 1-D (in which case its length must be | ||
the size of `a` along the given axis) or of the same shape as `a`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be relaxed to "Either 1D, or such that weights.ndim == a.ndim
and that weights
and a
are broadcastable"
This comment has been minimized.
This comment has been minimized.
numpy/lib/function_base.py
Outdated
indices = q * (Nx - 1) | ||
indices = np.broadcast_to(indices, ap.shape[:-1] + q.shape) | ||
ind_sorted = np.argsort(ap, axis=-1) # sort values long axis | ||
ap = arraysort(ap, ind_sorted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without weights it should not be slower than the old percentile, so partition should be used and the vectorization of the index sort avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fixed it in the latest push.
this does seem almost out of scope for numpy and more something for a statistics package, has this been previously discussed? |
The explanation of weights can be substantially simplified if you make the connection between the normalized weights and repeating the data in the sample in the integer case. For example, if you have [1,2,3] with weight [1,2,3] this will be have identically to the case where there are no weighs and the data are [1,2,2,3,3,3]. In Stata, for example, this type if weighting is known as fweighting (for frequency). There are other interpretations of weights that may result in a different estimator. |
I think this PR belongs in numpy: I've tried to address the comments above in the latest push. The explanation of how I compute the weighted quantile is still a bit cryptic, but I can't find a better way at the moment. |
Anything I could do to help expedite this? It seems that the merge conflict in #9213 is rather small, and I could help bridge it with this PR as well. |
@chunweiyuan: Thanks for pointing out the merge conflict, I can go fix that. There was a lack of decision in the comments of #9213 whether |
Doesn't seem like there's much response to my email related to the API, so I think the way forward is to keep both |
numpy/lib/function_base.py
Outdated
@@ -4202,61 +4239,166 @@ def percentile(a, q, axis=None, out=None, | |||
Returns | |||
------- | |||
percentile : scalar or ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be quantile, not percentile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed. This baby is now just waiting for #9213........
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs an entry somewhere in the release notes.
@@ -3185,6 +3185,85 @@ def test_nan_behavior(self): | |||
a, [0.3, 0.6], (0, 2), interpolation='nearest'), b) | |||
|
|||
|
|||
class TestQuantile(object): | |||
|
|||
def test_reqression(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a q
instaead of a g
. If I am wrong, ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
actual = np.quantile(ar, q=0.5, weights=np.ones(8).reshape(2, 1, 4)) | ||
assert_almost_equal(actual, expected) | ||
|
||
def test_various_weights(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about integer weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think line 3233 there tests integer weights.
numpy/lib/function_base.py
Outdated
""" | ||
Compute the qth percentile of the data along the specified axis. | ||
|
||
Returns the qth percentile(s) of the array elements. | ||
|
||
Uses quantile() underneath, with | ||
1.) weights=None, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the parens.
numpy/lib/function_base.py
Outdated
""" | ||
Compute the qth percentile of the data along the specified axis. | ||
|
||
Returns the qth percentile(s) of the array elements. | ||
|
||
Uses quantile() underneath, with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be removed or modified. There does not need to be any implementation information in the docs. Instead, you could say something like "Calling percentile
is identical to calling quantile
with weights
set to None
and the percentiles divided by 100."
numpy/lib/function_base.py
Outdated
-------- | ||
quantile | ||
""" | ||
return quantile(a, q=np.asarray(q)/100., axis=axis, weights=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could save on a call to asarray
by doing q=true_divide(q, 100)
@chunweiyuan For future reference, it is probably better to rebase your branch onto master rather than merging master into your branch. The history will look much cleaner and you won't have to deal with a stray merge commit later down the line. |
Sounds good. Old habits die hard. Now this branch is just waiting on #9213, I believe. |
Looks like you did something screwy with a merge/rebase there. I've seen that behaviour before, but don't remember the cause. |
Ok, looks like actually github was being screwy - changing the base twice has removed the commits from #9475 from this PR, which was what was weird in the first place. |
…n lib.function_base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211). Move weights normalization to function_base._validate_and_ureduce_weights(), and speed improvement to conditional statement therein (numpy#9211). Additional docs to how weights are applied for weighted quantile calculation in lib.function_base (numpy#9211). Consolidate all np.vectorize() algo into one in lib.function_base._quantile() (numpy#9211). Better mapping from weight space previous/next indexes to real space in lib.function_base._quantile(), with additional comments. (numpy#9211)
…s in lib.function_base (numpy#9211). Add tests to ensure consistency 1.) between get_virtual_index/fix_gamma and discrite_shortcut in lib.function_base._QuantileMethods, and 2.) in decimal weights. (numpy#9211) Remove nan-index check in function_base._get_indexes. (numpy#9211)
a52d539
to
8907353
Compare
…base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211)
…n lib.function_base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211). Move weights normalization to function_base._validate_and_ureduce_weights(), and speed improvement to conditional statement therein (numpy#9211). Additional docs to how weights are applied for weighted quantile calculation in lib.function_base (numpy#9211). Consolidate all np.vectorize() algo into one in lib.function_base._quantile() (numpy#9211). Better mapping from weight space previous/next indexes to real space in lib.function_base._quantile(), with additional comments. (numpy#9211)
…s in lib.function_base (numpy#9211). Add tests to ensure consistency 1.) between get_virtual_index/fix_gamma and discrite_shortcut in lib.function_base._QuantileMethods, and 2.) in decimal weights. (numpy#9211) Remove nan-index check in function_base._get_indexes. (numpy#9211)
8907353
to
ea8df3f
Compare
ea8df3f
to
3b9a3e6
Compare
…base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211)
…n lib.function_base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211). Move weights normalization to function_base._validate_and_ureduce_weights(), and speed improvement to conditional statement therein (numpy#9211). Additional docs to how weights are applied for weighted quantile calculation in lib.function_base (numpy#9211). Consolidate all np.vectorize() algo into one in lib.function_base._quantile() (numpy#9211). Better mapping from weight space previous/next indexes to real space in lib.function_base._quantile(), with additional comments. (numpy#9211)
…s in lib.function_base (numpy#9211). Add tests to ensure consistency 1.) between get_virtual_index/fix_gamma and discrite_shortcut in lib.function_base._QuantileMethods, and 2.) in decimal weights. (numpy#9211) Remove nan-index check in function_base._get_indexes. (numpy#9211)
…base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211)
…n lib.function_base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211). Move weights normalization to function_base._validate_and_ureduce_weights(), and speed improvement to conditional statement therein (numpy#9211). Additional docs to how weights are applied for weighted quantile calculation in lib.function_base (numpy#9211). Consolidate all np.vectorize() algo into one in lib.function_base._quantile() (numpy#9211). Better mapping from weight space previous/next indexes to real space in lib.function_base._quantile(), with additional comments. (numpy#9211)
…s in lib.function_base (numpy#9211). Add tests to ensure consistency 1.) between get_virtual_index/fix_gamma and discrite_shortcut in lib.function_base._QuantileMethods, and 2.) in decimal weights. (numpy#9211) Remove nan-index check in function_base._get_indexes. (numpy#9211)
Various other changes pertaining to api signatures.
3b9a3e6
to
479f098
Compare
Sorry, but closing this as I think I hinted before; it is too unwieldy and also a few open questions that need clearer answers IMO. If you have input, please reply to this thread: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/JHKKUXFT6Z2KTZQZHX6EKQSLSJCFNPYM/ |
…base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211)
…n lib.function_base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211). Move weights normalization to function_base._validate_and_ureduce_weights(), and speed improvement to conditional statement therein (numpy#9211). Additional docs to how weights are applied for weighted quantile calculation in lib.function_base (numpy#9211). Consolidate all np.vectorize() algo into one in lib.function_base._quantile() (numpy#9211). Better mapping from weight space previous/next indexes to real space in lib.function_base._quantile(), with additional comments. (numpy#9211)
…s in lib.function_base (numpy#9211). Add tests to ensure consistency 1.) between get_virtual_index/fix_gamma and discrite_shortcut in lib.function_base._QuantileMethods, and 2.) in decimal weights. (numpy#9211) Remove nan-index check in function_base._get_indexes. (numpy#9211)
Various other changes pertaining to api signatures.
…base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211)
…n lib.function_base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211). Move weights normalization to function_base._validate_and_ureduce_weights(), and speed improvement to conditional statement therein (numpy#9211). Additional docs to how weights are applied for weighted quantile calculation in lib.function_base (numpy#9211). Consolidate all np.vectorize() algo into one in lib.function_base._quantile() (numpy#9211). Better mapping from weight space previous/next indexes to real space in lib.function_base._quantile(), with additional comments. (numpy#9211)
…s in lib.function_base (numpy#9211). Add tests to ensure consistency 1.) between get_virtual_index/fix_gamma and discrite_shortcut in lib.function_base._QuantileMethods, and 2.) in decimal weights. (numpy#9211) Remove nan-index check in function_base._get_indexes. (numpy#9211)
Various other changes pertaining to api signatures.
Creates
quantile()
function that allowsweights
input.Fixes #6326, #8935
The following are plotted the same way as the graphs at the bottom of the
np.percentile
docs, but with an added weights argument:weights=[1, 2, 3, 4]
:weights = [2, 1, 1, 2]
: