8000 ENH: Weighted quantile by chunweiyuan · Pull Request #9211 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Weighted quantile #9211

New issue
8000

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

chunweiyuan
Copy link
@chunweiyuan chunweiyuan commented Jun 2, 2017

Creates quantile() function that allows weights input.

Fixes #6326, #8935


The following are plotted the same way as the graphs at the bottom of the np.percentile docs, but with an added weights argument:

weights=[1, 2, 3, 4]:
weights_1_2_3_4

weights = [2, 1, 1, 2]:
weights_2_1_1_2

@eric-wieser

This comment has been minimized.

if weights is None:
Nx = ap.shape[-1]
indices = q * (Nx - 1)
indices = np.broadcast_to(indices, ap[..., 0].shape + q.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ap.shape[:-1]?

def _take1d(vec, inds):
return take(vec, inds)

vec_take = np.vectorize(_take1d, signature='(n),(m)->(m)')
Copy link
Member
@eric-wieser eric-wieser Jun 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a superset of arraysort above? Just use this for both?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might work, but I think having separate names (and signature) provides more clarity on what these two routines are trying to accomplish. I think It certainly helps understanding/debugging the code. Unless there's a real performance impact I prefer to keep it the way it is, though I'd defer to your judgment.

An array of weights associated with the values in `a`. Each value in
`a` contributes to the average according to its associated weight.
The weights array can either be 1-D (in which case its length must be
the size of `a` along the given axis) or of the same shape as `a`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be relaxed to "Either 1D, or such that weights.ndim == a.ndim and that weights and a are broadcastable"

@chunweiyuan

This comment has been minimized.

indices = q * (Nx - 1)
indices = np.broadcast_to(indices, ap.shape[:-1] + q.shape)
ind_sorted = np.argsort(ap, axis=-1) # sort values long axis
ap = arraysort(ap, ind_sorted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without weights it should not be slower than the old percentile, so partition should be used and the vectorization of the index sort avoided.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixed it in the latest push.

@juliantaylor
Copy link
Contributor

this does seem almost out of scope for numpy and more something for a statistics package, has this been previously discussed?

@bashtage
Copy link
Contributor
bashtage commented Jun 3, 2017

The explanation of weights can be substantially simplified if you make the connection between the normalized weights and repeating the data in the sample in the integer case.

For example, if you have [1,2,3] with weight [1,2,3] this will be have identically to the case where there are no weighs and the data are [1,2,2,3,3,3]. In Stata, for example, this type if weighting is known as fweighting (for frequency). There are other interpretations of weights that may result in a different estimator.

@chunweiyuan
Copy link
Author

I think this PR belongs in numpy:
1.) it's nothing more than an extension of percentile, and
2.) I've gone through great pain to make sure the underlying interpolation scheme, when weights is not None, is consistent with the old percentile. In the new tests there are cases where all weights are identical, and quantile renders the same result as the old percentile. There's another case where weights=[0, 1, 2], and quantile gives the same result as using the percentile on a simple repetition of the original data.

I've tried to address the comments above in the latest push. The explanation of how I compute the weighted quantile is still a bit cryptic, but I can't find a better way at the moment.

@chunweiyuan chunweiyuan changed the title Weighted quantile ENH: Weighted quantile Jun 11, 2017
@chunweiyuan
Copy link
Author

Anything I could do to help expedite this? It seems that the merge conflict in #9213 is rather small, and I could help bridge it with this PR as well.

@eric-wieser
Copy link
Member
8000

@chunweiyuan: Thanks for pointing out the merge conflict, I can go fix that.

There was a lack of decision in the comments of #9213 whether quantile should be in the API, if we already have percentile. Perhaps asking the mailing list for opinions would resolve that.

@chunweiyuan
Copy link
Author

Doesn't seem like there's much response to my email related to the API, so I think the way forward is to keep both quantile and percentile. There was, however, one suggestion related to using an O(n) partition algorithm instead of O(n log n) (argsort) in the code. I'm not sure how that would be addressed since I need ind_sorted in my code to sort the weights array.

@@ -4202,61 +4239,166 @@ def percentile(a, q, axis=None, out=None,
Returns
-------
percentile : scalar or ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be quantile, not percentile

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed. This baby is now just waiting for #9213........

Copy link
Contributor
@madphysicist madphysicist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs an entry somewhere in the release notes.

@@ -3185,6 +3185,85 @@ def test_nan_behavior(self):
a, [0.3, 0.6], (0, 2), interpolation='nearest'), b)


class TestQuantile(object):

def test_reqression(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a q instaead of a g. If I am wrong, ignore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

actual = np.quantile(ar, q=0.5, weights=np.ones(8).reshape(2, 1, 4))
assert_almost_equal(actual, expected)

def test_various_weights(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about integer weights?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think line 3233 there tests integer weights.

"""
Compute the qth percentile of the data along the specified axis.

Returns the qth percentile(s) of the array elements.

Uses quantile() underneath, with
1.) weights=None, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the parens.

"""
Compute the qth percentile of the data along the specified axis.

Returns the qth percentile(s) of the array elements.

Uses quantile() underneath, with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed or modified. There does not need to be any implementation information in the docs. Instead, you could say something like "Calling percentile is identical to calling quantile with weights set to None and the percentiles divided by 100."

--------
quantile
"""
return quantile(a, q=np.asarray(q)/100., axis=axis, weights=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could save on a call to asarray by doing q=true_divide(q, 100)

@madphysicist
Copy link
Contributor

@chunweiyuan For future reference, it is probably better to rebase your branch onto master rather than merging master into your branch. The history will look much cleaner and you won't have to deal with a stray merge commit later down the line.

@chunweiyuan
Copy link
Author

Sounds good. Old habits die hard. Now this branch is just waiting on #9213, I believe.

@eric-wieser
Copy link
Member

Looks like you did something screwy with a merge/rebase there. I've seen that behaviour before, but don't remember the cause.

@eric-wieser eric-wieser changed the base branch from master to maintenance/1.13.x August 11, 2017 18:53
@eric-wieser eric-wieser changed the base branch from maintenance/1.13.x to master August 11, 2017 18:54
@eric-wieser
Copy link
Member

Ok, looks like actually github was being screwy - changing the base twice has removed the commits from #9475 from this PR, which was what was weird in the first place.

chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Aug 6, 2023
…n lib.function_base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211).

    Move weights normalization to function_base._validate_and_ureduce_weights(), and speed improvement to conditional statement therein (numpy#9211).

    Additional docs to how weights are applied for weighted quantile calculation in lib.function_base (numpy#9211).

    Consolidate all np.vectorize() algo into one in lib.function_base._quantile() (numpy#9211).

    Better mapping from weight space previous/next indexes to real space in lib.function_base._quantile(), with additional comments. (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Aug 6, 2023
…s in lib.function_base (numpy#9211).

    Add tests to ensure consistency 1.) between get_virtual_index/fix_gamma and discrite_shortcut in lib.function_base._QuantileMethods, and 2.) in decimal weights. (numpy#9211)

    Remove nan-index check in function_base._get_indexes. (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Aug 6, 2023
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Aug 7, 2023
…base

     and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Aug 7, 2023
…n lib.function_base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211).

    Move weights normalization to function_base._validate_and_ureduce_weights(), and speed improvement to conditional statement therein (numpy#9211).

    Additional docs to how weights are applied for weighted quantile calculation in lib.function_base (numpy#9211).

    Consolidate all np.vectorize() algo into one in lib.function_base._quantile() (numpy#9211).

    Better mapping from weight space previous/next indexes to real space in lib.function_base._quantile(), with additional comments. (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Aug 7, 2023
…s in lib.function_base (numpy#9211).

    Add tests to ensure consistency 1.) between get_virtual_index/fix_gamma and discrite_shortcut in lib.function_base._QuantileMethods, and 2.) in decimal weights. (numpy#9211)

    Remove nan-index check in function_base._get_indexes. (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Aug 7, 2023
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Aug 7, 2023
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Aug 25, 2023
…base

     and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Aug 25, 2023
…n lib.function_base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211).

    Move weights normalization to function_base._validate_and_ureduce_weights(), and speed improvement to conditional statement therein (numpy#9211).

    Additional docs to how weights are applied for weighted quantile calculation in lib.function_base (numpy#9211).

    Consolidate all np.vectorize() algo into one in lib.function_base._quantile() (numpy#9211).

    Better mapping from weight space previous/next indexes to real space in lib.function_base._quantile(), with additional comments. (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Aug 25, 2023
…s in lib.function_base (numpy#9211).

    Add tests to ensure consistency 1.) between get_virtual_index/fix_gamma and discrite_shortcut in lib.function_base._QuantileMethods, and 2.) in decimal weights. (numpy#9211)

    Remove nan-index check in function_base._get_indexes. (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Aug 25, 2023
…base

     and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211)
…n lib.function_base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211).

    Move weights normalization to function_base._validate_and_ureduce_weights(), and speed improvement to conditional statement therein (numpy#9211).

    Additional docs to how weights are applied for weighted quantile calculation in lib.function_base (numpy#9211).

    Consolidate all np.vectorize() algo into one in lib.function_base._quantile() (numpy#9211).

    Better mapping from weight space previous/next indexes to real space in lib.function_base._quantile(), with additional comments. (numpy#9211)
…s in lib.function_base (numpy#9211).

    Add tests to ensure consistency 1.) between get_virtual_index/fix_gamma and discrite_shortcut in lib.function_base._QuantileMethods, and 2.) in decimal weights. (numpy#9211)

    Remove nan-index check in function_base._get_indexes. (numpy#9211)
Various other changes pertaining to api signatures.
@seberg
Copy link
Member
seberg commented Nov 8, 2023

Sorry, but closing this as I think I hinted before; it is too unwieldy and also a few open questions that need clearer answers IMO. If you have input, please reply to this thread: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/JHKKUXFT6Z2KTZQZHX6EKQSLSJCFNPYM/

@seberg seberg closed this Nov 8, 2023
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Nov 8, 2023
…base

     and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Nov 8, 2023
…n lib.function_base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211).

    Move weights normalization to function_base._validate_and_ureduce_weights(), and speed improvement to conditional statement therein (numpy#9211).

    Additional docs to how weights are applied for weighted quantile calculation in lib.function_base (numpy#9211).

    Consolidate all np.vectorize() algo into one in lib.function_base._quantile() (numpy#9211).

    Better mapping from weight space previous/next indexes to real space in lib.function_base._quantile(), with additional comments. (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Nov 8, 2023
…s in lib.function_base (numpy#9211).

    Add tests to ensure consistency 1.) between get_virtual_index/fix_gamma and discrite_shortcut in lib.function_base._QuantileMethods, and 2.) in decimal weights. (numpy#9211)

    Remove nan-index check in function_base._get_indexes. (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Nov 8, 2023
Various other changes pertaining to api signatures.
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Nov 8, 2023
…base

     and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Nov 8, 2023
…n lib.function_base and nanquantile/nanpercentile in lib.nanfunctions (numpy#9211).

    Move weights normalization to function_base._validate_and_ureduce_weights(), and speed improvement to conditional statement therein (numpy#9211).

    Additional docs to how weights are applied for weighted quantile calculation in lib.function_base (numpy#9211).

    Consolidate all np.vectorize() algo into one in lib.function_base._quantile() (numpy#9211).

    Better mapping from weight space previous/next indexes to real space in lib.function_base._quantile(), with additional comments. (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Nov 8, 2023
…s in lib.function_base (numpy#9211).

    Add tests to ensure consistency 1.) between get_virtual_index/fix_gamma and discrite_shortcut in lib.function_base._QuantileMethods, and 2.) in decimal weights. (numpy#9211)

    Remove nan-index check in function_base._get_indexes. (numpy#9211)
chunweiyuan pushed a commit to chunweiyuan/numpy that referenced this pull request Nov 8, 2023
Various other changes pertaining to api signatures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue. component: numpy.lib
Projects
Development

Successfully merging this pull request may close these issues.

weighted percentile
0