-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Weighted quantile option in nanpercentile() #8935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In your example, is As far as I can tell, you're essentially asking to add a One thing that concerns me - we have Do you want to go ahead and submit a PR that patches |
Also, I'd be pretty in favor of renaming def percentile(..., q):
return quantile(..., q = np.asarray(q) / 100) But that's for another issue (#8936) |
The dimension prescribed in
If no objection I'll go ahead and try to patch |
Ah, I forgot that I think this would make more sense in numpy with |
Actually, the weights can be applied along any single dimension of the array, whether in
I'm perfectly fine with not using a dict for weights. |
It might make sense to allow for different weight vectors on different columns/rows. OTOH requiring the weights to broadcast to the original array would produce very confusing behavior for the relatively common case of 1d weights and axis != -1... |
Let's not reinvent the wheel here. We already do the following for
|
Feel like I'm ready to create a PR for this. I've added |
File the PR against master. Better to submit it early and fix it up than to try and get everything right first time, especially if people end up disagreeing with your goal |
any updates on this one? |
@chunweiyuan @eric-wieser any updates? is someone working on this? |
Yes, this is being worked on in #9211 |
The PR #9211 seems to be stuck because of different views of what a "weight" is. I think we should first clarify that point and then have an implementation plan. 1. Definition of weightsRequirements:
Statistical interpretation as case weights Definition of quantiles
Plugin in the empirical distribution in definitions 2 and 3 results in definitions of empirical quantiles well suited for testing. 2. Implementations planIf the properties above are agreed on, a possible implementation plan could be:
3. AlternativesAsk scipy maintainers if weighted quantiles could be a fit for scipy. |
@seberg @eric-wieser ping |
Using the weighted empirical CDF seems good, but the question is how/if you generalize it to the The second PR start used something like what landed in xarray: https://github.com/pydata/xarray/blob/b4e3cbcf17374b68477ed3ff7a8a52c82837ad91/xarray/core/weighted.py#L338
As given above, some methods depend on the sample-size, so we have to calculate an effective sample size and this means a distinction between the methods.
I don't believe in adding a something not fully functional. We work with 1-D internally anyway (IIRC) using the I would accept if we limit ourselves to methods that do not have a correction factor if necessary (the default is contiguous but has no correction term). |
As quantiles are often poorly defined, I wrote a little blog post: https://lorentzen.ch/index.php/2023/02/11/quantiles-and-their-estimation/.
After giving it much thought, my recommendation is to start with the weighted version of the inverted CDF method only. The reason is that it unambiguously generalizes to weights, which can be integers or floats. |
JuliaStats has something here, the discussion isn't super convincingly thorough to me. But they use our default (interpolate/R7) and I think for that the normalization should indeed be irrelevant. They do seem to ensure that |
@robjhyndman would you happen to have any guidance with regards to weighted quantiles? It seems probably clear within certain limitations. I.e. we can define it for frequency weights with a check that they are But I really wonder if we can find prior art or even literature? (e.g. the JuliaStats people seemed to not find clear prior art...) Or maybe you can formulate a preference for some solution? |
To the best of my knowledge, the distinction of weights has no impact on point estimates of the mean, only on the estimation of it‘s variance. The same applies to point estimates of quantiles which should not depend on the interpretation (better assumptions) of weights, only the variance of this estimate (which we are not talking about) would depend on it. |
I am not convinced for the non-trivial |
Very hard to give a short answer. 1. attempt: 2. attempt with more theory: Mean and quantiles are both elicitable, i.e. there exists a loss (or scoring) function such that the argmin of its expectation is the mean/quantile.
|
@lorentzenchr my counter argument is also short: given
Your argument is correct for the median but not for general quantiles (at least I would expect). I can also also translate this into a proof that
The unbiased estimation is obviously better at the job, but you can see the similarity I think. |
I don't think this is very hard to deal with as such. We probably just have to:
I just would prefer someone with a stronger statistics background to bless such a choice. We have been insisting on |
1Could we have a call? 2The median unbiased estimator is maybe not the best one to generalize with weights. 3
I have to - more or less - strongly disagree.:
The line 4Please let us drop the distinction between frequency, analytical and other weights for the estimation of quantiles (or do you have a literature reference for it, I'd be interested). We don't do that for the mean either. |
I don't disagree with your PR, beyond having to check how we handle |
So we had a short chat:
Remark: Sadly, there is hardly any literature pointing at definitions or solutions. |
I wanted to understand the |
I started with pretty much agreeing but now need to pivoting to an example first. One potential problem for me I think was how I interpret(ed) the weights. I tried to interpret them as measurement uncertainty and my intution and thinking process failed me badly on it. This is may be no surprise: Quantiles being robust using measurement uncertainty as weights likely doesn't make Alternatively, I can interpret them as wanting to correct for sampling biase. Example: I sample 10 people from a village and 10 from a city, but I know the total population has 2/3 living in the city and I want to correct for that. First, I thought: Aha, this works out with what you said. Then I thought: wait, how would I actually want to chose my weights in the above example (ignore problems with |
Measurement uncertainty can be seen as variance and, from an estimation theorist‘s point of view, weights inverse proportional to the (true) variance are optimal. (But this argument might only apply for estimation of the mean.) I would divide the cases as follows:
I personally won‘t implement weighted versions for 2). For the sample bias correction example and a quantile that relies on the ECDF, you apply any weights as long as their ratio is 1/2, scale does not matter. |
I can be convinced of a minimal thing, but what is that? Call it Also @anntzer who was involved in another issue discussion. What would really help moving things here is literature. Asking for things to be included in NumPy without clear literature is a very big ask from the start. The other thing that would help me is if we could say examples of what weights mean. I can be on-board better with I admit again though: I think for "inverted CDF" the type of weights doesn't matter, it is safe to implement. But how many users actually want to use "inverted CDF"? EDIT: Unless we explicitly add |
My understanding:
To give one use case: Gradient boosted trees for estimating quantiles, see Greedy function approximation: A gradient boosting machine. Eq (14). The weights are given by (the absolute value of) the base learner which can be any non-negative value. |
Just noticed that Wikipedia has "method extends the above approach in a natural way" [citation needed]... Which defines it for the plotting point methods in the way that I tried to think along (I think I got it slightly different/wrong but I suspected as much starting with the implementation). I can say the same thing in many ways probably. For example I suspect that for your definition we can say that duplicating/repeating the same data is an invariant for the result so that all interpretation of weights are equivalent. The question is still what the proposed API is? Use
Hmmm, hard to set into perspective of how to interpret the weights and what |
Concerning API: As long as there is no behavioral difference, i.e. as long as no method can have both fweights and aweights with different computational logic, a single argument @glemaitre and @lucyleeow you did quite some work on weighted quantiles in scikit-learn/scikit-learn#17768. Do you have further insights in this topic? About literature: It is really hard to find anything. This blogpost also did a literature and software package research. Because of the lack of anything, the same author wrote this paper (but note: not peer reviewed). |
🎉 yet another way to define the quantile where you use more than two samples in a final interpolation. |
I can see that working with a single |
As reference for different weights (with mean/average in mind), which are a good read: |
Uh oh!
There was an error while loading. Please reload this page.
For our work we frequently need to compute weighted quantiles. This is especially important when we need to weigh data from recent years more heavily in making predictions.
I've put together a function (called
weighted_quantile
) largely based on the source code ofnp.percentile
. It allows one to input weights along a single dimension, as a dictw_dict
. Below are some manual tests, usingxarray
as starting point:When all weights = 1, it's identical to using
np.percentile
:Now different weights:
Also handles nan values like
np.nanpercentile
:Lastly, different interpolation schemes are consistent:
We wonder if it's ok to make this feature part of numpy, probably in
np.nanpercentile
?The text was updated successfully, but these 8000 errors were encountered: