8000 Weighted quantile option in nanpercentile() · Issue #8935 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Weighted quantile option in nanpercentile() #8935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chunweiyuan opened this issue Apr 12, 2017 · 38 comments
Closed

Weighted quantile option in nanpercentile() #8935

chunweiyuan opened this issue Apr 12, 2017 · 38 comments

Comments

@chunweiyuan
Copy link
chunweiyuan commented Apr 12, 2017

For our work we frequently need to compute weighted quantiles. This is especially important when we need to weigh data from recent years more heavily in making predictions.

I've put together a function (called weighted_quantile) largely based on the source code of np.percentile. It allows one to input weights along a single dimension, as a dict w_dict. Below are some manual tests, using xarray as starting point:

When all weights = 1, it's identical to using np.percentile:

>>> ar0
<xarray.DataArray (x: 3, y: 4)>
array([[3, 4, 8, 1],
       [5, 3, 7, 9],
       [4, 9, 6, 2]])
Coordinates:
  * x        (x) |S1 'a' 'b' 'c'
  * y        (y) int64 0 1 2 3
>>> np.percentile(ar0, q=[25, 50, 75], axis=-1)
array([[ 2.5 ,  4.5 ,  3.5 ],
       [ 3.5 ,  6.  ,  5.  ],
       [ 5.  ,  7.5 ,  6.75]])
>>> weighted_quantile(da=ar0, q=[0.25, 0.5, 0.75], dim='y', w_dict={'y': [1,1,1,1]})
<xarray.DataArray (quantile: 3, x: 3)>
array([[ 2.5 ,  4.5 ,  3.5 ],
       [ 3.5 ,  6.  ,  5.  ],
       [ 5.  ,  7.5 ,  6.75]])
Coordinates:
  * x         (x) |S1 'a' 'b' 'c'
  * quantile  (quantile) float64 0.25 0.5 0.75

Now different weights:

>>> weighted_quantile(da=ar0, q=[0.25, 0.5, 0.75], dim='y', w_dict={'y': [1,2,3,4.0]})
<xarray.DataArray (quantile: 3, x: 3)>
array([[ 3.25    ,  5.666667,  4.333333],
       [ 4.      ,  7.      ,  5.333333],
       [ 6.      ,  8.      ,  6.75    ]])
Coordinates:
  * x         (x) |S1 'a' 'b' 'c'
  * quantile  (quantile) float64 0.25 0.5 0.75

Also handles nan values like np.nanpercentile:

>>> ar
<xarray.DataArray (x: 2, y: 2, z: 2)>
array([[[ nan,   3.],
        [ nan,   5.]],

       [[  8.,   1.],
        [ nan,   0.]]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 0 1
  * z        (z) int64 8 9
>>> da_stacked = ar.stack(mi=['x', 'y'])
>>> out = weighted_quantile(da=ar, q=[0.25, 0.5, 0.75], dim=['x', 'y'], w_dict={'x': [1, 1]})
>>> out
<xarray.DataArray (quantile: 3, z: 2)>
array([[ 8.  ,  0.75],
       [ 8.  ,  2.  ],
       [ 8.  ,  3.5 ]])
Coordinates:
  * z         (z) int64 8 9
  * quantile  (quantile) float64 0.25 0.5 0.75
>>>  np.nanpercentile(da_stacked, q=[25, 50, 75], axis=-1)
array([[ 8.  ,  0.75],
       [ 8.  ,  2.  ],
       [ 8.  ,  3.5 ]])

Lastly, different interpolation schemes are consistent:

>>> out = weighted_quantile(da=ar, q=[0.25, 0.5, 0.75], dim=['x', 'y'], w_dict={'x': [1, 1]}, interpolation='nearest')
>>> out
<xarray.DataArray (quantile: 3, z: 2)>
array([[ 8.,  1.],
       [ 8.,  3.],
       [ 8.,  3.]])
Coordinates:
  * z         (z) int64 8 9
  * quantile  (quantile) float64 0.25 0.5 0.75
>>> np.nanpercentile(da_stacked, q=[25, 50, 75], axis=-1, interpolation='nearest')
array([[ 8.,  1.],
       [ 8.,  3.],
       [ 8.,  3.]])

We wonder if it's ok to make this feature part of numpy, probably in np.nanpercentile?

@eric-wieser
Copy link
Member

In your example, is w_dict.keys() ever different to {dim}?

As far as I can tell, you're essentially asking to add a weights parameter to percentile (and nanpercentile) in order to make it match np.average, which already has one.

One thing that concerns me - we have average with a weights parameter, and mean without (#8864). If we're going to add weights to everything (median is definitely worthy if percentile is), perhaps we should decide whether to merge these first.

Do you want to go ahead and submit a PR that patches percentile? Getting nanpercentile right after that shouldn't be too hard, and it saves you iteration time.

@eric-wieser
Copy link
Member
eric-wieser commented Apr 12, 2017

Also, I'd be pretty in favor of renaming np.percentile to np.quantile, and adding something like

def percentile(..., q):
    return quantile(..., q = np.asarray(q) / 100)

But that's for another issue (#8936)

@chunweiyuan
Copy link
Author

The dimension prescribed in w_dict does not have to be part of dim:

>>> out = weighted_quantile(da=ar, q=[0.25, 0.5, 0.75], dim=['z', 'y'], w_dict={'x': [1, 1]})
>>> out
<xarray.DataArray (quantile: 3, x: 2)>
array([[ 3.5,  0.5],
       [ 4. ,  1. ],
       [ 4.5,  4.5]])
Coordinates:
  * x         (x) |S1 'a' 'b'
  * quantile  (quantile) float64 0.25 0.5 0.75

If no objection I'll go ahead and try to patch np.percentile with a PR then. Thanks.

@eric-wieser
Copy link
Member
eric-wieser commented Apr 12, 2017

Ah, I forgot that dim could take a list. Presumably, any key in w_dict not in dim is ignored in your code?

I think this would make more sense in numpy with weights being a broadcasting array, rather than a dict - try and match the semantics of average

@chunweiyuan
Copy link
Author

Actually, the weights can be applied along any single dimension of the array, whether in dim or not. The constraints on the weights are:

  1. A 1-D iterable the same length as the weighted dimension
  2. All elements must be numeric.
  3. No negative weights
  4. Weights must sum up larger than 0

I'm perfectly fine with not using a dict for weights.

@njsmith
Copy link
Member
njsmith commented Apr 13, 2017

1-D iterable

It might make sense to allow for different weight vectors on different columns/rows. OTOH requiring the weights to broadcast to the original array would produce very confusing behavior for the relatively common case of 1d weights and axis != -1...

@eric-wieser
Copy link
Member

Let's not reinvent the wheel here. We already do the following for np.average:

weights : array_like, optional

An array of weights associated with the values in a. Each value in a contributes to the average according to its associated weight. The weights array can either be 1-D (in which case its length must be the size of a along the given axis) or of the same shape as a. If weights=None, then all data in a are assumed to have a weight equal to one.

@chunweiyuan
Copy link
Author

Feel like I'm ready to create a PR for this. I've added quantile in function_base.py, wrapped percentile with it, and added tests and docstring. Is there a checklist of things to do before submitting the PR I should look at? What branch to submit the PR to?

@eric-wieser
Copy link
Member

File the PR against master. Better to submit it early and fix it up than to try and get everything right first time, especially if people end up disagreeing with your goal

@adrinjalali
Copy link
Contributor

any updates on this one?

@amitport
Copy link
amitport commented Dec 4, 2019

@chunweiyuan @eric-wieser any updates? is someone working on this?

@eric-wieser
Copy link
Member
eric-wieser commented Dec 4, 2019

Yes, this is being worked on in #9211

@lorentzenchr
8000 Copy link
Contributor
lorentzenchr commented Jan 14, 2023

The PR #9211 seems to be stuck because of different views of what a "weight" is. I think we should first clarify that point and then have an implementation plan.

1. Definition of weights

Requirements:

  1. I think np.quantile(x, weights=w) should be the same in spirit as np.average(x, weights=w).
  2. Non-negativity: any $w_i \geq 0$ is allowed.
  3. For integer weight values, the same result can be obtained by replicating $x_i$ exactly $w_i$ times.
  4. Multiplying the weights by a constant does not change the quantile, i.e. $w$ and $c*w$ give the same quantile value for $c&gt;0$.

Statistical interpretation as case weights
These requirements fit perfectly to case weights. A quantile at level alpha is a property $Q_\alpha$ of a probability distribution $F$, e.g. $Q_\alpha(F)$. Note that a quantile $Q_\alpha(F)$ can be set valued which leads to different possibilities for empirical quantiles (see method). The empirical version arises by plugging in the empirical distribution $\hat{F}$ of a given sample $x$. Case weights $w$ are then naturally defined on this distribution in a relative sense by the weighted empirical CDF $\hat{F}(t) = \frac{1}{\sum_i w_i} \sum_i w_i 1_{x_i \leq t}$.

Definition of quantiles
Quantiles can be defined in at least 3 equivalent ways, see https://doi.org/10.48550/arXiv.0912.0902:

  1. Any value $q$ with $\lim_{y\uparrow q} F(y) \leq \alpha \leq F(q)$
  2. Any solution $q$ to the optimization problem $argmin_{z} E[S(z, Y)]$ with $Y$ distributed as $F$ and a consistent scoring function $S$ for the quantiles like Koenker's pinball loss $S(z, y) = (1_{z\geq y} - \alpha) (z - y)$.
  3. Any solution $q$ to the equation $E[V(q, Y)] = 0$ with identification function $V(z, x) = 1_{z\geq x} - \alpha$. This can be interpreted as first order condition for point 2.

Plugin in the empirical distribution in definitions 2 and 3 results in definitions of empirical quantiles well suited for testing.

2. Implementations plan

If the properties above are agreed on, a possible implementation plan could be:

  1. Extend tests for quantiles to check basic properties TST: add tests for numpy.quantile #23129
    • Multiplying data by constant $c&gt;0$ gives $c*q$,
    • Adding a constant $c$ to the data gives $q + c$
    • q is a minimum, see 2. definition above, maybe checked by the 3. definition.
  2. Implement weighted quantiles
    • Start with 1d arrays, keep it private, don't release yet
    • Maybe start with one single method and extend later
    • Extend with axis support and release

3. Alternatives

Ask scipy maintainers if weighted quantiles could be a fit for scipy.

@lorentzenchr
Copy link
Contributor

@seberg @eric-wieser ping

@seberg
Copy link
Member
seberg commented Jan 20, 2023
  1. For integer weight values, the same result can be obtained by replicating x_i exactly w_i times

Using the weighted empirical CDF seems good, but the question is how/if you generalize it to the method=... that are contiguous. One of the PRs had a w >= 1 check, because this was somehow tricky (IIRC).

The second PR start used something like what landed in xarray: https://github.com/pydata/xarray/blob/b4e3cbcf17374b68477ed3ff7a8a52c82837ad91/xarray/core/weighted.py#L338
which uses "Kish's effective sample size"

  1. Multiplying the weights by a constant does not change the quantile [...]

As given above, some methods depend on the sample-size, so we have to calculate an effective sample size and this means a distinction between the methods.
I suspect what xarray is doing is just fine, but I doubt that it should be termed "frequency" weights, but aweights.
For the non-contiguous and uncorrected methods the results should be identical to frequency weights, though.

Implement weighted quantiles [...]

I don't believe in adding a something not fully functional. We work with 1-D internally anyway (IIRC) using the ureduce mechanism.
Anything that is not yet functional should live in its own small repo somewhere and we can discuss there. Or on a draft PR maybe.

I would accept if we limit ourselves to methods that do not have a correction factor if necessary (the default is contiguous but has no correction term).

@lorentzenchr
Copy link
Contributor

As quantiles are often poorly defined, I wrote a little blog post: https://lorentzen.ch/index.php/2023/02/11/quantiles-and-their-estimation/.

I would accept if we limit ourselves to methods that do not have a correction factor if necessary (the default is contiguous but has no correction term).

After giving it much thought, my recommendation is to start with the weighted version of the inverted CDF method only. The reason is that it unambiguously generalizes to weights, which can be integers or floats.

@seberg
Copy link
Member
seberg commented Aug 3, 2023

JuliaStats has something here, the discussion isn't super convincingly thorough to me. But they use our default (interpolate/R7) and I think for that the normalization should indeed be irrelevant. They do seem to ensure that fweights > 1 means that it is possible for the upper index to be the same as the lower one. (something the old PR also tried hard to achieve, but IIRC in my opinion too hard because it first normalized them to >=1)

@seberg
Copy link
Member
seberg commented Aug 3, 2023

@robjhyndman would you happen to have any guidance with regards to weighted quantiles? It seems probably clear within certain limitations. I.e. we can define it for frequency weights with a check that they are >= 1 or 0 and for some methods that constraint shouldn't matter. Nonfrequency weights for the default R7 method were also OK, IIRC.

But I really wonder if we can find prior art or even literature? (e.g. the JuliaStats people seemed to not find clear prior art...) Or maybe you can formulate a preference for some solution?

@lorentzenchr
Copy link
Contributor

To the best of my knowledge, the distinction of weights has no impact on point estimates of the mean, only on the estimation of it‘s variance. The same applies to point estimates of quantiles which should not depend on the interpretation (better assumptions) of weights, only the variance of this estimate (which we are not talking about) would depend on it.
#8935 (comment) was my best attempt to give precise meaning to the notion of weight in the context of point estimates of quantiles, i.e. empirical quantiles.

@seberg
Copy link
Member
seberg commented Aug 3, 2023

The same applies to point estimates of quantiles

I am not convinced for the non-trivial types methods: The reason we influence variance estimation is due to the ddof unbiasing. And as far as I understand that is very comparable to the unbiasing for the some of the method (not the default R7, which is like biased ddof=0 to my understanding).

@lorentzenchr
Copy link
Contributor
lorentzenchr commented Aug 3, 2023

Very hard to give a short answer.

1. attempt:
(edit)The variance is not an assumption of the estimation methods. The methods The reason for different methods is not (or less so) bias nor ddof, but the set-valued nature of quantiles: Each method selects one possible choice out of the set of possible quantiles for a given prob level. Those then have different bias properties for specific prob distributions (assuming those are sampled from).

2. attempt with more theory: Mean and quantiles are both elicitable, i.e. there exists a loss (or scoring) function such that the argmin of its expectation is the mean/quantile.
The variance, on the other hand, is not elicitable alone, only if paired with the mean. That means that estimation of the variance requires the mean and that is the reason for ddof considerations.
The same story applies to the expected shortfall which needs to be paired with the corresponding quantile (or value-at-risk).
So we have the analogy:

  1. mean <-> quantile
  2. variance <-> expected shortfall

@seberg
Copy link
Member
seberg commented Aug 4, 2023

@lorentzenchr my counter argument is also short: given x = np.arange(10).astype(np.float64)**2 and x2 = concatenate((x, x)):

  • mean(x) and mean(x2) is the same, as you say.
  • np.quantile(x, 0.75, method="median_unbiased") and np.quantile(x2, 0.75, method="median_unbiased") are not :). (Or use 0 or 1 as the probability level.)

Your argument is correct for the median but not for general quantiles (at least I would expect). I can also also translate this into a proof that quantile is directly related to the standard deviation because we can directly estimate a standard deviation from it:

import numpy as np

# Estimate sigma of a normal distribution via quantiles:
two_sigma = .954499736
outside_2sigma = (1 - two_sigma) / 2
# We expect the q_levels to be separated by 4 sigma:
q_levels = [outside_2sigma, 1-outside_2sigma]

# Let's do 1000 experiments observing each 50 samples:
obs = np.random.normal(size=(1000, 50))

q_unbiased = np.quantile(obs, q_levels, method="normal_unbiased", axis=-1)
sigma_estimates = (q_unbiased[1] - q_unbiased[0]) / 4

q_biased = np.quantile(obs, q_levels, axis=-1)
sigma_estimates_biased = (q_biased[1] - q_biased[0]) / 4

print("mean unbiased estimate from quantiles:", sigma_estimates.mean())
print("mean direct unbiased estimate:", obs.std(axis=-1, ddof=1).mean())
print()
print("mean biased estimate from quantiles:", sigma_estimates_biased.mean())
print("mean direct biased estimate:", obs.std(axis=-1, ddof=0).mean())

The unbiased estimation is obviously better at the job, but you can see the similarity I think.
EDIT: I am not sure something like the above (with a bit more thought about which level to use best) is weird if I expect a normal distribution but with a few bad outliers?

@seberg
Copy link
Member
seberg commented Aug 4, 2023

I don't think this is very hard to deal with as such. We probably just have to:

  • Decide that there is nothing weird about aweights<1, every data point is still a valid data point (maybe even those with weight=0).
  • Use the same scaling for the number of observations as std uses for aweights.

I just would prefer someone with a stronger statistics background to bless such a choice. We have been insisting on fweights being the only right way in on of the PRs and then even getting that wrong with respect to unbiasing...

@lorentzenchr
Copy link
Contributor

1

Could we have a call?

2

The median unbiased estimator is maybe not the best one to generalize with weights.
My proposition was always to start with the inverted cdf version.

3

Your argument is correct for the median but not for general quantiles

I have to - more or less - strongly disagree.:

  • The example with x and x2 only shows that the method "median_unbiased" gives results that lie outside of the set of the quantiles from the empirical CDF which is the single element set {49} for both x and x2. Proof: 49 is the only number that minimizes the pinball loss.
    import numpy as np
    from sklearn.metrics import mean_pinball_loss
    
    x = np.arange(10).astype(np.float64)**2
    x2 = np.concatenate((x, x))
    
    np.quantile(x, 0.75, method="median_unbiased")   # 50.25
    np.quantile(x2, 0.75, method="median_unbiased")  # 49.0
    
    # 49 is the unique 75% quantile of x
    mean_pinball_loss(x, np.full_like(x, 49), alpha=0.75)  # 9.825
    
    # A bit below
    mean_pinball_loss(x, np.full_like(x, 49 - 1e-2), alpha=0.75)  # 9.8255 is larger
    
    # A bit above
    mean_pinball_loss(x, np.full_like(x, 49 + 1e-2), alpha=0.75)  # 9.8255 is larger
    
    # Now the 75% quantile of x2
    mean_pinball_loss(x2, np.full_like(x, 49), alpha=0.75)  # 9.825
    
    # A bit below
    mean_pinball_loss(x2, np.full_like(x, 49 - 1e-2), alpha=0.75)  # 9.8255 is larger
    
    # A bit above
    mean_pinball_loss(x2, np.full_like(x, 49 + 1e-2), alpha=0.75)  # 9.8255 is larger
    Note that the pinball loss is convex.
    Also note that (forms of) unbiasedness of estimators often comes at a cost. In our case the "median unbiasedness" has a cost.

I can also also translate this into a proof that quantile is directly related to the standard deviation because we can directly estimate a standard deviation from it

The line two_sigma = .954499736 is the suspicious line.
Without knowing the distribution in advance, you would not be able to relate variance with a quantile.
In other words, it would be fair to estimate the number two_sigma. This corresponds to first estimating the mean as necessary ingredient to estimate the variance.

4

Please let us drop the distinction between frequency, analytical and other weights for the estimation of quantiles (or do you have a literature reference for it, I'd be interested). We don't do that for the mean either.

@seberg
Copy link
Member
seberg commented Aug 4, 2023
  1. yes if you like, I think my email is public, or you can join the numpy slack or so. Or you plan to come to the NumPy community/triage meeting (I am almost always there, although maybe spotty in the next weeks.).
  2. Yes, but if we have a plan to continue (and that will happen) and then find out we need to distinguish aweights and fweights I am not happy if we just dismiss it now. I may be OK if we explicitly accept that. But to me it still looks very likely that we need it, so I am not quite convinced we can ignore the problem.
  3. I don't follow:
    • gives results that lie outside of the set of the quantiles from the empirical CDF

      Doesn't that just mean that this is unproblematic for method="inverted_cdf"? I don't disagree with that.

    • The unbiased versions are estimating population quantiles from samples. Making a assumptions about the population to improve population estimates derived from finite sampling seems very much normal to me. And it works, method="normal_unbiased" is an unbiased population estimate when the input is sampled from a normal distribution.

  4. JuliaStats does make a distinction. I don't think they need it for normalization but maybe only make the distinction only because (I have to look closer, this may be wrong!):
    • Frequency weights with weight > 1 means that each data point has a finite support on the q/probability level.
    • While for non-frequency weights they modify the interpolation point between samples but the support stays the same. (the quantile value isn't constant for a while)

I don't disagree with your PR, beyond having to check how we handle weights=0, you don't have either of the two problems.

@lorentzenchr
Copy link
Contributor

So we had a short chat:

  1. For the inverted cdf method, there is no need for a distinction of weights.
  2. For the default "linear" method, we need to investigate.
  3. Even if we introduce fweight anticipating that we need it for other methods to support weights, we would still only require weight >= 0.
  4. We could consider to require weight > 0 such that we do not need to deal with this case.

Remark: Sadly, there is hardly any literature pointing at definitions or solutions.

@seberg
Copy link
Member
seberg commented Aug 4, 2023

I wanted to understand the aweights better, and think I was wrong: aweights don't change the normalization factor at all, they are normalized to give the same N. fweights do that though, since the effective N is increased with respect to ddof.
Honestly, we have all of this only in np.cov, I wonder if there is even much point to have the concept at all (which doesn't we don't have to make a clear choice in this context and at least document it).

@lorentzenchr
Copy link
Contributor

Investigate Method = Linear

Case 1: x=[1, 2, 3]

x = np.array([1., 2, 3])
q = np.linspace(0, 1, 300)
plt.plot(q, np.quantile(x, q), label="linear")
plt.scatter([0, 1/2, 1], x, color="C0")
plt.plot(q, np.quantile(x, q, method="inverted_cdf"), label="inverted_cdf")
plt.scatter([1/3, 2/3], np.quantile(x, [1/3, 2/3], method="inverted_cdf"), color="C1")
plt.scatter(np.zeros_like(x), x, color="k")
plt.legend()

image
x-axis = probability level, y-axis = quantile, black dots = data points

Case 2: x=[1, 2, 2, 3] same as x=[1, 2, 3] with weights=[1, 2, 1]
x = np.array([1., 2, 2, 3])
q = np.linspace(0, 1, 300)
plt.plot(q, np.quantile(x, q), label="linear")
plt.scatter([0, 1/3, 2/3, 1], x, color="C0")
plt.text(1/3, 1.5, "       1/3      ",
    ha="left", va="center", size=16,
    bbox=dict(boxstyle="rarrow, pad=0.", fc="white", ec="C0", lw=2))
plt.plot(q, np.quantile(x, q, method="inverted_cdf"), label="inverted_cdf")
plt.scatter([1/4, 1/2, 3/4], np.quantile(x, [1/4, 1/2, 3/4], method="inverted_cdf"), color="C1")
plt.scatter([0, 0, 0.05, 0], x, color="k")
plt.legend()

image

Case 3: x=[1, 2, 2, 2, 3] same as x=[1, 2, 3] with weights=[1, 3, 1]
x = np.array([1., 2, 2, 2, 3])
q = np.linspace(0, 1, 300)
plt.plot(q, np.quantile(x, q), label="linear")
plt.scatter([0, 1/4, 1/2, 3/4, 1], x, color="C0")
plt.text(1/4, 1.5, "             2/4              ",
    ha="left", va="center", rotation=0, size=15,
        bbox=dict(boxstyle="rarrow, pad=0.", fc="white", ec="C0", lw=2))
plt.plot(q, np.quantile(x, q, method="inverted_cdf"), label="inverted_cdf")
plt.scatter([1/5, 2/5, 3/5, 4/5], np.quantile(x, [1/5, 2/5, 3/5, 4/5], method="inverted_cdf"), color="C1")
plt.scatter([0, 0, 0.05, 0.1, 0], x, color="k")
plt.legend()

image

Case 4: x=[1, 1, 2, 2, 3, 3] same as x=[1, 2, 3] with weights=[2, 2, 2]

x = np.array([1., 1., 2, 2, 3, 3])
q = np.linspace(0, 1, 300)
plt.plot(q, np.quantile(x, q), label="linear")
plt.scatter([0, 1/5, 2/5, 3/5, 4/5, 1], x, color="C0")
plt.plot(q, np.quantile(x, q, method="inverted_cdf"), label="inverted_cdf")
plt.scatter([0, 0.05, 0, 0.05, 0, 0.05], x, color="k")
plt.legend()

image

Summary

For method linear, frequency weights have a more natural interpretation.
If we normalize weights to sum to 1, we could not reproduce the Case 4 with all weights = 2 because we would end up with the Case 1 (x=[1, 2, 3]).

Personal summary: I would still allow all values of weights > 0 (perfectly fine for inverted cdf) and (in case someone implements method linear) say that for method linear the scale of weights actually matters such that case 1 and case 4 are not the same.

@seberg
Copy link
Member
seberg commented Aug 5, 2023

I started with pretty much agreeing but now need to pivoting to an example first.

One potential problem for me I think was how I interpret(ed) the weights. I tried to interpret them as measurement uncertainty and my intution and thinking process failed me badly on it. This is may be no surprise: Quantiles being robust using measurement uncertainty as weights likely doesn't make much sense.

Alternatively, I can interpret them as wanting to correct for sampling biase. Example: I sample 10 people from a village and 10 from a city, but I know the total population has 2/3 living in the city and I want to correct for that.

First, I thought: Aha, this works out with what you said. Then I thought: wait, how would I actually want to chose my weights in the above example (ignore problems with N for unbiasing). I am not certain I wouldn't intuitively lean towards ensuring 0 < weights <= 1 (which actually would make this what I thought of aweights).

@lorentzenchr
Copy link
Contributor
lorentzenchr commented Aug 6, 2023

Measurement uncertainty can be seen as variance and, from an estimation theorist‘s point of view, weights inverse proportional to the (true) variance are optimal. (But this argument might only apply for estimation of the mean.)

I would divide the cases as follows:

  1. All methods that can be defined in terms of the empirical CDF. The interpretation of weights is then the same as for ECDFs (multiple possibilities). So we don‘t reinvent the wheel.
  2. Methods that don‘t derive from the ECDF, e.g. all (or most?) methods relying on plotting points. Here, a weight of 1 has a special meaning and more constraints on weights might apply.

I personally won‘t implement weighted versions for 2).

For the sample bias correction example and a quantile that relies on the ECDF, you apply any weights as long as their ratio is 1/2, scale does not matter.

@seberg
Copy link
Member
seberg commented Aug 7, 2023

I can be convinced of a minimal thing, but what is that? Call it fweights and do whatever @lorentzenchr is confident about? That seems safe to me (as said above, that allows 0 < fweights < 1 because at least for some methods there is nothing weird). I suppose fweights can generalize to everything, but for those with plotting points we are bit sure yet about 0 < fweights < 1 values?

Also @anntzer who was involved in another issue discussion. What would really help moving things here is literature. Asking for things to be included in NumPy without clear literature is a very big ask from the start.

The other thing that would help me is if we could say examples of what weights mean. I can be on-board better with fweights (yes, we only use this name once in a niche function so it isn't great), mainly because to me the above example of trying to adjust for sampling bias of sub-populations seems fairly "obvious" and I am not convinced that fweights is quite the right thing for that use case and I would like users to at least have a chance to notice that.

I admit again though: I think for "inverted CDF" the type of weights doesn't matter, it is safe to implement. But how many users actually want to use "inverted CDF"?

EDIT: Unless we explicitly add weights now and limit it a lot, and accept that in the future you might need to add aweights or fweights additionally and most methods would need to know (possibly giving us three types of weights one of which only works for methods where it doesn't matter; although we could deprecate it again, but...).

@lorentzenchr
Copy link
Contributor

My understanding:

  1. For ECDF-based methods, we only ever need weights and let the user do the interpretation. Constraint: 0<=weights.
  2. Plotting position based methods: Only need weights which is always interpreted as frequency weights. Constraint 1<=weights. (Or just not implement those at all.)

To give one use case: Gradient boosted trees for estimating quantiles, see Greedy function approximation: A gradient boosting machine. Eq (14). The weights are given by (the absolute value of) the base learner which can be any non-negative value.

@seberg
Copy link
Member
seberg commented Aug 7, 2023

Just noticed that Wikipedia has "method extends the above approach in a natural way" [citation needed]... Which defines it for the plotting point methods in the way that I tried to think along (I think I got it slightly different/wrong but I suspected as much starting with the implementation).

I can say the same thing in many ways probably. For example I suspect that for your definition we can say that duplicating/repeating the same data is an invariant for the result so that all interpretation of weights are equivalent.

The question is still what the proposed API is? Use weights but accept that this might be a bad name for some methods. (another path could be to have weights now and add fweights later if we generalize weights to mean non frequency ones.)

Eq (14). The weights are given by (the absolute value of) the base learner which can be any non-negative value.

Hmmm, hard to set into perspective of how to interpret the weights and what quantile operation I would use if this was unweighted.
Since this is used to update an iterative algorithm (although I am not sure it is minimized itself), I am tempted to think that you would want to use a contiguous/plotting point definition for this to converge well?

@lorentzenchr
Copy link
Contributor

Concerning API: As long as there is no behavioral difference, i.e. as long as no method can have both fweights and aweights with different computational logic, a single argument weights is enough, isn't it?

@glemaitre and @lucyleeow you did quite some work on weighted quantiles in scikit-learn/scikit-learn#17768. Do you have further insights in this topic?
Maybe also @rkern @harrelfe @mayer79?

About literature: It is really hard to find anything. This blogpost also did a literature and software package research. Because of the lack of anything, the same author wrote this paper (but note: not peer reviewed).

@seberg
Copy link
Member
seberg commented Aug 8, 2023

🎉 yet another way to define the quantile where you use more than two samples in a final interpolation. (Squinting at it, my gut thinks this might be a definition useful for measurement errors "weights"; probably with a good amount of iid assumption. But I probably shouldn't trust my gut :))

@seberg
Copy link
Member
seberg commented Aug 8, 2023

a single argument weights is enough, isn't it?

I can see that working with a single weights argument if it is clear enough what is useful for each method (I am not certain this is the case, but I am happy to defer to you and it isn't a terrible if we get it wrong).
Or alternatively, we agree on one "default" kind of weights and just stay compatible for now. (My instincts would say that isn't frequency weight, but that may be my background.) One point is that I think np.cov is so niche that it seems fine to ignore it as prior art. If anything we should look at downstream definition/uses of weights. If scipy/statsmodel/sklearn use weights almost exclusively for one style of weights than we can really just agree on that here and make sure we don't add anything that isn't compatible with that.

@lorentzenchr
Copy link
Contributor

As reference for different weights (with mean/average in mind), which are a good read:

@lorentzenchr
Copy link
Contributor

@seberg Shall we call this issue closed by #24254? If someone needs another method to support weights she can open a new issue (and explain how the method works)?

@seberg seberg closed this as completed Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants
0