-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
_weighted_percentile does not lead to the same result than np.median #17370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
ping @lucyleeow Since you already look at the code, you might have some intuition why this is the case. We should actually have the above test as a regression test for our |
Does |
yes numpy take |
|
pinging @NicolasHug @adrinjalali @ogrisel Making thing consistent will impact the Would be interesting to make the change and see which tests are breaking. |
For We take 'lower' percentile, as we use the default parameter We could amend In the end, I think the difference between 'lower' weighted percentile and interpolated weighted percentile would generally be small for large n sizes. Though, I guess the difference between the '2 middle values' would also affect the how different they are. |
For
default is 'linear', thus performs the same as |
I think that our |
|
Our np.median says
I do not know how to generalize this to account for sample weights. Therefore closing. |
Coming back to this, wouldn't it be possible to symmetrize the percentile computation? >>> import numpy as np
>>> x = np.linspace(0, 1, 4)
>>> w = np.asarray([2, 1, 1, 2])
>>> np.percentile(x, 50, weights=w, method="inverted_cdf")
array(0.33333333)
>>> -np.percentile(-x, 50, weights=w, method="inverted_cdf")
np.float64(0.6666666666666666)
>>> (
... np.percentile(x, 50, weights=w, method="inverted_cdf")
... - np.percentile(-x, 100 - 50, weights=w, method="inverted_cdf")
... ) / 2
np.float64(0.5) Or when there is an odd number of data points but an even sum of integer weights: >>> import numpy as np
>>> x = np.asarray([0, 0.5, 1])
>>> w = np.asarray([2, 1, 1])
>>> np.percentile(x, 50, weights=w, method="inverted_cdf")
array(0.)
>>> (
... np.percentile(x, 50, weights=w, method="inverted_cdf")
... - np.percentile(-x, 100 - 50, weights=w, method="inverted_cdf")
... ) / 2
np.float64(0.25)
>>> x_repeated = np.asarray([0, 0, 0.5, 1])
>>> np.median(x_repeated)
np.float64(0.25) If that symmetrization happened internally, one would probably not need to sort the data twice. EDIT: I reported the problem with numpy 2's |
Reopening to get feedback about what people (e.g. @lorentzenchr, @snath-xoc, @antoinebaker, @glemaitre, @lucyleeow) think about the validity of the above suggestion. I still think that the fact that the lack of symmetry of the weighted percentile can cause an (arguably small but) very surprising bias in many applications of scikit-learn and statistical analysis in general. So if there exists a cheap and easy way to remove this surprising behavior by default, that could spare some future difficult to investigate reports by understandably surprised users. This should also might help us enforce the usual weighting/repetitions semantics of EDIT: meanwhile, we could update |
I'm not sure what we should recover for a generic percentile x = np.linspace(0, 1, 4)
w = np.asarray([2, 1, 1, 1])
x_repeated = x.repeat(repeats=w)
q = 60
p_lin = np.percentile(x_repeated, q, method="linear")
p_inf = np.percentile(x, q, weights=w, method="inverted_cdf")
p_sup = -np.percentile(-x, 100-q, weights=w, method="inverted_cdf")
p_mid = (p_inf + p_sup) / 2
# p_lin=0.467 p_inf=0.333 p_sup=0.667 p_mid=0.500 If I understood correctly |
The original issue was about the surprising discrepancy between But one could also argue that the lack of symmetry for the result of I might have mistakenly confused the two cases, but apparently the situation is more complex. Are there cases where my I think it's ok for |
Preface: I still consider this a non-issue because it's mainly about cosmetics. The impact of different quantile methods on real use cases is well within the (usual) statistical uncertainty. Main Part: The symmetrised version that @ogrisel is proposing is, if I'm not mistaken, the same as Epilogue: Do we really want to discuss quantile methods again? See Fan & Hyndman (1996) and Sample quantiles 20 years later (Hyndman 2916). |
Thanks. I think we should contribute it to numpy and backport it to scikit-learn (and add array API support) and use it as the default to lower potential astonishment. |
So it doesn't get lost: |
While reviewing a test in #16937, it appears that our implementation of
_weighted_percentile
with unitsample_weight
will lead to a different result thannp.median
which is a bit problematic for consistency.In the gradient-boosting, it brakes the loss equivalence because the initial predictions are different. We could bypass this issue by always computing the median using
_weighted_percentile
there.The text was updated successfully, but these errors were encountered: