8000 _weighted_percentile does not lead to the same result than np.median · Issue #17370 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

_weighted_percentile does not lead to the same result than np.median #17370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
glemaitre opened this issue May 28, 2020 · 17 comments
Open

_weighted_percentile does not lead to the same result than np.median #17370

glemaitre opened this issue May 28, 2020 · 17 comments
Labels

Comments

@glemaitre
Copy link
Member
glemaitre commented May 28, 2020

While reviewing a test in #16937, it appears that our implementation of _weighted_percentile with unit sample_weight will lead to a different result than np.median which is a bit problematic for consistency.

In the gradient-boosting, it brakes the loss equivalence because the initial predictions are different. We could bypass this issue by always computing the median using _weighted_percentile there.

import pytest
import numpy as np
from sklearn.utils.stats import _weighted_percentile

rng = np.random.RandomState(42)
X = rng.randn(10)
X.sort()
sample_weight = np.ones(X.shape)

median_numpy = np.median(X)
median_numpy_percentile = np.percentile(X, 50)
median_sklearn = _weighted_percentile(X, sample_weight, percentile=50.0)

assert median_numpy == pytest.approx(np.mean(X[[4, 5]]))
assert median_sklearn == pytest.approx(X[4])
assert median_numpy == median_numpy_percentile
assert median_sklearn == pytest.approx(median_numpy)
@glemaitre
Copy link
Member Author

ping @lucyleeow Since you already look at the code, you might have some intuition why this is the case. We should actually have the above test as a regression test for our _weighted_percentile

@glemaitre
Copy link
Member Author
glemaitre commented May 28, 2020

Does numpy interpolating while we just making a np.searchsorted?

@glemaitre
Copy link
Member Author

yes numpy take np.mean(x, x+1) while we take x.

@glemaitre
Copy link
Member Author

np.percentile is consistent with np.median

@glemaitre
Copy link
Member Author

pinging @NicolasHug @adrinjalali @ogrisel Making thing consistent will impact the HistGradientBoosting* with the LAD loss.

Would be interesting to make the change and see which tests are breaking.

@lucyleeow
Copy link
Member
lucyleeow commented May 28, 2020

For np.median, if n is odd, the median is the middle value (after sorting) and if n is odd it is the average of the 2 middle values.

We take 'lower' percentile, as we use the default parameter np.searchsorted(side='left'). Thus when n is even, _weighted_percentile is not the same as np.median for unit weights. It is the same when n is odd.

We could amend _weighted_percentile to use interpolated median, described well here: #6217 (comment) (I have an implementation of this already, as I didn't realise _weighted_percentile existed). This implementation means that np.median always gives the same result with unit weights.

In the end, I think the difference between 'lower' weighted percentile and interpolated weighted percentile would generally be small for large n sizes. Though, I guess the difference between the '2 middle values' would also affect the how different they are.

@lucyleeow
Copy link
Member
lucyleeow commented May 28, 2020

For np.percentile, you can select how it is calculated with parameter:

interpolation{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}

default is 'linear', thus performs the same as np.median

@glemaitre
Copy link
Member Author

I think that our _weighted_precentile should offer the same default and maybe an interpolation parameter if required.

@lucyleeow
Copy link
Member

linear = linear interpolation.

@lorentzenchr
Copy link
Member

Our _weighted_precentile implements method inverted_cdf. It is important to note that quantiles are interval valued quantities. Out of that interval, one is free to choose any value. On top of that, different sample estimations exist. So we are well within statistical uncertainty.

np.median says

Given a vector V of length N, the median of V is the middle value of a sorted copy of V, V_sorted - i e., V_sorted[(N-1)/2], when N is odd, and the average of the two middle values of V_sorted when N is even.

I do not know how to generalize this to account for sample weights. Therefore closing.

@ogrisel
Copy link
Member
ogrisel commented Oct 15, 2024

Coming back to this, wouldn't it be possible to symmetrize the percentile computation?

>>> import numpy as np
>>> x = np.linspace(0, 1, 4)
>>> w = np.asarray([2, 1, 1, 2])
>>> np.percentile(x, 50, weights=w, method="inverted_cdf")
array(0.33333333)
>>> -np.percentile(-x, 50, weights=w, method="inverted_cdf")
np.float64(0.6666666666666666)
>>> (
...      np.percentile(x, 50, weights=w, method="inverted_cdf")
...      - np.percentile(-x, 100 - 50, weights=w, method="inverted_cdf")
... ) / 2
np.float64(0.5)

Or when there is an odd number of data points but an even sum of integer weights:

>>> import numpy as np
>>> x = np.asarray([0, 0.5, 1])
>>> w = np.asarray([2, 1, 1])
>>> np.percentile(x, 50, weights=w, method="inverted_cdf")
array(0.)
>>> (
...     np.percentile(x, 50, weights=w, method="inverted_cdf")
...     - np.percentile(-x, 100 - 50, weights=w, method="inverted_cdf")
... ) / 2
np.float64(0.25)

>>> x_repeated = np.asarray([0, 0, 0.5, 1])
>>> np.median(x_repeated)
np.float64(0.25)

If that symmetrization happened internally, one would probably not need to sort the data twice.

EDIT: I reported the problem with numpy 2's percentile with method="inverted_cdf" which is similar to our _weighted_percentile. Symmetrization strategy could be contributed both to numpy and in scikit-learn for backward compat with older numpy version and support with array API.

@ogrisel ogrisel reopened this Oct 15, 2024
@ogrisel
Copy link
Member
ogrisel commented Oct 15, 2024

Reopening to get feedback about what people (e.g. @lorentzenchr, @snath-xoc, @antoinebaker, @glemaitre, @lucyleeow) think about the validity of the above suggestion.

I still think that the fact that the lack of symmetry of the weighted percentile can cause an (arguably small but) very surprising bias in many applications of scikit-learn and statistical analysis in general. So if there exists a cheap and easy way to remove this surprising behavior by default, that could spare some future difficult to investigate reports by understandably surprised users.

This should also might help us enforce the usual weighting/repetitions semantics of sample_weights in scikit-learn estimators faster and more consistently in case the estimator relies on different percentile/quantile implementations for the weighted and unweighted cases.

EDIT: meanwhile, we could update check_sample_weight_equivalence to ensure that sample_weight.sum() is always odd, e.g. by adding +1 to the last weight if needed (whatever the value of the random seed) to hide this problem under the rug in our common tests.

@antoinebaker
Copy link
Contributor

I'm not sure what we should recover for a generic percentile q, but the symmetrization trick does not recover np.percentile(x_repeated, q, method="linear").

x = np.linspace(0, 1, 4)
w = np.asarray([2, 1, 1, 1])
x_repeated = x.repeat(repeats=w)
q = 60
p_lin = np.percentile(x_repeated, q, method="linear")
p_inf = np.percentile(x, q, weights=w, method="inverted_cdf")
p_sup = -np.percentile(-x, 100-q, weights=w, method="inverted_cdf")
p_mid = (p_inf + p_sup) / 2
# p_lin=0.467 p_inf=0.333 p_sup=0.667 p_mid=0.500

If I understood correctly np.median(x_repeated) is the same as np.percentile(x_repeated, 50, method="linear"), so perhaps we would like the same for a generic q ? Or is the issue only for the median (where some code uses np.median and other _weighted_percentile) ?

@ogrisel
Copy link
Member
ogrisel commented Oct 15, 2024

Or is the issue only for the median (where some code uses np.median and other _weighted_percentile)?

The original issue was about the surprising discrepancy between np.median and _weighted_percentile with unit weights and an even number of data points.

But one could also argue that the lack of symmetry for the result of _weighted_percentile with even valued sum of integer weights is also surprising in general and could be a source of breakage for our common test.

I might have mistakenly confused the two cases, but apparently the situation is more complex.

Are there cases where my symmetrized_inverted_cdf proposal (with q=50 and unit weights) would not match np.median?

I think it's ok for symmetrized_inverted_cdf to not match linear for q != 50
as long as it always outputs a valid percentile value.

@lorentzenchr
Copy link
Member

Preface: I still consider this a non-issue because it's mainly about cosmetics. The impact of different quantile methods on real use cases is well within the (usual) statistical uncertainty.

Main Part: The symmetrised version that @ogrisel is proposing is, if I'm not mistaken, the same as method="averaged_inverted_cdf". This is the only other method (besides "inverted_cdf"), for which I would consider to implement a weighted version in numpy.

Epilogue: Do we really want to discuss quantile methods again? See Fan & Hyndman (1996) and Sample quantiles 20 years later (Hyndman 2916).

@ogrisel
Copy link
Member
ogrisel commented Oct 15, 2024

Thanks. I think we should contribute it to numpy and backport it to scikit-learn (and add array API support) and use it as the default to lower potential astonishment.

@StefanieSenger
Copy link
Contributor

So it doesn't get lost: test_weighted_median_equal_weights is also affected by this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants
0