Open
Description
While reviewing a test in #16937, it appears that our implementation of _weighted_percentile
with unit sample_weight
will lead to a different result than np.median
which is a bit problematic for consistency.
In the gradient-boosting, it brakes the loss equivalence because the initial predictions are different. We could bypass this issue by always computing the median using _weighted_percentile
there.
import pytest
import numpy as np
from sklearn.utils.stats import _weighted_percentile
rng = np.random.RandomState(42)
X = rng.randn(10)
X.sort()
sample_weight = np.ones(X.shape)
median_numpy = np.median(X)
median_numpy_percentile = np.percentile(X, 50)
median_sklearn = _weighted_percentile(X, sample_weight, percentile=50.0)
assert median_numpy == pytest.approx(np.mean(X[[4, 5]]))
assert median_sklearn == pytest.approx(X[4])
assert median_numpy == median_numpy_percentile
assert median_sklearn == pytest.approx(median_numpy)