8000 [WIP] BUG make _weighted_percentile(data, ones, 50) consistent with numpy.median(data) by glemaitre · Pull Request #17377 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] BUG make _weighted_percentile(data, ones, 50) consistent with numpy.median(data) #17377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 52 commits into from

Conversation

glemaitre
Copy link
Member
@glemaitre glemaitre commented May 28, 2020

closes #17370
closes #6189

Add a new parameter to select the type of interpolation. By default, the median in NumPy and our implementation should be the same.

@adrinjalali
Copy link
Member

I was wondering about this when I was working on sample weights on HGBT. So I'm all in :)

If we have one cython version, then we could use it in other places such as HGBT as well.

@glemaitre
Copy link
Member Author

I was wondering about this when I was working on sample weights on HGBT. So I'm all in :)

So I think that I am converging here (the tests with the NumPy are fine). Not sure that we actually need "lower" and "higher". Basically "nearest" should be fast and "linear" would be to match the default of NumPy and allow to interpolate.

If we have one cython version, then we could use it in other places such as HGBT as well.

I really rely on NumPy there. If we are fine internally in the HGBDT to use the nearest I am not sure that we will improve too much by cythonizing.


if last_y_pred is not None:
assert_array_almost_equal(last_y_pred, y_pred)

assert_allclose(last_y_pred, y_pred)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucyleeow I added your test here. Using the strategy nearest everywhere seems to be the winner here to be able to keep the sample_weight semantic rights.

@glemaitre
Copy link
Member Author

I merged a couple of things in this PR to be able to check the integration in the gradient boosting as well.

If the test are passing I would propose to cut this PR into 3 parts:

  • implementation of the _weighted_percentile with its own test;
  • changing boston with make_regression in the gradient boosting;
  • change the call from np.median to np.percentile(..., interpolation="nearest"to ensure that we have the same behaviour withsample_weight=None and sample_weight=np.ones(...).

@jnothman
Copy link
Member

If the test are passing I would propose to cut this PR into 3 parts:

So this is no longer for review?

@glemaitre
Copy link
Member Author

No, I think that I need to cut it into smaller pieces because the diff is too large. I don't want to make the reviewing process even harder. I will take care of it this week.

@glemaitre glemaitre changed the title BUG make _weighted_percentile behave as NumPy [WIP] BUG make _weighted_percentile behave as NumPy Jun 29, 2020
Base automatically changed from master to main January 22, 2021 10:52
@ogrisel ogrisel changed the title [WIP] BUG make _weighted_percentile behave as NumPy [WIP] BUG make _weighted_percentile(data, ones, 50) consistent with numpy.median(data) Jan 16, 2024
@lorentzenchr
Copy link
Member

See #17370 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

_weighted_percentile does not lead to the same result than np.median Make _weighted_percentile more robust
6 participants
0