8000 FIX Ignore zero sample weights in precision recall curve by albertvillanova · Pull Request #18328 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX Ignore zero sample weights in precision recall curve #18328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Apr 5, 2021

Conversation

albertvillanova
Copy link
Contributor
@albertvillanova albertvillanova commented Sep 2, 2020

Fix #16065.

Supersede and close #16319.

@albertvillanova albertvillanova marked this pull request as ready for review September 3, 2020 09:06
@albertvillanova albertvillanova changed the title Ignore zero sample weights in precision recall curve FIX Ignore zero sample weights in precision recall curve Sep 3, 2020
Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @albertvillanova ! A few comments are below.

if len(np.unique(y_true)) != 2:
raise ValueError("Only one class present in y_true. Detection error "
"tradeoff curve is not defined in that case.")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? I would imagine it's better to check for it before computing the the _binary_clf_curve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it was, if you pass a multiclass y_true, it would rise "Only one class present..." ValueError.

With the reorder, a multiclass ValueError is raised by _binary_clf_curve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be in whats_new since it's changing a behavior.

Copy link
Contributor Author
@albertvillanova albertvillanova Feb 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrinjalali are you sure?

When passing a multiclass y_true, before and now a ValueError is raised; that does not change. The only change is the error message:

  • Before: "Only one class present...", which was wrong.
  • Now: "multiclass format is not supported"

Base automatically changed from master to main January 22, 2021 10:53
Copy link
Member
@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've a few minor comments.

Comment on lines 702 to 703
sample_weight = column_or_1d(sample_weight)
_check_sample_weight(sample_weight, y_true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sample_weight = column_or_1d(sample_weight)
_check_sample_weight(sample_weight, y_true)
sample_weight = _check_sample_weight(sample_weight, y_true)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_check_sample_weight internally calls check_array. This proposed change is the only one I consider important and not cosmetic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you for pointing this out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove the line sample_weight = column_or_1d(sample_weight).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After that change, I think it'll be a LGTM from my side:smirk:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we can't call _check_sample_weight.

Copy link
Member
@lorentzenchr lorentzenchr Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scikit-learn/core-devs Could someone help out?
According to the docstrings, _binary_clf_curve accepts only 1d array-likes of shape (n_samples,). Nevertheless, the code for _binary_clf_curve, dating back more than 7 yeary at #2251 d8f90d1, calls column_or_1d on y_true, y_score and sample_weight and therefore also accepts shape (n_samples, 1).

Shall we adopt the docstings or be stricter to the inputs and actually require shape (n_sampels,).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change the behavior. When there's a discrepancy between the code and the docs, we tend to fix the docs. In this case, I think I'd be happy with having column_or_1d, since it makes it more convenient for people to work with the API, and it doesn't have much of a consequence. In terms of the docs, I'm not sure if we need to updated them, but I wouldn't object an update I think. I'd need to think more about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrinjalali Thanks for your expert judgement. In this case, I'd leave the docs as is.
The only open question is then whether to use _check_sample_weight, which enforces shape (n_samples,), xor column_or_1d for sample_weight—but not both.
@albertvillanova For me, the choice is yours:smirk:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I realize: the current solution with both column_or_1d before _check_sample_weight works as expected.

@lorentzenchr
Copy link
Member

I propose to open a new issue for the renaming and deprecation in precision_recall_curve from predict_proba to y_score and ask for other core devs' opinion there.

Copy link
Member
@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your patience.

Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +703 to +707
sample_weight = _check_sample_weight(sample_weight, y_true)
nonzero_weight_mask = sample_weight != 0
y_true = y_true[nonzero_weight_mask]
y_score = y_score[nonzero_weight_mask]
sample_weight = sample_weight[nonzero_weight_mask]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be better to instead of filtering out zero sample weights, doing the computation of scores in a way that zero sample weights are not taken into account? This proposed change results in a difference between 0. and epsilon, which I don't think is desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For whatever reasons you have some zeros in sample_weight, you better always get the same length of arrays. Imagine y_score = clf.predict_proba(X) with y_score.shape[0] == X.shape[0] == sample_weight.shape[0]. Then, filtering out zero weight samples here in _binary_clf_curve is quite convenient.

This proposed change results in a difference between 0. and epsilon, which I don't think is desired.

Could you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @adrinjalali I do not understand your point either...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let say we want a weighted average of a bunch of numbers, one way is first to filter out data with zero weight, the other is to have sth like: sum(w[i]*x[i]) / sum(w[i]) and ignore the fact that some of those weights are zero. The calculation itself takes care of zero sample weights.

Now in this case, there's an issue with having samples with zero weight in the data. My question is if we just filter out the zero weights, then what happens to the sample with weight equal to 1e-32 for instance? That's practically zero, and should be treated [almost] the same as the ones with zero sample weight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, _binary_clf_curve is only used here in _ranking.py. It calculates counts cumsum(y * weight) which are the ingredients for precision, recall and the like. You never have ... / sum(weight). This would cancel out in ratios of counts anyway. Therefore, I would have expected that filtering out zero weight samples is equivalent.
The real problem mentioned in #16065, however, seems to be the different values of thresholds, i.e. thresholds with only zero weights should be dropped. This PR solves this problem (by removing those sample).

Suggestion: Should we just add a test with weight1=[1, 1, 1e-30] and weight2=[1, 1, 0] for several relevant scores/erros/losses?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reach a consensus? I would like to get this one closed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around a little bit, I'm happy with this solution as is :)

if len(np.unique(y_true)) != 2:
raise ValueError("Only one class present in y_true. Detection error "
"tradeoff curve is not defined in that case.")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be in whats_new since it's changing a behavior.

@lorentzenchr
Copy link
Member

@adrinjalali I guess you're busy with other important tasks. Do you still intend to review this one or should we look out for another reviewer?

@adrinjalali
Copy link
Member

I did what I wanted to do (#18328 (comment)), I'm happy for this to be merged :)

@lorentzenchr
Copy link
Member

@adrinjalali Thanks for confirming. A green check mark would have been enough:smirk:

@albertvillanova Could you move the whatsnew entry from 0.24 to 1.0 and thereby solve merge conflicts (after merging main)? After that, I'm eager to hit the green button.

@albertvillanova
Copy link
Contributor Author

Done @lorentzenchr 😉

@lorentzenchr lorentzenchr merged commit c957eb3 into scikit-learn:main Apr 5, 2021
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Apr 19, 2021
…n#18328)

Co-authored-by: Alonso Silva Allende <alonsosilva@gmaiil.com>
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request/Improvement] Dealing with weight=0 in precision_recall_curve
5 participants
0