-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Document and adapt isclose() usage #4864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can you give an example where this is failing? Maybe we need to handle certain situations better? |
Whenever you have very small probabilities. For example from a very unbalanced and slightly noisy classification. These small probabilities can still be predictive though (I had a real-life case). However in
Now as you say it, I agree this should be handled differently. I'd suggest not to tamper the raw data and just sort the floats as they are. On the other hand, I keep reimplementing aggregation procedures to compress millions of precision_recall_curve points into some number of points that is plottable. For that I usually look at min(precision), max(precision) for each small recall range. I can imagine using something similar to this min/max approach for a given recall resolution (e.g. [r_i ... r_i+0.01] ranges) would be more robust if you really want to combine presumably equal points. So this aggregation is needed anyway, but should be performed in the appropriate way. |
One student ran into this issue recently when upgrading his scikit learn version. |
Same here, using it for ROC curves, isclose groups all scores bellow a certain level creating plots with missing values below a certain fpr (below 1% for me) A good way to overcome this for me was to use the predict_log_proba() instead of predict_proba() |
I would be in favour of removing the isclose. Other opinion ? (@jnothman?) |
Correct me if I am mistaken, but just dropping the isclose can create a mess when dealing with big data sets (fpr and tpr arrays could become extremely large making almost certainly impossible to plot without some processing). If people need to deal with such low score values, I can see 3 viable solutions taking into consideration that only a small percentage of users would ever hit that limit. Good documentation is viable in any case. 3)Myself, I would be much more inclined on an approach that would aggregate the end result (output) rather than the input data. That is, group the fpr and tpr values (using a similar or even zero tolerance). As an example, when I used the default metrics.roc_curve in cases with big data sets, it resulted in millions of points for FPR and TPR (most of them redundant -at least for a plotting purpose), which is similar to not using isclose and get an even bigger number of different thresholds and consequently fpr and tpr values. To solve this issue, given the nature of ROC curve plots (mostly vertical and horizontal lines that need only two sets of coordinates), using a very simple method I kept only the values at the end of those linear parts. Depending on the tolerance and the ROC plot itself, this can decrease the number of values usually close to 1% which makes plotting way easier and faster and out-weights any computation costs introduced by this method. Actually I would suggest to anyone to use this method before plotting his fpr and tpr values. Having a zero tolerance (no loss of info) decreases the number of values close to the afforementioned 1%. Cheers |
Certainly 3) seems to be the cleanest solution. If you want to solve the plotting issue, then solve the plotting issue. Do not mess with the values beforehand. The plotting issue can be solved with different independent ways. I believe Bokeh can even handle it without any user adjustment. |
Appears to be a duplicate of #3864. Would someone like to summarise, there, the discussion here? |
sklearn.metrics
introducesisclose()
in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/ranking.py which can leave the unaware data practitioner with hours of debugging.In very unbalanced classification, probabilities/scores can be very small and yet meaningful. This however will cause unexpected missing precision_recall points due to
isclose
treating values within 10e-6 as equal.I'd suggest to place a warning about
isclose
in the documentation and also replace the absolute epsilon by a relative closeness comparison in order to avoid the problems with small probabilities in unbalanced classification.The text was updated successfully, but these errors were encountered: