8000 FIX adapt epsilon value depending of the dtype of the input by Safikh · Pull Request #24354 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX adapt epsilon value depending of the dtype of the input #24354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Nov 10, 2022

Conversation

Safikh
Copy link
Contributor
@Safikh Safikh commented Sep 4, 2022

Reference Issues/PRs

Fixes #24315

What does this implement/fix? Explain your changes.

Change the default epsilon value in logloss from 1e-15 to auto which is equal to eps of y_pred's dtype if y_pred
is a numpy float array else it defaults to 1e-15 as earlier.

Any other comments?

@glemaitre glemaitre changed the title Change default epsilon in logloss metric from 1e-15 to 1e-7 FIX adapt epsilon value depending of the dtype of the input Sep 5, 2022
@Safikh
Copy link
Contributor Author
Safikh commented Sep 5, 2022

@glemaitre Made the changes as per your comment.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to other maintainer:
We should not forget to add @gsiisg as Co-Author when merging the PR.

Comment on lines 283 to 285
- |Fix| :func:`metrics.logloss` takes "auto" as default eps value and it will be equal to
eps value of the `y_pred` if `y_pred` is numpy float else it will be 1e-15. This change was
made to be able to handle float16 and float32 numpy arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- |Fix| :func:`metrics.logloss` takes "auto" as default eps value and it will be equal to
eps value of the `y_pred` if `y_pred` is numpy float else it will be 1e-15. This change was
made to be able to handle float16 and float32 numpy arrays
- |Fix| automatically set `eps` in :func:`metrics.logloss` depending on the input
arrays. `eps` was previously too small by default when passing lower precision
floating point arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add an entry in the "Change models" section stating that we switch from 1e-15 to np.finfo(np.float64).eps by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the "Change models" section in the codebase?

Copy link
Contributor
@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Safikh! A few small notes.

y_true = [[0, 1]]
y_score = np.array([1], dtype=dtype)
loss = log_loss(y_true, y_score, eps="auto")
assert_allclose(loss, 0.000977, atol=1e-3)
Copy link
Contributor
@Micky774 Micky774 Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be clearer if this value were computed rather than hard-coded.

Comment on lines 283 to 287
- |Fix| automatically set `eps` in :func:`metrics.logloss` depending on the input
arrays. `eps` was previously too small by default when passing lower precision
floating point arrays.
:pr:`24354` by :user:`Safiuddin Khaja <Safikh>` and
:user:`gsiisg <gsiisg>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog entry should reflect that you are adding a new keyword option to eps and changing the default value of eps.

@gsiisg
Copy link
gsiisg commented Sep 6, 2022

Big thanks to @Safikh and @glemaitre and all contributors!
This is my first time contributing to scikit-learn, so wasn't familiar with the process and created some confusion with a second pull request, will stick with this one from now on. Just want to add a comment to the "eps" comment section, should say that all input of y_pred will pass through sklearn.utils.check_array which will cast ordinary int/float into int64/float64, so that even if the input was not explicit, it will be treated with 64 bit precision unless specified otherwise as in np.float16/32. etc. At first I was worried that y_pred.dtype will error out if the input was [1] etc, but realized later it will be cast as array([1]) which will be 64 bit by default.

@Safikh
Copy link
Contributor Author
Safikh commented Sep 7, 2022

I'm facing a weird bug where if I set the eps to dtype.eps, the test sklearn/metrics/tests/test_common.py::test_not_symmetric_metric fails. If I set it to something else (that is not a multiple of eps), it works.

Copy link
@gsiisg gsiisg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid the hard coding of dtype=np.float64 in check_array(), it will blow up the memory used if the original input was np.float16/32, wouldn't it? And if y_pred is cast as 64 bit then we wouldn't have problem with the eps=1e-15 in the first place.

@Micky774
Copy link
Contributor
Micky774 commented Sep 8, 2022

I think we should avoid the hard coding of dtype=np.float64 in check_array(), it will blow up the memory used if the original input was np.float16/32, wouldn't it? And if y_pred is cast as 64 bit then we wouldn't have problem with the eps=1e-15 in the first place.

You're right. I think it should instead be dtype=[np.float64, np.float32, np.float16] in which case it'll convert to np.float64 if it is anything other than those floating-type. In the int{32, 64} case it'll convert to np.float64 which should be fine.

y_pred = check_array(
y_pred, ensure_2d=False, dtype=[np.float64, np.float32, np.float16]
)
eps = np.finfo(y_pred.dtype).eps * 1.0001 if eps == "auto" else eps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for multiplying by 1.0001. This looks really arbritrary.

Copy link
Contributor Author
@Safikh Safikh Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test sklearn/metrics/tests/test_common.py::test_not_symmetric_metric fails for eps. I have not been able to identify why it is happening. But a very minute change to eps seems to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the test is not adapted for the log_loss. Indeed, the loss is symmetric when it contains only 0/1 values but not otherwise. Basically, since it takes y_proba and y_pred, the tests make little sense.

We should correct this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should I remove log_loss from the symmetric metrics? Then there would be no need for modifying the eps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we need to remove the loss from the symmetric metrics and remove this 1.0001.

@glemaitre glemaitre self-requested a review September 12, 2022 07:53
Safikh and others added 5 commits September 12, 2022 20:16
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Comment on lines 321 to 323
- |Fix| add a `"auto"` option to `eps` in :func:`metrics.logloss`.
This option will automatically set the `eps` value depending on the data
type `y_pred`.
Copy link
Contributor
@Micky774 Micky774 Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording here still needs to be more explicit.

Also, at this point I would consider this an enhancement which also happens to fix a bug. Wondering what the maintainers think.

Suggested change
- |Fix| add a `"auto"` option to `eps` in :func:`metrics.logloss`.
This option will automatically set the `eps` value depending on the data
type `y_pred`.
- |Enhancement| Adds an `"auto"` option to `eps` in :func:`metrics.logloss`.
This option will automatically set the `eps` value depending on the data
type of `y_pred`. In addition, the default value of `eps` is changed from
`1e-15` to the new `"auto"` option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

@glemaitre glemaitre self-requested a review September 13, 2022 13:15
@Safikh
Copy link
Contributor Author
Safikh commented Sep 29, 2022

Hi @glemaitre, I think this PR is complete. Is there anything that needs to be changed?

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to remove this 1.0001.

y_pred = check_array(
y_pred, ensure_2d=False, dtype=[np.float64, np.float32, np.float16]
)
eps = np.finfo(y_pred.dtype).eps * 1.0001 if eps == "auto" else eps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we need to remove the loss from the symmetric metrics and remove this 1.0001.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Safikh.

@Micky774 @ogrisel do you want to have a look.

Copy link
Contributor
@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. I had a couple of small wording nits, and a concern regarding testing the np.float16 dtype. If those are addressed then this should be ready to merge.

Copy link
Contributor
@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @glemaitre feel free to merge if the changes are still acceptable to you

@glemaitre glemaitre merged commit f8986ee into scikit-learn:main Nov 10, 2022
@glemaitre
Copy link
Member

Thanks @Safikh Merging.

@gsiisg
Copy link
gsiisg commented Nov 10, 2022

Thanks everyone!

@Safikh Safikh deleted the logloss_float32_fix branch November 11, 2022 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

log_loss giving nan when input is np.float32 and eps is default
4 participants
0