[MRG + 1] Fix gradient boosting overflow and various other float comparison on ==#7970
[MRG + 1] Fix gradient boosting overflow and various other float comparison on ==#7970amueller merged 10 commits intoscikit-learn:masterfrom
Conversation
|
I am wondering how float comparisons were handled before, when there was no isclose() function. |
|
@chenhe95 yeah usually you want "close to zero" so you can always do norm < eps |
| denominator = np.sum(sample_weight * (y - residual) * (1 - y + residual)) | ||
|
|
||
| if denominator == 0.0: | ||
| if isclose(denominator, 0., rtol=0., atol=np.float64(1e-150)): |
There was a problem hiding this comment.
silly question but is that a scalar? I feel the code denominator < np.float64(1e-150) easier to understand.
There was a problem hiding this comment.
It turned out a lot messier than I had anticipated.
I originally planned to just use isclose(denominator, 0.) which seemed pretty clean, but the default tolerance was only at 1e-8.
I suppose it is good to just do abs(denominator) < 1e-150 here.
|
I also went and did some pep8 housekeeping on the file gradient_boosting.py Which I wasn't quite sure what to do And here flake8 suggested doing just Let me know what you guys think |
There was a problem hiding this comment.
Apart from reverting all PEP8 changes unrelated to the PR, this LGTM... Thx!
| tree.value[leaf, 0, 0] = _weighted_percentile(diff, sample_weight, percentile=50) | ||
| diff = (y.take(terminal_region, axis=0) - | ||
| pred.take(terminal_region, axis=0)) | ||
| tree.value[leaf, 0, 0] = \ |
There was a problem hiding this comment.
Could you avoid backslash and rather do
_weighted_percentile(diff
sample_weight...)| gamma = stats.scoreatpercentile(np.abs(diff), self.alpha * 100) | ||
| else: | ||
| gamma = _weighted_percentile(np.abs(diff), sample_weight, self.alpha * 100) | ||
| gamma = _weighted_percentile( |
There was a problem hiding this comment.
Why are you cleaning up flake8 issues for code that is not modified in this PR... It creates merged conflicts with other PRs. In general we try to enforce flake8 only for the code that is being modified in the PR... +1 for reverting these changes...
|
|
||
| if denominator == 0.0: | ||
| # prevents overflow and division by zero | ||
| if abs(denominator) < 1e-150: |
There was a problem hiding this comment.
Instead of 1e-150 you could use np.finfo(np.float32).eps... There is precedent for it here
There was a problem hiding this comment.
Hmm.. I am not sure. @raghavrv do you have a strong preference for np.finfo(np.float32).eps or do you think 1e-150 is still fine? I personally prefer 1e-150 because the case where this algorithm was failing at was when denominator was around 1e-309, so I felt that 1e-150 was appropriate since it's about half of 300.
>>> np.finfo(np.float).eps
2.2204460492503131e-16
>>> np.finfo(np.double).eps
2.2204460492503131e-16
It's just that I am kind of worried that those values are too large compared to 1e-150 and I'm not sure if it will cause any rounding errors.
There was a problem hiding this comment.
>>> np.finfo(np.float32).tiny
1.1754944e-38
I suppose this seems okay!
There was a problem hiding this comment.
I was suggesting np.finfo(np.double).tiny, which is close to e-300 for 32 64 bit and much less for 64 bit systems...
There was a problem hiding this comment.
>>> np.finfo(np.double).tiny
2.2250738585072014e-308There was a problem hiding this comment.
Thing is if your system is 32 bit, denominator - which is the result of np.sum - can only be as low as (np.finfo(np.float<arch>).tiny) IIUC...
There was a problem hiding this comment.
I actually wasn't sure about this, because to the power of -308 seemed a bit too small as well, and it can easily overflow for not very large numerators
>>> 4/np.finfo(np.double).tiny
__main__:1: RuntimeWarning: overflow encountered in double_scalars
inf
Which was originally why I had 1e-150
| # Check input | ||
| X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], dtype=DTYPE) | ||
| X, y = check_X_y( | ||
| X, y, accept_sparse=['csr', 'csc', 'coo'], dtype=DTYPE) |
There was a problem hiding this comment.
+1 for reverting all pep8 changes unrelated to the PR... :)
|
Is this meant to be MRG, not WIP? |
|
@jnothman I think I am going to revert the flake8 fixes and then set the title to MRG. |
|
Thanks for the feedback everyone! I have reverted the flake8 things. Let me know how it looks! |
|
Could you make it |
|
Sorry I missed your comment here... |
|
1e-150 is fine. np.finfo(np.double).tiny is too small. |
|
Okay, it's reverted back to 1e-150 |
|
AppVeyor is claiming that the log is empty and failing. |
|
LGTM, thanks |
|
Could you please add a bug fix entry to whats_new.rst? Thanks |
|
Hm can we add tests that no warning is raised? Or is that too tricky? Otherwise lgtm. |
|
Oh, there's actually a ValueError in the issue. You should add a test to ensure this value error doesn't happen any more after your fix. |
|
I am unsure if the ValueError is easily reproducible since the original reporter of the error said
But I am fairly confident that this will fix the ValueError because the float will not be compared to |
|
The point about adding a test is that we don't accidentally introduce the same bug down the road. |
|
Okay, I'll see what I can do to come up with some test cases. |
|
Any luck on this, @chenhe95? |
|
Hmm.. not really. The last few days of finals have been rough and I have been working on my other CountFeaturizer pull request. |
|
@amueller, this is the sort of thing that I suspect we can only reasonably test by separating out a smaller private helper as a unit and testing that. I am inclined to merge the patch even if we can't build a test with ease. |
|
Waiting for @amueller to voice his opinion on to what extent a test is necessary. |
|
LGTM. |
|
Thanks @chenhe95!! |
…arison on == (scikit-learn#7970) * reintroduced isclose() and flake8 fixes to fixes.py * changed == 0.0 to isclose(...) * example changes * changed back to abs() < epsilon * flake8 convention on file * reverted flake8 fixes * reverted flake8 fixes (2) * np.finfo(np.float32).tiny instead of hard coded epsilon 1e-150 * reverted to 1e-150 * whats new modified
…arison on == (scikit-learn#7970) * reintroduced isclose() and flake8 fixes to fixes.py * changed == 0.0 to isclose(...) * example changes * changed back to abs() < epsilon * flake8 convention on file * reverted flake8 fixes * reverted flake8 fixes (2) * np.finfo(np.float32).tiny instead of hard coded epsilon 1e-150 * reverted to 1e-150 * whats new modified
…arison on == (scikit-learn#7970) * reintroduced isclose() and flake8 fixes to fixes.py * changed == 0.0 to isclose(...) * example changes * changed back to abs() < epsilon * flake8 convention on file * reverted flake8 fixes * reverted flake8 fixes (2) * np.finfo(np.float32).tiny instead of hard coded epsilon 1e-150 * reverted to 1e-150 * whats new modified
…arison on == (scikit-learn#7970) * reintroduced isclose() and flake8 fixes to fixes.py * changed == 0.0 to isclose(...) * example changes * changed back to abs() < epsilon * flake8 convention on file * reverted flake8 fixes * reverted flake8 fixes (2) * np.finfo(np.float32).tiny instead of hard coded epsilon 1e-150 * reverted to 1e-150 * whats new modified
…arison on == (scikit-learn#7970) * reintroduced isclose() and flake8 fixes to fixes.py * changed == 0.0 to isclose(...) * example changes * changed back to abs() < epsilon * flake8 convention on file * reverted flake8 fixes * reverted flake8 fixes (2) * np.finfo(np.float32).tiny instead of hard coded epsilon 1e-150 * reverted to 1e-150 * whats new modified
…arison on == (scikit-learn#7970) * reintroduced isclose() and flake8 fixes to fixes.py * changed == 0.0 to isclose(...) * example changes * changed back to abs() < epsilon * flake8 convention on file * reverted flake8 fixes * reverted flake8 fixes (2) * np.finfo(np.float32).tiny instead of hard coded epsilon 1e-150 * reverted to 1e-150 * whats new modified
…arison on == (scikit-learn#7970) * reintroduced isclose() and flake8 fixes to fixes.py * changed == 0.0 to isclose(...) * example changes * changed back to abs() < epsilon * flake8 convention on file * reverted flake8 fixes * reverted flake8 fixes (2) * np.finfo(np.float32).tiny instead of hard coded epsilon 1e-150 * reverted to 1e-150 * whats new modified

Reference Issue
Fix #7717
What does this implement/fix? Explain your changes.
Before, the code was using == to compare float values and dividing by "zero (~10e-309)" which caused an overflow.
Now I made it so that it's
There are several other instances of this happening, which may cause an error and I want to also address those later on.
In addition, this brings back the numpy.isclose() method which is a standardized way of computing if two float scalars or matrices of arbitrary size are almost close to a threshold.