8000 IsotonicRegression gives NANs on normal data · Issue #2507 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

IsotonicRegression gives NANs on normal data #2507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Felix-neko opened this issue Oct 9, 2013 · 22 comments
Closed

IsotonicRegression gives NANs on normal data #2507

Felix-neko opened this issue Oct 9, 2013 · 22 comments
Labels
Milestone

Comments

@Felix-neko
Copy link

I have a problem with IsotonicRegression: it gives some NANs in case of fitting some data with values close to zero (but greater than sys.float_info.min).

I pickled some breaking data and uploaded them to SendSpace:

http://www.sendspace.com/file/0i18ib

And below is the crashing example.

# -*- coding: utf-8 -*-

import pickle

[xArray, yArray, weightArray, pPredicted] = pickle.load(open("bugreport.dmp", 'r'))

#xArray and yArray are the raw data I want to fit, weightArray are the sample weights. There are no NANs among them.

import sklearn
print sklearn.__version__ #SKLEARN v. 0.14.1

import sklearn.isotonic

regression = sklearn.isotonic.IsotonicRegression()

regression.fit(xArray, yArray, sample_weight=weightArray)
print regression.predict(xArray) # Oh no! It gives some NANs!

@Felix-neko
Copy link
Author

P.S. Found on Python 2.7 (64-bit version) / Win7.

@ogrisel
Copy link
Member
ogrisel commented Oct 11, 2013

Unfortunately I cannot unpickle the data on my box with numpy 1.7.1:

Traceback (most recent call last):
  File "<ipython-input-8-0d0d340ac566>", line 1, in <module>
    [xArray, yArray, weightArray, pPredicted] = pickle.load(open("bugreport.dmp", 'r'))
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1378, in load
    return Unpickler(file).load()
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1124, in find_class
    __import__(module)
ImportError: No module named multiarray

Maybe you can save the data with a version independent format, for instance with numpy.savez or numpy.savez_compressed: http://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html

@Felix-neko
Copy link
Author

Here is an NPZ file: http://www.sendspace.com/file/hx06eh

The code changed to:

# -*- coding: utf-8 -*-

import pickle
import numpy as np
loadedData = np.load("bugreport.npz")

[xArray, yArray, weightArray, pPredicted] = loadedData["a"]

#xArray and yArray are the raw data I want to fit, weightArray are the sample weights. There are no NANs among them.

import sklearn
print sklearn.__version__ #SKLEARN v. 0.14.1

import sklearn.isotonic

regression = sklearn.isotonic.IsotonicRegression()

regression.fit(xArray, yArray, sample_weight=weightArray)
print regression.predict(xArray) # Oh no! It gives some NANs!

@ogrisel
Copy link
Member
ogrisel commented Oct 11, 2013

Indeed I can reproduce the failures on a 64 bits OSX 10.8 box. So this is neither windows nor 32 bits specific. Maybe @agramfort or @NelleV might want to have a look.

@Felix-neko
Copy link
Author

I used 64-bit version of Windows 7, if it's important.

@NelleV
Copy link
Member
NelleV commented Oct 17, 2013

As mention by mail, this is due to the fact that there are zeros in the weights.
We can easily fix the problem in our code by removing X and y with 0 weights, but in this case, there would still be a problem, as the 0 are in the bottom range and top range of X, we wouldn't be able to predict for those either (and the code would return nan instead of raising an error).
Another possible fix would be to add some epsilon to the weights, in order not to divide by 0.
The last fix would be to remove the 0 weighted Xs, and then predict by the nearest value, but that seems a hack to me.

@ogrisel
Copy link
Member
ogrisel commented Oct 17, 2013

Wouldn't it make more sense to just drop samples with 0 weights prior to fitting?

@ogrisel
Copy link
Member
ogrisel commented Oct 17, 2013

Otherwise, if zero weight samples are meaningless we can always throw a ValueError exception stating explicitly that zero is an invalid value for a weight.

@NelleV
Copy link
Member
NelleV commented Oct 17, 2013

It would totally make sense, but we only are able to predict the regression for new values between X.min() and X.max(). Hence, in this case, the function still would return Nan where it does right now (because we set the interp1d scipy function returns Nan when it is outside of the range of the fit, and not raise an exception).

@GaelVaroquaux
Copy link
Member

Wouldn't it make more sense to just drop samples with 0 weights prior to fitting?

And warn, in such situation.

@Felix-neko
Copy link
Author

I think, that it will be really useful to add the extrapolation:
if x values for prediction are below the min(xTraining), the predict(x) should not break, but give us the value from left borders of range of definition.

@Felix-neko
Copy link
Author

About dropping x values with zero values - I think, that this shound be done on the side of the library. For us, the users, We often forget to check our values for zero values. Or have to write some wrapping code around IsotonicRegression class to do it automatically. But it's much better, when it's done just-from-the-box.

@amueller amueller added the Bug label Jan 23, 2015
@amueller
Copy link
Member

Unfortunately the data is not available any more.
I tried reproducing with something like this:

import numpy as np
import sklearn.isotonic

regression = sklearn.isotonic.IsotonicRegression()
n_samples = 60

x = np.linspace(-3, 3, n_samples)
y = x + np.random.uniform(size=n_samples)
w = np.random.uniform(size=n_samples)
w[5:8] = 0

regression.fit(x, y, sample_weight=w)
plt.plot(x, y)
plt.plot(x, regression.predict(x))
print regression.predict(x)

I'm not sure if this was the idea.
This actually just runs and never finishes as far as I can tell.

@amueller
Copy link
Member

The out of bounds issue seems to have been tackled here: #3147 #3199

@mjbommar
Copy link
Contributor

Just today had issues with a client on 0.16-git where fit, transform gives
very different results than fit_transform. I think some regressions or
issues may have been introduced in the recent weighting.

Confirmed regression testing between 0.15.2 and HEAD, but won't have time
to prepare clean test case without client data until later this week.
Maybe someone can look?

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
Web: http://www.bommaritollc.com
Mobile: +1 (646) 450-3387

On Thu, Jan 29, 2015 at 11:03 AM, Andreas Mueller notifications@github.com
wrote:

The out of bounds issue seems to have been tackled here: #3147
#3147.


Reply to this email directly or view it on GitHub
#2507 (comment)
.

@amueller
Copy link
Member

I am just looking at the code. I don't see how this could happen.
I have

import numpy as np
import sklearn.isotonic
from sklearn.utils import shuffle


regression = sklearn.isotonic.IsotonicRegression()
n_samples = 60

x = np.linspace(-3, 3, n_samples)
y = x + np.random.uniform(size=n_samples)
w = 
8000
np.random.uniform(size=n_samples)
#w[5:8] = 0
x, y, w = shuffle(x, y, w)

regression.fit(x, y, sample_weight=w)
fit_then_transform = regression.fit(x, y, sample_weight=w).transform(x)
fit_transform = regression.fit_transform(x, y, sample_weight=w)
print(np.abs(fit_transform - fit_then_transform).max())

4.4408920985e-16

@amueller
Copy link
Member

Just masking out zero-weight points doesn't work with the current implementation of fit_transform. I think we need to handle them separately in the cython code. Or remove them and add them back in, but that seems more hassle.

@mjbommar
Copy link
Contributor

Not sure how many of these are related, but here's an example with clean data that shows that fit and fit_transform return different results for data with ties in X. This was generated with HEAD as about two hours ago.
http://nbviewer.ipython.org/urls/gist.githubusercontent.com/mjbommar/74fcefdcd0f2b1a5f708/raw/4742691db799101091598922cd0808f1eb5f07f2/isotonic_test_case_20150129.json

I have another 0.16-git install which shows site-packages/sklearn date of Dec 4. I compared md5sum's and confirmed it's at commit 81613d2.
https://github.com/scikit-learn/scikit-learn/tree/81613d2d2fa07bc310fb7d4abb4231ce78772fad/sklearn/isotonic.py

I patched this install with isotonic.py from commits d255866 and a9ea55f; the former has norm of 0 with ties case, and the latter reproduces the non-zero norm and error case. In other words, I think we can ping @ragv about a9ea55f as the culprit.

@mjbommar
Copy link
Contributor

BTW, I responded to this originally by email, not realizing how old the issue was...should we create a new issue for this regression, @amueller ?

@amueller
Copy link
Member

Yes please create a new issues as I think it is unrelated.

@amueller
Copy link
Member

Damn a9ea55f was my idea... lets see what it broke...

@amueller
Copy link
Member

Fixed by #4297.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants
0