8000 [MRG + 2] FIX dtypes to conform to the stricter type cast rules of numpy 1.10 by raghavrv · Pull Request #5398 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG + 2] FIX dtypes to conform to the stricter type cast rules of numpy 1.10 #5398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 15, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ env:
- DISTRIB="conda" PYTHON_VERSION="2.6" INSTALL_MKL="false"
NUMPY_VERSION="1.6.2" SCIPY_VERSION="0.11.0"
# This environment tests the newest supported anaconda env
- DISTRIB="conda" PYTHON_VERSION="3.4" INSTALL_MKL="true"
NUMPY_VERSION="1.8.1" SCIPY_VERSION="0.14.0"
- DISTRIB="conda" PYTHON_VERSION="3.5" INSTALL_MKL="true"
NUMPY_VERSION="1.10.1" SCIPY_VERSION="0.16.0"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve @ogrisel Is this better than * since we know which version we are testing against?

install: source continuous_integration/install.sh
script: bash continuous_integration/test_script.sh
after_success:
Expand Down
16 changes: 8 additions & 8 deletions doc/modules/linear_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ the parameter vector.
The implementation in the class :class:`Lasso` uses coordinate descent as
the algorithm to fit the coefficients. See :ref:`least_angle_regression`
for another implementation::

>>> from sklearn import linear_model
>>> clf = linear_model.Lasso(alpha = 0.1)
>>> clf.fit([[0, 0], [1, 1]], [0, 1])
Expand Down Expand Up @@ -1079,9 +1079,9 @@ of a given degree. It can be used as follows::
[4, 5]])
>>> poly = PolynomialFeatures(degree=2)
>>> poly.fit_transform(X)
array([[ 1, 0, 1, 0, 0, 1],
[ 1, 2, 3, 4, 6, 9],
[ 1, 4, 5, 16, 20, 25]])
array([[ 1., 0., 1., 0., 0., 1.],
[ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])

The features of ``X`` have been transformed from :math:`[x_1, x_2]` to
:math:`[1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]`, and can now be used within
Expand Down Expand Up @@ -1125,10 +1125,10 @@ This way, we can solve the XOR problem with a linear classifier::
>>> y = X[:, 0] ^ X[:, 1]
>>> X = PolynomialFeatures(interaction_only=True).fit_transform(X)
>>> X
array([[1, 0, 0, 0],
[1, 0, 1, 0],
[1, 1, 0, 0],
[1, 1, 1, 1]])
array([[ 1., 0., 0., 0.],
[ 1., 0., 1., 0.],
[ 1., 1., 0., 0.],
[ 1., 1., 1., 1.]])
>>> clf = Perceptron(fit_intercept=False, n_iter=10, shuffle=False).fit(X, y)
>>> clf.score(X, y)
1.0
Expand Down
12 changes: 6 additions & 6 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -484,9 +484,9 @@ Often it's useful to add complexity to the model by considering nonlinear featur
[4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X) # doctest: +ELLIPSIS
array([[ 1, 0, 1, 0, 0, 1],
[ 1, 2, 3, 4, 6, 9],
[ 1, 4, 5, 16, 20, 25]])
array([[ 1., 0., 1., 0., 0., 1.],
[ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])

The features of X have been transformed from :math:`(X_1, X_2)` to :math:`(1, X_1, X_2, X_1^2, X_1X_2, X_2^2)`.

Expand All @@ -499,9 +499,9 @@ In some cases, only interaction terms among features are required, and it can be
[6, 7, 8]])
>>> poly = PolynomialFeatures(degree=3, interaction_only=True)
>>> poly.fit_transform(X) # doctest: +ELLIPSIS
array([[ 1, 0, 1, 2, 0, 0, 2, 0],
[ 1, 3, 4, 5, 12, 15, 20, 60],
[ 1, 6, 7, 8, 42, 48, 56, 336]])
array([[ 1., 0., 1., 2., 0., 0., 2., 0.],
[ 1., 3., 4., 5., 12., 15., 20., 60.],
[ 1., 6., 7., 8., 42., 48., 56., 336.]])

The features of X have been transformed from :math:`(X_1, X_2, X_3)` to :math:`(1, X_1, X_2, X_3, X_1X_2, X_1X_3, X_2X_3, X_1X_2X_3)`.

Expand Down
8 changes: 4 additions & 4 deletions sklearn/cross_decomposition/pls_.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
import numpy as np
from scipy import linalg
from ..utils import arpack
from ..utils.validation import check_is_fitted
from ..utils.validation import check_is_fitted, FLOAT_DTYPES

__all__ = ['PLSCanonical', 'PLSRegression', 'PLSSVD']

Expand Down Expand Up @@ -375,14 +375,14 @@ def transform(self, X, Y=None, copy=True):
x_scores if Y is not given, (x_scores, y_scores) otherwise.
"""
check_is_fitted(self, 'x_mean_')
X = check_array(X, copy=copy)
X = check_array(X, copy=copy, dtype=FLOAT_DTYPES)
# Normalize
X -= self.x_mean_
X /= self.x_std_
# Apply rotation
x_scores = np.dot(X, self.x_rotations_)
if Y is not None:
Y = check_array(Y, ensure_2d=False, copy=copy)
Y = check_array(Y, ensure_2d=False, copy=copy, dtype=FLOAT_DTYPES)
if Y.ndim == 1:
Y = Y.reshape(-1, 1)
Y -= self.y_mean_
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GaelVaroquaux @ogrisel @lesteve I am also wondering if instead of explicitly typecasting Y to be float (assuming y_mean_ is float) should we -

  1. typecast y_mean_ to Y.dtype or
  2. typecast Y to y_mean_.dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would (?) preserve int as it is without typecasting it when y_mean_ is also int!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay!!

Expand Down Expand Up @@ -410,7 +410,7 @@ def predict(self, X, copy=True):
be an issue in high dimensional space.
"""
check_is_fitted(self, 'x_mean_')
X = check_array(X, copy=copy)
X = check_array(X, copy=copy, dtype=FLOAT_DTYPES)
# Normalize
X -= self.x_mean_
X /= self.x_std_
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here w.r.t x_mean_ and x_std_ ?

Expand Down
8 changes: 4 additions & 4 deletions sklearn/decomposition/fastica_.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from ..utils import check_array, as_float_array, check_random_state
from ..utils.extmath import fast_dot
from ..utils.validation import check_is_fitted
from ..utils.validation import FLOAT_DTYPES

__all__ = ['fastica', 'FastICA']

Expand Down Expand Up @@ -261,7 +262,7 @@ def my_g(x):
fun_args = {} if fun_args is None else fun_args
# make interface compatible with other decompositions
# a copy is required only for non whitened data
X = check_array(X, copy=whiten).T
X = check_array(X, copy=whiten, dtype=FLOAT_DTYPES).T

alpha = fun_args.get('alpha', 1.0)
if not 1 <= alpha <= 2:
Expand Down Expand Up @@ -540,7 +541,7 @@ def transform(self, X, y=None, copy=True):
"""
check_is_fitted(self, 'mixing_')

X = check_array(X, copy=copy)
X = check_array(X, copy=copy, dtype=FLOAT_DTYPES)
if self.whiten:
X -= self.mean_

Expand All @@ -563,8 +564,7 @@ def inverse_transform(self, X, copy=True):
"""
check_is_fitted(self, 'mixing_')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that not check_array here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why would we want to copy? fast_dot doesn't change inplace, right? That wouldn't make any sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there an error here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes X += self.mean_ made numpy 0.10 raise an error since X is not explicitly float but mean_ was... (is this fix okay or should I rather do check_array(X, dtype=FLOAT_DTYPES)?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so even if self.mixing_ is float, fast_dot(X, self.mixing_.T) is not? That is somewhat surprising to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fast_dot does casting according to my quick check on numpy 1.10

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! So I must've done this pre-emptively I think... I'll check against master once to confirm!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have replaced with check_array...! And yes I did this pre-emptively since there was a failure w.r.t to fast ica here -

======================================================================
ERROR: sklearn.tests.test_common.test_non_meta_estimators('FastICA', <class 'sklearn.decomposition.fastica_.FastICA'>)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/scikit-learn/sklearn/utils/testing.py", line 317, in wrapper
    return fn(*args, **kwargs)
  File "/scikit-learn/sklearn/utils/estimator_checks.py", line 669, in check_estimators_dtypes
    estimator.fit(X_train, y)
  File "/scikit-learn/sklearn/decomposition/fastica_.py", line 522, in fit
    self._fit(X, compute_sources=False)
  File "/scikit-learn/sklearn/decomposition/fastica_.py", line 478, in _fit
    compute_sources=compute_sources, return_n_iter=True)
  File "/scikit-learn/sklearn/decomposition/fastica_.py", line 300, in fastica
    X -= X_mean[:, np.newaxis]
TypeError: Cannot cast ufunc subtract output from dtype('float64') to dtype('int64') with casting rule 'same_kind'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set copy=copy and self.whiten? There is no need to trigger a copy when self.whiten is False.

if copy:
X = X.copy()
X = check_array(X, copy=(copy and self.whiten), dtype=FLOAT_DTYPES)
X = fast_dot(X, self.mixing_.T)
if self.whiten:
X += self.mean_
Expand Down
5 changes: 3 additions & 2 deletions sklearn/ensemble/weight_boosting.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
from numpy.core.umath_tests import inner1d

from .base import BaseEnsemble
from ..base import ClassifierMixin, RegressorMixin
from ..base import ClassifierMixin, RegressorMixin, is_regressor
from ..externals import six
from ..externals.six.moves import zip
from ..externals.six.moves import xrange as range
Expand Down Expand Up @@ -107,7 +107,8 @@ def fit(self, X, y, sample_weight=None):
dtype = None
accept_sparse = ['csr', 'csc']

X, y = check_X_y(X, y, accept_sparse=accept_sparse, dtype=dtype)
X, y = check_X_y(X, y, accept_sparse=accept_sparse, dtype=dtype,
y_numeric=is_regressor(self))

if sample_weight is None:
# Initialize weights to 1 / n_samples
Expand Down
2 changes: 2 additions & 0 deletions sklearn/linear_model/least_angle.py
Original file line number Diff line number Diff line change
Expand Up @@ -1074,6 +1074,8 @@ def fit(self, X, y):
"""
self.fit_path = True
X, y = check_X_y(X, y, y_numeric=True)
X = as_float_array(X, copy=self.copy_X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? why not dtypes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that for memory efficiency? Then at least y was already converted, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean dtype=FLOAT_DTYPES in check_X_y correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, wait, I thought y_numeric made y floating, but that is only if it has dtype object. Never mind, sorry for the noise. I didn't have coffee yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we could squash the three lines into

X, y = check_X_y(X, y, y_numeric=True, dtype=FLOAT_DTYPES, copy=self.copy_X)

Can I do that without any side effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well it wouldn't convert y and it would upcast int32 to float64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if you think upcasting int32 to float64 is an issue, we should get it fixed everywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_X_y never upcasts integers:

>>> check_X_y([[1], [2]], np.array([1, 2], dtype=np.int32), y_numeric=True, dtype=[np.float64, np.float32])[1].dtype
type('int32')

y_numeric=True just checks for non-object dtypes in y. We could benefit from a y_dtype=[np.int64, np.int32] option to replace y_numeric at some point but this is outside of the scope of this PR.

y = as_float_array(y, copy=self.copy_X)

# init cross-validation generator
cv = check_cv(self.cv, X, y, classifier=False)
Expand Down
30 changes: 14 additions & 16 deletions sklearn/preprocessing/data.py
< 10000 td class="blob-code blob-code-context js-file-line"> self.K_fit_rows_.shape[0])[:, np.newaxis]
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,7 @@ def transform(self, X):
"""
check_is_fitted(self, 'scale_')

X = check_array(X, copy=self.copy, ensure_2d=False)
X = check_array(X, copy=self.copy, ensure_2d=False, dtype=FLOAT_DTYPES)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn m D96B ore.

That's good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😕 is that not going to cast a int32 dtype into a float64:

import numpy as np

from sklearn.utils.validation import check_array, FLOAT_DTYPES
check_array(np.arange(10, dtype='i4'), dtype=FLOAT_DTYPES).dtype

returns dtype('float64').

On the other hand as_float_array seems to do the right thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks loic! will fix it!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get an array of twice the memory in an 'int32' cast to 'float64' case but this may not happen that much in practice I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that would be too difficult to do. But I also don't know if its high up in the priority list. We could fix that up if someone files an issue saying that their memory crashes after feeding in integer data to some model.

if X.ndim == 1:
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)

Expand All @@ -341,7 +341,7 @@ def inverse_transform(self, X):
"""
check_is_fitted(self, 'scale_')

X = check_array(X, copy=self.copy, ensure_2d=False)
X = check_array(X, copy=self.copy, ensure_2d=False, dtype=FLOAT_DTYPES)
if X.ndim == 1:
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)

Expand Down Expand Up @@ -1068,14 +1068,14 @@ class PolynomialFeatures(BaseEstimator, TransformerMixin):
[4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X)
array([[ 1, 0, 1, 0, 0, 1],
[ 1, 2, 3, 4, 6, 9],
[ 1, 4, 5, 16, 20, 25]])
array([[ 1., 0., 1., 0., 0., 1.],
[ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])
>>> poly = PolynomialFeatures(interaction_only=True)
>>> poly.fit_transform(X)
array([[ 1, 0, 1, 0],
[ 1, 2, 3, 6],
[ 1, 4, 5, 20]])
array([[ 1., 0., 1., 0.],
[ 1., 2., 3., 6.],
[ 1., 4., 5., 20.]])

Attributes
----------
Expand Down Expand Up @@ -1149,7 +1149,7 @@ def transform(self, X, y=None):
"""
check_is_fitted(self, ['n_input_features_', 'n_output_features_'])

X = check_array(X)
X = check_array(X, dtype=FLOAT_DTYPES)
n_samples, n_features = X.shape

if n_features != self.n_input_features_:
Expand Down Expand Up @@ -1313,7 +1313,7 @@ def binarize(X, threshold=0.0, copy=True):

Parameters
----------
XX : {array-like, sparse matrix}, shape [n_samples, n_features]
X : {array-like, sparse matrix}, shape [n_samples, n_features]
The data to binarize, element by element.
scipy.sparse matrices should be in CSR or CSC format to avoid an
un-necessary copy.
Expand Down Expand Up @@ -1438,7 +1438,7 @@ def fit(self, K, y=None):
-------
self : returns an instance of self.
"""
K = check_array(K)
K = check_array(K, dtype=FLOAT_DTYPES)
n_samples = K.shape[0]
self.K_fit_rows_ = np.sum(K, axis=0) / n_samples
self.K_fit_all_ = self.K_fit_rows_.sum() / n_samples
Expand All @@ -1461,9 +1461,7 @@ def transform(self, K, y=None, copy=True):
"""
check_is_fitted(self, 'K_fit_all_')

K = check_array(K)
if copy:
K = K.copy()
K = check_array(K, copy=copy, dtype=FLOAT_DTYPES)

K_pred_cols = (np.sum(K, axis=1) /
Expand Down Expand Up @@ -1503,7 +1501,7 @@ def add_dummy_feature(X, value=1.0):
array([[ 1., 0., 1.],
[ 1., 1., 0.]])
"""
X = check_array(X, accept_sparse=['csc', 'csr', 'coo'])
X = check_array(X, accept_sparse=['csc', 'csr', 'coo'], dtype=FLOAT_DTYPES)
n_samples, n_features = X.shape
shape = (n_samples, n_features + 1)
if sparse.issparse(X):
Expand Down Expand Up @@ -1558,7 +1556,7 @@ def _transform_selected(X, transform, selected="all", copy=True):
if selected == "all":
return transform(X)

X = check_array(X, accept_sparse='csc', copy=copy)
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

if len(selected) == 0:
return X
Expand Down
8 changes: 8 additions & 0 deletions sklearn/preprocessing/tests/test_data.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

# Authors:
#
# Giorgio Patrini
Expand Down Expand Up @@ -1247,6 +1248,13 @@ def test_binarizer():
X_bin = binarizer.transform(X)
if init is not list:
assert_true(X_bin is X)

binarizer = Binarizer(copy=False)
X_float = np.array([[1, 0, 5], [2, 3, -1]], dtype=np.float64)
X_bin = binarizer.transform(X_float)
if init is not list:
assert_true(X_bin is X_float)

X_bin = toarray(X_bin)
assert_equal(np.sum(X_bin == 0), 2)
assert_equal(np.sum(X_bin == 1), 4)
Expand Down
19 changes: 16 additions & 3 deletions sklearn/preprocessing/tests/test_imputation.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

import numpy as np
from scipy import sparse

Expand All @@ -12,7 +13,7 @@
from sklearn import grid_search
from sklearn import tree
from sklearn.random_projection import sparse_random_matrix


def _check_statistics(X, X_true,
strategy, statistics, missing_values):
Expand Down Expand Up @@ -121,6 +122,18 @@ def test_imputation_mean_median_only_zero():
statistics_median, 0)


def safe_median(arr, *args, **kwargs):
# np.median([]) raises a TypeError for numpy >= 1.10.1
length = arr.size if hasattr(arr, 'size') else len(arr)
return np.nan if length == 0 else np.median(arr, *args, **kwargs)


def safe_mean(arr, *args, **kwargs):
# np.mean([]) raises a RuntimeWarning for numpy >= 1.10.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and before, I think. But the statement is not wrong, I guess ^^

length = arr.size if hasattr(arr, 'size') else len(arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the line is missing here now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol sorry :P done :)

return np.nan if length == 0 else np.mean(arr, *args, **kwargs)


def test_imputation_mean_median():
# Test imputation using the mean and median strategies, when
# missing_values != 0.
Expand All @@ -134,9 +147,9 @@ def test_imputation_mean_median():
values = np.arange(1, shape[0]+1)
values[4::2] = - values[4::2]

tests = [("mean", "NaN", lambda z, v, p: np.mean(np.hstack((z, v)))),
tests = [("mean", "NaN", lambda z, v, p: safe_mean(np.hstack((z, v)))),
("mean", 0, lambda z, v, p: np.mean(v)),
("median", "NaN", lambda z, v, p: np.median(np.hstack((z, v)))),
("median", "NaN", lambda z, v, p: safe_median(np.hstack((z, v)))),
("median", 0, lambda z, v, p: np.median(v))]

for strategy, test_missing_values, true_value_fun in tests:
Expand Down
0