8000 [MRG] Large sparse matrix support by jnothman · Pull Request #11327 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Large sparse matrix support #11327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Jun 25, 2018
Merged

Conversation

jnothman
Copy link
Member
@jnothman jnothman commented Jun 20, 2018

Supersedes and closes #9678, given @kdhingra307's silence
Fixes #9545
Fixes #4149
Partially addresses #2969

TODO:

  • set accept_large_sparse=True by default
  • review error message wording

Dhingra and others added 25 commits September 3, 2017 05:46
…sparse to check_array and new functon sparse_indices_check
…into sparse_indices_check

Merging changes between fork and master:
…arse validation function has been extended to COO and CSC too
@jnothman jnothman added this to the 0.20 milestone Jun 20, 2018
@jnothman jnothman changed the title [WIP] Large sparse matrix support [MRG] Large sparse matrix support Jun 20, 2018
@jnothman jnothman changed the title [MRG] Large sparse matrix support [WIP] Large sparse matrix support Jun 20, 2018
@jnothman jnothman changed the title [WIP] Large sparse matrix support [MRG] Large sparse matrix support Jun 20, 2018
Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing this PR!


def test_check_array_accept_large_sparse_no_exception(X_64bit):
# When large sparse are allowed
if LARGE_SPARSE_SUPPORTED:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mark this with pytest.mark.skipif(not LARGE_SPARSE_SUPPORTED) instead? otherwise we will be generating 64 bit sparse in the fixture with a scipy that doesn't support it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we want to do this: we want to test what happens if the user passes one that they could not have constructed directly with scipy.



def test_check_array_accept_large_sparse_raise_exception(X_64bit):
print(X_64bit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgotten print statement..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And strangely, I think the forgotten print statement was causing the test failure on old scipy

@@ -297,6 +303,9 @@ def _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy,
if isinstance(accept_sparse, six.string_types):
accept_sparse = [accept_sparse]

# Indices Datatype regulation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"regulation" -> "validation" ?

@jnothman
Copy link
Member Author
jnothman commented Jun 20, 2018 via email

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides the following comments.

@@ -973,7 +973,8 @@ def non_negative_factorization(X, W=None, H=None, n_components=None,
factorization with the beta-divergence. Neural Computation, 23(9).
"""

X = check_array(X, accept_sparse=('csr', 'csc'), dtype=float)
X = check_array(X, accept_sparse=('csr', 'csc'),
dtype=float)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetics: this change looks useless.

@@ -1226,7 +1227,8 @@ def fit_transform(self, X, y=None, W=None, H=None):
W : array, shape (n_samples, n_components)
Transformed data.
"""
X = check_array(X, accept_sparse=('csr', 'csc'), dtype=float)
X = check_array(X, accept_sparse=('csr', 'csc'),
dtype=float)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here.

@@ -598,6 +640,13 @@ def check_X_y(X, y, accept_sparse=False, dtype="numeric", order=None,
deprecated in version 0.19 "and will be removed in 0.21. Use
``accept_sparse=False`` instead.

accept_large_sparse : bool (default=True)
If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by
accept_sparse, accept_large_sparse will cause it to be accepted only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accept_large_sparse=False instead of just accept_large_sparse.

" to 0.14.0 or above" % scipy_version)
raise TypeError("Only sparse matrices with 32-bit integer"
" indices are accepted. Got %s indices."
% indices_datatype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other scikit-learn input validation checks tend to raise ValueError when the type of the container is ok (array, sparse matrix, dataframe) but the dtype is invalid:

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> LogisticRegression().fit(np.array([['invalid'], ['invalid']], dtype=object), [0, 1])
Traceback (most recent call last):
  File "<ipython-input-8-d601c7dabfdc>", line 1, in <module>
    LogisticRegression().fit(np.array([['invalid'], ['invalid']], dtype=object), [0, 1])
  File "/home/ogrisel/code/scikit-learn/sklearn/linear_model/logistic.py", line 1218, in fit
    order="C")
  File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 671, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 494, in check_array
    array = np.asarray(array, dtype=dtype, order=order)
  File "/home/ogrisel/.virtualenvs/py36/lib/python3.6/site-packages/numpy/core/numeric.py", line 492, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'invalid'

Therefore we might want to raise ValueError here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. We raise TypeError for wrong sparse format. I can change to ValueError...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this, but I feel like TypeError is more precise.

@@ -433,6 +434,40 @@ def pairwise_estimator_convert_X(X, estimator, kernel=linear_kernel):
return X


def _generate_sparse_matrix(X_csr):
"""Generate sparse matrices with {32,64}bit indices of diverse format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side comment: once this is merged, I think there are some other existing tests where this function could be useful (cc @glemaitre )

@jnothman
Copy link
Member Author

Want to give an opinion on ValueError vs TypeError?? ;)

@ogrisel
Copy link
Member
ogrisel commented Jun 22, 2018

I think I'm still in favor (+0) of ValueError for consistency's sake, considering the existing behavior of our estimators when fed numpy arrays with an invalid dtype.

@ogrisel
Copy link
Member
ogrisel commented Jun 22, 2018

One more data point: we explicitly reject complex dtyped arrays with ValueError in sklearn validation (not a numpy side-effect):

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> LogisticRegression().fit(np.array([[1.2 + 1j]]), [0])
Traceback (most recent call last):
  File "<ipython-input-9-70badd21e619>", line 1, in <module>
    LogisticRegression().fit(np.array([[1.2 + 1j]]), [0])
  File "/home/ogrisel/code/scikit-learn/sklearn/linear_model/logistic.py", line 1218, in fit
    order="C")
  File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 671, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 497, in check_array
    "{}\n".format(array))
ValueError: Complex data not supported
[[1.2+1.j]]

@jnothman
Copy link
Member Author

@rth does this have your +1?

Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on the TypeError vs ValueError. I feel that for complex arrays (and possibly 32/64 bit indices), a TypeError would have made more sense. At least scipy does raise it in case of dtype mismatch (see e.g. scipy/scipy#8360 (comment) or this line) But ValueError seems fine for consistency sake.

LGTM. Thanks @jnothman !

@rth rth merged commit a7e1711 into scikit-learn:master Jun 25, 2018
@jnothman
Copy link
Member Author
jnothman commented Jun 25, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants
0