-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Large sparse matrix support #11327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…sparse to check_array and new functon sparse_indices_check
…into sparse_indices_check Merging changes between fork and master:
…arse validation function has been extended to COO and CSC too
…into work Aligning with the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing this PR!
|
||
def test_check_array_accept_large_sparse_no_exception(X_64bit): | ||
# When large sparse are allowed | ||
if LARGE_SPARSE_SUPPORTED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mark this with pytest.mark.skipif(not LARGE_SPARSE_SUPPORTED)
instead? otherwise we will be generating 64 bit sparse in the fixture with a scipy that doesn't support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we want to do this: we want to test what happens if the user passes one that they could not have constructed directly with scipy.
|
||
|
||
def test_check_array_accept_large_sparse_raise_exception(X_64bit): | ||
print(X_64bit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgotten print statement..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And strangely, I think the forgotten print statement was causing the test failure on old scipy
sklearn/utils/validation.py
Outdated
@@ -297,6 +303,9 @@ def _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, | |||
if isinstance(accept_sparse, six.string_types): | |||
accept_sparse = [accept_sparse] | |||
|
|||
# Indices Datatype regulation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"regulation" -> "validation" ?
Tests are now passing.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM besides the following comments.
sklearn/decomposition/nmf.py
Outdated
@@ -973,7 +973,8 @@ def non_negative_factorization(X, W=None, H=None, n_components=None, | |||
factorization with the beta-divergence. Neural Computation, 23(9). | |||
""" | |||
|
|||
X = check_array(X, accept_sparse=('csr', 'csc'), dtype=float) | |||
X = check_array(X, accept_sparse=('csr', 'csc'), | |||
dtype=float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cosmetics: this change looks useless.
sklearn/decomposition/nmf.py
Outdated
@@ -1226,7 +1227,8 @@ def fit_transform(self, X, y=None, W=None, H=None): | |||
W : array, shape (n_samples, n_components) | |||
Transformed data. | |||
""" | |||
X = check_array(X, accept_sparse=('csr', 'csc'), dtype=float) | |||
X = check_array(X, accept_sparse=('csr', 'csc'), | |||
dtype=float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here.
@@ -598,6 +640,13 @@ def check_X_y(X, y, accept_sparse=False, dtype="numeric", order=None, | |||
deprecated in version 0.19 "and will be removed in 0.21. Use | |||
``accept_sparse=False`` instead. | |||
|
|||
accept_large_sparse : bool (default=True) | |||
If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by | |||
accept_sparse, accept_large_sparse will cause it to be accepted only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accept_large_sparse=False
instead of just accept_large_sparse
.
sklearn/utils/validation.py
Outdated
" to 0.14.0 or above" % scipy_version) | ||
raise TypeError("Only sparse matrices with 32-bit integer" | ||
" indices are accepted. Got %s indices." | ||
% indices_datatype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other scikit-learn input validation checks tend to raise ValueError
when the type of the container is ok (array, sparse matrix, dataframe) but the dtype is invalid:
>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> LogisticRegression().fit(np.array([['invalid'], ['invalid']], dtype=object), [0, 1])
Traceback (most recent call last):
File "<ipython-input-8-d601c7dabfdc>", line 1, in <module>
LogisticRegression().fit(np.array([['invalid'], ['invalid']], dtype=object), [0, 1])
File "/home/ogrisel/code/scikit-learn/sklearn/linear_model/logistic.py", line 1218, in fit
order="C")
File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 671, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 494, in check_array
array = np.asarray(array, dtype=dtype, order=order)
File "/home/ogrisel/.virtualenvs/py36/lib/python3.6/site-packages/numpy/core/numeric.py", line 492, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'invalid'
Therefore we might want to raise ValueError
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. We raise TypeError for wrong sparse format. I can change to ValueError...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change this, but I feel like TypeError is more precise.
@@ -433,6 +434,40 @@ def pairwise_estimator_convert_X(X, estimator, kernel=linear_kernel): | |||
return X | |||
|
|||
|
|||
def _generate_sparse_matrix(X_csr): | |||
"""Generate sparse matrices with {32,64}bit indices of diverse format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side comment: once this is merged, I think there are some other existing tests where this function could be useful (cc @glemaitre )
Want to give an opinion on |
I think I'm still in favor (+0) of |
One more data point: we explicitly reject complex dtyped arrays with >>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> LogisticRegression().fit(np.array([[1.2 + 1j]]), [0])
Traceback (most recent call last):
File "<ipython-input-9-70badd21e619>", line 1, in <module>
LogisticRegression().fit(np.array([[1.2 + 1j]]), [0])
File "/home/ogrisel/code/scikit-learn/sklearn/linear_model/logistic.py", line 1218, in fit
order="C")
File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 671, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 497, in check_array
"{}\n".format(array))
ValueError: Complex data not supported
[[1.2+1.j]] |
@rth does this have your +1? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on the TypeError
vs ValueError
. I feel that for complex arrays (and possibly 32/64 bit indices), a TypeError
would have made more sense. At least scipy does raise it in case of dtype mismatch (see e.g. scipy/scipy#8360 (comment) or this line) But ValueError
seems fine for consistency sake.
LGTM. Thanks @jnothman !
Yay! Great to close some old issues!
|
Supersedes and closes #9678, given @kdhingra307's silence
Fixes #9545
Fixes #4149
Partially addresses #2969
TODO: