[MRG] Large sparse matrix support #11327

jnothman · 2018-06-20T10:50:02Z

Supersedes and closes #9678, given @kdhingra307's silence
Fixes #9545
Fixes #4149
Partially addresses #2969

TODO:

set accept_large_sparse=True by default
review error message wording

…sparse to check_array and new functon sparse_indices_check

…into sparse_indices_check Merging changes between fork and master:

…he go

…arse validation function has been extended to COO and CSC too

…into work Aligning with the top

… its test case

rth

Thanks for continuing this PR!

rth · 2018-06-20T12:49:22Z

sklearn/utils/tests/test_validation.py

+
+def test_check_array_accept_large_sparse_no_exception(X_64bit):
+    # When large sparse are allowed
+    if LARGE_SPARSE_SUPPORTED:


Maybe mark this with pytest.mark.skipif(not LARGE_SPARSE_SUPPORTED) instead? otherwise we will be generating 64 bit sparse in the fixture with a scipy that doesn't support it.

No, we want to do this: we want to test what happens if the user passes one that they could not have constructed directly with scipy.

rth · 2018-06-20T12:49:41Z

sklearn/utils/tests/test_validation.py

+
+
+def test_check_array_accept_large_sparse_raise_exception(X_64bit):
+    print(X_64bit)


Forgotten print statement..

And strangely, I think the forgotten print statement was causing the test failure on old scipy

rth · 2018-06-20T12:52:26Z

sklearn/utils/validation.py

@@ -297,6 +303,9 @@ def _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy,
    if isinstance(accept_sparse, six.string_types):
        accept_sparse = [accept_sparse]

+    # Indices Datatype regulation


"regulation" -> "validation" ?

jnothman · 2018-06-20T13:40:50Z

Tests are now passing.

ogrisel

LGTM besides the following comments.

ogrisel · 2018-06-20T15:30:32Z

sklearn/decomposition/nmf.py

@@ -973,7 +973,8 @@ def non_negative_factorization(X, W=None, H=None, n_components=None,
    factorization with the beta-divergence. Neural Computation, 23(9).
    """

-    X = check_array(X, accept_sparse=('csr', 'csc'), dtype=float)
+    X = check_array(X, accept_sparse=('csr', 'csc'),
+                    dtype=float)


cosmetics: this change looks useless.

ogrisel · 2018-06-20T15:30:48Z

sklearn/decomposition/nmf.py

@@ -1226,7 +1227,8 @@ def fit_transform(self, X, y=None, W=None, H=None):
        W : array, shape (n_samples, n_components)
            Transformed data.
        """
-        X = check_array(X, accept_sparse=('csr', 'csc'), dtype=float)
+        X = check_array(X, accept_sparse=('csr', 'csc'),
+                        dtype=float)


same comment here.

ogrisel · 2018-06-20T16:01:35Z

sklearn/utils/validation.py

@@ -598,6 +640,13 @@ def check_X_y(X, y, accept_sparse=False, dtype="numeric", order=None,
           deprecated in version 0.19 "and will be removed in 0.21. Use
           ``accept_sparse=False`` instead.

+    accept_large_sparse : bool (default=True)
+        If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by
+        accept_sparse, accept_large_sparse will cause it to be accepted only


accept_large_sparse=False instead of just accept_large_sparse.

ogrisel · 2018-06-20T16:05:07Z

sklearn/utils/validation.py

+                                    " to 0.14.0 or above" % scipy_version)
+                raise TypeError("Only sparse matrices with 32-bit integer"
+                                " indices are accepted. Got %s indices."
+                                % indices_datatype)


Other scikit-learn input validation checks tend to raise ValueError when the type of the container is ok (array, sparse matrix, dataframe) but the dtype is invalid:

>>> import numpy as np >>> from sklearn.linear_model import LogisticRegression >>> LogisticRegression().fit(np.array([['invalid'], ['invalid']], dtype=object), [0, 1]) Traceback (most recent call last): File "<ipython-input-8-d601c7dabfdc>", line 1, in <module> LogisticRegression().fit(np.array([['invalid'], ['invalid']], dtype=object), [0, 1]) File "/home/ogrisel/code/scikit-learn/sklearn/linear_model/logistic.py", line 1218, in fit order="C") File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 671, in check_X_y ensure_min_features, warn_on_dtype, estimator) File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 494, in check_array array = np.asarray(array, dtype=dtype, order=order) File "/home/ogrisel/.virtualenvs/py36/lib/python3.6/site-packages/numpy/core/numeric.py", line 492, in asarray return array(a, dtype, copy=False, order=order) ValueError: could not convert string to float: 'invalid'

Therefore we might want to raise ValueError here.

Hmm. We raise TypeError for wrong sparse format. I can change to ValueError...?

I can change this, but I feel like TypeError is more precise.

rth · 2018-06-21T10:26:22Z

sklearn/utils/estimator_checks.py

@@ -433,6 +434,40 @@ def pairwise_estimator_convert_X(X, estimator, kernel=linear_kernel):
    return X


+def _generate_sparse_matrix(X_csr):
+    """Generate sparse matrices with {32,64}bit indices of diverse format


Side comment: once this is merged, I think there are some other existing tests where this function could be useful (cc @glemaitre )

jnothman · 2018-06-21T10:38:17Z

Want to give an opinion on ValueError vs TypeError?? ;)

ogrisel · 2018-06-22T08:31:54Z

I think I'm still in favor (+0) of ValueError for consistency's sake, considering the existing behavior of our estimators when fed numpy arrays with an invalid dtype.

ogrisel · 2018-06-22T08:38:38Z

One more data point: we explicitly reject complex dtyped arrays with ValueError in sklearn validation (not a numpy side-effect):

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> LogisticRegression().fit(np.array([[1.2 + 1j]]), [0])
Traceback (most recent call last):
  File "<ipython-input-9-70badd21e619>", line 1, in <module>
    LogisticRegression().fit(np.array([[1.2 + 1j]]), [0])
  File "/home/ogrisel/code/scikit-learn/sklearn/linear_model/logistic.py", line 1218, in fit
    order="C")
  File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 671, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 497, in check_array
    "{}\n".format(array))
ValueError: Complex data not supported
[[1.2+1.j]]

jnothman · 2018-06-24T11:17:55Z

@rth does this have your +1?

rth

I don't have a strong opinion on the TypeError vs ValueError. I feel that for complex arrays (and possibly 32/64 bit indices), a TypeError would have made more sense. At least scipy does raise it in case of dtype mismatch (see e.g. scipy/scipy#8360 (comment) or this line) But ValueError seems fine for consistency sake.

LGTM. Thanks @jnothman !

jnothman · 2018-06-25T23:44:15Z

Yay! Great to close some old issues!

Dhingra and others added 25 commits September 3, 2017 05:46

Fix to Issue scikit-learn#9545, it includes a new param accept_large_…

9de3138

…sparse to check_array and new functon sparse_indices_check

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

ecada4d

…into sparse_indices_check Merging changes between fork and master:

Cleared Flake8 issues and also chained if loop issue, test cases on t…

3651c17

…he go

Changed Naming Conventions

9e43610

test cases added, flake should not give any error now

a049743

More checks

9e60f95

Appened Estimator file with int64 matrices

62268ba

Checked flake and scipy errors

7ad32c7

Merged- latest changes

2d11dca

Estimator Check now supports int32 and int64 based indices, _large_sp…

3588adf

…arse validation function has been extended to COO and CSC too

Merged to base

a3fdc01

Merged to latest remote

77281d7

Dummy Estimator added, also SAG 64bit indices error case handled

7cd25f9

Merge branch 'master' of http://github.com/scikit-learn/scikit-learn …

9591b37

…into work Aligning with the top

Added scipy check, along with tuple of matrix format

1f530e0

Removed extra paranthesis and used LooseVersion instead of NumpyVersion

0d00f4c

Added check for unsupported scipy version in validation.py along with…

11690ef

… its test case

modified sparse version check for test_validation

14f0384

Reverted all of the unnecessary changes

f9fe4b8

Reverted unnecessary changed, which got missed in previous commit

9f6e0f7

sorted flake8 issue in nmf.py

1d0a282

Revised whole algorithms from ground up plus more flatter now

0648248

Changed few norms, default case still needed to be changed

4e8b9a0

Fixes for Roman

809924b

Merge branch 'master' into HEAD

da0d5c5

jnothman added this to the 0.20 milestone Jun 20, 2018

jnothman added 4 commits June 20, 2018 21:13

Assorted fixes and clarifications

3616069

Better docstring

8b6b48b

accept_large_sparse is now True by default

1f86add

What's new

5c5bdab

jnothman changed the title ~~[WIP] Large sparse matrix support~~ [MRG] Large sparse matrix support Jun 20, 2018

asformat(copy=...) does not exist

d934567

jnothman changed the title ~~[MRG] Large sparse matrix support~~ [WIP] Large sparse matrix support Jun 20, 2018

jnothman added 2 commits June 20, 2018 21:50

SGD does not support large sparse

3d20558

10000

Fix typos

d9b5232

jnothman changed the title ~~[WIP] Large sparse matrix support~~ [MRG] Large sparse matrix support Jun 20, 2018

jnothman added 3 commits June 20, 2018 22:06

Fix validation tests

1b49491

accept_large_sparse=False for SV*

c33ba8d

Disable large sparse on libsvm/liblinear

c1a31b4

rth reviewed Jun 20, 2018

View reviewed changes

Roman comments

c53c859

ogrisel approved these changes Jun 20, 2018

View reviewed changes

COSMIT/DOC as per Olivier

2e7541b

rth reviewed Jun 21, 2018

View reviewed changes

jnothman added 2 commits June 23, 2018 19:11

ValueError instead of TypeError

a59fb82

Merge branch 'master' into large-sparse

70e1485

rth approved these changes Jun 25, 2018

View reviewed changes

rth merged commit a7e1711 into scikit-learn:master Jun 25, 2018

astrophys mentioned this pull request Mar 27, 2019

Possible incompatibility with underlying sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2 smaegol/PlasFlow#22

Open

nkulkarni mentioned this pull request Aug 4, 2020

Adding accept_large_sparse flag to SGDRegressor #18090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Large sparse matrix support #11327

[MRG] Large sparse matrix support #11327

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!



		def test_check_array_accept_large_sparse_raise_exception(X_64bit):
		print(X_64bit)

Uh oh!

[MRG] Large sparse matrix support #11327

[MRG] Large sparse matrix support #11327

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!