8000 Update cython code to support 64 bit indexed sparse inputs · Issue #2969 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Update cython code to support 64 bit indexed sparse inputs #2969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
12 of 45 tasks
ogrisel opened this issue Mar 14, 2014 · 9 comments
Closed
12 of 45 tasks

Update cython code to support 64 bit indexed sparse inputs #2969

ogrisel opened this issue Mar 14, 2014 · 9 comments

Comments

@ogrisel
Copy link
Member
ogrisel commented Mar 14, 2014

In scipy master (to be released as 0.14), scipy sparse matrices can now be indexed with 64 bit integers.

This means that we will probably need to use fused types for indptr and indices arrays whenever we deal with CSC or CSR datastructures in our Cython code base.

Edit: by @rth on Nov 2017 to add the status of support for 64 bit CSR indices in different parts of the code, as discussed in #2969 (comment)

  • preprocessing_normalize[l1-False]
  • preprocessing_normalize[l1-True]
  • preprocessing_normalize[l2-False]
  • preprocessing_normalize[l2-True]
  • decomposition_truncatedsvd[randomized]
  • decomposition_truncatedsvd[arpack]
  • decomposition_nmf[cd]
  • decomposition_nmf[mu]
  • linear_model_logisticregression[liblinear-l1]
  • linear_model_sgdclassifier[modified_huber-elasticnet]
  • linear_model_sgdclassifier[squared_hinge-l1]
  • linear_model_sgdclassifier[squared_hinge-l2]
  • linear_model_sgdclassifier[squared_hinge-elasticnet]
  • linear_model_sgdclassifier[perceptron-l1]
  • linear_model_sgdclassifier[perceptron-l2]
  • linear_model_sgdclassifier[perceptron-elasticnet]
  • linear_model_sgdregressor[squared_loss-l1]
  • linear_model_sgdregressor[squared_loss-l2]
  • linear_model_sgdregressor[squared_loss-elasticnet]
  • linear_model_sgdregressor[huber-l1]
  • linear_model_sgdregressor[huber-l2]
  • linear_model_sgdregressor[huber-elasticnet]
  • linear_model_sgdregressor[epsilon_insensitive-l1]
  • linear_model_sgdregressor[epsilon_insensitive-l2]
  • linear_model_sgdregressor[epsilon_insensitive-elasticnet]
  • linear_model_sgdregressor[squared_epsilon_insensitive-l1]
  • linear_model_sgdregressor[squared_epsilon_insensitive-l2]
  • linear_model_sgdregressor[squared_epsilon_insensitive-elasticnet]
  • linear_model_estimator[linear_model.LinearRegression]
  • linear_model_estimator[linear_model.ElasticNet]
  • linear_model_estimator[svm.LinearSVC]
  • linear_model_logisticregression[liblinear-l2]
  • linear_model_sgdclassifier[log-l1]
  • linear_model_sgdclassifier[log-l2]
  • linear_model_sgdclassifier[log-elasticnet]
  • linear_model_sgdclassifier[modified_huber-l1]
  • linear_model_sgdclassifier[modified_huber-l2]
  • linear_model_estimator[svm.SVC]
  • linear_model_estimator[tree.DecisionTreeClassifier]
  • linear_model_logisticregression[newton-cg-l2]
  • linear_model_logisticregression[lbfgs-l2]
  • linear_model_logisticregression[sag-l2]
  • linear_model_sgdclassifier[hinge-l1]
  • linear_model_sgdclassifier[hinge-l2]
  • linear_model_sgdclassifier[hinge-elasticnet]
@larsmans
Copy link
Member

Except that when we construct sparse matrices, we should still use int/np.intc because older SciPy won't handle larger index arrays.

For this reason, I suggest we do this on an as-needed basis (handling more than 2bln points at once isn't a good idea for most algorithms anyway...)

@ogrisel
Copy link
Member Author
ogrisel commented Mar 16, 2014

Indeed. However one should keep in mind that the 2bln points limit means 2bln non-zero values, not 2bln samples. It's still big but could happen for large text corpus vectorized with the hashing vectorizer for instance. Any both MinibatchKMeans and SGDClassifier (and friends) could deal with such a training set.

@larsmans
Copy link
Member

Related SciPy bug: scipy/scipy#3465

@rth
Copy link
Member
rth commented Jun 19, 2017

To give an update on this issue, below is the test status of a few common estimators applied to a CSR array with 64 bit indices. These results were obtained by running test_64bit_csr_indices.py on master and full traceback can be found here.

A short summary is that,

  1. Anything that uses liblinear (and possibly other bundled C as opposed to Cython code) will segfault when given CSR arrays with 64 bit indices (e.g. LogisticRegression(), LinearSVC() etc). This is fairly critical IMO, and even if sparse arrays with 64 bit indices won't be supported there in the near future (or at all), it would be good to check for indices dtype and raise a python exception when appropriate. This is also the reason these tests need to be run with pytest-xdist using the -n 1 option, so that pytest could recover from a crashed interpreter.
  2. To make the SGD related estimators work, the sklearn.utils.seq_dataset.CSRDataset (at least) need to be added support for 64 bit indices.
  3. To make normalization work, sklearn.utils.sparsefuncs_fast.inplace_csr_row_normalize_l2 etc needs to be updated.
$ py.test -sv -n 1 test_64bit_csr_indices.py
============================= test session starts ==============================
platform linux -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 plugins: xdist-1.16.0

[gw0] FAILED test_preprocessing_normalize[l1-False] 
[gw0] PASSED test_preprocessing_normalize[l1-True] 
[gw0] FAILED test_preprocessing_normalize[l2-False] 
[gw0] PASSED test_preprocessing_normalize[l2-True] 
[gw0] PASSED test_decomposition_truncatedsvd[randomized] 
[gw0] PASSED test_decomposition_truncatedsvd[arpack] 
[gw0] PASSED test_decomposition_nmf[cd] 
[gw0] PASSED test_decomposition_nmf[mu] 
[gw0] FAILED test_linear_model_logisticregression[liblinear-l1] 
[gw1] FAILED test_linear_model_sgdclassifier[modified_huber-elasticnet] 
[gw1] FAILED test_linear_model_sgdclassifier[squared_hinge-l1] 
[gw1] FAILED test_linear_model_sgdclassifier[squared_hinge-l2] 
[gw1] FAILED test_linear_model_sgdclassifier[squared_hinge-elasticnet] 
[gw1] FAILED test_linear_model_sgdclassifier[perceptron-l1] 
[gw1] FAILED test_linear_model_sgdclassifier[perceptron-l2] 
[gw1] FAILED test_linear_model_sgdclassifier[perceptron-elasticnet] 
[gw1] FAILED test_linear_model_sgdregressor[squared_loss-l1] 
[gw1] FAILED test_linear_model_sgdregressor[squared_loss-l2] 
[gw1] FAILED test_linear_model_sgdregressor[squared_loss-elasticnet] 
[gw1] FAILED test_linear_model_sgdregressor[huber-l1] 
[gw1] FAILED test_linear_model_sgdregressor[huber-l2] 
[gw1] FAILED test_linear_model_sgdregressor[huber-elasticnet] 
[gw1] FAILED test_linear_model_sgdregressor[epsilon_insensitive-l1] 
[gw1] FAILED test_linear_model_sgdregressor[epsilon_insensitive-l2] 
[gw1] FAILED test_linear_model_sgdregressor[epsilon_insensitive-elasticnet] 
[gw1] FAILED test_linear_model_sgdregressor[squared_epsilon_insensitive-l1] 
[gw1] FAILED test_linear_model_sgdregressor[squared_epsilon_insensitive-l2] 
[gw1] FAILED test_linear_model_sgdregressor[squared_epsilon_insensitive-elasticnet] 
[gw1] PASSED test_linear_model_estimator[linear_model.LinearRegression] 
[gw1] PASSED test_linear_model_estimator[linear_model.ElasticNet] 
[gw1] FAILED test_linear_model_estimator[svm.LinearSVC] 
[gw2] FAILED test_linear_model_logisticregression[liblinear-l2] 
[gw3] FAILED test_linear_model_sgdclassifier[log-l1] 
[gw3] FAILED test_linear_model_sgdclassifier[log-l2] 
[gw3] FAILED test_linear_model_sgdclassifier[log-elasticnet] 
[gw3] FAILED test_linear_model_sgdclassifier[modified_huber-l1] 
[gw3] FAILED test_linear_model_sgdclassifier[modified_huber-l2] 
[gw3] FAILED test_linear_model_estimator[svm.SVC] 
[gw3] PASSED test_linear_model_estimator[tree.DecisionTreeClassifier] 
[gw3] PASSED test_linear_model_logisticregression[newton-cg-l2] 
[gw3] PASSED test_linear_model_logisticregression[lbfgs-l2] 
[gw3] FAILED test_linear_model_logisticregression[sag-l2] 
[gw3] FAILED test_linear_model_sgdclassifier[hinge-l1] 
[gw3] FAILED test_linear_model_sgdclassifier[hinge-l2] 
[gw3] FAILED test_linear_model_sgdclassifier[hinge-elasticnet] 
==================== 34 failed, 11 passed in 52.64 seconds =====================

@jnothman
Copy link
Member
jnothman commented Jun 19, 2017 via email

@jnothman
Copy link
Member

It would be good to have this in the form of an estimator check.

@rth
Copy link
Member
rth commented Sep 3, 2017
  1. To make the SGD related estimators work, the sklearn.utils.seq_dataset.CSRDataset (at least) need to be added support for 64 bit indices.

Actually, this is probably a non-issue, since SGD related estimators are likely to be trained in batches with partial_fit anyway; in which case scipy will downcast indices to 32 bit automatically...

X = scipy.sparse.rand(10, 1000, format='csr')
X.indices = X.indices.astype('int64')
X.indptr = X.indptr.astype('int64')
print(X.indices.dtype)   # -> "int64"
print(X[:5].indices.dtype)  # -> "int32"

@jnothman
Copy link
Member
jnothman commented Sep 3, 2017 via email

@jnothman
Copy link
Member

I vote we close this and open individual issues for SGD, Libsvm (maybe never fix?) and liblinear (probably never fix)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0