8000 BUG: ValueError when using StandardScaler on large sparse matrix · Issue #9575 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

BUG: ValueError when using StandardScaler on large sparse matrix #9575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aerval opened this issue Aug 17, 2017 · 2 comments
Closed

BUG: ValueError when using StandardScaler on large sparse matrix #9575

aerval opened this issue Aug 17, 2017 · 2 comments

Comments

@aerval
Copy link
aerval commented Aug 17, 2017

I am training a logistic regression model on a large (approx. 34e6 times 1000, about 7% non-zero content) sparse (csr) matrix. I am using StandardScaler for preprocessing. When preprocessing the whole matrix with StandardScaler.fit I am getting ValueError: Buffer dtype mismatch, expected 'int' but got 'long' with the following Traceback:

Traceback (most recent call last):
  [...]
  File "/[...]/sklearn/preprocessing/data.py", line 560, in fit
    return self.partial_fit(X, y)
  File "/[...]/sklearn/preprocessing/data.py", line 600, in partial_fit
    self.mean_, self.var_ = mean_variance_axis(X, axis=0)
  File "/[...]/sklearn/utils/sparsefuncs.py", line 90, in mean_variance_axis
    return _csr_mean_var_axis0(X)
  File "sklearn/utils/sparsefuncs_fast.pyx", line 74, in sklearn.utils.sparsefuncs_fast.csr_mean_variance_axis0 (sklearn/utils/sparsefuncs_fast.c:4248)
  File "sklearn/utils/sparsefuncs_fast.pyx", line 77, in sklearn.utils.sparsefuncs_fast._csr_mean_variance_axis0 (sklearn/utils/sparsefuncs_fast.c:5062)

I assume that this is due to the number of matrix data elements surpassing 2 ** 32 (not necessarily as a direct type casting error as smaller submatrices can be used for fitting and partial fitting does not result in any errors)

@rth
Copy link
Member
rth commented Aug 17, 2017

@aerval Yes, StandardScaler doesn't yet support large sparse CSR matrices with 64-bit indices (see #2969 for a more detailed discussion). Applying sklearn.preprocessing.normalize on smaller chunks of the array, could be a temporary (and inefficient) workaround...

@aerval
Copy link
Author
aerval commented Aug 17, 2017

Okay thanks, did not find that one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0