BUG: ValueError when using StandardScaler on large sparse matrix #9575

aerval · 2017-08-17T09:44:53Z

I am training a logistic regression model on a large (approx. 34e6 times 1000, about 7% non-zero content) sparse (csr) matrix. I am using StandardScaler for preprocessing. When preprocessing the whole matrix with StandardScaler.fit I am getting ValueError: Buffer dtype mismatch, expected 'int' but got 'long' with the following Traceback:

Traceback (most recent call last):
  [...]
  File "/[...]/sklearn/preprocessing/data.py", line 560, in fit
    return self.partial_fit(X, y)
  File "/[...]/sklearn/preprocessing/data.py", line 600, in partial_fit
    self.mean_, self.var_ = mean_variance_axis(X, axis=0)
  File "/[...]/sklearn/utils/sparsefuncs.py", line 90, in mean_variance_axis
    return _csr_mean_var_axis0(X)
  File "sklearn/utils/sparsefuncs_fast.pyx", line 74, in sklearn.utils.sparsefuncs_fast.csr_mean_variance_axis0 (sklearn/utils/sparsefuncs_fast.c:4248)
  File "sklearn/utils/sparsefuncs_fast.pyx", line 77, in sklearn.utils.sparsefuncs_fast._csr_mean_variance_axis0 (sklearn/utils/sparsefuncs_fast.c:5062)

I assume that this is due to the number of matrix data elements surpassing 2 ** 32 (not necessarily as a direct type casting error as smaller submatrices can be used for fitting and partial fitting does not result in any errors)

The text was updated successfully, but these errors were encountered:

rth · 2017-08-17T09:59:11Z

@aerval Yes, StandardScaler doesn't yet support large sparse CSR matrices with 64-bit indices (see #2969 for a more detailed discussion). Applying sklearn.preprocessing.normalize on smaller chunks of the array, could be a temporary (and inefficient) workaround...

aerval · 2017-08-17T12:27:39Z

Okay thanks, did not find that one

aerval closed this as completed Aug 17, 2017

flying-sheep referenced this issue in scverse/scanpy Oct 2, 2019

Renamed all used utils to _*

26fd770

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: ValueError when using StandardScaler on large sparse matrix #9575

BUG: ValueError when using StandardScaler on large sparse matrix #9575

BUG: ValueError when using StandardScaler on large sparse matrix #9575

BUG: ValueError when using StandardScaler on large sparse matrix #9575

Comments