You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am training a logistic regression model on a large (approx. 34e6 times 1000, about 7% non-zero content) sparse (csr) matrix. I am using StandardScaler for preprocessing. When preprocessing the whole matrix with StandardScaler.fit I am getting ValueError: Buffer dtype mismatch, expected 'int' but got 'long' with the following Traceback:
Traceback (most recent call last):
[...]
File "/[...]/sklearn/preprocessing/data.py", line 560, in fit
return self.partial_fit(X, y)
File "/[...]/sklearn/preprocessing/data.py", line 600, in partial_fit
self.mean_, self.var_ = mean_variance_axis(X, axis=0)
File "/[...]/sklearn/utils/sparsefuncs.py", line 90, in mean_variance_axis
return _csr_mean_var_axis0(X)
File "sklearn/utils/sparsefuncs_fast.pyx", line 74, in sklearn.utils.sparsefuncs_fast.csr_mean_variance_axis0 (sklearn/utils/sparsefuncs_fast.c:4248)
File "sklearn/utils/sparsefuncs_fast.pyx", line 77, in sklearn.utils.sparsefuncs_fast._csr_mean_variance_axis0 (sklearn/utils/sparsefuncs_fast.c:5062)
I assume that this is due to the number of matrix data elements surpassing 2 ** 32 (not necessarily as a direct type casting error as smaller submatrices can be used for fitting and partial fitting does not result in any errors)
The text was updated successfully, but these errors were encountered:
@aerval Yes, StandardScaler doesn't yet support large sparse CSR matrices with 64-bit indices (see #2969 for a more detailed discussion). Applying sklearn.preprocessing.normalize on smaller chunks of the array, could be a temporary (and inefficient) workaround...
I am training a logistic regression model on a large (approx. 34e6 times 1000, about 7% non-zero content) sparse (csr) matrix. I am using StandardScaler for preprocessing. When preprocessing the whole matrix with StandardScaler.fit I am getting
ValueError: Buffer dtype mismatch, expected 'int' but got 'long'
with the following Traceback:I assume that this is due to the number of matrix data elements surpassing 2 ** 32 (not necessarily as a direct type casting error as smaller submatrices can be used for fitting and partial fitting does not result in any errors)
The text was updated successfully, but these errors were encountered: