8000 ValueError: Buffer dtype mismatch, expected 'int' but got 'long' · Issue #13526 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ValueError: Buffer dtype mismatch, expected 'int' but got 'long' #13526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Hoeze opened this issue Mar 26, 2019 · 7 comments
Closed

ValueError: Buffer dtype mismatch, expected 'int' but got 'long' #13526

Hoeze opened this issue Mar 26, 2019 · 7 comments

Comments

@Hoeze
Copy link
Hoeze commented Mar 26, 2019

I'm trying to fit a logistic regression on a sparse matrix, but I'm failing due to some ValueError:

model = LogisticRegression(
    C=1,
    solver='sag',
    random_state=0,
    tol=0.0001,
    max_iter=100,
    verbose=1,
    warm_start=True,
    n_jobs=64,
    penalty='l2',
    dual=False,
    multi_class='ovr',
)
model.fit(train_data.inputs, train_data.targets)
/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/sklearn/base.py:253: UserWarning: Trying to unpickle estimator StandardScaler from version 0.20.0 when using version 0.20.3. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
[Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers.
Traceback (most recent call last):
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-4-5936f410e945>", line 58, in <module>
    model.fit(train_data_cadd.inputs, train_data_cadd.targets)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/comet_ml/monkey_patching.py", line 244, in wrapper
    return_value = original(*args, **kwargs)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/sklearn/linear_model/logistic.py", line 1363, in fit
    for class_, warm_start_coef_ in zip(classes_, warm_start_coef))
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 930, in __call__
    self.retrieve()
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 567, in __call__
    return self.func(*args, **kwargs)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/sklearn/linear_model/logistic.py", line 792, in logistic_regression_path
    is_saga=(solver == 'saga'))
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/sklearn/linear_model/sag.py", line 305, in sag_solver
    dataset, intercept_decay = make_dataset(X, y, sample_weight, random_state)
  File "/opt/modules/i12g/anaconda/3-5.0.1/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 84, in make_dataset
    seed=seed)
  File "sklearn/utils/seq_dataset.pyx", line 259, in sklearn.utils.seq_dataset.CSRDataset.__cinit__
ValueError: Buffer dtype mismatch, expected 'int' but got 'long'
>>> sklearn.__version__
Out[5]: '0.20.3'
>>> train_data.inputs
Out[6]: 
<28034374x904 sparse matrix of type '<class 'numpy.float32'>'
	with 2223406363 stored elements in Compressed Sparse Row format>

Is there some way I can still train my data?

Originally posted by @Hoeze in #10758 (comment)

@Hoeze Hoeze changed the title I still got the same error with v0.20.3: ValueError: Buffer dtype mismatch, expected 'int' but got 'long' Mar 26, 2019
@jnothman
Copy link
Member

Please provide a complete, runnable example that reproduces your issue. It's very clear that the code you provided has little to do with the output provided.

@Hoeze
Copy link
Author
Hoeze commented Mar 26, 2019

@jnothman

import numpy as np
import scipy.sparse
from sklearn.linear_model import LogisticRegression

size=222

data = np.random.uniform(size=size)
row = np.random.randint(low=0, high=28034374, size=size, dtype=int)
col = np.random.randint(low=0, high=904, size=size, dtype=int)

inputs = scipy.sparse.csr_matrix((data, (row, col)), shape=(28034374,904))
inputs.indptr = inputs.indptr.astype(int)
inputs.indices = inputs.indices.astype(int)

targets = np.random.randint(low=0, high=2, size=[28034374])



model = LogisticRegression(
    C=1,
    solver='sag',
    random_state=0,
    tol=0.0001,
    max_iter=100,
    verbose=1,
    warm_start=True,
    n_jobs=64,
    penalty='l2',
    dual=False,
    multi_class='ovr',
)
model.fit(inputs, targets)

In this example case I had to directly cast the indices to int64.
However, in my real dataset, which consists of vstacked csr matrices, these int64 pointers are already in place.

I tried to cast my real data indices to int32, too, but in this case Python just crashes with error -1.

@rth
Copy link
Member
rth commented Mar 26, 2019

Thanks for the report @Hoeze and the reproducible example!

It is a known issue reported in #11355

Closing this as a duplicate (to avoid splitting discussions), but a contribution to address this issue would be very welcome.

@rth rth closed this as completed Mar 26, 2019
@rth
Copy link
Member
rth commented Mar 26, 2019

(Also please feel free to comment here or in the above-linked issue to continue the discussion).

@yujianll
Copy link

Hi, I also ran into this issue. Is there any workaround for this?

Thanks!

@Hoeze
Copy link
Author
Hoeze commented May 18, 2021

@yujianll you could go for:

  1. use dense matrix if manageable
  2. use iterative training on batches of the full dataset having less than 2^32 entries, maybe even using Dask?

@yujianll
Copy link

@Hoeze Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0