index type `np.int32_t` causes issue in `_csr_polynomial_expansion` #16803

jianlingzhong · 2020-03-30T00:33:46Z

I ran into an issue when trying to construct a ploynomial expansion feature with a large sparse matrix input:

[1] x = sp.sparse.rand(10000, 120006, density=0.000004)
[2] x
>>> <10000x120006 sparse matrix of type '<class 'numpy.float64'>'
	with 4800 stored elements in COOrdinate format>

[2] from sklearn.preprocessing import PolynomialFeatures
[3] pf = PolynomialFeatures(interaction_only=True, include_bias=False, degree=2)
[4] xinter = pf.fit_transform(x)

And got the error ValueError: negative column index found:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-78-dc5dc18d59d2> in <module>
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-a32d56bebd65> in <module>
----> 1 xinter = pf.fit_transform(x)

~/anaconda2/envs/py37/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    688         if y is None:
    689             # fit method of arity 1 (unsupervised transformation)
--> 690             return self.fit(X, **fit_params).transform(X)
    691         else:
    692             # fit method of arity 2 (supervised transformation)

~/anaconda2/envs/py37/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in transform(self, X)
   1571                     break
   1572                 to_stack.append(Xp_next)
-> 1573             XP = sparse.hstack(to_stack, format='csr')
   1574         elif sparse.isspmatrix_csc(X) and self.degree < 4:
   1575             return self.transform(X.tocsr()).tocsc()

~/anaconda2/envs/py37/lib/python3.7/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
    463 
    464     """
--> 465     return bmat([blocks], format=format, dtype=dtype)
    466 
    467 

~/anaconda2/envs/py37/lib/python3.7/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
    572         for j in range(N):
    573             if blocks[i,j] is not None:
--> 574                 A = coo_matrix(blocks[i,j])
    575                 blocks[i,j] = A
    576                 block_mask[i,j] = True

~/anaconda2/envs/py37/lib/python3.7/site-packages/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy)
    170                     self._shape = check_shape(arg1.shape)
    171                 else:
--> 172                     coo = arg1.tocoo()
    173                     self.row = coo.row
    174                     self.col = coo.col

~/anaconda2/envs/py37/lib/python3.7/site-packages/scipy/sparse/compressed.py in tocoo(self, copy)
   1016         from .coo import coo_matrix
   1017         return coo_matrix((self.data, (row, col)), self.shape, copy=copy,
-> 1018                           dtype=self.dtype)
   1019 
   1020     tocoo.__doc__ = spmatrix.tocoo.__doc__

~/anaconda2/envs/py37/lib/python3.7/site-packages/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy)
    196             self.data = self.data.astype(dtype, copy=False)
    197 
--> 198         self._check()
    199 
    200     def reshape(self, *args, **kwargs):

~/anaconda2/envs/py37/lib/python3.7/site-packages/scipy/sparse/coo.py in _check(self)
    289                 raise ValueError('negative row index found')
    290             if self.col.min() < 0:
--> 291                 raise ValueError('negative column index found')
    292 
    293     def transpose(self, axes=None, copy=False):

ValueError: negative column index found

The problem is not with scipy as it correctly set the index type to int64:

> /venv/lib/python3.6/site-packages/scipy/sparse/coo.py(291)_check()
    289                 raise ValueError('negative row index found')
    290             if self.col.min() < 0:
--> 291                 raise ValueError('negative column index found')
    292 
    293     def transpose(self, axes=None, copy=False):

ipdb> self.col.max()
2147482788
ipdb> self.col.dtype
dtype('int64')
ipdb> self.col.min()
-2147480639

And I believe the issue is with sklearn.preprocess._data.py, which calls sklearn._csr_polynomial_expansion, which in turn used an int32 for the c code:

scikit-learn/sklearn/preprocessing/_csr_polynomial_expansion.pyx

Line 11 in ada94ae

ctypedef np.int32_t INDEX_T

and:

scikit-learn/sklearn/preprocessing/_csr_polynomial_expansion.pyx

Line 51 in ada94ae

ndarray[INDEX_T, ndim=1] indices,

I wondering if there is a quick fix for this. Thanks!

The text was updated successfully, but these errors were encountered:

rth · 2020-03-30T09:23:37Z

Thanks for the report @jianlingzhong !

So scipy csr_matrix will automatically upcast indices and indptr to int64 in cases when int32 is not enough.

I think we should use Cython fused types in scikit-learn/sklearn/preprocessing/_csr_polynomial_expansion.pyx for indices, to support it.

Pull Request to fix it would be welcome.

jianlingzhong · 2020-03-31T02:00:36Z

Thanks @rth. fused types isn't the only change needed. What I did is changing the definition of INDEX_T to

ctypedef fused INDEX_T:
    np.int32_t
    np.int64_t

The code compiles fine. However, the same problem still occurs.

I need to change how _csr_polynomial_expansion is called as well so that it is explicit int64:

Xp_next = _csr_polynomial_expansion(X.data, X.indices.astype(np.int64),
                                                    X.indptr.astype(np.int64), np.int64(X.shape[1]),
                                                    self.interaction_only,
                                                    deg)

After this, it runs fine on my large input sparse matrix.

This is less than ideal as I imagine we don't need to cast to int64 every time we call _csr_polynomial_expansion.

If you have any suggestions on how to change this quickly, I'd be happy to submit a pull request.

rth added Bug Large Scale labels Mar 30, 2020

cmarmo added the module:preprocessing label Apr 2, 2020

jianlingzhong mentioned this issue Apr 2, 2020

[MRG] FIX index overflow error in sparse matrix polynomial expansion #16831

Closed

wdevazelhes mentioned this issue Mar 15, 2021

[WIP] FIX index overflow error in sparse matrix polynomial expansion (bis) #19676

Closed

frrad mentioned this issue Mar 20, 2021

ENH more efficient _num_combinations calculation in PolynomialFeatures #19734

Merged

thomasjpfan moved this to Delegate📪 in Quansight's scikit-learn Project Board Apr 13, 2022

thomasjpfan added this to Quansight's scikit-learn Project Board Apr 13, 2022

thomasjpfan moved this from Delegate📪 to Todo📬 in Quansight's scikit-learn Project Board Apr 28, 2022

ogrisel mentioned this issue Jun 16, 2022

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code #23653

Open

Micky774 moved this from Todo📬 to In Progress🏗 in Quansight's scikit-learn Project Board Jun 20, 2022

Micky774 mentioned this issue Jun 22, 2022

ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures for sparse matrices #23731

Merged

ogrisel closed this as completed in #23731 May 4, 2023

github-project-automation bot moved this from In Progress🏗 to Done🚀 in Quansight's scikit-learn Project Board May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

index type `np.int32_t` causes issue in `_csr_polynomial_expansion` #16803

index type `np.int32_t` causes issue in `_csr_polynomial_expansion` #16803

Uh oh!

Uh oh!

Uh oh!

index type np.int32_t causes issue in _csr_polynomial_expansion #16803

index type np.int32_t causes issue in _csr_polynomial_expansion #16803

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

index type `np.int32_t` causes issue in `_csr_polynomial_expansion` #16803

index type `np.int32_t` causes issue in `_csr_polynomial_expansion` #16803