Cython code for PolynomialFeatures should use int64s for indices. #17554

AWNystrom · 2020-06-10T06:38:15Z

Describe the workflow you want to enable

The code in preprocessing/_csr_polynomial_expansion.pyx produces the components of a CSR matrix with column values represented as int32s. For inputs with a sufficiently large number of columns, this can cause overflow, which leads to negative indices in the resulting CSR matrix.

Here's an instance of a user running into this problem:
https://stackoverflow.com/questions/60920877/scipy-sparse-matrix-negative-column-index-found

The dimensionality of a polynomial expansion grows quickly with respect to the input dimensionality. Using int32s to represent columns, a second degree expansion without bias is overwhelmed with an input dimensionality of 65,535.

Describe your proposed solution

Use int64s as the index type in the cython file. The type is currently specified as an int32 in a typedef, so this should be a simple fix.

Describe alternatives you've considered, if relevant

The type necessary to store the output dimensionality could easily be determined. INDEX_T could be made a fused type. The call to _csr_polynomial_expansion could support two specializations, one for int32 and another for int64. A wrapper around them could decide which to call based on the input dimensionality.

This approach seems like overkill.

jnothman · 2020-06-10T11:35:20Z

Sounds reasonable, though since it should be possible to estimate which size is needed, we should be able to use fused types to support both. Would that be worthwhile?

AWNystrom · 2020-06-10T19:52:04Z

If you think the added code complexity would be worth it, that's totally doable.

ra1nty · 2020-11-23T06:19:28Z

Hi all,

I ran into this problem a couple of days ago and found this issue. Wondering if anyone is actively working on this? If not, I can add a simple PR to fix this. Also, do we want to use fused type or simply int64 instead? IMO we should use int64 for consistency.

AWNystrom · 2020-11-23T10:11:33Z

I’ve not done anything for this. Have you, Joel?

…

On Sun, Nov 22, 2020 at 10:19 PM Rain ***@***.***> wrote: Hi all, I ran into this problem a couple of days ago and found this issue. Wondering if anyone is actively working on this? If not, I can add a simple PR to fix this. Also, do we want to use fused type or simply int64 instead? IMO we should use int64 for consistency. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#17554 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALI3MYWI5XCU6N2KP33XFTSRH5H5ANCNFSM4N2CJXDA> .

AWNystrom · 2020-11-25T10:13:58Z

Tagging @jnothman to increase visibility.

AWNystrom · 2020-11-29T06:06:40Z

@ra1nty, since we're not hearing back from @jnothman, let's work out the best fix. You mentioned that we should use int64 for consistency. Where else do we do that? I find that solution nicer as it's much simpler.

jnothman · 2020-11-29T12:42:05Z

Open a PR for switching to int64 indices, but it seems strange to not continue to efficiently support int32-indexed input, even if we always give int64-indexed output. We probably implemented this when scipy's int64-indexed sparse matrices were not yet widely used.

AWNystrom · 2020-11-29T21:11:38Z

I realize there are two separate issues here. One is supporting both int32 and int64 indices as *input*, and the other is supporting both as *output*. I think both types should be supported as input, but it makes sense to only support int64 as output.

…

On Sun, Nov 29, 2020 at 4:42 AM Joel Nothman ***@***.***> wrote: Open a PR for switching to int64 indices, but it seems strange to not continue to efficiently support int32-indexed input, even if we always give int64-indexed output. We probably implemented this when scipy's int64-indexed sparse matrices were not yet widely used. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#17554 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALI3M4PYW6UYHBRLL4RQTDSSI6STANCNFSM4N2CJXDA> .

wdevazelhes · 2021-02-26T11:21:09Z

Hi, how is this issue going ? I'm working on tests for sparse inputs, and one test fails because of the issue here (see comment #13246 (comment))
I could do a PR for a quick fix to support int64 as inputs, for instance by replacing this line :

scikit-learn/sklearn/preprocessing/_csr_polynomial_expansion.pyx

Line 12 in 94abe05

ctypedef np.int32_t INDEX_T

by

ctypedef fused INDEX_T:
    np.int32_t
    np.int64_t

right ?
I think it would make the test I'm working on to pass
(Note that I'm not familiar with cython so it's just a guess)

And then in a second step another PR could ensure the returned indices (in outputs) are always int64 as suggested above by @AWNystrom ?

thomasjpfan · 2021-02-26T13:07:38Z

There was a PR working related to this issue: #16831. If you would like to work on this, please comment on the PR asking if the contributor is still working on it.

wdevazelhes · 2021-03-01T07:03:20Z

There was a PR working related to this issue: #16831. If you would like to work on this, please comment on the PR asking if the contributor is still working on it.

Thanks for the pointer @thomasjpfan, I hadn't seen it, I will comment there

niuk-a · 2021-07-12T20:04:25Z

take

AWNystrom · 2021-08-26T20:24:32Z

How's this going?

niuk-a · 2021-08-26T21:31:10Z

How's this going?

Sorry, looks like i forgot to mention this issue in my PR. Now l've fixed it.

niuk-a · 2021-08-26T21:55:38Z

@AWNystrom, could you help me?
How can I fix problem with "label the PR with 'No Changelog Needed' to bypass this check"?

AWNystrom · 2021-09-16T21:38:09Z

@niuk-a, not sure. @jnothman?

frrad · 2021-09-16T22:27:56Z

seems like you can fix this by adding a changelog entry as described here

scikit-learn/.github/workflows/check-changelog.yml

Lines 49 to 65 in d7cecb3

    
                       echo "A Changelog entry is missing." 
        
                       echo "" 
        
                       echo "Please add an entry to the changelog at 'doc/whats_new/v*.rst'" 
        
                       echo "to document your change assuming that the PR will be merged" 
        
                       echo "in time for the next release of scikit-learn." 
        
                       echo "" 
        
                       echo "Look at other entries in that file for inspiration and please" 
        
                       echo "reference this pull request using the ':pr:' directive and" 
        
                       echo "credit yourself (and other contributors if applicable) with" 
        
                       echo "the ':user:' directive." 
        
                       echo "" 
        
                       echo "If you see this error and there is already a changelog entry," 
        
                       echo "check that the PR number is correct." 
        
                       echo "" 
        
                       echo" If you believe that this PR does no warrant a changelog" 
        
                       echo "entry, say so in a comment so that a maintainer will label " 
        
                       echo "the PR with 'No Changelog Needed' to bypass this check."

8000

AWNystrom · 2021-12-12T00:48:55Z

Any luck with this?

…

On Thu, Sep 16, 2021 at 3:28 PM Frederick Robinson ***@***.***> wrote: seems like you can fix this by adding a changelog entry as described here https://github.com/scikit-learn/scikit-learn/blob/d7cecb3b718f84ee5a3f5d33462721644f50d3b4/.github/workflows/check-changelog.yml#L49-L65 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17554 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALI3M5CZLWNXYLLTGRN3KTUCJVPPANCNFSM4N2CJXDA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

AWNystrom added the New Feature label Jun 10, 2020

adrinjalali added help wanted Moderate Anything that requires some knowledge of conventions and best practices module:preprocessing labels Jul 10, 2020

wdevazelhes mentioned this issue Feb 26, 2021

[WIP] Common test for equivalence between sparse and dense matrices. #13246

Closed

7 tasks

cmarmo added the cython label May 12, 2021

github-actions bot assigned niuk-a Jul 12, 2021

github-actions bot removed the help wanted label Jul 12, 2021

niuk-a mentioned this issue Aug 26, 2021

[WIP] FIX index overflow error in sparse matrix polynomial expansion … #20524

Closed

ogrisel mentioned this issue Jun 16, 2022

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code #23653

Open

Micky774 mentioned this issue Jun 22, 2022

ENH Allow for appropriate dtype us in preprocessing.PolynomialFeatures for sparse matrices #23731

Merged

ogrisel closed this as completed in #23731 May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cython code for PolynomialFeatures should use int64s for indices. #17554

Cython code for PolynomialFeatures should use int64s for indices. #17554

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cython code for PolynomialFeatures should use int64s for indices. #17554

Cython code for PolynomialFeatures should use int64s for indices. #17554

Comments

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!