8000 Cython code for PolynomialFeatures should use int64s for indices. · Issue #17554 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Cython code for PolynomialFeatures should use int64s for indices. #17554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AWNystrom opened this issue Jun 10, 2020 · 18 comments · Fixed by #23731
Closed

Cython code for PolynomialFeatures should use int64s for indices. #17554

AWNystrom opened this issue Jun 10, 2020 · 18 comments · Fixed by #23731
Assignees
Labels
cython Moderate Anything that requires some knowledge of conventions and best practices module:preprocessing New Feature

Comments

@AWNystrom
Copy link
Contributor

Describe the workflow you want to enable

The code in preprocessing/_csr_polynomial_expansion.pyx produces the components of a CSR matrix with column values represented as int32s. For inputs with a sufficiently large number of columns, this can cause overflow, which leads to negative indices in the resulting CSR matrix.

Here's an instance of a user running into this problem:
https://stackoverflow.com/questions/60920877/scipy-sparse-matrix-negative-column-index-found

The dimensionality of a polynomial expansion grows quickly with respect to the input dimensionality. Using int32s to represent columns, a second degree expansion without bias is overwhelmed with an input dimensionality of 65,535.

Describe your proposed solution

Use int64s as the index type in the cython file. The type is currently specified as an int32 in a typedef, so this should be a simple fix.

Describe alternatives you've considered, if relevant

The type necessary to store the output dimensionality could easily be determined. INDEX_T could be made a fused type. The call to _csr_polynomial_expansion could support two specializations, one for int32 and another for int64. A wrapper around them could decide which to call based on the input dimensionality.

This approach seems like overkill.

@jnothman
Copy link
Member
jnothman commented Jun 10, 2020 via email

@AWNystrom
Copy link
Contributor Author

If you think the added code complexity would be worth it, that's totally doable.

@adrinjalali adrinjalali added help wanted Moderate Anything that requires some knowledge of conventions and best practices module:preprocessing labels Jul 10, 2020
@ra1nty
Copy link
ra1nty commented Nov 23, 2020

Hi all,

I ran into this problem a couple of days ago and found this issue. Wondering if anyone is actively working on this? If not, I can add a simple PR to fix this. Also, do we want to use fused type or simply int64 instead? IMO we should use int64 for consistency.

@AWNystrom
Copy link
Contributor Author
AWNystrom commented Nov 23, 2020 via email

@AWNystrom
Copy link
Contributor Author

Tagging @jnothman to increase visibility.

@AWNystrom
Copy link
Contributor Author

@ra1nty, since we're not hearing back from @jnothman, let's work out the best fix. You mentioned that we should use int64 for consistency. Where else do we do that? I find that solution nicer as it's much simpler.

@jnothman
Copy link
Member

Open a PR for switching to int64 indices, but it seems strange to not continue to efficiently support int32-indexed input, even if we always give int64-indexed output. We probably implemented this when scipy's int64-indexed sparse matrices were not yet widely used.

@AWNystrom
Copy link
Contributor Author
AWNystrom commented Nov 29, 2020 via email

@wdevazelhes
Copy link
Contributor

Hi, how is this issue going ? I'm working on tests for sparse inputs, and one test fails because of the issue here (see comment #13246 (comment))
I could do a PR for a quick fix to support int64 as inputs, for instance by replacing this line :

by

ctypedef fused INDEX_T:
    np.int32_t
    np.int64_t

right ?
I think it would make the test I'm working on to pass
(Note that I'm not familiar with cython so it's just a guess)

And then in a second step another PR could ensure the returned indices (in outputs) are always int64 as suggested above by @AWNystrom ?

@thomasjpfan
Copy link
Member

There was a PR working related to this issue: #16831. If you would like to work on this, please comment on the PR asking if the contributor is still working on it.

@wdevazelhes
Copy link
Contributor

There was a PR working related to this issue: #16831. If you would like to work on this, please comment on the PR asking if the contributor is still working on it.

Thanks for the pointer @thomasjpfan, I hadn't seen it, I will comment there

@cmarmo cmarmo added the cython label May 12, 2021
@niuk-a
Copy link
Contributor
niuk-a commented Jul 12, 2021

take

@AWNystrom
Copy link
Contributor Author

How's this going?

@niuk-a
Copy link
Contributor
niuk-a commented Aug 26, 2021

How's this going?

Sorry, looks like i forgot to mention this issue in my PR. Now l've fixed it.

@niuk-a
Copy link
Contributor
niuk-a commented Aug 26, 2021

@AWNystrom, could you help me?
How can I fix problem with "label the PR with 'No Changelog Needed' to bypass this check"?

@AWNystrom
Copy link
Contributor Author

@niuk-a, not sure. @jnothman?

@frrad
Copy link
Contributor
frrad commented Sep 16, 2021

seems like you can fix this by adding a changelog entry as described here

echo "A Changelog entry is missing."
echo ""
echo "Please add an entry to the changelog at 'doc/whats_new/v*.rst'"
echo "to document your change assuming that the PR will be merged"
echo "in time for the next release of scikit-learn."
echo ""
echo "Look at other entries in that file for inspiration and please"
echo "reference this pull request using the ':pr:' directive and"
echo "credit yourself (and other contributors if applicable) with"
echo "the ':user:' directive."
echo ""
echo "If you see this error and there is already a changelog entry,"
echo "check that the PR number is correct."
echo ""
echo" If you believe that this PR does no warrant a changelog"
echo "entry, say so in a comment so that a maintainer will label "
echo "the PR with 'No Changelog Needed' to bypass this check."

8000

@AWNystrom
Copy link
Contributor Author
AWNystrom commented Dec 12, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython Moderate Anything that requires some knowledge of conventions and best practices module:preprocessing New Feature
Projects
None yet
9 participants
0