-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Cython code for PolynomialFeatures should use int64s for indices. #17554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sounds reasonable, though since it should be possible to estimate which
size is needed, we should be able to use fused types to support both. Would
that be worthwhile?
|
If you think the added code complexity would be worth it, that's totally doable. |
Hi all, I ran into this problem a couple of days ago and found this issue. Wondering if anyone is actively working on this? If not, I can add a simple PR to fix this. Also, do we want to use fused type or simply int64 instead? IMO we should use int64 for consistency. |
I’ve not done anything for this. Have you, Joel?
…On Sun, Nov 22, 2020 at 10:19 PM Rain ***@***.***> wrote:
Hi all,
I ran into this problem a couple of days ago and found this issue.
Wondering if anyone is actively working on this? If not, I can add a simple
PR to fix this. Also, do we want to use fused type or simply int64 instead?
IMO we should use int64 for consistency.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#17554 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALI3MYWI5XCU6N2KP33XFTSRH5H5ANCNFSM4N2CJXDA>
.
|
Tagging @jnothman to increase visibility. |
Open a PR for switching to int64 indices, but it seems strange to not continue to efficiently support int32-indexed input, even if we always give int64-indexed output. We probably implemented this when scipy's int64-indexed sparse matrices were not yet widely used. |
I realize there are two separate issues here. One is supporting both int32
and int64 indices as *input*, and the other is supporting both as *output*.
I think both types should be supported as input, but it makes sense to only
support int64 as output.
…On Sun, Nov 29, 2020 at 4:42 AM Joel Nothman ***@***.***> wrote:
Open a PR for switching to int64 indices, but it seems strange to not
continue to efficiently support int32-indexed input, even if we always give
int64-indexed output. We probably implemented this when scipy's
int64-indexed sparse matrices were not yet widely used.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#17554 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALI3M4PYW6UYHBRLL4RQTDSSI6STANCNFSM4N2CJXDA>
.
|
Hi, how is this issue going ? I'm working on tests for sparse inputs, and one test fails because of the issue here (see comment #13246 (comment))
by ctypedef fused INDEX_T:
np.int32_t
np.int64_t right ? And then in a second step another PR could ensure the returned indices (in outputs) are always |
There was a PR working related to this issue: #16831. If you would like to work on this, please comment on the PR asking if the contributor is still working on it. |
Thanks for the pointer @thomasjpfan, I hadn't seen it, I will comment there |
take |
How's this going? |
Sorry, looks like i forgot to mention this issue in my PR. Now l've fixed it. |
@AWNystrom, could you help me? |
seems like you can fix this by adding a changelog entry as described here scikit-learn/.github/workflows/check-changelog.yml Lines 49 to 65 in d7cecb3
|
Any luck with this?
…On Thu, Sep 16, 2021 at 3:28 PM Frederick Robinson ***@***.***> wrote:
seems like you can fix this by adding a changelog entry as described here
https://github.com/scikit-learn/scikit-learn/blob/d7cecb3b718f84ee5a3f5d33462721644f50d3b4/.github/workflows/check-changelog.yml#L49-L65
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17554 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALI3M5CZLWNXYLLTGRN3KTUCJVPPANCNFSM4N2CJXDA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Describe the workflow you want to enable
The code in preprocessing/_csr_polynomial_expansion.pyx produces the components of a CSR matrix with column values represented as int32s. For inputs with a sufficiently large number of columns, this can cause overflow, which leads to negative indices in the resulting CSR matrix.
Here's an instance of a user running into this problem:
https://stackoverflow.com/questions/60920877/scipy-sparse-matrix-negative-column-index-found
The dimensionality of a polynomial expansion grows quickly with respect to the input dimensionality. Using int32s to represent columns, a second degree expansion without bias is overwhelmed with an input dimensionality of 65,535.
Describe your proposed solution
Use int64s as the index type in the cython file. The type is currently specified as an int32 in a typedef, so this should be a simple fix.
Describe alternatives you've considered, if relevant
The type necessary to store the output dimensionality could easily be determined. INDEX_T could be made a fused type. The call to _csr_polynomial_expansion could support two specializations, one for int32 and another for int64. A wrapper around them could decide which to call based on the input dimensionality.
This approach seems like overkill.
The text was updated successfully, but these errors were encountered: