-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
PR for Issue #1466 #1823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR for Issue #1466 #1823
Conversation
…big data. Indent a few copy/pasted function declarations for consistency. Fixes scikit-learn#1466.
Just the one that you modified. |
Would it be feasible to do a regression test? Maybe using a single tree of depth 1 or something? |
@@ -1139,7 +1139,7 @@ cdef class ClassificationCriterion(Criterion): | |||
if self.n_classes == NULL: | |||
raise MemoryError() | |||
|
|||
cdef int label_count_stride = -1 | |||
cdef Py_ssize_t label_count_stride = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what happens here. Does it underflow?
Same goes for the other changes. We have to make sure Py_ssize_t values do not flip around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int
is already a signed value (to do negative array indices). Py_ssize_t
is just a wider signed variable so I think it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, right, makes sense now. s
stands for *signed* size_t
.
Such a regression test would have to create a huge array... |
I didn't have an exact look at the code. So how huge exactly? |
Bigger than what 32bits system can handle (about >4Gb?). |
Actually, 2 billion elements is enough because The main problem is that you need >2e9 * |
|
@larsmans Ok, so I guess that answers my feasibility question. +1 for merge :) |
Looks good to me as well. It does not solve all problems (billions of rows and/or features), but it is already quite an improvement. Thank you for the fix @erg ! |
Fixes the crash with addressing large datasets. Does not fix having billions of rows/columns.
I checked in the _tree.c cythonized file from cython 18.0. If you run
make cython
a lot of .c files change. How often is it worth pushing out new .c files?