Conversation
…big data. Indent a few copy/pasted function declarations for consistency. Fixes scikit-learn#1466.
Just the one that you modified. |
|
Would it be feasible to do a regression test? Maybe using a single tree of depth 1 or something? |
There was a problem hiding this comment.
I am not sure what happens here. Does it underflow?
Same goes for the other changes. We have to make sure Py_ssize_t values do not flip around.
There was a problem hiding this comment.
int is already a signed value (to do negative array indices). Py_ssize_t is just a wider signed variable so I think it's fine.
There was a problem hiding this comment.
Yup, right, makes sense now. s stands for *signed* size_t.
|
Such a regression test would have to create a huge array... |
|
I didn't have an exact look at the code. So how huge exactly? |
|
Bigger than what 32bits system can handle (about >4Gb?). |
|
Actually, 2 billion elements is enough because The main problem is that you need >2e9 * |
|
|
|
@larsmans Ok, so I guess that answers my feasibility question. +1 for merge :) |
|
Looks good to me as well. It does not solve all problems (billions of rows and/or features), but it is already quite an improvement. Thank you for the fix @erg ! |
Fixes the crash with addressing large datasets. Does not fix having billions of rows/columns.
I checked in the _tree.c cythonized file from cython 18.0. If you run
make cythona lot of .c files change. How often is it worth pushing out new .c files?