PR for Issue #1466 #1823

erg · 2013-04-01T15:30:30Z

Fixes the crash with addressing large datasets. Does not fix having billions of rows/columns.

I checked in the _tree.c cythonized file from cython 18.0. If you run make cython a lot of .c files change. How often is it worth pushing out new .c files?

…big data. Indent a few copy/pasted function declarations for consistency. Fixes scikit-learn#1466.

GaelVaroquaux · 2013-04-01T15:32:09Z

I checked in the _tree.c cythonized file from cython 18.0. If you run make
cython a lot of .c files change. How often is it worth pushing out new .c
files?

Just the one that you modified.

amueller · 2013-04-01T15:57:14Z

Would it be feasible to do a regression test? Maybe using a single tree of depth 1 or something?

glouppe · 2013-04-01T16:02:12Z

sklearn/tree/_tree.pyx

@@ -1139,7 +1139,7 @@ cdef class ClassificationCriterion(Criterion):
        if self.n_classes == NULL:
            raise MemoryError()

        cdef int label_count_stride = -1
+        cdef Py_ssize_t label_count_stride = -1


I am not sure what happens here. Does it underflow?

Same goes for the other changes. We have to make sure Py_ssize_t values do not flip around.

int is already a signed value (to do negative array indices). Py_ssize_t is just a wider signed variable so I think it's fine.

Yup, right, makes sense now. s stands for *signed* size_t.

…ile again.

larsmans · 2013-04-02T11:39:13Z

Such a regression test would have to create a huge array...

amueller · 2013-04-02T11:43:00Z

I didn't have an exact look at the code. So how huge exactly?

glouppe · 2013-04-02T11:43:57Z

Bigger than what 32bits system can handle (about >4Gb?).

larsmans · 2013-04-02T11:45:04Z

Actually, 2 billion elements is enough because int is usually 31 bits magnitude. We'd also need to make sure the test is only run on a machine where sizeof(Py_ssize_t) > sizeof(int), but I guess that's doable.

The main problem is that you need >2e9 * sizeof(double), which is 16GB... so I'd need to run the test on a big headless node instead of my laptop :p

amueller · 2013-04-02T11:45:15Z

~~n_samples * n_features must be bigger than max int, right?~~
No, I confused myself. The number of features must be large for x_stride to be large?
I should really study this code more closely.

amueller · 2013-04-02T11:54:21Z

@larsmans Ok, so I guess that answers my feasibility question. +1 for merge :)

glouppe · 2013-04-02T12:56:04Z

Looks good to me as well. It does not solve all problems (billions of rows and/or features), but it is already quite an improvement. Thank you for the fix @erg !

PR for Issue #1466

erg added 2 commits April 1, 2013 08:03

BUG: Use Py_ssize_t to index into numpy arrays to help Python handle …

6a47c6f

…big data. Indent a few copy/pasted function declarations for consistency. Fixes scikit-learn#1466.

MISC: Update _tree.c with cython.

4db660e

glouppe reviewed Apr 1, 2013
View reviewed changes

BUG: Use Py_ssize_t in a few more places for strides. Add the c f…

0f1950c

…ile again.

glouppe added a commit that referenced this pull request Apr 2, 2013

Merge pull request #1823 from erg/issue-1466

7BCD

c87d45d

PR for Issue #1466

glouppe merged commit c87d45d into scikit-learn:master Apr 2, 2013

glouppe mentioned this pull request Apr 2, 2013

_tree.pyx: Use np.intp_t instead of int #1466

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PR for Issue #1466 #1823

PR for Issue #1466 #1823

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PR for Issue #1466 #1823

PR for Issue #1466 #1823

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!