8000 PR for Issue #1466 by erg · Pull Request #1823 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

PR for Issue #1466 #1823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 2, 2013
Merged

PR for Issue #1466 #1823

merged 3 commits into from
Apr 2, 2013

Conversation

erg
Copy link
Contributor
@erg erg commented Apr 1, 2013

Fixes the crash with addressing large datasets. Does not fix having billions of rows/columns.

I checked in the _tree.c cythonized file from cython 18.0. If you run make cython a lot of .c files change. How often is it worth pushing out new .c files?

erg added 2 commits April 1, 2013 08:03
…big data.

Indent a few copy/pasted function declarations for consistency.
Fixes scikit-learn#1466.
@GaelVaroquaux
Copy link
Member

I checked in the _tree.c cythonized file from cython 18.0. If you run make
cython a lot of .c files change. How often is it worth pushing out new .c
files?

Just the one that you modified.

@amueller
Copy link
Member
amueller commented Apr 1, 2013

Would it be feasible to do a regression test? Maybe using a single tree of depth 1 or something?

@@ -1139,7 +1139,7 @@ cdef class ClassificationCriterion(Criterion):
if self.n_classes == NULL:
raise MemoryError()

cdef int label_count_stride = -1
cdef Py_ssize_t label_count_stride = -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what happens here. Does it underflow?

Same goes for the other changes. We have to make sure Py_ssize_t values do not flip around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int is already a signed value (to do negative array indices). Py_ssize_t is just a wider signed variable so I think it's fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, right, makes sense now. s stands for *signed* size_t.

@larsmans
Copy link
Member
larsmans commented Apr 2, 2013

Such a regression test would have to create a huge array...

@amueller
Copy link
Member
amueller commented Apr 2, 2013

I didn't have an exact look at the code. So how huge exactly?

@glouppe
Copy link
Contributor
glouppe commented Apr 2, 2013

Bigger than what 32bits system can handle (about >4Gb?).

@larsmans
Copy link
Member
larsmans commented Apr 2, 2013

Actually, 2 billion elements is enough because int is usually 31 bits magnitude. We'd also need to make sure the test is only run on a machine where sizeof(Py_ssize_t) > sizeof(int), but I guess that's doable.

The main problem is that you need >2e9 * sizeof(double), which is 16GB... so I'd need to run the test on a big headless node instead of my laptop :p

@amueller
Copy link
Member
amueller commented Apr 2, 2013

n_samples * n_features must be bigger than max int, right?
No, I confused myself. The number of features must be large for x_stride to be large?
I should really study this code more closely.

@amueller
Copy link
Member
amueller commented Apr 2, 2013

@larsmans Ok, so I guess that answers my feasibility question. +1 for merge :)

@glouppe
Copy link
Contributor
glouppe commented Apr 2, 2013

Looks good to me as well. It does not solve all problems (billions of rows and/or features), but it is already quite an improvement. Thank you for the fix @erg !

glouppe added a commit that referenced this pull request Apr 2, 2013
@glouppe glouppe merged commit c87d45d into scikit-learn:master Apr 2, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0