8000 crash in _tree.so when calling fit() on large tree ensemble · Issue #1818 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

crash in _tree.so when calling fit() on large tree ensemble #1818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
erg opened this issue Mar 27, 2013 · 11 comments
Closed

crash in _tree.so when calling fit() on large tree ensemble #1818

erg opened this issue Mar 27, 2013 · 11 comments

Comments

@erg
Copy link
Contributor
erg commented Mar 27, 2013

Perhaps it's using 32bit ints as the indices instead of u64?

from sklearn.ensemble import ExtraTreesClassifier
import numpy as np

np.random.seed(32)
clf = ExtraTreesClassifier(n_estimators=1000,
                           max_depth=None,
                           max_features=None,
                           min_samples_split=5,
                           min_samples_leaf=5)
print "Making X"
X = np.random.randn(25000,100000)
print "X size", np.size(X)
print "Making y"
y = np.random.randint(10, size=(25000,1))
print "Fitting..."
clf.fit(X,y)
[erg@pliny src]$ [master*] gdb python2
GNU gdb (GDB) 7.5.1
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/python2...(no debugging symbols found)...done.
(gdb) run crash.py 
Starting program: /usr/bin/python2 crash.py
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
warning: Could not load shared library symbols for linux-vdso.so.1.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Making X
X size 2500000000
Making y
Fitting...

Program received signal SIGSEGV, Segmentation fault.
0x00007fffe967d5f7 in __pyx_f_7sklearn_4tree_5_tree_4Tree_find_random_split ()
   from /usr/lib/python2.7/site-packages/sklearn/tree/_tree.so
(gdb) bt
#0  0x00007fffe967d5f7 in __pyx_f_7sklearn_4tree_5_tree_4Tree_find_random_split ()
   from /usr/lib/python2.7/site-packages/sklearn/tree/_tree.so
#1  0x00007fffe9680f2e in __pyx_f_7sklearn_4tree_5_tree_4Tree_recursive_partition ()
   from /usr/lib/python2.7/site-packages/sklearn/tree/_tree.so
#2  0x00007fffe9675008 in __pyx_f_7sklearn_4tree_5_tree_4Tree_build ()
   from /usr/lib/python2.7/site-packages/sklearn/tree/_tree.so
#3  0x00007fffe9671af0 in __pyx_pw_7sklearn_4tree_5_tree_4Tree_11build ()
   from /usr/lib/python2.7/site-packages/sklearn/tree/_tree.so
#4  0x00007ffff7afd192 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#5  0x00007ffff7afeedd in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#6  0x00007ffff7afd15c in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#7  0x00007ffff7afeedd in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#8  0x00007ffff7a8f63f in function_call () from /usr/lib/libpython2.7.so.1.0
#9  0x00007ffff7a6b8be in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#10 0x00007ffff7af9cce in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#11 0x00007ffff7afeedd in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#12 0x00007ffff7a8f536 in function_call () from /usr/lib/libpython2.7.so.1.0
#13 0x00007ffff7a6b8be in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#14 0x00007ffff7a79e78 in instancemethod_call () from /usr/lib/libpython2.7.so.1.0
#15 0x00007ffff7a6b8be in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#16 0x00007ffff7abe5b2 in slot_tp_init () from /usr/lib/libpython2.7.so.1.0
#17 0x00007ffff7abe24c in type_call () from /usr/lib/libpython2.7.so.1.0
#18 0x00007ffff7a6b8be in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#19 0x00007ffff7afab69 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#20 0x00007ffff7afda83 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#21 0x00007ffff7afeedd in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#22 0x00007ffff7a8f536 in function_call () from /usr/lib/libpython2.7.so.1.0
#23 0x00007ffff7a6b8be in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#24 0x00007ffff7a79e78 in instancemethod_call () from /usr/lib/libpython2.7.so.1.0
#25 0x00007ffff7a6b8be in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#26 0x00007ffff7abe962 in slot_tp_call () from /usr/lib/libpython2.7.so.1.0
#27 0x00007ffff7a6b8be in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#28 0x00007ffff7afab69 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#29 0x00007ffff7afeedd in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#30 0x00007ffff7afd15c in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#31 0x00007ffff7afeedd in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#32 0x00007ffff7afefb2 in PyEval_EvalCode () from /usr/lib/libpython2.7.so.1.0
#33 0x00007ffff7b17eea in run_mod () from /usr/lib/libpython2.7.so.1.0
#34 0x00007ffff7b18ce2 in PyRun_FileExFlags () from /usr/lib/libpython2.7.so.1.0
#35 0x00007ffff7b196fb in PyRun_SimpleFileExFlags () from /usr/lib/libpython2.7.so.1.0
#36 0x00007ffff7b2a9f2 in Py_Main () from /usr/lib/libpython2.7.so.1.0
#37 0x00007ffff747aa15 in __libc_start_main () from /usr/lib/libc.so.6
#38 0x0000000000400741 in _start ()
@larsmans
Copy link
Member

Yep, it's using 32-bit indices, without a check for whether its inputs are small enough. Ping @glouppe.

(Were you actually handling a 20GB dataset, or is this just a test?)

@erg
Copy link
Contributor Author
erg commented Mar 27, 2013

I had some data that was large enough to trigger the bug, but these sizes are arbitrary.

@erg
Copy link
Contributor Author
erg commented Mar 27, 2013

@larsmans
Copy link
Member

We should be using Py_ssize_t (or np.npy_intp, which is what NumPy uses).

@glouppe
Copy link
Contributor
glouppe commented Mar 27, 2013

This bug is indeed known (see #1466) and should be solved by using Py_ssize_t instead.

@erg
Copy link
Contributor Author
erg commented Mar 27, 2013

Isn't it just a search and replace basically? Why wait for someone to report crashers instead of just fixing it?

@glouppe
Copy link
Contributor
glouppe commented Mar 27, 2013

Yes basically. PR are welcome.

@erg
Copy link
Contributor Author
erg commented Mar 27, 2013

With adversarial datasets, I can get more errors. I think if you swap dimensions it also errors out.

clf = ExtraTreesClassifier(n_estimators=1000,
                           max_depth=None,
                           max_features=None,
                           min_samples_split=5,
                           min_samples_leaf=5)
X = np.zeros((2, 3000000000), dtype=float)
y = np.zeros((2,1))
clf.fit(X, y)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-9-1a154f38b9fb> in <module>()
----> 1 clf.fit(X, y)

/usr/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in fit(self, X, y, sample_weight)
    372                 seeds[i],
    373                 verbose=self.verbose)
--> 374             for i in range(n_jobs))
    375 
    376         # Reduce

/usr/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
    512         try:
    513         
8000
    for function, args, kwargs in iterable:
--> 514                 self.dispatch(function, args, kwargs)
    515 
    516             self.retrieve()

/usr/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch(self, func, args, kwargs)
    309         """
    310         if self._pool is None:
--> 311             job = ImmediateApply(func, args, kwargs)
    312             index = len(self._jobs)
    313             if not _verbosity_filter(index, self.verbose):

/usr/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __init__(self, func, args, kwargs)
    133         # Don't delay the application, to avoid keeping the input
    134         # arguments in memory
--> 135         self.results = func(*args, **kwargs)
    136 
    137     def get(self):

/usr/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in _parallel_build_trees(n_trees, forest, X, y, sample_weight, sample_mask, X_argsorted, seeds, verbose)
    106                      sample_mask=sample_mask,
    107                      X_argsorted=X_argsorted,
--> 108                      check_input=False)
    109 
    110         trees.append(tree)

/usr/lib/python2.7/site-packages/sklearn/tree/tree.pyc in fit(self, X, y, sample_mask, X_argsorted, check_input, sample_weight)
    365                                 min_samples_split, self.min_samples_leaf,
    366                                 self.min_density, max_features,
--> 367                                 self.find_split_, random_state)
    368 
    369         self.tree_.build(X, y,

/usr/lib/python2.7/site-packages/sklearn/tree/_tree.so in sklearn.tree._tree.Tree.__cinit__ (sklearn/tree/_tree.c:2376)()

OverflowError: value too large to convert to int

@larsmans
Copy link
Member

Closing as duplicate -- feel free to continue discussing, but I'd like to keep the issue tracker clean.

@amueller
Copy link
Member

On 03/27/2013 10:08 PM, erg wrote:

Isn't it just a search and replace basically? Why wait for someone to
report crashers instead of just fixing it?

So why didn't you submit a PR yet?

@erg
Copy link
Contributor Author
erg commented Mar 31, 2013

I made a patch that fixes the splitting problem and didn't submit it yet. I ran it with this patch overnight and while it didn't finish training my trees, it didn't crash either.

erg@fa7d0b6

The remaining bug is with overflow on datasets with large number of rows or columns. I don't know if there are any subtleties in changing the cython code or if it would slow it down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0