8000 JoblibException thrown when passing "fit_params={'sample_weights': weights}" to RandomizedSearchCV with RandomForestClassifier · Issue #8062 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

JoblibException thrown when passing "fit_params={'sample_weights': weights}" to RandomizedSearchCV with RandomForestClassifier #8062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ianlcassidy opened this issue Dec 15, 2016 · 8 comments · Fixed by #8068
Labels
Bug Easy Well-defined and straightforward way to resolve

Comments

@ianlcassidy
Copy link

Description

JoblibException thrown when passing fit_params={'sample_weights': weights} to RandomizedSearchCV with RandomForestClassifier as the base classifier. This error does not occur when ExtraTreesClassifier is the base classifier.

Steps/Code to Reproduce

def tune_rf(X_train, Y_train, cv, n_iter=25,
            scoring='accuracy', weights=None, n_jobs=-1):
    # instantiate classifier
    clf = RandomForestClassifier(n_jobs=n_jobs, random_state=42)

    # specify parameters and distributions to sample from
    param_dist = {'n_estimators': randint(25, 1000),
                  'min_samples_split': randint(2, 10),
                  'min_samples_leaf': randint(1, 10),
                  'max_depth': randint(1, 10),
                  'criterion': ['entropy', 'gini'],
                  'max_features': ['sqrt', 'log2']}

    # run randomized search
    if weights is None:
        random_search = RandomizedSearchCV(clf,
                                           param_distributions=param_dist,
                                           n_iter=n_iter,
                                           scoring=scoring,
                                           n_jobs=n_jobs,
                                           cv=cv)
    else:
        random_search = RandomizedSearchCV(clf,
                                           param_distributions=param_dist,
                                           n_iter=n_iter,
                                           scoring=scoring,
                                           n_jobs=n_jobs,
                                           cv=cv,
                                           fit_params={'sample_weight': weights})
    random_search.fit(X_train, Y_train)
    return random_search

Expected Results

No error is thrown.

Actual Results

---------------------------------------------------------------------------
JoblibException                           Traceback (most recent call last)
<ipython-input-10-020f007024f9> in <module>()
      2 # random_search_rf = tune_rf(X_train, Y_train, cv, n_iter=n_iter_search, scoring=scoring)
      3 random_search_rf = tune_rf(X_train, Y_train, cv, n_iter=n_iter_search, n_jobs=-1,
----> 4                            scoring=scoring, weights=train_weights)
      5 print('RandomizedSearchCV took %.2f seconds for %d candidates'
      6       ' parameter settings.' % ((time() - start), n_iter_search))

/Users/icassidy/Desktop/mackbet/model/helper_functions.py in tune_rf(X_train, Y_train, cv, n_iter, scoring, weights, n_jobs)
    186                                            cv=cv,
    187                                            fit_params={'sample_weight': weights})
--> 188     random_search.fit(X_train, Y_train)
    189     return random_search
    190 

/Users/icassidy/Documents/Development/venv/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in fit(self, X, y, groups)
   1183                                           self.n_iter,
   1184                                           random_state=self.random_state)
-> 1185         return self._fit(X, y, groups, sampled_params)

/Users/icassidy/Documents/Development/venv/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in _fit(self, X, y, groups, parameter_iterable)
    560                                   return_times=True, return_parameters=True,
    561                                   error_score=self.error_score)
--> 562           for parameters in parameter_iterable
    563           for train, test in cv.split(X, y, groups))
    564 

/Users/icassidy/Documents/Development/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
    766                 # consumption.
    767                 self._iterating = False
--> 768             self.retrieve()
    769             # Make sure that we get a last message telling us we are done
    770             elapsed_time = time.time() - self._start_time

/Users/icassidy/Documents/Development/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in retrieve(self)
    717                     ensure_ready = self._managed_backend
    718                     backend.abort_everything(ensure_ready=ensure_ready)
--> 719                 raise exception
    720 
    721     def __call__(self, iterable):

JoblibException: JoblibException
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py in _run_module_as_main(mod_name='ipykernel.__main__', alter_argv=1)
    157     pkg_name = mod_name.rpartition('.')[0]
    158     main_globals = sys.modules["__main__"].__dict__
    159     if alter_argv:
    160         sys.argv[0] = fname
    161     return _run_code(code, main_globals, None,
--> 162                      "__main__", fname, loader, pkg_name)
        fname = '/Users/icassidy/Documents/Development/venv/lib/python2.7/site-packages/ipykernel/__main__.py'
        loader = <pkgutil.ImpLoader instance>
        pkg_name = 'ipykernel'

...there's a lot more stuff printed after this

Versions

Darwin-15.6.0-x86_64-i386-64bit
('Python', '2.7.10 (default, Oct 23 2015, 19:19:21) \n[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)]')
('NumPy', '1.11.2')
('SciPy', '0.18.1')
('Scikit-Learn', '0.18')

@ianlcassidy
Copy link
Author

Just upgraded to ('Scikit-Learn', '0.18.1') and the error is still thrown.

@jnothman
Copy link
Member
jnothman commented Dec 15, 2016 via email

@ianlcassidy
Copy link
Author
ianlcassidy commented Dec 15, 2016

Here is the full error traceback:
traceback.txt

Do you need a better example of the code that is failing? I'm pretty sure if you tried running the function I pasted above with dummy X_train and Y_train data that it would fail in the same way.

@jnothman
Copy link
Member

Yes, we need to call check_array on sample_weight in BaseRandomForest.

@jnothman jnothman added Bug Easy Well-defined and straightforward way to resolve Need Contributor labels Dec 15, 2016
@Don86
Copy link
Contributor
Don86 commented Dec 16, 2016

Hi, I'd like to work on this. (: Is this related to #8064?

@kushagraagrawal
Copy link

Hi, I'd like to take this on. Can someone tell me as to how do I start?

@ianlcassidy
Copy link
Author

@jnothman I'm having trouble reproducing this error with dummy data. I thought the following code was going to throw the same error, but it doesn't.

import numpy as np
import pandas as pd
from scipy.stats import randint
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

def tune_rf(X, Y, weights, cv, n_iter=25, scoring='accuracy'):
    # instantiate classifier
    clf = RandomForestClassifier(n_jobs=-1, random_state=42)

    # specify parameters and distributions to sample from
    param_dist = {'n_estimators': randint(25, 1000),
                  'min_samples_split': randint(2, 10),
                  'min_samples_leaf': randint(1, 10),
                  'max_depth': randint(1, 10),
                  'criterion': ['entropy', 'gini'],
                  'max_features': ['sqrt', 'log2']}

    # run randomized search
    random_search = RandomizedSearchCV(clf,
                                       param_distributions=param_dist,
                                       n_iter=n_iter,
                                       scoring=scoring,
                                       n_jobs=-1,
                                       cv=cv,
                                       fit_params={'sample_weight': weights})
    random_search.fit(X, Y)
    return random_search

X_train = np.random.randn(100, 5)
Y_train = np.zeros(100)
idx = np.unique(np.random.choice(100, 50))
Y_train[idx] = 1
train_weights = np.random.rand(100) + 1

scoring = 'neg_log_loss'
cv = model_selection.KFold(n_splits=5, random_state=42)

r = tune_rf(X_train, Y_train, train_weights, cv, scoring=scoring)

It still definitely happens on my machine with the specific dataset I am using. It's weird that the error does not happen when RandomForestClassifier is replaced by ExtraTreesClassifier. I will continue to try to reproduce the error with dummy data and post anything I find.

@ianlcassidy
Copy link
Author
ianlcassidy commented Dec 16, 2016

Ohhh, I see. In my original code train_weights is a list and not a numpy array as it is in the code snippet I posted above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Easy Well-defined and straightforward way to resolve
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0