8000 Scalar fit_params no longer handled. Was: Singleton array (insert value here) cannot be considered a valid collection. · Issue #15805 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Scalar fit_params no longer handled. Was: Singleton array (insert value here) cannot be considered a valid collection. #15805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cia05rf opened this issue Dec 5, 2019 · 17 comments · Fixed by #15863

Comments

@cia05rf
Copy link
cia05rf commented Dec 5, 2019

Description

TypeError: Singleton array array(True) cannot be considered a valid collection.

Steps/Code to Reproduce

Found when running RandomizedSearchCV with LightGBM. Previously worked fine. Latest update requires that all the **fit_params be checked for 'slicability'. Difficult when some fit params are things like early_stopping_rounds = 5.

#Import the modules
import lightgbm as lgb
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

#Create parameters grid
#Create fixed parameters
mod_fixed_params = {
    'boosting_type':'gbdt'
    ,'random_state':0
    ,'silent':False
    ,'objective':'multiclass'
    ,'num_class':np.unique(y_train)
    ,'min_samples_split':200 #Should be between 0.5-1% of samples
    ,'min_samples_leaf':50
    ,'subsample':0.8
}
search_params = {
    'fixed':{
        'cv':3
        ,'n_iter':80
        ,'verbose':True
        ,'random_state':0
    }
    ,'variable':{
        'learning_rate':[0.1,0.01,0.005]
        ,'num_leaves':np.linspace(10,1010,100,dtype=int)
        ,'max_depth':np.linspace(2,22,10,dtype=int)
    }
}
fit_params = {
    'verbose':True
    ,'eval_set':[(X_valid,y_valid)]
    ,'eval_metric':lgbm_custom_loss
    ,'early_stopping_rounds':5
}

#Setup the model
lgb_mod = lgb.LGBMClassifier(**mod_fixed_params)
#Add the search grid
seed = np.random.seed(0)
gbm = RandomizedSearchCV(lgb_mod,search_params['variable'],**search_params['fixed'])
#Fit the model
gbm.fit(X_train,y_train,**fit_params)
print('Best parameters found by grid search are: {}'.format(gbm.best_params_))

I've traced the error through and it starts in model_selection/_search.py ln652

-->

Expected Results

Expected to run the LightGBM wthrough RandomSearchGrid

Actual Results

TypeError: Singleton array array(True) cannot be considered a valid collection.

Versions

@cia05rf
Copy link
Author
cia05rf commented Dec 5, 2019

Don't know why versions didn't come through:
System:
python: 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]
executable: C:\Users\Robert\Anaconda3\python.exe
machine: Windows-10-10.0.18362-SP0

Python dependencies:
pip: 19.3.1
setuptools: 41.0.1
sklearn: 0.22
numpy: 1.17.4
scipy: 1.1.0
Cython: 0.28.5
pandas: 0.25.0
matplotlib: 3.1.2
joblib: 0.14.0

Built with OpenMP: True

@jnothman
Copy link
Member
jnothman commented Dec 5, 2019 via email

@cia05rf
Copy link
Author
cia05rf commented Dec 6, 2019

I'm trying to work out if you're being sarcastic or not. If you're then thanks.

@jnothman
Copy link
Member
jnothman commented Dec 7, 2019 via email

@jnothman jnothman added this to the 0.22.1 milestone Dec 7, 2019
@cia05rf
Copy link
Author
cia05rf commented Dec 8, 2019

Thanks, i think this is best sorted on the lightGBM git so i'll raise an issue there.

For anyone who does come across this i have found a work around by making the below changes to sklearn/model_selection/_search.py -> fit.

Currently at line 651

# make sure fit_params are sliceable
- fit_params_values = indexable(*fit_params.values())
- fit_params = dict(zip(fit_params.keys(), fit_params_values))
+# fit_params_values = indexable(*fit_params.values())
+# fit_params = dict(zip(fit_params.keys(), fit_params_values))

@jnothman jnothman changed the title Singleton array (insert value here) cannot be considered a valid collection. Scalar fit_params no longer handled. Was: Singleton array (insert value here) cannot be considered a valid collection. Dec 8, 2019
@adrinjalali
Copy link
Member

I'm really not sure if this is a regression on our side though. early_stopping_rounds does very much sound like an init parameter in our conventions, and I rather keep it that way.

@jnothman
Copy link
Member

I'm really not sure if this is a regression on our side though.

It's been raised 2-3 times in the couple of weeks since 0.22 was released, and not before.

In 0.21.X:

X, y, groups = indexable(X, y, groups)
n_splits = cv.get_n_splits(X, y, groups)

in 0.22.X

X, y, groups = indexable(X, y, groups)
# make sure fit_params are sliceable
fit_params_values = indexable(*fit_params.values())
fit_params = dict(zip(fit_params.keys(), fit_params_values))

early_stopping_rounds does very much sound like an init parameter in our conventions

It does. But we have tacitly supported this behaviour for many, many releases and have changed the behaviour without warning. The support is more than tacit in the sense that _fit_and_score explicitly makes use of a helper that bypasses fit params that are not samplewise:

def _index_param_value(X, v, indices):
"""Private helper function for parameter value indexing."""
if not _is_arraylike(v) or _num_samples(v) != _num_samples(X):
# pass through: skip indexing
return v

Thus the previous behaviour could be understood as supported and intended behaviour, even though it was untested (with respect to search at least).

Yes, we can change behaviour around things that do not conform to our conventions, but the change was introduced by @amueller in #14702 and was incidental to that PR. If we are going to change our handling of popular if non-conforming estimators, it should be done intentionally, and incidental changes should indeed be reverted in patch releases, IMO.

@ogrisel
Copy link
Member
ogrisel commented Dec 11, 2019

+1 for at least restoring backward compat in 0.22.1 (to restore support for scalar fit params possibly with deprecation warning).

+0 for keeping (undocumented) support for scalar fit params indefinitely without deprecation warning as they feel harmless to me.

@rwjmiller
Copy link

@jnothman
Copy link
Member

I think we should continue to support scalars, but consider deprecating support for array-likes of size != n_samples in 0.23.

I think for 0.22 we need to ensure similar behaviour in search and other cross validation, which mean full backwards compatibility with 0.21

Also note that the documentation is completely opaque regarding the handling.

@jnothman
Copy link
Member

The solution should introduce _check_fit_params to check consistent length / indexable while passing through scalars.

@NicolasHug
Copy link
Member
NicolasHug commented Dec 12, 2019

Let's not deprecate non-aligned fit_params just yet. We need to carefully think about it first. Non-aligned fit_params is one proposition to implement the new warm start API #15105

We might also want to add feature-aligned params in the future, who knows

@jnothman
Copy link
Member
jnothman commented Dec 12, 2019 via email

@qinhanmin2014
Copy link
Member

Not being sarcastic. It's a bit frustrating that third party libraries ignore our convention of only having sample-aligned arguments to fit

Sorry but do we mention this behavior in the doc? This seems strange.

@ogrisel
Copy link
Member
ogrisel commented Dec 30, 2019

Not being sarcastic. It's a bit frustrating that third party libraries ignore our convention of only having sample-aligned arguments to fit

No this was never mentioned. We said "data-dependent". feature aligned arrays is still an option although we would have to discuss the use case.

Another use case is how to pass a custom validation set (X_val and y_val) for models that supports early stopping. Passing those as constructor params would mean that they would be stored as attribute on the estimator which makes the object unnecessarily big for pickling / deployment. Passing those as fit params would be possible but those are not sample aligned with X_train and routing in pipelines/column-transformer/feature union/grid-search/cross-val might be challenging. Or maybe not.

@qinhanmin2014
Copy link
Member

No this was never mentioned.

@jnothman Could you please explain your comment "It's a bit frustrating that third party libraries
ignore our convention of only having sample-aligned arguments to fit", thanks.

No this was never mentioned. We said "data-dependent". feature aligned arrays is still an option although we would have to discuss the use case.

I saw "data-dependent" in our doc, but I think it's difficult to define what's "data-dependent", and I guess "data-dependent" parameters don't need to be indexable.

@jnothman
Copy link
Member

I understand that we ordinarily avoid parameters to fit that are not sample-aligned. This is a convention that other estimator developers have not followed. In theory, order data-dependent parameters belong there but we have not done this afaik.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
0