8000 [MRG+1] Sparse bagging tests by msalahi · Pull Request #3076 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] Sparse bagging tests #3076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 18, 2014
Merged

[MRG+1] Sparse bagging tests #3076

merged 2 commits into from
Apr 18, 2014

Conversation

msalahi
Copy link
@msalahi msalahi commented Apr 16, 2014

Added a test each for BaggingClassifier and BaggingRegressor to train on dense/sparse versions of the same data with various input parameters, and make sure the outputs are equal. Constrained tests to just SVC/SVR models. It should be up to various models' tests to make sure they handle sparse data correctly. These tests should just make sure BaggingClassifier doesn't throw a wrench in the works along the way.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 9d7493e on msalahi:sparse-bagging-tests into 6f6de86 on scikit-learn:master.

assert_array_equal(
sparse_results,
dense_results,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetics: please put this assertion on one line as it is less than 80 columns wide.

@ogrisel
Copy link
Member
ogrisel commented Apr 17, 2014

@msalahi how long do the test take? So according to travis and your new test, bagging already supports sparse input right? I would not have expected that on oldish scipy versions.


base_estimator = SVC()
for params in grid:
sparse_format = csc_matrix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please try with csr_matrix and coo_matrix as well? If some of them fail we will have to fix that by improving the input validation logic.

@glouppe
Copy link
Contributor
glouppe commented Apr 17, 2014

I would not have expected that on oldish scipy versions.

Same here, I am quite surprised this passes. I would double-check to be sure that data is not densified somewhere along the lines...

@msalahi
Copy link
Author
msalahi commented Apr 17, 2014

test_sparse_classification takes ~1.06s, test_sparse_regression takes ~0.6s. I took out the various matrix types / model types to get test speeds down to the speeds they're currently at, but I'm happy to add the different matrix types back if you're cool with the tests running one or two seconds longer.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling f38db56 on msalahi:sparse-bagging-tests into 9f07963 on scikit-learn:master.

iris.target,
random_state=rng)
grid = ParameterGrid({"max_samples": [0.5, 1.0],
"max_features": [1, 2, 4],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can only test max_features in [2, 4] to speed up the test.

@ogrisel
Copy link
Member
ogrisel commented Apr 17, 2014

To speed up the tests you could avoid exploring all the possible combinations of parameters but rather a fixed subset such as:

parameter_sets = [
    {"max_samples": 0.5,
     "max_features": 2,
     "bootstrap": True,
     "bootstrap_samples": True},
    {"max_samples": 1.0,
     "max_features": 4,
     "bootstrap": True,
     "bootstrap_samples": True},
    {"max_features": 2,
     "bootstrap": False,
     "bootstrap_samples": True},
    {"max_samples": 0.5,
     "bootstrap": True,
     "bootstrap_samples": False},
]

@msalahi
Copy link
Author
msalahi commented Apr 17, 2014

thanks, i'll keep that in mind as i add the various matrix types back. the tests definitely break on coo_matrix, as they don't support slicing, which happens on line 99 and line 113 in bagging.py:

https://github.com/msalahi/scikit-learn/blob/master/sklearn/ensemble/bagging.py#L99

but csr and csc both work fine.

@msalahi
Copy link
Author
msalahi commented Apr 17, 2014

i too am surprised about the fact that it doesn't break on older scipy, but i can't find out where it would densify input. i'm using SVC() to test the functionality of bagging in these tests. if bagging did, at some point along the line, densify the input, then the SVC would never see sparse data. so, i put a print statement (the ultimate debugging tool) in the fit() function of libSVM (which SVC inherits) to print out the type of the input. that yielded <class 'scipy.sparse.csc.csc_matrix'> ... so i don't know, i think it might magically work?

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 593bdc8 on msalahi:sparse-bagging-tests into b70a481 on scikit-learn:master.

@ogrisel
Copy link
Member
ogrisel commented Apr 17, 2014

I think we can merge this PR as it is and open a new one to handle the coo case.

For instance it could be made possible to have sklearn.utils.validation.check_arrays extended to support:

X, y = check_arrays(X, y, sparse_format=('csr', 'csc'))

when sparse_format is given as a tuple of supported formats instead of a single format string and X.getformat() not in sparse_format, check_arrays would internally do:

new_format = sparse_format[0]
X = getattr(X, 'to' + new_format)()

@glouppe what is your opinion?

@ogrisel ogrisel changed the title Sparse bagging tests [MRG+1] Sparse bagging tests Apr 17, 2014
@ogrisel
Copy link
Member
ogrisel commented Apr 17, 2014

@msalahi what are the new running times with the smaller parameter sets?

@ogrisel
Copy link
Member
ogrisel commented Apr 17, 2014

Actually before merging we should try to make sure that the data does not get converted into a dense array internally by BaggingClassifier. This could be achieved by using a CustomSVC that would record the nature of the training data in a custom attribute, e.g.:

class CustomSVC(SVC):
     """SVC variant that records the nature of the training set"""

    def fit(self, X, y):
         super(CustomSVC, self).fit(X, y)
         self.data_type_ = type(X)
         return self

@ogrisel
Copy link
Member
ogrisel commented Apr 17, 2014

This looks great to me. @glouppe any further comment?

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling d31dee1 on msalahi:sparse-bagging-tests into 47080ec on scikit-learn:master.

@glouppe
Copy link
Contributor
glouppe commented Apr 18, 2014

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0