-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Sparse bagging tests #3076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
assert_array_equal( | ||
sparse_results, | ||
dense_results, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cosmetics: please put this assertion on one line as it is less than 80 columns wide.
@msalahi how long do the test take? So according to travis and your new test, bagging already supports sparse input right? I would not have expected that on oldish scipy versions. |
|
||
base_estimator = SVC() | ||
for params in grid: | ||
sparse_format = csc_matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please try with csr_matrix
and coo_matrix
as well? If some of them fail we will have to fix that by improving the input validation logic.
Same here, I am quite surprised this passes. I would double-check to be sure that data is not densified somewhere along the lines... |
|
iris.target, | ||
random_state=rng) | ||
grid = ParameterGrid({"max_samples": [0.5, 1.0], | ||
"max_features": [1, 2, 4], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can only test max_features
in [2, 4]
to speed up the test.
To speed up the tests you could avoid exploring all the possible combinations of parameters but rather a fixed subset such as: parameter_sets = [
{"max_samples": 0.5,
"max_features": 2,
"bootstrap": True,
"bootstrap_samples": True},
{"max_samples": 1.0,
"max_features": 4,
"bootstrap": True,
"bootstrap_samples": True},
{"max_features": 2,
"bootstrap": False,
"bootstrap_samples": True},
{"max_samples": 0.5,
"bootstrap": True,
"bootstrap_samples": False},
] |
thanks, i'll keep that in mind as i add the various matrix types back. the tests definitely break on coo_matrix, as they don't support slicing, which happens on line 99 and line 113 in bagging.py: https://github.com/msalahi/scikit-learn/blob/master/sklearn/ensemble/bagging.py#L99 but csr and csc both work fine. |
i too am surprised about the fact that it doesn't break on older scipy, but i can't find out where it would densify input. i'm using SVC() to test the functionality of bagging in these tests. if bagging did, at some point along the line, densify the input, then the SVC would never see sparse data. so, i put a print statement (the ultimate debugging tool) in the fit() function of libSVM (which SVC inherits) to print out the |
I think we can merge this PR as it is and open a new one to handle the coo case. For instance it could be made possible to have X, y = check_arrays(X, y, sparse_format=('csr', 'csc')) when new_format = sparse_format[0]
X = getattr(X, 'to' + new_format)() @glouppe what is your opinion? |
@msalahi what are the new running times with the smaller parameter sets? |
Actually before merging we should try to make sure that the data does not get converted into a dense array internally by BaggingClassifier. This could be achieved by using a class CustomSVC(SVC):
"""SVC variant that records the nature of the training set"""
def fit(self, X, y):
super(CustomSVC, self).fit(X, y)
self.data_type_ = type(X)
return self |
This looks great to me. @glouppe any further comment? |
+1 |
Added a test each for BaggingClassifier and BaggingRegressor to train on dense/sparse versions of the same data with various input parameters, and make sure the outputs are equal. Constrained tests to just SVC/SVR models. It should be up to various models' tests to make sure they handle sparse data correctly. These tests should just make sure BaggingClassifier doesn't throw a wrench in the works along the way.