8000 [WIP] Bagging ensemble meta-estimator by glouppe · Pull Request #2198 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] Bagging ensemble meta-estimator #2198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 32 commits into from

Conversation

glouppe
Copy link
Contributor
@glouppe glouppe commented Jul 23, 2013

Hi,

This is a very early PR for a meta-estimator ensemble implementing ensemble averaging/voting. The idea is to make a meta-estimator that can take as input any type of base estimator (not only trees) and make an ensemble out of it. This should work quite well for estimators with high variance (trees, gbrt, neural networks typically).


TODO list:

  • rename to BaggingClassifier and BaggingRegressor.
  • add subsampling hyper-parameter.
  • add subsampling_features hyper-parameter.
  • documentation
  • tests
  • examples

@pprett
Copy link
Member
pprett commented Jul 23, 2013

For me, RandomPatches sounds good - extracting random patches from the FX matrix...

for estimator in self.estimators_:
mask = np.ones(n_samples, dtype=np.bool)
mask[estimator.indices_] = False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some comments on what's going on below - seems like you have two cases depending on whether or not estimator supports predict_proba.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, if predict_proba is not supported, I make the base estimators to vote. I'll add some comments to clarify things.

@pprett
Copy link
Member
pprett commented Jul 23, 2013

code looks good to me - documentation is missing - would be great if you could add an example and/or incorporate RandomPatch into one of our examples

@ogrisel
Copy link
Member
ogrisel commented Jul 25, 2013

If think the RandomPatches name is too confusing as people might expect that it's only relevant to 2D structured data like images or other computer vision related tasks.

ResampledClassifiers sounds both concise and explicit to me.

@amueller
Copy link
Member

how about ResampledEnsemble?

@amueller
Copy link
Member

actually maybe ResampledClassifiers is better

@GaelVaroquaux
Copy link
Member

If think the RandomPatches name is too confusing as people might expect that it's only relevant to 2D structured data like images or other computer vision related tasks.

Fully agreed.

ResampledClassifiers sounds both concise and explicit to me.

Sounds good. Or maybe BaggingClassifier: I don't care about the
mathematical exactitudes of little details like the fact that this does
more than bootstrap: bagging has captured the popular imagination beyond
bootstrap.

8000

@ogrisel
Copy link
Member
ogrisel commented Aug 4, 2013

Sounds good. Or maybe BaggingClassifier: I don't care about the mathematical exactitudes of little details like the fact that this does more than bootstrap: bagging has captured the popular imagination beyond bootstrap.

Also the recent paper on Google ad click prediction calls the "feature subsampling" strategy (without replacement) "Feature Bagging". So indeed Bagging ain't what it used to be. Therefore I am ok with abusing the name as well.

To sum up I am ok with either ResampledClassifiers or BaggingClassifier.

@agramfort
Copy link
Member

I think that @glouppe did not like Bagging so we went for
ResampledClassifiers but I am fine with BaggingClassifier too.

emsrc and others added 15 commits August 18, 2013 19:17
- Added 'cosine_distances' function to sklearn.metrics.pairwise.
- Added 'cosine' as metric in 'pairwise_distances' function.
- Corrected doc string of same function, because all metrics based on
the 'manhattan_distances' function (i.e. 'cityblock', 'l1', and
'manhattan') do currently NOT support sparse matrices.
- Added corresponding corresponding unit test.
Cosine distance metric for sparse matrices
[MRG] remove warnings in univariate feature selection
… random-patches

Conflicts:
	sklearn/ensemble/__init__.py
@glouppe
Copy link
Contributor Author
glouppe commented Aug 20, 2013

Ping. Just to let you know, I am making progress on this. I have renamed the classes to BaggingClassifier (resp Resgressor) and added everything that I wanted. This meta-estimator can now handle pasting, bagging, random subspaces or all of them at once (i.e., random patches).

The only things left are the narrative documentation and writing an example :)

@glouppe
Copy link
Contributor Author
glouppe commented Aug 20, 2013

Also, I have rebased on top of master, but the history in the pull request seems to be kinda screwed up :s It contains dupplicate or unrelated commits. Any guess on how to clean that?

@GaelVaroquaux
Copy link
Member

Also, I have rebased on top of master, but the history in the pull request
seems to be kinda screwed up :s It contains dupplicate or unrelated commits.
Any guess on how to clean that?

Yeah, rebasing confuses git. Unfortunately, the only way out of this,
AFAIK is to create a new PR.

@glouppe glouppe closed this Aug 20, 2013
@glouppe glouppe mentioned this pull request Aug 20, 2013
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0