-
-
Notifications
You must be signed in to change notification settings - Fork 26k
MRG Feature stacker #1173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG Feature stacker #1173
Conversation
clf = RandomizedLogisticRegression(verbose=False, C=1., random_state=42, | ||
scaling=scaling, n_resampling=50, tol=1e-3) | ||
feature_scores_sp = clf.fit(X_sp, y).scores_ | ||
assert_equal(feature_scores, feature_scores_sp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hunk seems to be unrelated to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops sorry, forked from wrong branch. just a sec.
Very interesting. I want an example first! (then documentation and tests :) |
on it :) |
@amueller to avoid forking from non-master branches you should use something such as http://volnitsky.com/project/git-prompt/ |
features.append(trans.transform(X)) | ||
issparse = [sparse.issparse(f) for f in features] | ||
if np.any(issparse): | ||
features = sparse.hstack(features).tocsr() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the tocsr()
can be avoided. For instance the downstream model might prefer CSC such as ElasticNet
for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then again, bugs crop up every now and then where estimators that are supposed to handle any sparse format turn out to only handle CSR. It's a good defensive strategy to produce CSR by default (and it's unfortunate that sparse.hstack
doesn't do this already).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this thing in the heat of the battle and I don't remember if there was a reason or if it was just a precaution. I'm inclined to think that I put it there because something, somewhere, broke.
Yes, it should derive from transformer mixin. |
Added a toy example. |
I think such a feature stack should provide some way to do feature group normalization in one way or another. But this probably require some experiments to know which normalization pattern is useful on such beast in practice. Anybody has practical experience or insight to share on this? |
GREAT idea! However, I don't like the name I tried to find a "plumbing equivalent" of this class to keep with the pipeline metaphor, but I can't seem to find it. It's not quite a tee as it connects the various streams back together in the end. Maybe one of the other devs is more experienced with plumbing? :) |
BTW I think the example could be improved my using a less trivial example (e.g. using the digits dataset) and showing that the cross validate score best grid searched parameter set of the pipeline with stacked features is better than the pipeline with individual feature transformers separately. |
@larsmans maybe |
|
|
glad you like it. the estimator and the example even more are in a v
8000
ery rough state.i wasn't sure if the was interest and i had to leave my desk without really testing the example. I'll try to polish it asap. thanks for your suggestions. i don't think this exists in plumbing btw. it's a t followed by a y. .. Lars Buitinck notifications@github.com schrieb:
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet. |
My favorite so far |
I also like Hm i think I like |
+1 for |
+1 for |
In my application, I found the Oh and @ogrisel for the normalization, each feature should be normalized separately, right? |
We might introduce an |
@larsmans ok, will do that. Should be easy enough. |
Having a bit of a hard time creating a good example :-/ |
Have you been able to use this kind of tool successfully for your kaggle contest? If so then we can stick to a simplistic toy example and tell in the narrative documentation which kind of feature bundle was proven useful in practice on which kind of problem (e.g. PCA feature + raw TF-IDF for text classification for instance). |
I can tell you how successful I was tomorrow ;) |
|
||
This estimator applies a list of transformer objects in parallel to the | ||
input data, then concatenates the results. This is useful to combine | ||
several feature extraction mechanisms into a single estimator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single feature representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer it the way it is, as getting the features out is not the important part, the important part is formulating it as an estimator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood what you meant. Since you're talking about extraction mechanisms, it may be clearer to say "in a single transformer".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
Nice idea indeed! |
@mblondel any votes on the name? |
Some I like include |
Name votes: (If I counted correctly, which is unlikely given my degree in math) |
+1 for FeatureUnion I would have thought of FeatureConcat (FeatureConcatenator?) but FeatureUnion |
Renamed, think this is good to go. |
Any more comments? (github claims this can not be merged but I just rebased, so it should be a fast-forward merge). |
This cannot be merged in master currently but appart from that +1 for merging :) |
LGTM. 👍 for merge. Thanks @amueller ! |
Thank you for this convenient transformer. In my application I had to hack it a bit, and I wonder whether the feature I wanted could be more generally useful. Basically, sometimes you want to concatenate the same feature extractor multiple times, and have some of the parameters tied when grid searching. In my case, I was learning a hyphenator, so my data points consist of 2 strings: the one to the left of the current position and the one to the right of the current position. For this I defined a class HomogeneousFeatureUnion(FeatureUnion):
def set_params(self, **params):
for key, value in params.iteritems():
for _, transf in self.transformer_list:
transf.set_params(**{key: value}) This can be easily extended to support both tied params and specific params. I'm not sure whether I overengineered this, but I still have the feeling that this might pop up in other people's applications, so I wanted to raise the question. |
This estimator provides a Y piece for the pipeline.
I used it to combine word ngrams and char ngrams into a single transformer.
Basically it just concatenates the output of several transformers into one large feature.
If you think this is helpful, I'll add some docs and an example.
With this, together with
Pipeline
, one can build arbitrary complex graphs (with one source and one sink) of estimators in sklearn :)TODO
testsnarrative documentationexampleThanks to the awesome implementation of the
BaseEstimator
, grid search simply works - though with complicated graphs you get parameter names likefeature_stacker__first_feature__feature_selection__percentile
(more or less from my code ^^).