8000 Algorithm description for RandomizedLogisticRegression and RandomizedLasso is inaccurate · Issue #6493 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Algorithm description for RandomizedLogisticRegression and RandomizedLasso is inaccurate #6493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hlin117 opened this issue Mar 5, 2016 · 6 comments

Comments

@hlin117
Copy link
Contributor
hlin117 commented Mar 5, 2016

The algorithm descriptions for RandomizedLogisticRegression and RandomizedLasso are as follows:

Randomized Logistic Regression
Randomized Regression works by resampling the train data and computing a LogisticRegression on each resampling. In short, the features selected more often are good features. It is also known as stability selection.

Randomized Lasso.
Randomized Lasso works by resampling the train data and computing a Lasso on each resampling. In short, the features selected more often are good features. It is also known as stability selection.

I don't think these descriptions are accurate. According to the original paper here, the description of the randomized lasso (and by association, the randomized logistic regression) is as follows:

screenshot 2016-03-05 13 38 20

(We would then find multiple values of beta-hat using randomly chosen values for W)

In other words, the algorithm resamples some default weights of the features; the algorithm doesn't sample the training set and fit to these samples (ie: it doesn't bootstrap).

I think how the documentation is currently written, it seems like we're resampling the training set like a bootstrap approach. The documentation should instead clarify that we're reweighting each feature each time we fit Lasso / LogisticRegression to the data.

Thoughts, @agramfort, @GaelVaroquaux ?

@clamus
Copy link
clamus commented Mar 6, 2016

I thought this was confusing too. After looking briefly at the code it seems that the algorithm actually does randomization via both the W_k as well as subsampling the data. This "double" randomization is described in Remark 4 of Meinshausen & Bühlmann, and this is what they recommend. Maybe the documentation can be extended to mention the double randomization?

@hlin117
Copy link
Contributor Author
hlin117 commented Mar 6, 2016

I think that's a great idea, @clamus. Thanks for checking the source code. At the very least though, the documentation should mention the randomization of the weights, because that's what Meinshausen & Bühlmann defined as their "Randomized Lasso / Logistic Regression".

@agramfort
Copy link
Member
agramfort commented Mar 6, 2016 via email

@clamus
Copy link
clamus commented Mar 6, 2016

I'll do it if it is ok with @hlin117

@hlin117
Copy link
Contributor Author
hlin117 commented Mar 6, 2016

@clamus. Sure, be my guest.

Can you make the docs such that it has the equation embedded in there too? Something similar to the lasso regression documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

@clamus
Copy link
clamus commented Mar 6, 2016

yep sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0