-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Fix to documentation and docstring of randomized lasso and randomized logistic regression #6498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
.. math:: \hat{w_I} = \mathrm{arg}\min_{w} \frac{1}{2n_I} \sum_{i \in I} (y_i - x_i^T w)^2 + \alpha \sum_{j=1}^p \frac{|w_j|}{s_j}, | ||
|
||
where :math:`s_j \in \{s, 1\}` are independent trials of a fair Bernoulli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be consistent with the docstrings of the functions. s
is described as alpha
in the docstrings:
scaling : float, optional, default=0.5
The alpha parameter in the stability selection article used to randomly scale the features. Should be between 0 and 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this but I am not sure. alpha
is used throughout sklearn linear models as the penalty parameter. The unfortunate thing is that alpha
is the scaling factor in the paper. Probably using alpha
in the equations will be more confusing. Maybe the solution is to change the docstring for scaling and reference that this is s
in the equation. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good observation. Though, to stay true with the source code, I would say it's better to replace alpha
with C
and s
with alpha
. We would ideally want someone to look at the model's description, and understand the model by looking at its objective function.
But you do make a good point about trying to stay true to the paper; I would put in parenthesis what the variables C
and alpha
are representing from the original paper.
(Making the point that w
here is beta
in the paper, and s_j
here is w_k
in the paper seems like overkill.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, agree with you. Staying consistent internally is better to match the paper's notation. Though, there is an additional "issue". This is that alpha
and C
in the code refer to the penalization in the Lasso and Logistic regression models, respectively. And in the code, alpha
is not used as scaler, but a variable called weights
. I just submitted a change that does not call alpha
the scaling parameter, but just s
. This might be better. What do you think?
These are very good edits. Thanks for the clarification in the documentation, @clamus. |
Thanks so much for the comments @hlin117! I fixed most of them, except the two things I mention now in the commits message. |
@@ -267,9 +283,6 @@ of features non zero. | |||
Journal of the Royal Statistical Society, 72 (2010) | |||
http://arxiv.org/pdf/0809.2932 | |||
|
|||
* F. Bach, "Model-Consistent Sparse Estimation through the Bootstrap" | |||
http://hal.inria.fr/hal-00354771/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that Bach's procedure (the Bolasso) does things somewhat different from what is implemented in sklearn
. Also, what is implemented does match Meinshausen & Buhlmann's Stability selection procedure.
Specifically, Bach's Bolasso gets bootstrap replicates of the data, via sampling data pairs with replacement or residuals of a lasso fit, then fits a regular lasso to each bootstrap replicate data set, and at last, intersects the active sets across replicates. It also gives theoretical guidance as to how to choose the regularization parameter for the lasso in both high and low dimensional setting.
On the other hand, Meinshausen & Buhlmann's stability procedure get several subsamples of the data pairs (this is a sample of a smaller size than the original and sampled without replacement), then fits the modified lasso where the penalty of a random subset of features is scaled, and then counts the number of times each feature is selected across subsamples for each value of the regularization parameter. This gives a score for each feature and regularization parameter, which is then thresholded. Then the paper tell you to pick for each feature the highest score across the thresholded regularization parameter.
So, I think they are different. What do you think? I might have missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good point, now I think credit to Bach's Bolasso need to be included. What about keeping the reference and making a little note clarifying that what is implemented is the Stability procedure, and that a similar randomization approach can be obtained via the Bolasso?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agramfort, I added back the reference to Bach's Bolasso and explained in documentation that the implementation follows the Stability selection method.
cicleci fails as http://apt.typesafe.com is not reachable. uhm? |
LGTM |
@agramfort, do you have any opinions of what the scaling factor We were also wondering whether |
In the stability selection method, a subsample of the data is fit to a | ||
L1-penalized model where the penalty of a random subset of coefficients has | ||
been scaled. Specifically, given a subsample of the data | ||
:math:`(y_i, x_i), i \in I`, where :math:`I \subset \{1, 2, \ldots, n\}` is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more consistent in scikit-learn to use (x_i, y_i)
instead of (y_i, x_i)
. (Citation / reference needed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what citation is missing? I didn't see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I said "Citation / reference needed", I meant that I needed to find a reference in the scikit-learn documentation that uses (x_i, y_i)
instead of (y_i, x_i)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right about the order. Here is an example: http://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current variable names for scaling, etc, are the most consistent with |
@hlin117 and @agramfort, thanks for reviewing this! |
@clamus: My pleasure. Thanks for working on this. Hope you contribute more to scikit-learn! |
@agramfort, should I rebase to include the fixes in upstream/master that fixed the issue with travis? |
Yes if you can please
|
ae68dc6
to
3a83071
Compare
Thanks for you contribution! Merging |
[MRG+1] Fix to documentation and docstring of randomized lasso and randomized logistic regression
You are welcome! |
Fixes #6493
@hlin117 and @agramfort, this is ready.
Todo:
scaling
in relation the equations.