8000 [MRG+1] Fix to documentation and docstring of randomized lasso and randomized logistic regression by clamus · Pull Request #6498 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] Fix to documentation and docstring of randomized lasso and randomized logistic regression #6498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 19, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 33 additions & 14 deletions doc/modules/feature_selection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -173,8 +173,8 @@ L1-based feature selection
sparse solutions: many of their estimated coefficients are zero. When the goal
is to reduce the dimensionality of the data to use with another classifier,
they can be used along with :class:`feature_selection.SelectFromModel`
to select the non-zero coefficients. In particular, sparse estimators useful for
this purpose are the :class:`linear_model.Lasso` for regression, and
to select the non-zero coefficients. In particular, sparse estimators useful
for this purpose are the :class:`linear_model.Lasso` for regression, and
of :class:`linear_model.LogisticRegression` and :class:`svm.LinearSVC`
for classification::

Expand Down Expand Up @@ -234,15 +234,34 @@ Randomized sparse models

.. currentmodule:: sklearn.linear_model

The limitation of L1-based sparse models is that faced with a group of
very correlated features, they will select only one. To mitigate this
problem, it is possible to use randomization techniques, reestimating the
sparse model many times perturbing the design matrix or sub-sampling data
and counting how many times a given regressor is selected.
In terms of feature selection, there are some well-known limitations of
L1-penalized models for regression and classification. For example, it is
known that the Lasso will tend to select an individual variable out of a group
of highly correlated features. Furthermore, even when the correlation between
features is not too high, the conditions under which L1-penalized methods
consistently select "good" features can be restrictive in general.

To mitigate this problem, it is possible to use randomization techniques such
as those presented in [B2009]_ and [M2010]_. The latter technique, known as
stability selection, is implemented in the module :mod:`sklearn.linear_model`.
In the stability selection method, a subsample of the data is fit to a
L1-penalized model where the penalty of a random subset of coefficients has
been scaled. Specifically, given a subsample of the data
:math:`(x_i, y_i), i \in I`, where :math:`I \subset \{1, 2, \ldots, n\}` is a
random subset of the data of size :math:`n_I`, the following modified Lasso
fit is obtained:

.. math:: \hat{w_I} = \mathrm{arg}\min_{w} \frac{1}{2n_I} \sum_{i \in I} (y_i - x_i^T w)^2 + \alpha \sum_{j=1}^p \frac{ \vert w_j \vert}{s_j},

where :math:`s_j \in \{s, 1\}` are independent trials of a fair Bernoulli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be consistent with the docstrings of the functions. s is described as alpha in the docstrings:

scaling : float, optional, default=0.5
The alpha parameter in the stability selection article used to randomly scale the features. Should be between 0 and 1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but I am not sure. alpha is used throughout sklearn linear models as the penalty parameter. The unfortunate thing is that alpha is the scaling factor in the paper. Probably using alpha in the equations will be more confusing. Maybe the solution is to change the docstring for scaling and reference that this is s in the equation. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good observation. Though, to stay true with the source code, I would say it's better to replace alpha with C and s with alpha. We would ideally want someone to look at the model's description, and understand the model by looking at its objective function.

But you do make a good point about trying to stay true to the paper; I would put in parenthesis what the variables C and alpha are representing from the original paper.

(Making the point that w here is beta in the paper, and s_j here is w_k in the paper seems like overkill.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, agree with you. Staying consistent internally is better to match the paper's notation. Though, there is an additional "issue". This is that alpha and C in the code refer to the penalization in the Lasso and Logistic regression models, respectively. And in the code, alpha is not used as scaler, but a variable called weights. I just submitted a change that does not call alpha the scaling parameter, but just s. This might be better. What do you think?

random variable, and :math:`0<s<1` is the scaling factor. By repeating this
procedure across different random subsamples and Bernoulli trials, one can
count the fraction of times the randomized procedure selected each feature,
and used these fractions as scores for feature selection.

:class:`RandomizedLasso` implements this strategy for regression
settings, using the Lasso, while :class:`RandomizedLogisticRegression` uses the
logistic regression and is suitable for classification tasks. To get a full
logistic regression and is suitable for classification tasks. To get a full
path of stability scores you can use :func:`lasso_stability_path`.

.. figure:: ../auto_examples/linear_model/images/plot_sparse_recovery_003.png
Expand All @@ -263,12 +282,12 @@ of features non zero.

.. topic:: References:

* N. Meinshausen, P. Buhlmann, "Stability selection",
Journal of the Royal Statistical Society, 72 (2010)
http://arxiv.org/pdf/0809.2932
.. [B2009] F. Bach, "Model-Consistent Sparse Estimation through the
Bootstrap." http://hal.inria.fr/hal-00354771/

* F. Bach, "Model-Consistent Sparse Estimation through the Bootstrap"
http://hal.inria.fr/hal-00354771/
.. [M2010] N. Meinshausen, P. Buhlmann, "Stability selection",
Journal of the Royal Statistical Society, 72 (2010)
http://arxiv.org/pdf/0809.2932

Tree-based feature selection
----------------------------
Expand Down Expand Up @@ -324,4 +343,4 @@ Then, a :class:`sklearn.ensemble.RandomForestClassifier` is trained on the
transformed output, i.e. using only relevant features. You can perform
similar operations with the other feature selection methods and also
classifiers that provide a way to evaluate feature importances of course.
See the :class:`sklearn.pipeline.Pipeline` examples for more details.
See the :class:`sklearn.pipeline.Pipeline` examples for more details.
44 changes: 27 additions & 17 deletions sklearn/linear_model/randomized_l1.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,9 +187,13 @@ def _randomized_lasso(X, y, weights, mask, alpha=1., verbose=False,
class RandomizedLasso(BaseRandomizedLinearModel):
"""Randomized Lasso.

Randomized Lasso works by resampling the train data and computing
a Lasso on each resampling. In short, the features selected more
often are good features. It is also known as stability selection.
Randomized Lasso works by subsampling the training data and
computing a Lasso estimate where the penalty of a random subset of
coefficients has been scaled. By performing this double
randomization several times, the method assigns high scores to
features that are repeatedly selected across randomizations. This
is known as stability selection. In short, features selected more
often are considered good features.

Read more in the :ref:`User Guide <randomized_l1>`.

Expand All @@ -201,8 +205,9 @@ class RandomizedLasso(BaseRandomizedLinearModel):
article which is scaling.

scaling : float, optional
The alpha parameter in the stability selection article used to
randomly scale the features. Should be between 0 and 1.
The s parameter used to randomly scale the penalty of different
features (See :ref:`User Guide <randomized_l1>` for details ).
Should be between 0 and 1.

sample_fraction : float, optional
The fraction of samples to be used in each randomized design.
Expand All @@ -226,11 +231,11 @@ class RandomizedLasso(BaseRandomizedLinearModel):
If True, the regressors X will be normalized before regression.
This parameter is ignored when `fit_intercept` is set to False.
When the regressors are normalized, note that this makes the
hyperparameters learnt more robust and almost independent of the number
of samples. The same property is not valid for standardized data.
However, if you wish to standardize, please use
`preprocessing.StandardScaler` before calling `f 8000 it` on an estimator
with `normalize=False`.
hyperparameters learned more robust and almost independent of
the number of samples. The same property is not valid for
standardized data. However, if you wish to standardize, please
use `preprocessing.StandardScaler` before calling `fit` on an
estimator with `normalize=False`.

precompute : True | False | 'auto'
Whether to use a precomputed Gram matrix to speed up
Expand Down Expand Up @@ -307,7 +312,7 @@ class RandomizedLasso(BaseRandomizedLinearModel):

See also
--------
RandomizedLogisticRegression, LogisticRegression
RandomizedLogisticRegression, Lasso, ElasticNet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would hurt to mention LogisticRegression here. Though yes, you could argue that LogisticRegression serves a different purpose than a feature selection method, it might help users distinguish the use cases of these methods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My rationale is based on the data type that needs to be modeled. So methods for categorical and metric responses are kind of far away. This separation is maintained in the "See also" section of the lasso and logistic sides in sklearn, as they do not reference each other. I thought that it would be good to keep this separation. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agramfort, do you have any opinions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. +1 for the change.

"""
def __init__(self, alpha='aic', scaling=.5, sample_fraction=.75,
n_resampling=200, selection_threshold=.25,
Expand Down Expand Up @@ -378,9 +383,13 @@ def _randomized_logistic(X, y, weights, mask, C=1., verbose=False,
class RandomizedLogisticRegression(BaseRandomizedLinearModel):
"""Randomized Logistic Regression

Randomized Regression works by resampling the train data and computing
a LogisticRegression on each resampling. In short, the features selected
more often are good features. It is also known as stability selection.
Randomized Logistic Regression works by subsampling the training
data and fitting a L1-penalized LogisticRegression model where the
penalty of a random subset of coefficients has been scaled. By
performing this double randomization several times, the method
assigns high scores to features that are repeatedly selected across
randomizations. This is known as stability selection. In short,
features selected more often are considered good features.

Read more in the :ref:`User Guide <randomized_l1>`.

Expand All @@ -390,8 +399,9 @@ class RandomizedLogisticRegression(BaseRandomizedLinearModel):
The regularization parameter C in the LogisticRegression.

scaling : float, optional, default=0.5
The alpha parameter in the stability selection article used to
randomly scale the features. Should be between 0 and 1.
The s parameter used to randomly scale the penalty of different
features (See :ref:`User Guide <randomized_l1>` for details ).
Should be between 0 and 1.

sample_fraction : float, optional, default=0.75
The fraction of samples to be used in each randomized design.
Expand Down Expand Up @@ -484,7 +494,7 @@ class RandomizedLogisticRegression(BaseRandomizedLinearModel):

See also
--------
RandomizedLasso, Lasso, ElasticNet
RandomizedLasso, LogisticRegression
"""
def __init__(self, C=1, scaling=.5, sample_fraction=.75,
n_resampling=200,
Expand Down
0