8000 [MRG] ENH: Support centering in LogisticRegression by kernc · Pull Request #1 · kernc/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] ENH: Support centering in LogisticRegression #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
53713c9
Fix for issue #6352
tracer0tong Feb 17, 2016
65a2b8f
Fixed codestyle
tracer0tong Feb 18, 2016
eb242c2
ENH: FeatureHasher now accepts string values.
devashishd12 Jan 15, 2016
876f123
Do not ignore files starting with _ in nose
lesteve Feb 29, 2016
2e7d9ad
FIX: improve docs of randomized lasso
Mar 7, 2016
bf81451
Fix consistency in docs and docstring
Mar 7, 2016
a754e09
Added ref to Bach and improved docs
Mar 7, 2016
f81e5aa
Try to fix link to pdf
Mar 7, 2016
3a83071
fix x and y order
Mar 7, 2016
4eca0c9
updated info for cross_val_score
ohld Mar 14, 2016
c2eaf75
Merge pull request #6173 from dsquareindia/featurehasher_fix
MechCoder Mar 19, 2016
b64e992
Merge pull request #6542 from ohld/make_scorer-link
glouppe Mar 19, 2016
e2e6bde
Merge pull request #6498 from clamus/rand-lasso-fix-6493
glouppe Mar 19, 2016
9691824
Merge pull request #6466 from lesteve/nose-ignore-files-tweak
glouppe Mar 19, 2016
7580746
Fix broken link in ABOUT
bryandeng Mar 20, 2016
bd6b313
Merge pull request #6565 from bryandeng/doc-link
agramfort Mar 20, 2016
549474d
[gardening] Fix NameError ("estimator" not defined). Remove unused va…
practicalswift Mar 20, 2016
e228581
Merge pull request #6566 from practicalswift/fix-nameerror-and-remove…
jnothman Mar 21, 2016
e9492b7
LabelBinarizer single label case now works for sparse and dense case
devashishd12 Jan 24, 2016
528533d
MAINT: Simplify n_features_to_select in RFECV
MechCoder Mar 21, 2016
945cb7e
Merge pull request #6221 from dsquareindia/LabelBinarizer_fix
MechCoder Mar 21, 2016
54af09e
Fixing typos in logistic regression docs
hlin117 Mar 21, 2016
146f461
Merge pull request #6575 from hlin117/logregdocs
agramfort Mar 22, 2016
07a6433
Fix typo in html target
Mar 22, 2016
b3c2219
Add the possibility to add prior to Gaussian Naive Bayes
Jan 18, 2016
5a046c7
Update whatsnew
MechCoder Mar 22, 2016
5d92bd5
Merge pull request #6579 from nlathia/issue-6541
TomDLT Mar 23, 2016
65b570b
Update scorer.py
lizsz Mar 19, 2016
56d625f
Merge pull request #6569 from MechCoder/minor
TomDLT Mar 23, 2016
22d7cd5
Make dump_svmlight_file support sparse y
yenchenlin Feb 18, 2016
eed5fc5
Merge pull request #6395 from yenchenlin1994/make-dump_svmlight_file-…
TomDLT Mar 23, 2016
afc058f
Merge pull request #6376 from tracer0tong/issue_6352
TomDLT Mar 24, 2016
612cd9e
ENH: Support data centering in LogisticRegression
kernc Mar 17, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/about.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ High quality PNG and SVG logos are available in the `doc/logos/ <https://github.
Funding
-------

`INRIA <http://inria.fr>`_ actively supports this project. It has
`INRIA <http://www.inria.fr>`_ actively supports this project. It has
provided funding for Fabian Pedregosa (2010-2012), Jaques Grobler
(2012-2013) and Olivier Grisel (2013-2015) to work on this project
full-time. It also hosts coding sprints and other events.
Expand Down
47 changes: 33 additions & 14 deletions doc/modules/feature_selection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -173,8 +173,8 @@ L1-based feature selection
sparse solutions: many of their estimated coefficients are zero. When the goal
is to reduce the dimensionality of the data to use with another classifier,
they can be used along with :class:`feature_selection.SelectFromModel`
to select the non-zero coefficients. In particular, sparse estimators useful for
this purpose are the :class:`linear_model.Lasso` for regression, and
to select the non-zero coefficients. In particular, sparse estimators useful
for this purpose are the :class:`linear_model.Lasso` for regression, and
of :class:`linear_model.LogisticRegression` and :class:`svm.LinearSVC`
for classification::

Expand Down Expand Up @@ -234,15 +234,34 @@ Randomized sparse models

.. currentmodule:: sklearn.linear_model

The limitation of L1-based sparse models is that faced with a group of
very correlated features, they will select only one. To mitigate this
problem, it is possible to use randomization techniques, reestimating the
sparse model many times perturbing the design matrix or sub-sampling data
and counting how many times a given regressor is selected.
In terms of feature selection, there are some well-known limitations of
L1-penalized models for regression and classification. For example, it is
known that the Lasso will tend to select an individual variable out of a group
of highly correlated features. Furthermore, even when the correlation between
features is not too high, the conditions under which L1-penalized methods
consistently select "good" features can be restrictive in general.

To mitigate this problem, it is possible to use randomization techniques such
as those presented in [B2009]_ and [M2010]_. The latter technique, known as
stability selection, is implemented in the module :mod:`sklearn.linear_model`.
In the stability selection method, a subsample of the data is fit to a
L1-penalized model where the penalty of a random subset of coefficients has
been scaled. Specifically, given a subsample of the data
:math:`(x_i, y_i), i \in I`, where :math:`I \subset \{1, 2, \ldots, n\}` is a
random subset of the data of size :math:`n_I`, the following modified Lasso
fit is obtained:

.. math:: \hat{w_I} = \mathrm{arg}\min_{w} \frac{1}{2n_I} \sum_{i \in I} (y_i - x_i^T w)^2 + \alpha \sum_{j=1}^p \frac{ \vert w_j \vert}{s_j},

where :math:`s_j \in \{s, 1\}` are independent trials of a fair Bernoulli
random variable, and :math:`0<s<1` is the scaling factor. By repeating this
procedure across different random subsamples and Bernoulli trials, one can
count the fraction of times the randomized procedure selected each feature,
and used these fractions as scores for feature selection.

:class:`RandomizedLasso` implements this strategy for regression
settings, using the Lasso, while :class:`RandomizedLogisticRegression` uses the
logistic regression and is suitable for classification tasks. To get a full
logistic regression and is suitable for classification tasks. To get a full
path of stability scores you can use :func:`lasso_stability_path`.

.. figure:: ../auto_examples/linear_model/images/plot_sparse_recovery_003.png
Expand All @@ -263,12 +282,12 @@ of features non zero.

.. topic:: References:

* N. Meinshausen, P. Buhlmann, "Stability selection",
Journal of the Royal Statistical Society, 72 (2010)
http://arxiv.org/pdf/0809.2932
.. [B2009] F. Bach, "Model-Consistent Sparse Estimation through the
Bootstrap." http://hal.inria.fr/hal-00354771/

* F. Bach, "Model-Consistent Sparse Estimation through the Bootstrap"
http://hal.inria.fr/hal-00354771/
.. [M2010] N. Meinshausen, P. Buhlmann, "Stability selection",
Journal of the Royal Statistical Society, 72 (2010)
http://arxiv.org/pdf/0809.2932

Tree-based feature selection
----------------------------
Expand Down Expand Up @@ -324,4 +343,4 @@ Then, a :class:`sklearn.ensemble.RandomForestClassifier` is trained on the
transformed output, i.e. using only relevant features. You can perform
similar operations with the other feature selection methods and also
classifiers that provide a way to evaluate feature importances of course.
See the :class:`sklearn.pipeline.Pipeline` examples for more details.
See the :class:`sklearn.pipeline.Pipeline` examples for more details.
2 changes: 1 addition & 1 deletion doc/modules/outlier_detection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ but regular, observation outside the frontier.
:class:`svm.OneClassSVM` object.

.. figure:: ../auto_examples/svm/images/plot_oneclass_001.png
:target: ../auto_examples/svm/plot_oneclasse.html
:target: ../auto_examples/svm/plot_oneclass.html
:align: center
:scale: 75%

Expand Down
11 changes: 11 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,10 @@ New features
Enhancements
............

- :class:`feature_extraction.FeatureHasher` now accepts string values.
(`#6173 <https://github.com/scikit-learn/scikit-learn/pull/6173>`_) By `Ryad Zenine`_
and `Devashish Deshpande`_.

- The cross-validation iterators are replaced by cross-validation splitters
available from :mod:`model_selection`. These expose a ``split`` method
that takes in the data and yields a generator for the different splits.
Expand Down Expand Up @@ -117,6 +121,9 @@ Enhancements
- Added ``inverse_transform`` function to :class:`decomposition.nmf` to compute
data matrix of original shape. By `Anish Shah`_.

- :class:`naive_bayes.GaussianNB` now accepts data-independent class-priors
through the parameter ``priors``. By `Guillaume Lemaitre`_.

Bug fixes
.........

Expand Down Expand Up @@ -4121,3 +4128,7 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
.. _Jonathan Arfa: https://github.com/jarfa

.. _Anish Shah: https://github.com/AnishShah

.. _Ryad Zenine: https://github.com/ryadzenine

.. _Guillaume Lemaitre: https://github.com/glemaitre
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ with-doctest = 1
doctest-tests = 1
doctest-extension = rst
doctest-fixtures = _fixture
ignore-files=^setup\.py$
#doctest-options = +ELLIPSIS,+NORMALIZE_WHITESPACE

[wheelhouse_uploader]
Expand Down
6 changes: 6 additions & 0 deletions sklearn/cross_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -1438,6 +1438,12 @@ def cross_val_score(estimator, X, y=None, scoring=None, cv=None, n_jobs=1,
-------
scores : array of float, shape=(len(list(cv)),)
Array of scores of the estimator for each run of the cross validation.

See Also
---------
:func:`sklearn.metrics.make_scorer`:
Make a scorer from a performance metric or loss function.

"""
X, y = indexable(X, y)

Expand Down
46 changes: 33 additions & 13 deletions
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,8 @@ def load_svmlight_files(files, n_features=None, dtype=np.float64,


def _dump_svmlight(X, y, f, multilabel, one_based, comment, query_id):
is_sp = int(hasattr(X, "tocsr"))
X_is_sp = int(hasattr(X, "tocsr"))
y_is_sp = int(hasattr(y, "tocsr"))
if X.dtype.kind == 'i':
value_pattern = u("%d:%d")
else:
Expand All @@ -302,7 +303,7 @@ def _dump_svmlight(X, y, f, multilabel, one_based, comment, query_id):
f.writelines(b("# %s\n" % line) for line in comment.splitlines())

for i in range(X.shape[0]):
if is_sp:
if X_is_sp:
span = slice(X.indptr[i], X.indptr[i + 1])
row = zip(X.indices[span], X.data[span])
else:
Expand All @@ -312,10 +313,16 @@ def _dump_svmlight(X, y, f, multilabel, one_based, comment, query_id):
s = " ".join(value_pattern % (j + one_based, x) for j, x in row)

if multilabel:
nz_labels = np.where(y[i] != 0)[0]
if y_is_sp:
nz_labels = y[i].nonzero()[1]
else:
nz_labels = np.where(y[i] != 0)[0]
labels_str = ",".join(label_pattern % j for j in nz_labels)
else:
labels_str = label_pattern % y[i]
if y_is_sp:
labels_str = label_pattern % y.data[i]
else:
labels_str = label_pattern % y[i]

if query_id is not None:
feat = (labels_str, query_id[i], s)
Expand All @@ -341,9 +348,10 @@ def dump_svmlight_file(X, y, f, zero_based=True, comment=None, query_id=None,
Training vectors, where n_samples is the number of samples and
n_features is the number of features.

y : array-like, shape = [n_samples] or [n_samples, n_labels]
Target values. Class labels must be an integer or float, or array-like
objects of integer or float for multilabel classifications.
y : {array-like, sparse matrix}, shape = [n_samples (, n_labels)]
Target values. Class labels must be an
integer or float, or array-like objects of integer or float for
multilabel classifications.

f : string or file-like in binary mode
If string, specifies the path that will contain the data.
Expand Down Expand Up @@ -385,19 +393,31 @@ def dump_svmlight_file(X, y, f, zero_based=True, comment=None, query_id=None,
if six.b("\0") in comment:
raise ValueError("comment string contains NUL byte")

y = np.asarray(y)
if y.ndim != 1 and not multilabel:
raise ValueError("expected y of shape (n_samples,), got %r"
% (y.shape,))
yval = check_array(y, accept_sparse='csr', ensure_2d=False)
if sp.issparse(yval):
if yval.shape[1] != 1 and not multilabel:
raise ValueError("expected y of shape (n_samples, 1),"
" got %r" % (yval.shape,))
else:
if yval.ndim != 1 and not multilabel:
raise ValueError("expected y of shape (n_samples,), got %r"
% (yval.shape,))

Xval = check_array(X, accept_sparse='csr')
if Xval.shape[0] != y.shape[0]:
if Xval.shape[0] != yval.shape[0]:
raise ValueError("X.shape[0] and y.shape[0] should be the same, got"
" %r and %r instead." % (Xval.shape[0], y.shape[0]))
" %r and %r instead." % (Xval.shape[0], yval.shape[0]))

# We had some issues with CSR matrices with unsorted indices (e.g. #1501),
# so sort them here, but first make sure we don't modify the user's X.
# TODO We can do this cheaper; sorted_indices copies the whole matrix.
if yval is y and hasattr(yval, "sorted_indices"):
y = yval.sorted_indices()
else:
y = yval
if hasattr(y, "sort_indices"):
y.sort_indices()

if Xval is X and hasattr(Xval, "sorted_indices"):
X = Xval.sorted_indices()
else:
Expand Down
118 changes: 68 additions & 50 deletions sklearn/datasets/tests/test_svmlight_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import gzip
from io import BytesIO
import numpy as np
import scipy.sparse as sp
import os
import shutil
from tempfile import NamedTemporaryFile
Expand Down Expand Up @@ -200,67 +201,84 @@ def test_invalid_filename():


def test_dump():
Xs, y = load_svmlight_file(datafile)
Xd = Xs.toarray()
X_sparse, y_dense = load_svmlight_file(datafile)
X_dense = X_sparse.toarray()
y_sparse = sp.csr_matrix(y_dense)

# slicing a csr_matrix can unsort its .indices, so test that we sort
# those correctly
Xsliced = Xs[np.arange(Xs.shape[0])]

for X in (Xs, Xd, Xsliced):
for zero_based in (True, False):
for dtype in [np.float32, np.float64, np.int32]:
f = BytesIO()
# we need to pass a comment to get the version info in;
# LibSVM doesn't grok comments so they're not put in by
# default anymore.
dump_svmlight_file(X.astype(dtype), y, f, comment="test",
zero_based=zero_based)
f.seek(0)

comment = f.readline()
try:
comment = str(comment, "utf-8")
except TypeError: # fails in Python 2.x
pass

assert_in("scikit-learn %s" % sklearn.__version__, comment)

comment = f.readline()
try:
comment = str(comment, "utf-8")
except TypeError: # fails in Python 2.x
pass

assert_in(["one", "zero"][zero_based] + "-based", comment)

X2, y2 = load_svmlight_file(f, dtype=dtype,
zero_based=zero_based)
assert_equal(X2.dtype, dtype)
assert_array_equal(X2.sorted_indices().indices, X2.indices)
if dtype == np.float32:
assert_array_almost_equal(
X_sliced = X_sparse[np.arange(X_sparse.shape[0])]
y_sliced = y_sparse[np.arange(y_sparse.shape[0])]

for X in (X_sparse, X_dense, X_sliced):
for y in (y_sparse, y_dense, y_sliced):
for zero_based in (True, False):
for dtype in [np.float32, np.float64, np.int32]:
f = BytesIO()
# we need to pass a comment to get the version info in;
# LibSVM doesn't grok comments so they're not put in by
# default anymore.

if (sp.issparse(y) and y.shape[0] == 1):
# make sure y's shape is: (n_samples, n_labels)
# when it is sparse
y = y.T

dump_svmlight_file(X.astype(dtype), y, f, comment="test",
zero_based=zero_based)
f.seek(0)

comment = f.readline()
try:
comment = str(comment, "utf-8")
except TypeError: # fails in Python 2.x
pass

assert_in("scikit-learn %s" % sklearn.__version__, comment)

comment = f.readline()
try:
comment = str(comment, "utf-8")
except TypeError: # fails in Python 2.x
pass

assert_in(["one", "zero"][zero_based] + "-based", comment)

X2, y2 = load_svmlight_file(f, dtype=dtype,
zero_based=zero_based)
assert_equal(X2.dtype, dtype)
assert_array_equal(X2.sorted_indices().indices, X2.indices)

X2_dense = X2.toarray()

if dtype == np.float32:
# allow a rounding error at the last decimal place
Xd.astype(dtype), X2.toarray(), 4)
else:
assert_array_almost_equal(
assert_array_almost_equal(
X_dense.astype(dtype), X2_dense, 4)
assert_array_almost_equal(
y_dense.astype(dtype), y2, 4)
else:
# allow a rounding error at the last decimal place
Xd.astype(dtype), X2.toarray(), 15)
assert_array_equal(y, y2)
assert_array_almost_equal(
X_dense.astype(dtype), X2_dense, 15)
assert_array_almost_equal(
y_dense.astype(dtype), y2, 15)


def test_dump_multilabel():
X = [[1, 0, 3, 0, 5],
[0, 0, 0, 0, 0],
[0, 5, 0, 1, 0]]
y = [[0, 1, 0], [1, 0, 1], [1, 1, 0]]
f = BytesIO()
dump_svmlight_file(X, y, f, multilabel=True)
f.seek(0)
# make sure it dumps multilabel correctly
assert_equal(f.readline(), b("1 0:1 2:3 4:5\n"))
assert_equal(f.readline(), b("0,2 \n"))
assert_equal(f.readline(), b("0,1 1:5 3:1\n"))
y_dense = [[0, 1, 0], [1, 0, 1], [1, 1, 0]]
y_sparse = sp.csr_matrix(y_dense)
for y in [y_dense, y_sparse]:
f = BytesIO()
dump_svmlight_file(X, y, f, multilabel=True)
f.seek(0)
# make sure it dumps multilabel correctly
assert_equal(f.readline(), b("1 0:1 2:3 4:5\n"))
assert_equal(f.readline(), b("0,2 \n"))
assert_equal(f.readline(), b("0,1 1:5 3:1\n"))


def test_dump_concise():
Expand Down
Loading
0