kernc · kernc · Feb 17, 2016 · Feb 18, 2016 · Jan 15, 2016 · Feb 29, 2016
diff --git a/doc/about.rst b/doc/about.rst
@@ -63,7 +63,7 @@ High quality PNG and SVG logos are available in the `doc/logos/ <https://github.
 Funding
 -------
 
-`INRIA <http://inria.fr>`_ actively supports this project. It has
+`INRIA <http://www.inria.fr>`_ actively supports this project. It has
 provided funding for Fabian Pedregosa (2010-2012), Jaques Grobler
 (2012-2013) and Olivier Grisel (2013-2015) to work on this project
 full-time. It also hosts coding sprints and other events.

diff --git a/doc/modules/feature_selection.rst b/doc/modules/feature_selection.rst
@@ -173,8 +173,8 @@ L1-based feature selection
 sparse solutions: many of their estimated coefficients are zero. When the goal
 is to reduce the dimensionality of the data to use with another classifier,
 they can be used along with :class:`feature_selection.SelectFromModel`
-to select the non-zero coefficients. In particular, sparse estimators useful for
-this purpose are the :class:`linear_model.Lasso` for regression, and
+to select the non-zero coefficients. In particular, sparse estimators useful
+for this purpose are the :class:`linear_model.Lasso` for regression, and
 of :class:`linear_model.LogisticRegression` and :class:`svm.LinearSVC`
 for classification::
 
@@ -234,15 +234,34 @@ Randomized sparse models
 
 .. currentmodule:: sklearn.linear_model
 
-The limitation of L1-based sparse models is that faced with a group of
-very correlated features, they will select only one. To mitigate this
-problem, it is possible to use randomization techniques, reestimating the
-sparse model many times perturbing the design matrix or sub-sampling data
-and counting how many times a given regressor is selected.
+In terms of feature selection, there are some well-known limitations of
+L1-penalized models for regression and classification. For example, it is
+known that the Lasso will tend to select an individual variable out of a group
+of highly correlated features. Furthermore, even when the correlation between
+features is not too high, the conditions under which L1-penalized methods
+consistently select "good" features can be restrictive in general.
+
+To mitigate this problem, it is possible to use randomization techniques such
+as those presented in [B2009]_ and [M2010]_. The latter technique, known as
+stability selection, is implemented in the module :mod:`sklearn.linear_model`.
+In the stability selection method, a subsample of the data is fit to a
+L1-penalized model where the penalty of a random subset of coefficients has
+been scaled. Specifically, given a subsample of the data
+:math:`(x_i, y_i), i \in I`, where :math:`I \subset \{1, 2, \ldots, n\}` is a
+random subset of the data of size :math:`n_I`, the following modified Lasso
+fit is obtained:
+
+.. math::   \hat{w_I} = \mathrm{arg}\min_{w} \frac{1}{2n_I} \sum_{i \in I} (y_i - x_i^T w)^2 + \alpha \sum_{j=1}^p \frac{ \vert w_j \vert}{s_j},
+
+where :math:`s_j \in \{s, 1\}` are independent trials of a fair Bernoulli
+random variable, and :math:`0<s<1` is the scaling factor. By repeating this
+procedure across different random subsamples and Bernoulli trials, one can
+count the fraction of times the randomized procedure selected each feature,
+and used these fractions as scores for feature selection.
 
 :class:`RandomizedLasso` implements this strategy for regression
 settings, using the Lasso, while :class:`RandomizedLogisticRegression` uses the
-logistic regression and is suitable for classification tasks.  To get a full
+logistic regression and is suitable for classification tasks. To get a full
 path of stability scores you can use :func:`lasso_stability_path`.
 
 .. figure:: ../auto_examples/linear_model/images/plot_sparse_recovery_003.png
@@ -263,12 +282,12 @@ of features non zero.
 
 .. topic:: References:
 
-   * N. Meinshausen, P. Buhlmann, "Stability selection",
-     Journal of the Royal Statistical Society, 72 (2010)
-     http://arxiv.org/pdf/0809.2932
+  .. [B2009] F. Bach, "Model-Consistent Sparse Estimation through the
+        Bootstrap." http://hal.inria.fr/hal-00354771/
 
-   * F. Bach, "Model-Consistent Sparse Estimation through the Bootstrap"
-     http://hal.inria.fr/hal-00354771/
+  .. [M2010] N. Meinshausen, P. Buhlmann, "Stability selection",
+       Journal of the Royal Statistical Society, 72 (2010)
+       http://arxiv.org/pdf/0809.2932
 
 Tree-based feature selection
 ----------------------------
@@ -324,4 +343,4 @@ Then, a :class:`sklearn.ensemble.RandomForestClassifier` is trained on the
 transformed output, i.e. using only relevant features. You can perform
 similar operations with the other feature selection methods and also
 classifiers that provide a way to evaluate feature importances of course.
-See the :class:`sklearn.pipeline.Pipeline` examples for more details.
+See the :class:`sklearn.pipeline.Pipeline` examples for more details.
diff --git a/doc/modules/outlier_detection.rst b/doc/modules/outlier_detection.rst
@@ -76,7 +76,7 @@ but regular, observation outside the frontier.
      :class:`svm.OneClassSVM` object.
 
 .. figure:: ../auto_examples/svm/images/plot_oneclass_001.png
-   :target: ../auto_examples/svm/plot_oneclasse.html
+   :target: ../auto_examples/svm/plot_oneclass.html
    :align: center
    :scale: 75%
 

diff --git a/doc/whats_new.rst b/doc/whats_new.rst
@@ -50,6 +50,10 @@ New features
 Enhancements
 ............
 
+   - :class:`feature_extraction.FeatureHasher` now accepts string values.
+     (`#6173 <https://github.com/scikit-learn/scikit-learn/pull/6173>`_) By `Ryad Zenine`_
+     and `Devashish Deshpande`_.
+
    - The cross-validation iterators are replaced by cross-validation splitters
      available from :mod:`model_selection`. These expose a ``split`` method
      that takes in the data and yields a generator for the different splits.
@@ -117,6 +121,9 @@ Enhancements
    - Added ``inverse_transform`` function to :class:`decomposition.nmf` to compute
      data matrix of original shape. By `Anish Shah`_.
 
+   - :class:`naive_bayes.GaussianNB` now accepts data-independent class-priors
+     through the parameter ``priors``. By `Guillaume Lemaitre`_.
+
 Bug fixes
 .........
 
@@ -4121,3 +4128,7 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
 .. _Jonathan Arfa: https://github.com/jarfa
 
 .. _Anish Shah: https://github.com/AnishShah
+
+.. _Ryad Zenine: https://github.com/ryadzenine
+
+.. _Guillaume Lemaitre: https://github.com/glemaitre
diff --git a/setup.cfg b/setup.cfg
@@ -19,6 +19,7 @@ with-doctest = 1
 doctest-tests = 1
 doctest-extension = rst
 doctest-fixtures = _fixture
+ignore-files=^setup\.py$
 #doctest-options = +ELLIPSIS,+NORMALIZE_WHITESPACE
 
 [wheelhouse_uploader]

diff --git a/sklearn/cross_validation.py b/sklearn/cross_validation.py
@@ -1438,6 +1438,12 @@ def cross_val_score(estimator, X, y=None, scoring=None, cv=None, n_jobs=1,
     -------
     scores : array of float, shape=(len(list(cv)),)
         Array of scores of the estimator for each run of the cross validation.
+
+    See Also
+    ---------
+    :func:`sklearn.metrics.make_scorer`:
+        Make a scorer from a performance metric or loss function.
+
     """
     X, y = indexable(X, y)
 

diff --git a/sklearn/datasets/svmlight_format.py b/sklearn/datasets/svmlight_format.py
@@ -276,7 +276,8 @@ def load_svmlight_files(files, n_features=None, dtype=np.float64,
 
 
 def _dump_svmlight(X, y, f, multilabel, one_based, comment, query_id):
-    is_sp = int(hasattr(X, "tocsr"))
+    X_is_sp = int(hasattr(X, "tocsr"))
+    y_is_sp = int(hasattr(y, "tocsr"))
     if X.dtype.kind == 'i':
         value_pattern = u("%d:%d")
     else:
@@ -302,7 +303,7 @@ def _dump_svmlight(X, y, f, multilabel, one_based, comment, query_id):
         f.writelines(b("# %s\n" % line) for line in comment.splitlines())
 
     for i in range(X.shape[0]):
-        if is_sp:
+        if X_is_sp:
             span = slice(X.indptr[i], X.indptr[i + 1])
             row = zip(X.indices[span], X.data[span])
         else:
@@ -312,10 +313,16 @@ def _dump_svmlight(X, y, f, multilabel, one_based, comment, query_id):
         s = " ".join(value_pattern % (j + one_based, x) for j, x in row)
 
         if multilabel:
-            nz_labels = np.where(y[i] != 0)[0]
+            if y_is_sp:
+                nz_labels = y[i].nonzero()[1]
+            else:
+                nz_labels = np.where(y[i] != 0)[0]
             labels_str = ",".join(label_pattern % j for j in nz_labels)
         else:
-            labels_str = label_pattern % y[i]
+            if y_is_sp:
+                labels_str = label_pattern % y.data[i]
+            else:
+                labels_str = label_pattern % y[i]
 
         if query_id is not None:
             feat = (labels_str, query_id[i], s)
@@ -341,9 +348,10 @@ def dump_svmlight_file(X, y, f,  zero_based=True, comment=None, query_id=None,
         Training vectors, where n_samples is the number of samples and
         n_features is the number of features.
 
-    y : array-like, shape = [n_samples] or [n_samples, n_labels]
-        Target values. Class labels must be an integer or float, or array-like
-        objects of integer or float for multilabel classifications.
+    y : {array-like, sparse matrix}, shape = [n_samples (, n_labels)]
+        Target values. Class labels must be an
+        integer or float, or array-like objects of integer or float for
+        multilabel classifications.
 
     f : string or file-like in binary mode
         If string, specifies the path that will contain the data.
@@ -385,19 +393,31 @@ def dump_svmlight_file(X, y, f,  zero_based=True, comment=None, query_id=None,
         if six.b("\0") in comment:
             raise ValueError("comment string contains NUL byte")
 
-    y = np.asarray(y)
-    if y.ndim != 1 and not multilabel:
-        raise ValueError("expected y of shape (n_samples,), got %r"
-                         % (y.shape,))
+    yval = check_array(y, accept_sparse='csr', ensure_2d=False)
+    if sp.issparse(yval):
+        if yval.shape[1] != 1 and not multilabel:
+            raise ValueError("expected y of shape (n_samples, 1),"
+                             " got %r" % (yval.shape,))
+    else:
+        if yval.ndim != 1 and not multilabel:
+            raise ValueError("expected y of shape (n_samples,), got %r"
+                             % (yval.shape,))
 
     Xval = check_array(X, accept_sparse='csr')
-    if Xval.shape[0] != y.shape[0]:
+    if Xval.shape[0] != yval.shape[0]:
         raise ValueError("X.shape[0] and y.shape[0] should be the same, got"
-                         " %r and %r instead." % (Xval.shape[0], y.shape[0]))
+                         " %r and %r instead." % (Xval.shape[0], yval.shape[0]))
 
     # We had some issues with CSR matrices with unsorted indices (e.g. #1501),
     # so sort them here, but first make sure we don't modify the user's X.
     # TODO We can do this cheaper; sorted_indices copies the whole matrix.
+    if yval is y and hasattr(yval, "sorted_indices"):
+        y = yval.sorted_indices()
+    else:
+        y = yval
+        if hasattr(y, "sort_indices"):
+            y.sort_indices()
+
     if Xval is X and hasattr(Xval, "sorted_indices"):
         X = Xval.sorted_indices()
     else:

diff --git a/sklearn/datasets/tests/test_svmlight_format.py b/sklearn/datasets/tests/test_svmlight_format.py
@@ -2,6 +2,7 @@
 import gzip
 from io import BytesIO
 import numpy as np
+import scipy.sparse as sp
 import os
 import shutil
 from tempfile import NamedTemporaryFile
@@ -200,67 +201,84 @@ def test_invalid_filename():
 
 
 def test_dump():
-    Xs, y = load_svmlight_file(datafile)
-    Xd = Xs.toarray()
+    X_sparse, y_dense = load_svmlight_file(datafile)
+    X_dense = X_sparse.toarray()
+    y_sparse = sp.csr_matrix(y_dense)
 
     # slicing a csr_matrix can unsort its .indices, so test that we sort
     # those correctly
-    Xsliced = Xs[np.arange(Xs.shape[0])]
-
-    for X in (Xs, Xd, Xsliced):
-        for zero_based in (True, False):
-            for dtype in [np.float32, np.float64, np.int32]:
-                f = BytesIO()
-                # we need to pass a comment to get the version info in;
-                # LibSVM doesn't grok comments so they're not put in by
-                # default anymore.
-                dump_svmlight_file(X.astype(dtype), y, f, comment="test",
-                                   zero_based=zero_based)
-                f.seek(0)
-
-                comment = f.readline()
-                try:
-                    comment = str(comment, "utf-8")
-                except TypeError:  # fails in Python 2.x
-                    pass
-
-                assert_in("scikit-learn %s" % sklearn.__version__, comment)
-
-                comment = f.readline()
-                try:
-                    comment = str(comment, "utf-8")
-                except TypeError:  # fails in Python 2.x
-                    pass
-
-                assert_in(["one", "zero"][zero_based] + "-based", comment)
-
-                X2, y2 = load_svmlight_file(f, dtype=dtype,
-                                            zero_based=zero_based)
-                assert_equal(X2.dtype, dtype)
-                assert_array_equal(X2.sorted_indices().indices, X2.indices)
-                if dtype == np.float32:
-                    assert_array_almost_equal(
+    X_sliced = X_sparse[np.arange(X_sparse.shape[0])]
+    y_sliced = y_sparse[np.arange(y_sparse.shape[0])]
+
+    for X in (X_sparse, X_dense, X_sliced):
+        for y in (y_sparse, y_dense, y_sliced):
+            for zero_based in (True, False):
+                for dtype in [np.float32, np.float64, np.int32]:
+                    f = BytesIO()
+                    # we need to pass a comment to get the version info in;
+                    # LibSVM doesn't grok comments so they're not put in by
+                    # default anymore.
+
+                    if (sp.issparse(y) and y.shape[0] == 1):
+                        # make sure y's shape is: (n_samples, n_labels)
+                        # when it is sparse
+                        y = y.T
+
+                    dump_svmlight_file(X.astype(dtype), y, f, comment="test",
+                                       zero_based=zero_based)
+                    f.seek(0)
+
+                    comment = f.readline()
+                    try:
+                        comment = str(comment, "utf-8")
+                    except TypeError:  # fails in Python 2.x
+                        pass
+
+                    assert_in("scikit-learn %s" % sklearn.__version__, comment)
+
+                    comment = f.readline()
+                    try:
+                        comment = str(comment, "utf-8")
+                    except TypeError:  # fails in Python 2.x
+                        pass
+
+                    assert_in(["one", "zero"][zero_based] + "-based", comment)
+
+                    X2, y2 = load_svmlight_file(f, dtype=dtype,
+                                                zero_based=zero_based)
+                    assert_equal(X2.dtype, dtype)
+                    assert_array_equal(X2.sorted_indices().indices, X2.indices)
+
+                    X2_dense = X2.toarray()
+
+                    if dtype == np.float32:
                         # allow a rounding error at the last decimal place
-                        Xd.astype(dtype), X2.toarray(), 4)
-                else:
-                    assert_array_almost_equal(
+                        assert_array_almost_equal(
+                            X_dense.astype(dtype), X2_dense, 4)
+                        assert_array_almost_equal(
+                            y_dense.astype(dtype), y2, 4)
+                    else:
                         # allow a rounding error at the last decimal place
-                        Xd.astype(dtype), X2.toarray(), 15)
-                assert_array_equal(y, y2)
+                        assert_array_almost_equal(
+                            X_dense.astype(dtype), X2_dense, 15)
+                        assert_array_almost_equal(
+                            y_dense.astype(dtype), y2, 15)
 
 
 def test_dump_multilabel():
     X = [[1, 0, 3, 0, 5],
          [0, 0, 0, 0, 0],
          [0, 5, 0, 1, 0]]
-    y = [[0, 1, 0], [1, 0, 1], [1, 1, 0]]
-    f = BytesIO()
-    dump_svmlight_file(X, y, f, multilabel=True)
-    f.seek(0)
-    # make sure it dumps multilabel correctly
-    assert_equal(f.readline(), b("1 0:1 2:3 4:5\n"))
-    assert_equal(f.readline(), b("0,2 \n"))
-    assert_equal(f.readline(), b("0,1 1:5 3:1\n"))
+    y_dense = [[0, 1, 0], [1, 0, 1], [1, 1, 0]]
+    y_sparse = sp.csr_matrix(y_dense)
+    for y in [y_dense, y_sparse]:
+        f = BytesIO()
+        dump_svmlight_file(X, y, f, multilabel=True)
+        f.seek(0)
+        # make sure it dumps multilabel correctly
+        assert_equal(f.readline(), b("1 0:1 2:3 4:5\n"))
+        assert_equal(f.readline(), b("0,2 \n"))
+        assert_equal(f.readline(), b("0,1 1:5 3:1\n"))
 
 
 def test_dump_concise():