8000 Merge branch 'master' into circle-noplot · scikit-learn/scikit-learn@2d001a6 · GitHub
[go: up one dir, main page]

Skip to content

Commit 2d001a6

Browse files
committed
Merge branch 'master' into circle-noplot
2 parents 7096f3d + c171561 commit 2d001a6

File tree

15 files changed

+101
-66
lines changed
  • tests
  • 15 files changed

    +101
    -66
    lines changed

    .gitattributes

    Lines changed: 1 addition & 30 deletions
    Original file line numberDiff line numberDiff line change
    @@ -1,30 +1 @@
    1-
    /sklearn/__check_build/_check_build.c -diff
    2-
    /sklearn/_isotonic.c -diff
    3-
    /sklearn/cluster/_dbscan_inner.cpp -diff
    4-
    /sklearn/cluster/_hierarchical.cpp -diff
    5-
    /sklearn/cluster/_k_means.c -diff
    6-
    /sklearn/cluster/_k_means_elkan.c -diff
    7-
    /sklearn/datasets/_svmlight_format.c -diff
    8-
    /sklearn/decomposition/_online_lda.c -diff
    9-
    /sklearn/decomposition/cdnmf_fast.c -diff
    10-
    /sklearn/ensemble/_gradient_boosting.c -diff
    11-
    /sklearn/feature_extraction/_hashing.c -diff
    12-
    /sklearn/linear_model/cd_fast.c -diff
    13-
    /sklearn/linear_model/sgd_fast.c -diff
    14-
    /sklearn/linear_model/sag_fast.c -diff
    15-
    /sklearn/metrics/pairwise_fast.c -diff
    16-
    /sklearn/neighbors/ball_tree.c -diff
    17-
    /sklearn/neighbors/kd_tree.c -diff
    18-
    /sklearn/svm/liblinear.c -diff
    19-
    /sklearn/svm/libsvm.c -diff
    20-
    /sklearn/svm/libsvm_sparse.c -diff
    21-
    /sklearn/tree/_tree.c -diff
    22-
    /sklearn/tree/_utils.c -diff
    23-
    /sklearn/utils/arrayfuncs.c -diff
    24-
    /sklearn/utils/graph_shortest_path.c -diff
    25-
    /sklearn/utils/lgamma.c -diff
    26-
    /sklearn/utils/_logistic_sigmoid.c -diff
    27-
    /sklearn/utils/murmurhash.c -diff
    28-
    /sklearn/utils/seq_dataset.c -diff
    29-
    /sklearn/utils/sparsefuncs_fast.c -diff
    30-
    /sklearn/utils/weight_vector.c -diff
    1+
    /doc/whats_new.rst merge=union

    build_tools/circle/push_doc.sh

    Lines changed: 3 additions & 1 deletion
    Original file line numberDiff line numberDiff line change
    @@ -24,9 +24,11 @@ MSG="Pushing the docs to $dir/ for branch: $CIRCLE_BRANCH, commit $CIRCLE_SHA1"
    2424

    2525
    cd $HOME
    2626
    if [ ! -d $DOC_REPO ];
    27-
    then git clone "git@github.com:scikit-learn/"$DOC_REPO".git";
    27+
    then git clone --depth 1 --no-checkout "git@github.com:scikit-learn/"$DOC_REPO".git";
    2828
    fi
    2929
    cd $DOC_REPO
    30+
    git config core.sparseCheckout true
    31+
    echo $dir > .git/info/sparse-checkout
    3032
    git checkout $CIRCLE_BRANCH
    3133
    git reset --hard origin/$CIRCLE_BRANCH
    3234
    git rm -rf $dir/ && rm -rf $dir/

    circle.yml

    Lines changed: 2 additions & 2 deletions
    Original file line numberDiff line numberDiff line change
    @@ -9,9 +9,9 @@ dependencies:
    99
    - ./build_tools/circle/build_doc.sh:
    1010
    timeout: 3600 # seconds
    1111
    test:
    12-
    # Grep error on the documentation
    1312
    override:
    14-
    - cat ~/log.txt && if grep -q "Traceback (most recent call last):" ~/log.txt; then false; else true; fi
    13+
    # override is needed otherwise nosetests is run by default
    14+
    - echo "Documentation has been built in the 'dependencies' step. No additional test to run"
    1515
    deployment:
    1616
    push:
    1717
    branch: /^master$|^[0-9]+\.[0-9]+\.X$/

    doc/modules/clustering.rst

    Lines changed: 10 additions & 9 deletions
    Original file line numberDiff line numberDiff line change
    @@ -746,17 +746,18 @@ by black points below.
    746746

    747747
    .. topic:: Implementation
    748748

    749-
    The algorithm is non-deterministic, but the core samples will
    750-
    always belong to the same clusters (although the labels may be
    751-
    different). The non-determinism comes from deciding to which cluster a
    752-
    non-core sample belongs. A non-core sample can have a distance lower
    753-
    than ``eps`` to two core samples in different clusters. By the
    749+
    The DBSCAN algorithm is deterministic, always generating the same clusters
    750+
    when given the same data in the same order. However, the results can differ when
    751+
    data is provided in a different order. First, even though the core samples
    752+
    will always be assigned to the same clusters, the labels of those clusters
    753+
    will depend on the order in which those samples are encountered in the data.
    754+
    Second and more importantly, the clusters to which non-core samples are assigned
    755+
    can differ depending on the data order. This would happen when a non-core sample
    756+
    has a distance lower than ``eps`` to two core samples in different clusters. By the
    754757
    triangular inequality, those two core samples must be more distant than
    755758
    ``eps`` from each other, or they would be in the same cluster. The non-core
    756-
    sample is assigned to whichever cluster is generated first, where
    757-
    the order is determined randomly. Other than the ordering of
    758-
    the dataset, the algorithm is deterministic, making the results relatively
    759-
    stable between runs on the same data.
    759+
    sample is assigned to whichever cluster is generated first in a pass
    760+
    through the data, and so the results will depend on the data ordering.
    760761

    761762
    The current implementation uses ball trees and kd-trees
    762763
    to determine the neighborhood of points,

    doc/modules/grid_search.rst

    Lines changed: 1 addition & 1 deletion
    Original file line numberDiff line numberDiff line change
    @@ -41,7 +41,7 @@ distribution. After describing these tools we detail
    4141

    4242
    Note that it is common that a small subset of those parameters can have a large
    4343
    impact on the predictive or computation performance of the model while others
    44-
    can be left to their default values. It is recommend to read the docstring of
    44+
    can be left to their default values. It is recommended to read the docstring of
    4545
    the estimator class to get a finer understanding of their expected behavior,
    4646
    possibly by reading the enclosed reference to the literature.
    4747

    doc/modules/model_evaluation.rst

    Lines changed: 12 additions & 0 deletions
    Original file line numberDiff line numberDiff line change
    @@ -1133,6 +1133,12 @@ are predicted. This is useful if you want to know how many top-scored-labels
    11331133
    you have to predict in average without missing any true one. The best value
    11341134
    of this metrics is thus the average number of true labels.
    11351135

    1136+
    .. note::
    1137+
    1138+
    Our implementation's score is 1 greater than the one given in Tsoumakas
    1139+
    et al., 2010. This extends it to handle the degenerate case in which an
    1140+
    instance has 0 true labels.
    1141+
    11361142
    Formally, given a binary indicator matrix of the ground truth labels
    11371143
    :math:`y \in \left\{0, 1\right\}^{n_\text{samples} \times n_\text{labels}}` and the
    11381144
    score associated with each label
    @@ -1236,6 +1242,12 @@ Here is a small example of usage of this function::
    12361242
    >>> label_ranking_loss(y_true, y_score)
    12371243
    0.0
    12381244

    1245+
    1246+
    .. topic:: References:
    1247+
    1248+
    * Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data. In
    1249+
    Data mining and knowledge discovery handbook (pp. 667-685). Springer US.
    1250+
    12391251
    .. _regression_metrics:
    12401252

    12411253
    Regression metrics

    doc/modules/model_persistence.rst

    Lines changed: 4 additions & 0 deletions
    Original file line numberDiff line numberDiff line change
    @@ -81,6 +81,10 @@ additional metadata should be saved along the pickled model:
    8181
    This should make it possible to check that the cross-validation score is in the
    8282
    same range as before.
    8383

    84+
    Since a model internal representation may be different on two different
    85+
    architectures, dumping a model on one architecture and loading it on
    86+
    another architecture is not supported.
    87+
    8488
    If you want to know more about these issues and explore other possible
    8589
    serialization methods, please refer to this
    8690
    `talk by Alex Gaynor <http://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_.

    doc/whats_new.rst

    Lines changed: 5 additions & 0 deletions
    Original file line numberDiff line numberDiff line change
    @@ -85,6 +85,11 @@ Enhancements
    8585
    do not set attributes on the estimator.
    8686
    :issue:`7533` by :user:`Ekaterina Krivich <kiote>`.
    8787

    88+
    - For sparse matrices, :func:`preprocessing.normalize` with ``return_norm=True``
    89+
    will now raise a ``NotImplementedError`` with 'l1' or 'l2' norm and with norm 'max'
    90+
    the norms returned will be the same as for dense matrices (:issue:`7771`).
    91+
    By `Ang Lu <https://github.com/luang008>`_.
    92+
    8893
    Bug fixes
    8994
    .........
    9095

    examples/classification/plot_lda_qda.py

    Lines changed: 6 additions & 2 deletions
    Original file line numberDiff line numberDiff line change
    @@ -1,9 +1,13 @@
    11
    """
    22
    ====================================================================
    3-
    Linear and Quadratic Discriminant Analysis with confidence ellipsoid
    3+
    Linear and Quadratic Discriminant Analysis with covariance ellipsoid
    44
    ====================================================================
    55
    6-
    Plot the confidence ellipsoids of each class and decision boundary
    6+
    This example plots the covariance ellipsoids of each class and
    7+
    decision boundary learned by LDA and QDA. The ellipsoids display
    8+
    the double standard deviation for each class. With LDA, the
    9+
    standard deviation is the same for all the classes, while each
    10+
    class has its own standard deviation with QDA.
    711
    """
    812
    print(__doc__)
    913

    examples/datasets/plot_iris_dataset.py

    Lines changed: 3 additions & 3 deletions
    Original file line numberDiff line numberDiff line change
    @@ -31,7 +31,7 @@
    3131
    # import some data to play with
    3232
    iris = datasets.load_iris()
    3333
    X = iris.data[:, :2] # we only take the first two features.
    34-
    Y = iris.target
    34+
    y = iris.target
    3535

    3636
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    3737
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    @@ -40,7 +40,7 @@
    4040
    plt.clf()
    4141

    4242
    # Plot the training points
    43-
    plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
    43+
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
    4444
    plt.xlabel('Sepal length')
    4545
    plt.ylabel('Sepal width')
    4646

    @@ -54,7 +54,7 @@
    5454
    fig = plt.figure(1, figsize=(8, 6))
    5555
    ax = Axes3D(fig, elev=-150, azim=110)
    5656
    X_reduced = PCA(n_components=3).fit_transform(iris.data)
    57-
    ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y,
    57+
    ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,
    5858
    cmap=plt.cm.Paired)
    5959
    ax.set_title("First three PCA directions")
    6060
    ax.set_xlabel("1st eigenvector")

    sklearn/metrics/pairwise.py

    Lines changed: 4 additions & 4 deletions
    Original file line numberDiff line numberDiff line change
    @@ -752,7 +752,7 @@ def polynomial_kernel(X, Y=None, degree=3, gamma=None, coef0=1):
    752752
    degree : int, default 3
    753753
    754754
    gamma : float, default None
    755-
    if None, defaults to 1.0 / n_samples_1
    755+
    if None, defaults to 1.0 / n_features
    756756
    757757
    coef0 : int, default 1
    758758
    @@ -786,7 +786,7 @@ def sigmoid_kernel(X, Y=None, gamma=None, coef0=1):
    786786
    Y : ndarray of shape (n_samples_2, n_features)
    787787
    788788
    gamma : float, default None
    789-
    If None, defaults to 1.0 / n_samples_1
    789+
    If None, defaults to 1.0 / n_features
    790790
    791791
    coef0 : int, default 1
    792792
    @@ -822,7 +822,7 @@ def rbf_kernel(X, Y=None, gamma=None):
    822822
    Y : array of shape (n_samples_Y, n_features)
    823823
    824824
    gamma : float, default None
    825-
    If None, defaults to 1.0 / n_samples_X
    825+
    If None, defaults to 1.0 / n_features
    826826
    827827
    Returns
    828828
    -------
    @@ -857,7 +857,7 @@ def laplacian_kernel(X, Y=None, gamma=None):
    857857
    Y : array of shape (n_samples_Y, n_features)
    858858
    859859
    gamma : float, default None
    860-
    If None, defaults to 1.0 / n_samples_X
    860+
    If None, defaults to 1.0 / n_features
    861861
    862862
    Returns
    863863
    -------

    sklearn/metrics/ranking.py

    Lines changed: 4 additions & 0 deletions
    Original file line numberDiff line numberDiff line change
    @@ -633,6 +633,10 @@ def coverage_error(y_true, y_score, sample_weight=None):
    633633
    Ties in ``y_scores`` are broken by giving maximal rank that would have
    634634
    been assigned to all tied values.
    635635
    636+
    Note: Our implementation's score is 1 greater than the one given in
    637+
    Tsoumakas et al., 2010. This extends it to handle the degenerate case
    638+
    in which an instance has 0 true labels.
    639+
    636640
    Read more in the :ref:`User Guide <coverage_error>`.
    637641
    638642
    Parameters

    sklearn/neighbors/binary_tree.pxi

    Lines changed: 11 additions & 11 deletions
    Original file line numberDiff line numberDiff line change
    @@ -297,9 +297,9 @@ Query for k-nearest neighbors
    297297
    >>> X = np.random.random((10, 3)) # 10 points in 3 dimensions
    298298
    >>> tree = {BinaryTree}(X, leaf_size=2) # doctest: +SKIP
    299299
    >>> dist, ind = tree.query([X[0]], k=3) # doctest: +SKIP
    300-
    >>> print ind # indices of 3 closest neighbors
    300+
    >>> print(ind) # indices of 3 closest neighbors
    301301
    [0 3 1]
    302-
    >>> print dist # distances to 3 closest neighbors
    302+
    >>> print(dist) # distances to 3 closest neighbors
    303303
    [ 0. 0.19662693 0.29473397]
    304304
    305305
    Pickle and Unpickle a tree. Note that the state of the tree is saved in the
    @@ -313,9 +313,9 @@ pickle operation: the tree needs not be rebuilt upon unpickling.
    313313
    >>> s = pickle.dumps(tree) # doctest: +SKIP
    314314
    >>> tree_copy = pickle.loads(s) # doctest: +SKIP
    315315
    >>> dist, ind = tree_copy.query(X[0], k=3) # doctest: +SKIP
    316-
    >>> print ind # indices of 3 closest neighbors
    316+
    >>> print(ind) # indices of 3 closest neighbors
    317317
    [0 3 1]
    318-
    >>> print dist # distances to 3 closest neighbors
    318+
    >>> print(dist) # distances to 3 closest neighbors
    319319
    [ 0. 0.19662693 0.29473397]
    320320
    321321
    Query for neighbors within a given radius
    @@ -324,10 +324,10 @@ Query for neighbors within a given radius
    324324
    >>> np.random.seed(0)
    325325
    >>> X = np.random.random((10, 3)) # 10 points in 3 dimensions
    326326
    >>> tree = {BinaryTree}(X, leaf_size=2) # doctest: +SKIP
    327-
    >>> print tree.query_radius(X[0], r=0.3, count_only=True)
    327+
    >>> print(tree.query_radius(X[0], r=0.3, count_only=True))
    328328
    3
    329329
    >>> ind = tree.query_radius(X[0], r=0.3) # doctest: +SKIP
    330-
    >>> print ind # indices of neighbors within distance 0.3
    330+
    >>> print(ind) # indices of neighbors within distance 0.3
    331331
    [3 0 1]
    332332
    333333
    @@ -623,7 +623,7 @@ cdef class NeighborsHeap:
    623623
    dist_arr[0] = val
    624624
    ind_arr[0] = i_val
    625625

    626-
    #descend the heap, swapping values until the max heap criterion is met
    626+
    # descend the heap, swapping values until the max heap criterion is met
    627627
    i = 0
    628628
    while True:
    629629
    ic1 = 2 * i + 1
    @@ -1282,9 +1282,9 @@ cdef class BinaryTree:
    12821282
    >>> X = np.random.random((10, 3)) # 10 points in 3 dimensions
    12831283
    >>> tree = BinaryTree(X, leaf_size=2) # doctest: +SKIP
    12841284
    >>> dist, ind = tree.query(X[0], k=3) # doctest: +SKIP
    1285-
    >>> print ind # indices of 3 closest neighbors
    1285+
    >>> print(ind) # indices of 3 closest neighbors
    12861286
    [0 3 1]
    1287-
    >>> print dist # distances to 3 closest neighbors
    1287+
    >>> print(dist) # distances to 3 closest neighbors
    12881288
    [ 0. 0.19662693 0.29473397]
    12891289
    """
    12901290
    # XXX: we should allow X to be a pre-built tree.
    @@ -1415,10 +1415,10 @@ cdef class BinaryTree:
    14151415
    >>> np.random.seed(0)
    14161416
    >>> X = np.random.random((10, 3)) # 10 points in 3 dimensions
    14171417
    >>> tree = BinaryTree(X, leaf_size=2) # doctest: +SKIP
    1418-
    >>> print tree.query_radius(X[0], r=0.3, count_only=True)
    1418+
    >>> print(tree.query_radius(X[0], r=0.3, count_only=True))
    14191419
    3
    14201420
    >>> ind = tree.query_radius(X[0], r=0.3) # doctest: +SKIP
    1421-
    >>> print ind # indices of neighbors within distance 0.3
    1421+
    >>> print(ind) # indices of neighbors within distance 0.3
    14221422
    [3 0 1]
    14231423
    """
    14241424
    if count_only and return_distance:

    sklearn/preprocessing/data.py

    Lines changed: 17 additions & 3 deletions
    Original file line numberDiff line numberDiff li BD94 ne change
    @@ -1325,6 +1325,16 @@ def normalize(X, norm='l2', axis=1, copy=True, return_norm=False):
    13251325
    return_norm : boolean, default False
    13261326
    whether to return the computed norms
    13271327
    1328+
    Returns
    1329+
    -------
    1330+
    X : {array-like, sparse matrix}, shape [n_samples, n_features]
    1331+
    Normalized input X.
    1332+
    1333+
    norms : array, shape [n_samples] if axis=1 else [n_features]
    1334+
    An array of norms along given axis for X.
    1335+
    When X is sparse, a NotImplementedError will be raised
    1336+
    for norm 'l1' or 'l2'.
    1337+
    13281338
    See also
    13291339
    --------
    13301340
    Normalizer: Performs normalization using the ``Transformer`` API
    @@ -1346,15 +1356,19 @@ def normalize(X, norm='l2', axis=1, copy=True, return_norm=False):
    13461356
    X = X.T
    13471357

    13481358
    if sparse.issparse(X):
    1359+
    if return_norm and norm in ('l1', 'l2'):
    1360+
    raise NotImplementedError("return_norm=True is not implemented "
    1361+
    "for sparse matrices with norm 'l1' "
    1362+
    "or norm 'l2'")
    13491363
    if norm == 'l1':
    13501364
    inplace_csr_row_normalize_l1(X)
    13511365
    elif norm == 'l2':
    13521366
    inplace_csr_row_normalize_l2(X)
    13531367
    elif norm == 'max':
    13541368
    _, norms = min_max_axis(X, 1)
    1355-
    norms = norms.repeat(np.diff(X.indptr))
    1356-
    mask = norms != 0
    1357-
    X.data[mask] /= norms[mask]
    1369+
    norms_elementwise = norms.repeat(np.diff(X.indptr))
    1370+
    mask = norms_elementwise != 0
    1371+
    X.data[mask] /= norms_elementwise[mask]
    13581372
    else:
    13591373
    if norm == 'l1':
    13601374
    norms = np.abs(X).sum(axis=1)

    sklearn/preprocessing/tests/test_data.py

    Lines changed: 18 additions & 0 deletions
    Original file line numberDiff line numberDiff line change
    @@ -1315,6 +1315,24 @@ def test_normalize():
    13151315

    13161316
    assert_array_almost_equal(row_sums, ones)
    13171317

    1318+
    # Test return_norm
    1319+
    X_dense = np.array([[3.0, 0, 4.0], [1.0, 0.0, 0.0], [2.0, 3.0, 0.0]])
    1320+
    for norm in ('l1', 'l2', 'max'):
    1321+
    _, norms = normalize(X_dense, norm=norm, return_norm=True)
    1322+
    if norm == 'l1':
    1323+
    assert_array_almost_equal(norms, np.array([7.0, 1.0, 5.0]))
    1324+
    elif norm == 'l2':
    1325+
    assert_array_almost_equal(norms, np.array([5.0, 1.0, 3.60555127]))
    1326+
    else:
    1327+
    assert_array_almost_equal(norms, np.array([4.0, 1.0, 3.0]))
    1328+
    1329+
    X_sparse = sparse.csr_matrix(X_dense)
    1330+
    for norm in ('l1', 'l2'):
    1331+
    assert_raises(NotImplementedError, normalize, X_sparse,
    1332+
    norm=norm, return_norm=True)
    1333+
    _, norms = normalize(X_sparse, norm='max', return_norm=True)
    1334+
    assert_array_almost_equal(norms, np.array([4.0, 1.0, 3.0]))
    1335+
    13181336

    13191337
    def test_binarizer():
    13201338
    X_ = np.array([[1, 0, 5], [2, 3, -1]])

    0 commit comments

    Comments
     (0)
    0