10000 Merge branch 'master' into circle-noplot · scikit-learn/scikit-learn@2d001a6 · GitHub
[go: up one dir, main page]

Skip to content

Commit 2d001a6

Browse files
committed
Merge branch 'master' into circle-noplot
2 parents 7096f3d + c171561 commit 2d001a6

File tree

15 files changed

+101
-66
lines changed

15 files changed

+101
-66
lines changed

.gitattributes

Lines changed: 1 addition & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1 @@
1-
/sklearn/__check_build/_check_build.c -diff
2-
/sklearn/_isotonic.c -diff
3-
/sklearn/cluster/_dbscan_inner.cpp -diff
4-
/sklearn/cluster/_hierarchical.cpp -diff
5-
/sklearn/cluster/_k_means.c -diff
6-
/sklearn/cluster/_k_means_elkan.c -diff
7-
/sklearn/datasets/_svmlight_format.c -diff
8-
/sklearn/decomposition/_online_lda.c -diff
9-
/sklearn/decomposition/cdnmf_fast.c -diff
10-
/sklearn/ensemble/_gradient_boosting.c -diff
11-
/sklearn/feature_extraction/_hashing.c -diff
12-
/sklearn/linear_model/cd_fast.c -diff
13-
/sklearn/linear_model/sgd_fast.c -diff
14-
/sklearn/linear_model/sag_fast.c -diff
15-
/sklearn/metrics/pairwise_fast.c -diff
16-
/sklearn/neighbors/ball_tree.c -diff
17-
/sklearn/neighbors/kd_tree.c -diff
18-
/sklearn/svm/liblinear.c -diff
19-
/sklearn/svm/libsvm.c -diff
20-
/sklearn/svm/libsvm_sparse.c -diff
21-
/sklearn/tree/_tree.c -diff
22-
/sklearn/tree/_utils.c -diff
23-
/sklearn/utils/arrayfuncs.c -diff
24-
/sklearn/utils/graph_shortest_path.c -diff
25-
/sklearn/utils/lgamma.c -diff
26-
/sklearn/utils/_logistic_sigmoid.c -diff
27-
/sklearn/utils/murmurhash.c -diff
28-
/sklearn/utils/seq_dataset.c -diff
29-
/sklearn/utils/sparsefuncs_fast.c -diff
30-
/sklearn/utils/weight_vector.c -diff
1+
/doc/whats_new.rst merge=union

build_tools/circle/push_doc.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,11 @@ MSG="Pushing the docs to $dir/ for branch: $CIRCLE_BRANCH, commit $CIRCLE_SHA1"
2424

2525
cd $HOME
2626
if [ ! -d $DOC_REPO ];
27-
then git clone "git@github.com:scikit-learn/"$DOC_REPO".git";
27+
then git clone --depth 1 --no-checkout "git@github.com:scikit-learn/"$DOC_REPO".git";
2828
fi
2929
cd $DOC_REPO
30+
git config core.sparseCheckout true
31+
echo $dir > .git/info/sparse-checkout
3032
git checkout $CIRCLE_BRANCH
3133
git reset --hard origin/$CIRCLE_BRANCH
3234
git rm -rf $dir/ && rm -rf $dir/

circle.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@ dependencies:
99
- ./build_tools/circle/build_doc.sh:
1010
timeout: 3600 # seconds
1111
test:
12-
# Grep error on the documentation
1312
override:
14-
- cat ~/log.txt && if grep -q "Traceback (most recent call last):" ~/log.txt; then false; else true; fi
13+
# override is needed otherwise nosetests is run by default
14+
- echo "Documentation has been built in the 'dependencies' step. No additional test to run"
1515
deployment:
1616
push:
1717
branch: /^master$|^[0-9]+\.[0-9]+\.X$/

doc/modules/clustering.rst

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -746,17 +746,18 @@ by black points below.
746746

747747
.. topic:: Implementation
748748

749-
The algorithm is non-deterministic, but the core samples will
750-
always belong to the same clusters (although the labels may be
751-
different). The non-determinism comes from deciding to which cluster a
752-
non-core sample belongs. A non-core sample can have a distance lower
753-
than ``eps`` to two core samples in different clusters. By the
749+
The DBSCAN algorithm is deterministic, always generating the same clusters
750+
when given the same data in the same order. However, the results can differ when
751+
data is provided in a different order. First, even though the core samples
752+
will always be assigned to the same clusters, the labels of those clusters
753+
will depend on the order in which those samples are encountered in the data.
754+
Second and more importantly, the clusters to which non-core samples are assigned
755+
can differ depending on the data order. This would happen when a non-core sample
756+
has a distance lower than ``eps`` to two core samples in different clusters. By the
754757
triangular inequality, those two core samples must be more distant than
755758
``eps`` from each other, or they would be in the same cluster. The non-core
756-
sample is assigned to whichever cluster is generated first, where
757-
the order is determined randomly. Other than the ordering of
758-
the dataset, the algorithm is deterministic, making the results relatively
759-
stable between runs on the same data.
759+
sample is assigned to whichever cluster is generated first in a pass
760+
through the data, and so the results will depend on the data ordering.
760761

761762
The current implementation uses ball trees and kd-trees
762763
to determine the neighborhood of points,

doc/modules/grid_search.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ distribution. After describing these tools we detail
4141

4242
Note that it is common that a small subset of those parameters can have a large
4343
impact on the predictive or computation performance of the model while others
44-
can be left to their default values. It is recommend to read the docstring of
44+
can be left to their default values. It is recommended to read the docstring of
4545
the estimator class to get a finer understanding of their expected behavior,
4646
possibly by reading the enclosed reference to the literature.
4747

doc/modules/model_evaluation.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1133,6 +1133,12 @@ are predicted. This is useful if you want to know how many top-scored-labels
11331133
you have to predict in average without missing any true one. The best value
11341134
of this metrics is thus the average number of true labels.
11351135

1136+
.. note::
1137+
1138+
Our implementation's score is 1 greater than the one given in Tsoumakas
1139+
et al., 2010. This extends it to handle the degenerate case in which an
1140+
instance has 0 true labels.
1141+
11361142
Formally, given a binary indicator matrix of the ground truth labels
11371143
:math:`y \in \left\{0, 1\right\}^{n_\text{samples} \times n_\text{labels}}` and the
11381144
score associated with each label
@@ -1236,6 +1242,12 @@ Here is a small example of usage of this function::
12361242
>>> label_ranking_loss(y_true, y_score)
12371243
0.0
12381244

1245+
1246+
.. topic:: References:
1247+
1248+
* Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data. In
1249+
Data mining and knowledge discovery handbook (pp. 667-685). Springer US.
1250+
12391251
.. _regression_metrics:
12401252

12411253
Regression metrics

doc/modules/model_persistence.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,10 @@ additional metadata should be saved along the pickled model:
8181
This should make it possible to check that the cross-validation score is in the
8282
same range as before.
8383

84+
Since a model internal representation may be different on two different
85+
architectures, dumping a model on one architecture and loading it on
86+
another architecture is not supported.
87+
8488
If you want to know more about these issues and explore other possible
8589
serialization methods, please refer to this
8690
`talk by Alex Gaynor <http://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_.

doc/whats_new.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,11 @@ Enhancements
8585
do not set attributes on the estimator.
8686
:issue:`7533` by :user:`Ekaterina Krivich <kiote>`.
8787

88+
- For sparse matrices, :func:`preprocessing.normalize` with ``return_norm=True``
89+
will now raise a ``NotImplementedError`` with 'l1' or 'l2' norm and with norm 'max'
90+
the norms returned will be the same as for dense matrices (:issue:`7771`).
91+
By `Ang Lu <https://github.com/luang008>`_.
92+
8893
Bug fixes
8994
.........
9095

examples/classification/plot_lda_qda.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
11
"""
22
====================================================================
3-
Linear and Quadratic Discriminant Analysis with confidence ellipsoid
3+
Linear and Quadratic Discriminant Analysis with covariance ellipsoid
44
====================================================================
55
6-
Plot the confidence ellipsoids of each class and decision boundary
6+
This example plots the covariance ellipsoids of each class and
7+
decision boundary learned by LDA and QDA. The ellipsoids display
8+
the double standard deviation for each class. With LDA, the
9+
standard deviation is the same for all the classes, while each
10+
class has its own standard deviation with QDA.
711
"""
812
print(__doc__)
913

examples/datasets/plot_iris_dataset.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
# import some data to play with
3232
iris = datasets.load_iris()
3333
X = iris.data[:, :2] # we only take the first two features.
34-
Y = iris.target
34+
y = iris.target
3535

3636
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
3737
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
@@ -40,7 +40,7 @@
4040
plt.clf()
4141

4242
# Plot the training points
43-
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
43+
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
4444
plt.xlabel('Sepal length')
4545
plt.ylabel('Sepal width')
4646

@@ -54,7 +54,7 @@
5454
fig = plt.figure(1, figsize=(8, 6))
5555
ax = Axes3D(fig, elev=-150, azim=110)
5656
X_reduced = PCA(n_components=3).fit_transform(iris.data)
57-
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y,
57+
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,
5858
cmap=plt.cm.Paired)
5959
ax.set_title("First three PCA directions")
6060
ax.set_xlabel("1st eigenvector")

sklearn/metrics/pairwise.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -752,7 +752,7 @@ def polynomial_kernel(X, Y=None, degree=3, gamma=None, coef0=1):
752752
degree : int, default 3
753753
754754
gamma : float, default None
755-
if None, defaults to 1.0 / n_samples_1
755+
if None, defaults to 1.0 / n_features
756756
757757
coef0 : int, default 1
758758
@@ -786,7 +786,7 @@ def sigmoid_kernel(X, Y=None, gamma=None, coef0=1):
786786
Y : ndarray of shape (n_samples_2, n_features)
787787
788788
gamma : float, default None
789-
If None, defaults to 1.0 / n_samples_1
789+
If None, defaults to 1.0 / n_features
790790
791791
coef0 : int, default 1
792792
@@ -822,7 +822,7 @@ def rbf_kernel(X, Y=None, gamma=None):
822822
Y : array of shape (n_samples_Y, n_features)
823823
824824
gamma : float, default None
825-
If None, defaults to 1.0 / n_samples_X
825+
If None, defaults to 1.0 / n_features
826826
827827
Returns
828828
-------
@@ -857,7 +857,7 @@ def laplacian_kernel(X, Y=None, gamma=None):
857857
Y : array of shape (n_samples_Y, n_features)
858858
859859
gamma : float, default None
860-
If None, defaults to 1.0 / n_samples_X
860+
If None, defaults to 1.0 / n_features
861861
862862
Returns
863863
-------

sklearn/metrics/ranking.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -633,6 +633,10 @@ def coverage_error(y_true, y_score, sample_weight=None):
633633
Ties in ``y_scores`` are broken by giving maximal rank that would have
634634
been assigned to all tied values.
635635
636+
Note: Our implementation's score is 1 greater than the one given in
637+
Tsoumakas et al., 2010. This extends it to handle the degenerate case
638+
in which an instance has 0 true labels.
639+
636640
Read more in the :ref:`User Guide <coverage_error>`.
637641
638642
Parameters

sklearn/neighbors/binary_tree.pxi

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -297,9 +297,9 @@ Query for k-nearest neighbors
297297
>>> X = np.random.random((10, 3)) # 10 points in 3 dimensions
298298
>>> tree = {BinaryTree}(X, leaf_size=2) # doctest: +SKIP
299299
>>> dist, ind = tree.query([X[0]], k=3) # doctest: +SKIP
300-
>>> print ind # indices of 3 closest neighbors
300+
>>> print(ind) # indices of 3 closest neighbors
301301
[0 3 1]
302-
>>> print dist # distances to 3 closest neighbors
302+
>>> print(dist) # distances to 3 closest neighbors
303303
[ 0. 0.19662693 0.29473397]
304304
305305
Pickle and Unpickle a tree. Note that the state of the tree is saved in the
@@ -313,9 +313,9 @@ pickle operation: the tree needs not be rebuilt upon unpickling.
313313
>>> s = pickle.dumps(tree) # doctest: +SKIP
314314
>>> tree_copy = pickle.loads(s) # doctest: +SKIP
315315
>>> dist, ind = tree_copy.query(X[0], k=3) # doctest: +SKIP
316-
>>> print ind # indices of 3 closest neighbors
316+
>>> print(ind) # indices of 3 closest neighbors
317317
[0 3 1]
318-
>>> print dist # distances to 3 closest neighbors
318+
>>> print(dist) # distances to 3 closest neighbors
319319
[ 0. 0.19662693 0.29473397]
320320
321321
Query for neighbors within a given radius
@@ -324,10 +324,10 @@ Query for neighbors within a given radius
324324
>>> np.random.seed(0)
325325
>>> X = np.random.random((10, 3)) # 10 points in 3 dimensions
326326
>>> tree = {BinaryTree}(X, leaf_size=2) # doctest: +SKIP
327-
>>> print tree.query_radius(X[0], r=0.3, count_only=True)
327+
>>> print(tree.query_radius(X[0], r=0.3, count_only=True))
328328
3
329329
>>> ind = tree.query_radius(X[0], r=0.3) # doctest: +SKIP
330-
>>> print ind # indices of neighbors within distance 0.3
330+
>>> print(ind) # indices of neighbors within distance 0.3
331331
[3 0 1]
332332
333333
@@ -623,7 +623,7 @@ cdef class NeighborsHeap:
623623
dist_arr[0] = val
624624
ind_arr[0] = i_val
625625

626-
#descend the heap, swapping values until the max heap criterion is met
626+
# descend the heap, swapping values until the max heap criterion is met
627627
i = 0
628628
while True:
629629
ic1 = 2 * i + 1
@@ -1282,9 +1282,9 @@ cdef class BinaryTree:
12821282
>>> X = np.random.random((10, 3)) # 10 points in 3 dimensions
12831283
>>> tree = BinaryTree(X, leaf_size=2) # doctest: +SKIP
12841284
>>> dist, ind = tree.query(X[0], k=3) # doctest: +SKIP
1285-
>>> print ind # indices of 3 closest neighbors
1285+
>>> print(ind) # indices of 3 closest neighbors
12861286
[0 3 1]
1287-
>>> print dist # distances to 3 closest neighbors
1287+
>>> print(dist) # distances to 3 closest neighbors
12881288
[ 0. 0.19662693 0.29473397]
12891289
"""
12901290
# XXX: we should allow X to be a pre-built tree.
@@ -1415,10 +1415,10 @@ cdef class BinaryTree:
14151415
>>> np.random.seed(0)
14161416
>>> X = np.random.random((10, 3)) # 10 points in 3 dimensions
14171417
>>> tree = BinaryTree(X, leaf_size=2) # doctest: +SKIP
1418-
>>> print tree.query_radius(X[0], r=0.3, count_only=True)
1418+
>>> print(tree.query_radius(X[0], r=0.3, count_only=True))
14191419
3
14201420
>>> ind = tree.query_radius(X[0], r=0.3) # doctest: +SKIP
1421-
>>> print ind # indices of neighbors within distance 0.3
1421+
>>> print(ind) # indices of neighbors within distance 0.3
14221422
[3 0 1]
14231423
"""
14241424
if count_only and return_distance:

sklearn/preprocessing/data.py

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1325,6 +1325,16 @@ def normalize(X, norm='l2', axis=1, copy=True, return_norm=False):
13251325
return_norm : boolean, default False
13261326
whether to return the computed norms
13271327
1328+
Returns
1329+
-------
1330+
X : {array-like, sparse matrix}, shape [n_samples, n_features]
1331+
Normalized input X.
1332+
1333+
norms : array, shape [n_samples] if axis=1 else [n_features]
1334+
An array of norms along given axis for X.
1335+
When X is sparse, a NotImplementedError will be raised
1336+
for norm 'l1' or 'l2'.
1337+
13281338
See also
13291339
--------
13301340
Normalizer: Performs normalization using the ``Transformer`` API
@@ -1346,15 +1356,19 @@ def normalize(X, norm='l2', axis=1, copy=True, return_norm=False):
13461356
X = X.T
13471357

13481358
if sparse.issparse(X):
1359+
if return_norm and norm in ('l1', 'l2'):
1360+
raise NotImplementedError("return_norm=True is not implemented "
1361+
"for sparse matrices with norm 'l1' "
1362+
"or norm 'l2'")
13491363
if norm == 'l1':
13501364
inplace_csr_row_normalize_l1(X)
13511365
elif norm == 'l2':
13521366
inplace_csr_row_normalize_l2(X)
13531367
elif norm == 'max':
13541368
_, norms = min_max_axis(X, 1)
1355-
norms = norms.repeat(np.diff(X.indptr))
1356-
mask = norms != 0
1357-
X.data[mask] /= norms[mask]
1369+
norms_elementwise = norms.repeat(np.diff(X.indptr))
1370+
mask = norms_elementwise != 0
1371+
X.data[mask] /= norms_elementwise[mask]
13581372
else:
13591373
if norm == 'l1':
13601374
norms = np.abs(X).sum(axis=1)

sklearn/preprocessing/tests/test_data.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1315,6 +1315,24 @@ def test_normalize():
13151315

13161316
assert_array_almost_equal(row_sums, ones)
13171317

1318+
# Test return_norm
1319+
X_dense = np.array([[3.0, 0, 4.0], [1.0, 0.0, 0.0], [2.0, 3.0, 0.0]])
1320+
for norm in ('l1', 'l2', 'max'):
1321+
_, norms = normalize(X_dense, norm=norm, return_norm=True)
1322+
if norm == 'l1':
1323+
assert_array_almost_equal(norms, np.array([7.0, 1.0, 5.0]))
1324+
elif norm == 'l2':
1325+
assert_array_almost_equal(norms, np.array([5.0, 1.0, 3.60555127]))
1326+
else:
1327+
assert_array_almost_equal(norms, np.array([4.0, 1.0, 3.0]))
1328+
1329+
X_sparse = sparse.csr_matrix(X_dense)
1330+
for norm in ('l1', 'l2'):
1331+
assert_raises(NotImplementedError, normalize, X_sparse,
1332+
norm=norm, return_norm=True)
1333+
_, norms = normalize(X_sparse, norm='max', return_norm=True)
1334+
assert_array_almost_equal(norms, np.array([4.0, 1.0, 3.0]))
1335+
13181336

13191337
def test_binarizer():
13201338
X_ = np.array([[1, 0, 5], [2, 3, -1]])

0 commit comments

Comments
 (0)
0