8000 Merge branch 'master' into fix-stop-words-validation · scikit-learn/scikit-learn@4ff060e · GitHub
[go: up one dir, main page]

Skip to content

Commit 4ff060e

Browse files
authored
Merge branch 'master' into fix-stop-words-validation
2 parents 9874994 + c676981 commit 4ff060e

File tree

90 files changed

+1980
-687
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

90 files changed

+1980
-687
lines changed
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
import matplotlib.pyplot as plt
2+
import numpy as np
3+
import scipy.sparse as sparse
4+
from sklearn.preprocessing import PolynomialFeatures
5+
from time import time
6+
7+
degree = 2
8+
trials = 3
9+
num_rows = 1000
10+
dimensionalities = np.array([1, 2, 8, 16, 32, 64])
11+
densities = np.array([0.01, 0.1, 1.0])
12+
csr_times = {d: np.zeros(len(dimensionalities)) for d in densities}
13+
dense_times = {d: np.zeros(len(dimensionalities)) for d in densities}
14+
transform = PolynomialFeatures(degree=degree, include_bias=False,
15+
interaction_only=False)
16+
17+
for trial in range(trials):
18+
for density in densities:
19+
for dim_index, dim in enumerate(dimensionalities):
20+
print(trial, density, dim)
21+
X_csr = sparse.random(num_rows, dim, density).tocsr()
22+
X_dense = X_csr.toarray()
23+
# CSR
24+
t0 = time()
25+
transform.fit_transform(X_csr)
26+
csr_times[density][dim_index] += time() - t0
27+
# Dense
28+
t0 = time()
29+
transform.fit_transform(X_dense)
30+
dense_times[density][dim_index] += time() - t0
31+
32+
csr_linestyle = (0, (3, 1, 1, 1, 1, 1)) # densely dashdotdotted
33+
dense_linestyle = (0, ()) # solid
34+
35+
fig, axes = plt.subplots(nrows=len(densities), ncols=1, figsize=(8, 10))
36+
for density, ax in zip(densities, axes):
37+
38+
ax.plot(dimensionalities, csr_times[density] / trials,
39+
label='csr', linestyle=csr_linestyle)
40+
ax.plot(dimensionalities, dense_times[density] / trials,
41+
label='dense', linestyle=dense_linestyle)
42+
ax.set_title("density %0.2f, degree=%d, n_samples=%d" %
43+
(density, degree, num_rows))
44+
ax.legend()
45+
ax.set_xlabel('Dimensionality')
46+
ax.set_ylabel('Time (seconds)')
47+
48+
plt.tight_layout()
49+
plt.show()

conftest.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,9 @@
1616
PYTEST_MIN_VERSION = '3.3.0'
1717

1818
if LooseVersion(pytest.__version__) < PYTEST_MIN_VERSION:
19-
raise('Your version of pytest is too old, you should have at least '
20-
'pytest >= {} installed.'.format(PYTEST_MIN_VERSION))
19+
raise ImportError('Your version of pytest is too old, you should have '
20+
'at least pytest >= {} installed.'
21+
.format(PYTEST_MIN_VERSION))
2122

2223

2324
def pytest_addoption(parser):

doc/developers/contributing.rst

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1143,6 +1143,16 @@ data dependent. A tolerance stopping criterion ``tol`` is not directly
11431143
data dependent (although the optimal value according to some scoring
11441144
function probably is).
11451145

1146+
When ``fit`` is called, any previous call to ``fit`` should be ignored. In
1147+
general, calling ``estimator.fit(X1)`` and then ``estimator.fit(X2)`` should
1148+
be the same as only calling ``estimator.fit(X2)``. However, this may not be
1149+
true in practice when ``fit`` depends on some random process, see
1150+
:term:`random_state`. Another exception to this rule is when the
1151+
hyper-parameter ``warm_start`` is set to ``True`` for estimators that
1152+
support it. ``warm_start=True`` means that the previous state of the
1153+
trainable parameters of the estimator are reused instead of using the
1154+
default initialization strategy.
1155+
11461156
Estimated Attributes
11471157
^^^^^^^^^^^^^^^^^^^^
11481158

@@ -1151,9 +1161,8 @@ ending with trailing underscore, for example the coefficients of
11511161
some regression estimator would be stored in a ``coef_`` attribute after
11521162
``fit`` has been called.
11531163

1154-
The last-mentioned attributes are expected to be overridden when
1155-
you call ``fit`` a second time without taking any previous value into
1156-
account: **fit should be idempotent**.
1164+
The estimated attributes are expected to be overridden when you call ``fit``
1165+
a second time.
11571166

11581167
Optional Arguments
11591168
^^^^^^^^^^^^^^^^^^
@@ -1209,7 +1218,7 @@ the correct interface more easily.
12091218
and optionally the mixin classes in ``sklearn.base``.
12101219
For example, below is a custom classifier, with more examples included
12111220
in the scikit-learn-contrib
1212-
`project template <https://github.com/scikit-learn-contrib/project-template/blob/master/skltemplate/template.py>`__.
1221+
`project template <https://github.com/scikit-learn-contrib/project-template/blob/master/skltemplate/_template.py>`__.
12131222

12141223
>>> import numpy as np
12151224
>>> from sklearn.base import BaseEstimator, ClassifierMixin

doc/modules/classes.rst

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -846,6 +846,7 @@ details.
846846
metrics.jaccard_similarity_score
847847
metrics.log_loss
848848
metrics.matthews_corrcoef
849+
metrics.multilabel_confusion_matrix
849850
metrics.precision_recall_curve
850851
metrics.precision_recall_fscore_support
851852
metrics.precision_score
@@ -904,7 +905,7 @@ details.
904905

905906
metrics.adjusted_mutual_info_score
906907
metrics.adjusted_rand_score
907-
metrics.calinski_harabaz_score
908+
metrics.calinski_harabasz_score
908909
metrics.davies_bouldin_score
909910
metrics.completeness_score
910911
metrics.cluster.contingency_matrix
@@ -1496,6 +1497,15 @@ Utilities from joblib:
14961497
Recently deprecated
14971498
===================
14981499

1500+
To be removed in 0.23
1501+
---------------------
1502+
1503+
.. autosummary::
1504+
:toctree: generated/
1505+
:template: deprecated_function.rst
1506+
1507+
metrics.calinski_harabaz_score
1508+
14991509

15001510
To be removed in 0.22
15011511
---------------------
@@ -1513,4 +1523,4 @@ To be removed in 0.22
15131523
:template: deprecated_function.rst
15141524

15151525
covariance.graph_lasso
1516-
datasets.fetch_mldata
1526+
datasets.fetch_mldata

doc/modules/clustering.rst

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -387,7 +387,7 @@ is updated according to the following equation:
387387

388388
.. math::
389389
390-
x_i^{t+1} = x_i^t + m(x_i^t)
390+
x_i^{t+1} = m(x_i^t)
391391
392392
Where :math:`N(x_i)` is the neighborhood of samples within a given distance
393393
around :math:`x_i` and :math:`m` is the *mean shift* vector that is computed for each
@@ -1551,7 +1551,7 @@ Advantages
15511551
- **Upper-bounded at 1**: Values close to zero indicate two label
15521552
assignments that are largely independent, while values close to one
15531553
indicate significant agreement. Further, values of exactly 0 indicate
1554-
**purely** independent label assignments and a AMI of exactly 1 indicates
1554+
**purely** independent label assignments and a FMI of exactly 1 indicates
15551555
that the two label assignments are equal (with or without permutation).
15561556

15571557
- **No assumption is made on the cluster structure**: can be used
@@ -1652,17 +1652,16 @@ Drawbacks
16521652
* :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py` : In this example
16531653
the silhouette analysis is used to choose an optimal value for n_clusters.
16541654

1655-
.. _calinski_harabaz_index:
1655+
.. _calinski_harabasz_index:
16561656

1657-
Calinski-Harabaz Index
1657+
Calinski-Harabasz Index
16581658
----------------------
1659-
1660-
If the ground truth labels are not known, the Calinski-Harabaz index
1661-
(:func:`sklearn.metrics.calinski_harabaz_score`) - also known as the Variance
1659+
If the ground truth labels are not known, the Calinski-Harabasz index
1660+
(:func:`sklearn.metrics.calinski_harabasz_score`) - also known as the Variance
16621661
Ratio Criterion - can be used to evaluate the model, where a higher
1663-
Calinski-Harabaz score relates to a model with better defined clusters.
1662+
Calinski-Harabasz score relates to a model with better defined clusters.
16641663

1665-
For :math:`k` clusters, the Calinski-Harabaz score :math:`s` is given as the
1664+
For :math:`k` clusters, the Calinski-Harabasz score :math:`s` is given as the
16661665
ratio of the between-clusters dispersion mean and the within-cluster
16671666
dispersion:
16681667

@@ -1689,17 +1688,16 @@ points in cluster :math:`q`.
16891688
>>> X = dataset.data
16901689
>>> y = dataset.target
16911690

1692-
In normal usage, the Calinski-Harabaz index is applied to the results of a
1691+
In normal usage, the Calinski-Harabasz index is applied to the results of a
16931692
cluster analysis.
16941693

16951694
>>> import numpy as np
16961695
>>> from sklearn.cluster import KMeans
16971696
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
16981697
>>> labels = kmeans_model.labels_
1699-
>>> metrics.calinski_harabaz_score(X, labels) # doctest: +ELLIPSIS
1698+
>>> metrics.calinski_harabasz_score(X, labels) # doctest: +ELLIPSIS
17001699
561.62...
17011700

1702-
17031701
Advantages
17041702
~~~~~~~~~~
17051703

@@ -1712,7 +1710,7 @@ Advantages
17121710
Drawbacks
17131711
~~~~~~~~~
17141712

1715-
- The Calinski-Harabaz index is generally higher for convex clusters than other
1713+
- The Calinski-Harabasz index is generally higher for convex clusters than other
17161714
concepts of clusters, such as density based clusters like those obtained
17171715
through DBSCAN.
17181716

doc/modules/compose.rst

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -107,10 +107,10 @@ This is particularly important for doing grid searches::
107107
>>> grid_search = GridSearchCV(pipe, param_grid=param_grid)
108108

109109
Individual steps may also be replaced as parameters, and non-final steps may be
110-
ignored by setting them to ``None``::
110+
ignored by setting them to ``'passthrough'``::
111111

112112
>>> from sklearn.linear_model import LogisticRegression
113-
>>> param_grid = dict(reduce_dim=[None, PCA(5), PCA(10)],
113+
>>> param_grid = dict(reduce_dim=['passthrough', PCA(5), PCA(10)],
114114
... clf=[SVC(), LogisticRegression()],
115115
... clf__C=[0.1, 10, 100])
116116
>>> grid_search = GridSearchCV(pipe, param_grid=param_grid)
@@ -486,17 +486,19 @@ the transformation::
486486
[0.5, 0.5],
487487
[1. , 0. ]])
488488

489-
The :func:`~sklearn.compose.make_columntransformer` function is available
489+
The :func:`~sklearn.compose.make_column_transformer` function is available
490490
to more easily create a :class:`~sklearn.compose.ColumnTransformer` object.
491491
Specifically, the names will be given automatically. The equivalent for the
492492
above example would be::
493493

494494
>>> from sklearn.compose import make_column_transformer
495495
>>> column_trans = make_column_transformer(
496496
... ('city', CountVectorizer(analyzer=lambda x: [x])),
497-
... ('title', CountVectorizer()))
497+
... ('title', CountVectorizer()),
498+
... remainder=MinMaxScaler())
498499
>>> column_trans # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
499-
ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
500+
ColumnTransformer(n_jobs=None, remainder=MinMaxScaler(copy=True, ...),
501+
sparse_threshold=0.3,
500502
transformer_weights=None,
501503
transformers=[('countvectorizer- 4E34 1', ...)
502504

doc/modules/ensemble.rst

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -594,21 +594,20 @@ learners. Decision trees have a number of abilities that make them
594594
valuable for boosting, namely the ability to handle data of mixed type
595595
and the ability to model complex functions.
596596

597-
Similar to other boosting algorithms GBRT builds the additive model in
598-
a forward stagewise fashion:
597+
Similar to other boosting algorithms, GBRT builds the additive model in
598+
a greedy fashion:
599599

600600
.. math::
601601
602-
F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)
602+
F_m(x) = F_{m-1}(x) + \gamma_m h_m(x),
603603
604-
At each stage the decision tree :math:`h_m(x)` is chosen to
605-
minimize the loss function :math:`L` given the current model
606-
:math:`F_{m-1}` and its fit :math:`F_{m-1}(x_i)`
604+
where the newly added tree :math:`h_m` tries to minimize the loss :math:`L`,
605+
given the previous ensemble :math:`F_{m-1}`:
607606

608607
.. math::
609608
610-
F_m(x) = F_{m-1}(x) + \arg\min_{h} \sum_{i=1}^{n} L(y_i,
611-
F_{m-1}(x_i) + h(x))
609+
h_m = \arg\min_{h} \sum_{i=1}^{n} L(y_i,
610+
F_{m-1}(x_i) + h(x_i)).
612611
613612
The initial model :math:`F_{0}` is problem specific, for least-squares
614613
regression one usually chooses the mean of the target values.

doc/modules/metrics.rst

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,34 @@ the kernel:
3333
2. ``S = 1. / (D / np.max(D))``
3434

3535

36+
.. currentmodule:: sklearn.metrics
37+
38+
The distances between the row vectors of ``X`` and the row vectors of ``Y``
39+
can be evaluated using :func:`pairwise_distances`. If ``Y`` is omitted the
40+
pairwise distances of the row vectors of ``X`` are calculated. Similarly,
41+
:func:`pairwise.pairwise_kernels` can be used to calculate the kernel between `X`
42+
and `Y` using different kernel functions. See the API reference for more
43+
details.
44+
45+
>>> import numpy as np
46+
>>> from sklearn.metrics import pairwise_distances
47+
>>> from sklearn.metrics.pairwise import pairwise_kernels
48+
>>> X = np.array([[2, 3], [3, 5], [5, 8]])
49+
>>> Y = np.array([[1, 0], [2, 1]])
50+
>>> pairwise_distances(X, Y, metric='manhattan')
51+
array([[ 4., 2.],
52+
[ 7., 5.],
53+
[12., 10.]])
54+
>>> pairwise_distances(X, metric='manhattan')
55+
array([[0., 3., 8.],
56+
[3., 0., 5.],
57+
[8., 5., 0.]])
58+
>>> pairwise_kernels(X, Y, metric='linear')
59+
array([[ 2., 7.],
60+
[ 3., 11.],
61+
[ 5., 18.]])
62+
63+
3664
.. currentmodule:: sklearn.metrics.pairwise
3765

3866
.. _cosine_similarity:

0 commit comments

Comments
 (0)
0