8000 Merge branch 'master' of github.com:scikit-learn/scikit-learn into se… · scikit-learn/scikit-learn@85e4a5f · GitHub
[go: up one dir, main page]

Skip to content

Commit 85e4a5f

Browse files
committed
Merge branch 'master' of github.com:scikit-learn/scikit-learn into select_categorical
2 parents 017afab + 63df99d commit 85e4a5f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+19813
-838
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
*.so
33
*~
44
.#*
5+
*.lprof
56
*.swp
67
*.swo
78
.DS_Store
@@ -40,6 +41,7 @@ nips2010_pdf/
4041
*.tgz
4142

4243
examples/cluster/joblib
44+
reuters/
4345
benchmarks/bench_covertype_data/
4446

4547
*.prefs

benchmarks/bench_covertype.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ def load_data(dtype=np.float32, order='F'):
109109
print("Loading dataset...")
110110
data = fetch_covtype(download_if_missing=True, shuffle=True,
111111
random_state=opts.random_seed)
112-
X, y = data.data, data.target
112+
X, y = data['data'], data['target']
113113
if order.lower() == 'f':
114114
X = np.asfortranarray(X)
115115

doc/datasets/covtype.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ Some of the features are boolean indicators,
1414
while others are discrete or continuous measurements.
1515

1616
``sklearn.datasets.fetch_covtype`` will load the covertype dataset;
17-
it returns a ``Bunch`` object with the feature matrix in the ``data`` member
17+
it returns a dictionary-like object
18+
with the feature matrix in the ``data`` member
1819
and the target values in ``target``.
1920
The dataset will be downloaded from the web if necessary.

doc/datasets/index.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -32,11 +32,11 @@ numpy array X and an array of length n_samples containing the targets y.
3232

3333
The toy datasets as well as the 'real world' datasets and the datasets
3434
fetched from mldata.org have more sophisticated structure.
35-
These functions return a ``bunch`` (which is a dictionary that is
36-
accessible with the 'dict.key' syntax).
37-
All datasets have at least two keys, ``data``, containg an array of shape
38-
``n_samples x n_features`` (except for 20newsgroups) and ``target``, a numpy
39-
array of length ``n_features``, containing the targets.
35+
These functions return a dictionary-like object holding at least two items:
36+
an array of shape ``n_samples`` * `` n_features`` with key ``data``
37+
(except for 20newsgroups)
38+
and a NumPy array of length ``n_features``, containing the target values,
39+
with key ``target``.
4040

4141
The datasets also contain a description in ``DESCR`` and some contain
4242
``feature_names`` and ``target_names``.

doc/modules/classes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,7 @@ Samples generator
217217
decomposition.KernelPCA
218218
decomposition.FactorAnalysis
219219
decomposition.FastICA
220+
decomposition.TruncatedSVD
220221
decomposition.NMF
221222
decomposition.SparsePCA
222223
decomposition.MiniBatchSparsePCA

doc/modules/clustering.rst

Lines changed: 28 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -55,20 +55,20 @@ Overview of clustering methods
5555
- number of clusters
5656
- Very large `n_samples`, medium `n_clusters` with
5757
:ref:`MiniBatch code <mini_batch_kmeans>`
58-
- General-purpose, even cluster size, flat geometry, not too many clusters
59-
- Distances between points
58+
- General-purpose, even cluster size, flat geometry, not too many clusters
59+
- Distances between points
6060

6161
* - :ref:`Affinity propagation <affinity_propagation>`
62-
- damping, sample preference
62+
- damping, sample preference
6363
- Not scalable with n_samples
6464
- Many clusters, uneven cluster size, non-flat geometry
6565
- Graph distance (e.g. nearest-neighbor graph)
6666

6767
* - :ref:`Mean-shift <mean_shift>`
68-
- bandwidth
68+
- bandwidth
6969
- Not scalable with n_samples
7070
- Many clusters, uneven cluster size, non-flat geometry
71-
- Distances between points
71+
- Distances between points
7272

7373
* - :ref:`Spectral clustering <spectral_clustering>`
7474
- number of clusters
@@ -80,13 +80,13 @@ Overview of clustering methods
8080
- number of clusters
8181
- Large `n_samples` and `n_clusters`
8282
- Many clusters, possibly connectivity constraints
83-
- Distances between points
83+
- Distances between points
8484

8585
* - :ref:`DBSCAN <dbscan>`
8686
- neighborhood size
8787
- Very large `n_samples`, medium `n_clusters`
8888
- Non-flat geometry, uneven cluster sizes
89-
- Distances between nearest points
89+
- Distances between nearest points
9090

9191
* - :ref:`Gaussian mixtures <mixture>`
9292
- many
@@ -116,7 +116,7 @@ be specified. It scales well to large number of samples and has been used
116116
across a large range of application areas in many different fields. It is
117117
also equivalent to the expectation-maximization algorithm when setting the
118118
covariance matrix to be diagonal, equal and small. The K-means algorithm
119-
aims to choose centroids :math:`C` that minimise the within cluster sum of
119+
aims to choose centroids :math:`C` that minimise the within cluster sum of
120120
squares objective function with a dataset :math:`X` with :math:`n` samples:
121121

122122
.. math:: J(X, C) = \sum_{i=0}^{n}\min_{\mu_j \in C}(||x_j - \mu_i||^2)
@@ -156,8 +156,8 @@ centroids to be (generally) distant from each other, leading to provably better
156156
results than random initialisation.
157157

158158
A parameter can be given to allow K-means to be run in parallel, called
159-
`n_jobs`. Giving this parameter a positive value uses that many processors
160-
(default=1). A value of -1 uses all processors, with -2 using one less, and so
159+
`n_jobs`. Giving this parameter a positive value uses that many processors
160+
(default=1). A value of -1 uses all processors, with -2 using one less, and so
161161
on. Parallelization generally speeds up computation at the cost of memory (in
162162
this case, multiple copies of centroids need to be stored, one for each job).
163163

@@ -500,16 +500,17 @@ separated by areas of low density. Due to this rather generic view, clusters
500500
found by DBSCAN can be any shape, as opposed to k-means which assumes that
501501
clusters are convex shaped. The central component to the DBSCAN is the concept
502502
of *core samples*, which are samples that are in areas of high density. A
503-
cluster is therefore a set of core samples, each highly similar to each other
504-
and a set of non-core samples that are similar to a core sample (but are not
503+
cluster is therefore a set of core samples, each close to each other
504+
(measured by some distance measure)
505+
and a set of non-core samples that are close to a core sample (but are not
505506
themselves core samples). There are two parameters to the algorithm,
506-
`min_points` and `eps`, which define formally what we mean when we say *dense*.
507-
A higher `min_points` or lower `eps` indicate higher density necessary to form
507+
`min_samples` and `eps`, which define formally what we mean when we say *dense*.
508+
A higher `min_samples` or lower `eps` indicate higher density necessary to form
508509
a cluster.
509510

510511
More formally, we define a core sample as being a sample in the dataset such
511-
that there exists `min_samples` other samples with a similarity higher than
512-
`eps` to it, which are defined as *neighbors* of the core sample. This tells
512+
that there exist `min_samples` other samples within a distance of
513+
`eps`, which are defined as *neighbors* of the core sample. This tells
513514
us that the core sample is in a dense area of the vector space. A cluster
514515
is a set of core samples, that can be built by recursively by taking a core
515516
sample, finding all of its neighbors that are core samples, finding all of
@@ -520,24 +521,24 @@ are on the fringes of a cluster.
520521

521522
Any core sample is part of a cluster, by definition. Further, any cluster has
522523
at least `min_samples` points in it, following the definition of a core
523-
sample. For any sample that is not a core sample, and does not have a
524-
similarity higher than `eps` to a core sample, it is considered an outlier by
524+
sample. For any sample that is not a core sample, and does have a
525+
distance higher than `eps` to any core sample, it is considered an outlier by
525526
the algorithm.
526527

527528
The algorithm is non-deterministic, however the core samples themselves will
528529
always belong to the same clusters (although the labels themselves may be
529530
different). The non-determinism comes from deciding on which cluster a
530-
non-core sample belongs to. A non-core sample can be have a similarity higher
531+
non-core sample belongs to. A non-core sample can have a distance lower
531532
than `eps` to two core samples in different classes. Following from the
532-
triangular inequality, those two core samples would be less similar than
533+
triangular inequality, those two core samples would be more distant than
533534
`eps` from each other -- else they would be in the same class. The non-core
534535
sample is simply assigned to which ever cluster is generated first, where
535536
the order is determined randomly within the code. Other than the ordering of,
536537
the dataset, the algorithm is deterministic, making the results relatively
537538
stable between iterations on the same data.
538539

539540
In the figure below, the color indicates cluster membership, with large circles
540-
indicating core samples found by the algorithm. Smaller circles are non-core
541+
indicating core samples found by the algorithm. Smaller circles are non-core
541542
samples that are still part of a cluster. Moreover, the outliers are indicated
542543
by black points below.
543544

@@ -819,7 +820,7 @@ Drawbacks
819820

820821
* :ref:`example_cluster_plot_adjusted_for_chance_measures.py`: Analysis of
821822
the impact of the dataset size on the value of clustering measures
822-
for random assignments. This example also includes the Adjusted Rand
823+
for random assignments. This example also includes the Adjusted Rand
823824
Index.
824825

825826

@@ -864,7 +865,7 @@ following equation, from Vinh, Epps, and Bailey, (2009). In this equation,
864865
\frac{a_i!b_j!(N-a_i)!(N-b_j)!}{N!n_{ij}!(a_i-n_{ij})!(b_j-n_{ij})!
865866
(N-a_i-b_j+n_{ij})!}
866867

867-
Using the expected value, the adjusted mutual information can then be
868+
Using the expected value, the adjusted mutual information can then be
868869
calculated using a similar form to that of the adjusted Rand index:
869870

870871
.. math:: \text{AMI} = \frac{\text{MI} - E[\text{MI}]}{\max(H(U), H(V)) - E[\text{MI}]}
@@ -875,7 +876,7 @@ calculated using a similar form to that of the adjusted Rand index:
875876
knowledge reuse framework for combining multiple partitions". Journal of
876877
Machine Learning Research 3: 583–617. doi:10.1162/153244303321897735
877878

878-
* Vinh, Epps, and Bailey, (2009). "Information theoretic measures
879+
* Vinh, Epps, and Bailey, (2009). "Information theoretic measures
879880
for clusterings comparison". Proceedings of the 26th Annual International
880881
Conference on Machine Learning - ICML '09.
881882
doi:10.1145/1553374.1553511. ISBN 9781605585161.
@@ -1045,7 +1046,7 @@ mean of homogeneity and completeness**:
10451046
10461047
.. [B2011] `Identication and Characterization of Events in Social Media
10471048
<http://www.cs.columbia.edu/~hila/hila-thesis-distributed.pdf>`_, Hila
1048-
Becker, PhD Thesis.
1049+
Becker, PhD Thesis.
10491050
10501051
.. _silhouette_coefficient:
10511052

@@ -1073,7 +1074,7 @@ The Silhoeutte Coefficient *s* for a single sample is then given as:
10731074

10741075
.. math:: s = \frac{b - a}{max(a, b)}
10751076

1076-
The Silhouette Coefficient for a set of samples is given as the mean of the
1077+
The Silhouette Coefficient for a set of samples is given as the mean of the
10771078
Silhouette Coefficient for each sample.
10781079

10791080

@@ -1091,7 +1092,7 @@ cluster analysis.
10911092
>>> from sklearn.cluster import KMeans
10921093
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
10931094
>>> labels = kmeans_model.labels_
1094-
>>> metrics.silhouette_score(X, labels, metric='euclidean')
1095+
>>> metrics.silhouette_score(X, labels, metric='euclidean')
10951096
... # doctest: +ELLIPSIS
10961097
0.55...
10971098

doc/modules/cross_validation.rst

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -277,8 +277,11 @@ not waste much data as only one sample is removed from the learning set::
277277
Leave-P-Out - LPO
278278
-----------------
279279

280-
:class:`LeavePOut` is very similar to *Leave-One-Out*, as it creates all the
281-
possible training/test sets by removing :math:`P` samples from the complete set.
280+
:class:`LeavePOut` is very similar to :class:`LeaveOneOut` as it creates all
281+
the possible training/test sets by removing :math:`p` samples from the complete
282+
set. For :math:`n` samples, this produces :math:`{n \choose p}` train-test
283+
pairs. Unlike :class:`LeaveOneOut` and :class:`KFold`, the test sets will
284+
overlap for :math:`p > 1`.
282285

283286
Example of Leave-2-Out::
284287

doc/modules/decomposition.rst

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,86 @@ factorization, while larger values shrink many coefficients to zero.
232232
R. Jenatton, G. Obozinski, F. Bach, 2009
233233
234234
235+
.. _LSA:
236+
237+
Truncated singular value decomposition and latent semantic analysis
238+
===================================================================
239+
240+
:class:`TruncatedSVD` implements a variant of singular value decomposition
241+
(SVD) that only computes the :math:`k` largest singular values,
242+
where :math:`k` is a user-specified parameter.
243+
244+
When truncated SVD is applied to term-document matrices
245+
(as returned by ``CountVectorizer`` or ``TfidfVectorizer``),
246+
this transformation is known as
247+
`latent semantic analysis <http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf>`_
248+
(LSA), because it transforms such matrices
249+
to a "semantic" space of low dimensionality.
250+
In particular, LSA is known to combat the effects of synonymy and polysemy
251+
(both of which roughly mean there are multiple meanings per word),
252+
which cause term-document matrices to be overly sparse
253+
and exhibit poor similarity under measures such as cosine similarity.
254+
255+
.. note::
256+
LSA is also known as latent semantic indexing, LSI,
257+
though strictly that refers to its use in persistent indexes
258+
for information retrieval purposes.
259+
260+
Mathematically, truncated SVD applied to training samples :math:`X`
261+
produces a low-rank approximation :math:`X`:
262+
263+
.. math::
264+
X \approx X_k = U_k \Sigma_k V_k^\top
265+
266+
After this operation, :math:`U_k \Sigma_k^\top`
267+
is the transformed training set with :math:`k` features
268+
(called ``n_components`` in the API).
269+
270+
To also transform a test set :math:`X`, we multiply it with :math:`V_k`:
271+
272+
.. math::
273+
X' = X V_k^\top
274+
275+
.. note::
276+
Most treatments of LSA in the natural language processing (NLP)
277+
and information retrieval (IR) literature
278+
swap the axis of the matrix :math:`X` so that it has shape
279+
``n_features`` × ``n_samples``.
280+
We present LSA in a different way that matches the scikit-learn API better,
281+
but the singular values found are the same.
282+
283+
:class:`TruncatedSVD` is very similar to :class:`PCA`, but differs
284+
in that it works on sample matrices :math:`X` directly
285+
instead of their covariance matrices.
286+
When the columnwise (per-feature) means of :math:`X`
287+
are subtracted from the feature values,
288+
truncated SVD on the resulting matrix is equivalent to PCA.
289+
In practical terms, this means
290+
that the :class:`TruncatedSVD` transformer accepts ``scipy.sparse``
291+
matrices without the need to densify them,
292+
as densifying may fill up memory even for medium-sized document collections.
293+
294+
While the :class:`TruncatedSVD` transformer
295+
works with any (sparse) feature matrix,
296+
using it on tf–idf matrices is recommended over raw frequency counts
297+
in an LSA/document processing setting.
298+
In particular, sublinear scaling and inverse document frequency
299+
should be turned on (``sublinear_tf=True, use_idf=True``)
300+
to bring the feature values closer to a Gaussian distribution,
301+
compensating for LSA's erroneous assumptions about textual data.
302+
303+
.. topic:: Examples:
304+
305+
* :ref:`example_document_clustering.py`
306+
307+
.. topic:: References:
308+
309+
* Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze (2008),
310+
*Introduction to Information Retrieval*, Cambridge University Press,
311+
chapter 18: `Matrix decompositions & latent semantic indexing
312+
<http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf>`_
313+
314+
235315
.. _DictionaryLearning:
236316

237317
Dictionary Learning

doc/modules/feature_extraction.rst

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -653,6 +653,25 @@ The :class:`HashingVectorizer` also comes with the following limitations:
653653
model. A :class:`TfidfTransformer` can be appended to it in a pipeline if
654654
required.
655655

656+
Performing out-of-core scaling with HashingVectorizer
657+
------------------------------------------------------
658+
659+
An interesting development of using a :class:`HashingVectorizer` is the ability
660+
to perform `out-of-core`_ scaling. This means that we can learn from data that
661+
does not fit into the computer's main memory.
662+
663+
.. _out-of-core: http://en.wikipedia.org/wiki/Out-of-core_algorithm.
664+
665+
A strategy to implement out-of-core scaling is to stream data to the estimator
666+
in mini-batches. Each mini-batch is vectorized using :class:`HashingVectorizer`
667+
so as to guarantee that the input space of the estimator has always the same
668+
dimensionality. The amount of memory used at any time is thus bounded by the
669+
size of a mini-batch. Although there is no limit to the amount of data that can
670+
be ingested using such an approach, from a practical point of view the learning
671+
time is often limited by the CPU time one wants to spend on the task.
672+
673+
For a full-fledged example of out-of-core scaling in a text classification
674+
task see :ref:`example_applications_plot_out_of_core_classification.py`.
656675

657676
Customizing the vectorizer classes
658677
----------------------------------

doc/modules/metrics.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ The chi squared kernel is given by
9696

9797
.. math::
9898
99-
k(x, y) = exp(-\gamma * \sum_i (x[i] - y[i]) ** 2 / (x[i] + y[i]))
99+
k(x, y) = \exp \left (-\gamma \sum_i \frac{(x[i] - y[i]) ^ 2}{x[i] + y[i]} \right )
100100
101101
The data is assumed to be non-negative, and is often normalized to have an L1-norm of one.
102102
The normalization is rationalized with the connection to the chi squared distance,

doc/sphinxext/gen_rst.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -873,15 +873,14 @@ def embed_code_links(app, exception):
873873
str_repl[name_html] = link_pattern % (link, name_html)
874874
# do the replacement in the html file
875875
if len(str_repl) > 0:
876-
with open(full_fname, 'rt') as fid:
876+
with open(full_fname, 'rb') as fid:
877877
lines_in = fid.readlines()
878-
fid.close()
879-
with open(full_fname, 'wt') as fid:
878+
with open(full_fname, 'wb') as fid:
880879
for line in lines_in:
880+
line = line.decode('utf-8')
881881
for name, link in str_repl.iteritems():
882882
line = line.replace(name, link)
883-
fid.write(line)
884-
fid.close()
883+
fid.write(line.encode('utf-8'))
885884
except urllib2.HTTPError, e:
886885
print ("The following HTTP Error has occurred:\n")
887886
print e.code

0 commit comments

Comments
 (0)
0