8000 Merge pull request #4887 from amueller/tsne_fixes · scikit-learn/scikit-learn@770d089 · GitHub
[go: up one dir, main page]

Skip to content

Commit 770d089

Browse files
committed
Merge pull request #4887 from amueller/tsne_fixes
Rebase of barnes-hut TSNE
2 parents 37bd275 + fe49161 commit 770d089

File tree

11 files changed

+31628
-1594
lines changed

11 files changed

+31628
-1594
lines changed

doc/modules/manifold.rst

+74-17
Original file line numberDiff line numberDiff line change
@@ -480,30 +480,65 @@ t-distributed Stochastic Neighbor Embedding (t-SNE)
480480
t-SNE (:class:`TSNE`) converts affinities of data points to probabilities.
481481
The affinities in the original space are represented by Gaussian joint
482482
probabilities and the affinities in the embedded space are represented by
483-
Student's t-distributions. The Kullback-Leibler (KL) divergence of the joint
483+
Student's t-distributions. This allows t-SNE to be particularly sensitive
484+
to local structure and has a few other advantages over existing techniques:
485+
486+
* Revealing the structure at many scales on a single map
487+
* Revealing data that lie in multiple, different, manifolds or clusters
488+
* Reducing the tendency to crowd points together at the center
489+
490+
While Isomap, LLE and variants are best suited to unfold a single continuous
491+
low dimensional manifold, t-SNE will focus on the local structure of the data
492+
and will tend to extract clustered local groups of samples as highlighted on
493+
the S-curve example. This ability to group samples based on the local structure
494+
might be beneficial to visually disentangle a dataset that comprises several
495+
manifolds at once as is the case in the digits dataset.
496+
497+
The Kullback-Leibler (KL) divergence of the joint
484498
probabilities in the original space and the embedded space will be minimized
485499
by gradient descent. Note that the KL divergence is not convex, i.e.
486500
multiple restarts with different initializations will end up in local minima
487501
of the KL divergence. Hence, it is sometimes useful to try different seeds
488-
and select the embedding with the lowest KL divergence.
502+
and select the embedding with the lowest KL divergence.
503+
504+
The disadvantages to using t-SNE are roughly:
505+
506+
* t-SNE is computationally expensive, and can take several hours on million-sample
507+
datasets where PCA will finish in seconds or minutes
508+
* The Barnes-Hut t-SNE method is limited to two or three dimensional embeddings.
509+
* The algorithm is stochastic and multiple restarts with different seeds can
510+
yield different embeddings. However, it is perfectly legitimate to pick the the
511+
embedding with the least error.
512+
* Global structure is not explicitly preserved. This is problem is mitigated by
513+
initializing points with PCA (using `init='pca'`).
489514

490515

491516
.. figure:: ../auto_examples/manifold/images/plot_lle_digits_013.png
492517
:target: ../auto_examples/manifold/plot_lle_digits.html
493518
:align: center
494519
:scale: 50
495520

496-
521+
Optimizing t-SNE
522+
----------------
497523
The main purpose of t-SNE is visualization of high-dimensional data. Hence,
498524
it works best when the data will be embedded on two or three dimensions.
499525

500526
Optimizing the KL divergence can be a little bit tricky sometimes. There are
501-
three parameters that control the optimization of t-SNE:
527+
five parameters that control the optimization of t-SNE and therefore possibly
528+
the quality of the resulting embedding:
502529

530+
* perplexity
503531
* early exaggeration factor
504532
* learning rate
505533
* maximum number of iterations
506-
534+
* angle (not used in the exact method)
535+
536+
The perplexity is defined as :math:`k=2^(S)` where :math:`S` is the Shannon
537+
entropy of the conditional probability distribution. The perplexity of a
538+
:math:`k`-sided die is :math:`k`, so that :math:`k` is effectively the number of
539+
nearest neighbors t-SNE considers when generating the conditional probabilities.
540+
Larger perplexities lead to more nearest neighbors and less sensitive to small
541+
structure. Larger datasets tend to require larger perplexities.
507542
The maximum number of iterations is usually high enough and does not need
508543
any tuning. The optimization consists of two phases: the early exaggeration
509544
phase and the final optimization. During early exaggeration the joint
@@ -514,19 +549,37 @@ divergence could increase during this phase. Usually it does not have to be
514549
tuned. A critical parameter is the learning rate. If it is too low gradient
515550
descent will get stuck in a bad local minimum. If it is too high the KL
516551
divergence will increase during optimization. More tips can be found in
517-
Laurens van der Maaten's FAQ (see references).
552+
Laurens van der Maaten's FAQ (see references). The last parameter, angle,
553+
is a tradeoff between performance and accuracy. Larger angles imply that we
554+
can approximate larger regions by a single point,leading to better speed
555+
but less accurate results.
518556

519-
Standard t-SNE that has been implemented here is usually much slower than
520-
other manifold learning algorithms. The optimization is quite difficult
521-
and the computation of the gradient is on :math:`O[d N^2]`, where :math:`d`
522-
is the number of output dimensions and :math:`N` is the number of samples.
557+
Barnes-Hut t-SNE
558+
----------------
523559

524-
While Isomap, LLE and variants are best suited to unfold a single continuous
525-
low dimensional manifold, t-SNE will focus on the local structure of the data
526-
and will tend to extract clustered local groups of samples as highlighted on
527-
the S-curve example. This ability to group samples based on the local structure
528-
might be beneficial to visually disentangle a dataset that comprises several
529-
manifolds at once as is the case in the digits dataset.
560+
The Barnes-Hut t-SNE that has been implemented here is usually much slower than
561+
other manifold learning algorithms. The optimization is quite difficult
562+
and the computation of the gradient is :math:`O[d N log(N)]`, where :math:`d`
563+
is the number of output dimensions and :math:`N` is the number of samples. The
564+
Barnes-Hut method improves on the exact method where t-SNE complexity is
565+
:math:`O[d N^2]`, but has several other notable differences:
566+
567+
* The Barnes-Hut implementation only works when the target dimensionality is 3
568+
or less. The 2D case is typical when building visualizations.
569+
* Barnes-Hut only works with dense input data. Sparse data matrices can only be
570+
embedded with the exact method or can be approximated by a dense low rank
571+
projection for instance using :class:`sklearn.decomposition.TruncatedSVD`
572+
* Barnes-Hut is an approximation of the exact method. The approximation is
573+
parameterized with the angle parameter, therefore the angle parameter is
574+
unused when method="exact"
575+
* Barnes-Hut is significantly more scalable. Barnes-Hut can be used to embed
576+
hundred of thousands of data points while the exact method can handle
577+
thousands of samples before becoming computationally intractable
578+
579+
For visualization purpose (which is the main use case of t-SNE), using the
580+
Barnes-Hut method is strongly recommended. The exact t-SNE method is useful
581+
for checking the theoretically properties of the embedding possibly in higher
582+
dimensional space but limit to small datasets due to computational constraints.
530583

531584
Also note that the digits labels roughly match the natural grouping found by
532585
t-SNE while the linear 2D projection of the PCA model yields a representation
@@ -547,9 +600,13 @@ the internal structure of the data.
547600
(2008)
548601

549602
* `"t-Distributed Stochastic Neighbor Embedding"
550-
<http://homepage.tudelft.nl/19j49/t-SNE.html>`_
603+
<http://lvdmaaten.github.io/tsne/>`_
551604
van der Maaten, L.J.P.
552605

606+
* `"Accelerating t-SNE using Tree-Based Algorithms."
607+
<http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf>`_
608+
L.J.P. van der Maaten. Journal of Machine Learning Research 15(Oct):3221-3245, 2014.
609+
553610
Tips on practical use
554611
=====================
555612

doc/whats_new.rst

+6-1
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,9 @@ New features
5454

5555
Enhancements
5656
............
57+
- :class:`manifold.TSNE` now supports approximate optimization via the
58+
Barnes-Hut method, leading to much faster fitting. By Christopher Erick Moody.
59+
(`#4025 <https://github.com/scikit-learn/scikit-learn/pull/4025>`_)
5760

5861
- :class:`cluster.mean_shift_.MeanShift` now supports parallel execution,
5962
as implemented in the ``mean_shift`` function. By `Martino Sorbaro`_.
@@ -182,7 +185,9 @@ Enhancements
182185
, to set the seed of the pseudo random generator used in ``sag`` solver. By `Tom Dupre la Tour`_.
183186

184187
- Added optional parameter ``warm_start`` in
185-
:class:`linear_model.LogisticRegression`. If set to True, the solvers ``lbfgs``, ``newton-cg`` and ``sag`` will be initialized with the coefficients computed in the previous fit. By `Tom Dupre la Tour`_.
188+
:class:`linear_model.LogisticRegression`. If set to True, the solvers
189+
``lbfgs``, ``newton-cg`` and ``sag`` will be initialized with the
190+
coefficients computed in the previous fit. By `Tom Dupre la Tour`_.
186191

187192
- Added ``sample_weight`` support to :class:`linear_model.LogisticRegression` for
188193
the ``lbfgs``, ``newton-cg``, and ``sag`` solvers. By `Valentin Stolbunov`_.

0 commit comments

Comments
 (0)
0