8000 [MRG+1] Integration of GSoC2015 Gaussian Mixture (first step) by tguillemot · Pull Request #6666 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] Integration of GSoC2015 Gaussian Mixture (first step) #6666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 21, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 32 additions & 32 deletions doc/modules/dp-derivation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ as covariance matrices.
The inference algorithm is the one from the following paper:

* `Variational Inference for Dirichlet Process Mixtures
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.4467&rep=rep1&type=pdf>`_
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.4467&rep=rep1&type=pdf>`_
David Blei, Michael Jordan. Bayesian Analysis, 2006

While this paper presents the parts of the inference algorithm that
Expand All @@ -36,7 +36,7 @@ is necessary to invert the covariance/precision matrices and compute
its determinant, hence the cubic term).

This implementation is expected to scale at least as well as EM for
the mixture of Gaussians.
the Gaussian mixture.

Update rules for VB inference
==============================
Expand Down Expand Up @@ -78,7 +78,7 @@ The variational distribution we'll use is
\sigma_k &\sim& Gamma(a_{k}, b_{k}) \\
z_{i} &\sim& Discrete(\nu_{z_i}) \\
\end{array}


The bound
...........
Expand All @@ -88,7 +88,7 @@ The variational bound is
.. math::

\begin{array}{rcl}
\log P(X) &\ge&
\log P(X) &\ge&
\sum_k (E_q[\log P(\phi_k)] - E_q[\log Q(\phi_k)]) \\
&&
+\sum_k \left( E_q[\log P(\mu_k)] - E_q[\log Q(\mu_k)] \right) \\
Expand All @@ -99,28 +99,28 @@ The variational bound is
&&
+\sum_i E_q[\log P(X_t)]
\end{array}
**The bound for** :math:`\phi_k`


**The bound for** :math:`\phi_k`

.. math::

\begin{array}{rcl}
E_q[\log Beta(1,\alpha)] - E[\log Beta(\gamma_{k,1},\gamma_{k,2})]
E_q[\log Beta(1,\alpha)] - E[\log Beta(\gamma_{k,1},\gamma_{k,2})]
&=&
\log \Gamma(1+\alpha) - \log \Gamma(\alpha) \\ &&
\log \Gamma(1+\alpha) - \log \Gamma(\alpha) \\ &&
+(\alpha-1)(\Psi(\gamma_{k,2})-\Psi(\gamma_{k,1}+\gamma_{k,2})) \\ &&
- \log \Gamma(\gamma_{k,1}+\gamma_{k,2}) + \log \Gamma(\gamma_{k,1}) +
\log \Gamma(\gamma_{k,2}) \\ &&
-
(\gamma_{k,1}-1)(\Psi(\gamma_{k,1})-\Psi(\gamma_{k,1}+\gamma_{k,2}))
\\ &&
-
(\gamma_{k,2}-1)(\Psi(\gamma_{k,2})-\Psi(\gamma_{k,1}+\gamma_{k,2}))
(\gamma_{k,2}-1)(\Psi(\gamma_{k,2})-\Psi(\gamma_{k,1}+\gamma_{k,2}))
\end{array}


**The bound for** :math:`\mu_k`

**The bound for** :math:`\mu_k`

.. math::

Expand All @@ -131,11 +131,11 @@ The variational bound is
- \int\!d\mu_f q(\mu_f) \log Q(\mu_f) \\
&=&
- \frac{D}{2}\log 2\pi - \frac{1}{2} ||\nu_{\mu_k}||^2 - \frac{D}{2}
+ \frac{D}{2} \log 2\pi e
+ \frac{D}{2} \log 2\pi e
\end{array}


**The bound for** :math:`\sigma_k`
**The bound for** :math:`\sigma_k`

Here I'll use the inverse scale parametrization of the gamma
distribution.
Expand All @@ -155,21 +155,21 @@ distribution.
\begin{array}{rcl}
&& E_q[\log P(z)] - E_q[\log Q(z)] \\
&=&
\sum_{k} \left(
\left(\sum_{j=k+1}^K \nu_{z_{i,j}}\right)(\Psi(\gamma_{k,2})-\Psi(\gamma_{k,1}+\gamma_{k,2}))
\sum_{k} \left(
\left(\sum_{j=k+1}^K \nu_{z_{i,j}}\right)(\Psi(\gamma_{k,2})-\Psi(\gamma_{k,1}+\gamma_{k,2}))
+ \nu_{z_{i,k}}(\Psi(\gamma_{k,1})-\Psi(\gamma_{k,1}+\gamma_{k,2}))
- \log \nu_{z_{i,k}} \right)
\end{array}


**The bound for** :math:`X`
**The bound for** :math:`X`

Recall that there is no need for a :math:`Q(X)` so this bound is just

.. math::

\begin{array}{rcl}
E_q[\log P(X_i)] &=& \sum_k \nu_{z_k} \left( - \frac{D}{2}\log 2\pi
E_q[\log P(X_i)] &=& \sum_k \nu_{z_k} \left( - \frac{D}{2}\log 2\pi
+\frac{D}{2} (\Psi(a_k) - \log(b_k))
-\frac{a_k}{2b_k} (||X_i - \nu_{\mu_k}||^2+D) - \log 2 \pi e \right)
\end{array}
Expand All @@ -186,7 +186,7 @@ The updates

\begin{array}{rcl}
\gamma_{k,1} &=& 1+\sum_i \nu_{z_{i,k}} \\
\gamma_{k,2} &=& \alpha + \sum_i \sum_{j > k} \nu_{z_{i,j}}.
\gamma_{k,2} &=& \alpha + \sum_i \sum_{j > k} \nu_{z_{i,j}}.
\end{array}


Expand All @@ -204,7 +204,7 @@ The gradient is

so the update is

.. math::
.. math::

\nu_{\mu_k} = \frac{\sum_i \frac{\nu_{z_{i,k}}b_k}{a_k}X_i}{1+\sum_i \frac{\nu_{z_{i,k}}b_k}{a_k}}

Expand Down Expand Up @@ -299,9 +299,9 @@ have
.. math::

\begin{array}{rcl}
E_q[\log P(X_i)] &=& \sum_k \nu_{z_k} \Big( - \frac{D}{2}\log 2\pi
E_q[\log P(X_i)] &=& \sum_k \nu_{z_k} \Big( - \frac{D}{2}\log 2\pi
+\frac{1}{2}\sum_d (\Psi(a_{k,d}) - \log(b_{k,d})) \\
&&
&&
-\frac{1}{2}((X_i - \nu_{\mu_k})^T\bm{\frac{a_k}{b_k}}(X_i - \nu_{\mu_k})+ \sum_d \sigma_{k,d})- \log 2 \pi e \Big)
\end{array}

Expand All @@ -315,7 +315,7 @@ The updates only chance for :math:`\mu` (to weight them with the new

**The update for** :math:`\mu`

.. math::
.. math::

\nu_{\mu_k} = \left(\mathbf{I}+\sum_i \frac{\nu_{z_{i,k}}\mathbf{b_k}}{\mathbf{a_k}}\right)^{-1}\left(\sum_i \frac{\nu_{z_{i,k}}b_k}{a_k}X_i\right)

Expand All @@ -328,13 +328,13 @@ of the bound:

.. math::

\log Q(\sigma_{k,d}) = -\sigma_{k,d} + \sum_i \nu_{z_{i,k}}\frac{1}{2}\log \sigma_{k,d}
\log Q(\sigma_{k,d}) = -\sigma_{k,d} + \sum_i \nu_{z_{i,k}}\frac{1}{2}\log \sigma_{k,d}
- \frac{\sigma_{k,d}}{2}\sum_i \nu_{z_{i,k}} ((X_{i,d}-\mu_{k,d})^2 + 1)


Hence
Hence

.. math::
.. math::

a_{k,d} = 1 + \frac{1}{2} \sum_i \nu_{z_{i,k}}

Expand Down Expand Up @@ -381,7 +381,7 @@ There are two changes in the lower-bound: for :math:`\Sigma` and for :math:`X`.
\begin{array}{rcl}
\frac{D^2}{2}\log 2 + \sum_d \log \Gamma(\frac{D+1-d}{2}) \\
- \frac{aD}{2}\log 2 + \frac{a}{2} \log |\mathbf{B}| + \sum_d \log \Gamma(\frac{a+1-d}{2}) \\
+ \frac{a-D}{2}\left(\sum_d \Psi\left(\frac{a+1-d}{2}\right)
+ \frac{a-D}{2}\left(\sum_d \Psi\left(\frac{a+1-d}{2}\right)
+ D \log 2 + \log |\mathbf{B}|\right) \\
+ \frac{1}{2} a \mathbf{tr}[\mathbf{B}-\mathbf{I}]
\end{array}
Expand All @@ -392,10 +392,10 @@ There are two changes in the lower-bound: for :math:`\Sigma` and for :math:`X`.
.. math::

\begin{array}{rcl}
E_q[\log P(X_i)] &=& \sum_k \nu_{z_k} \Big( - \frac{D}{2}\log 2\pi
+\frac{1}{2}\left(\sum_d \Psi\left(\frac{a+1-d}{2}\right)
E_q[\log P(X_i)] &=& \sum_k \nu_{z_k} \Big( - \frac{D}{2}\log 2\pi
+\frac{1}{2}\left(\sum_d \Psi\left(\frac{a+1-d}{2}\right)
+ D \log 2 + \log |\mathbf{B}|\right) \\
&&
&&
-\frac{1}{2}((X_i - \nu_{\mu_k})a\mathbf{B}(X_i - \nu_{\mu_k})+ a\mathbf{tr}(\mathbf{B}))- \log 2 \pi e \Big)
\end{array}

Expand Down Expand Up @@ -482,9 +482,9 @@ The updates
..............

All that changes in the updates is that the update for mu uses only
the proper sigma and the updates for a and B don't have a sum over K, so
the proper sigma and the updates for a and B don't have a sum over K, so

.. math::
.. math::

\nu_{\mu_k} = \left(\mathbf{I}+ a_k\mathbf{B_k}\sum_i \nu_{z_{i,k}}\right)^{-1}
\left(a_k\mathbf{B_k}\sum_i \nu_{z_{i,k}} X_i\right)
Expand Down
62 changes: 30 additions & 32 deletions doc/modules/mixture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

.. _gmm:

===================================================
=======================
Gaussian mixture models
===================================================
=======================

.. currentmodule:: sklearn.mixture

Expand All @@ -19,7 +19,7 @@ components are also provided.
:align: center
:scale: 50%

**Two-component Gaussian mixture model:** *data points, and equi-probability surfaces of
**Two-component Gaussian mixture model:** *data points, and equi-probability surfaces of
the model.*

A Gaussian mixture model is a probabilistic model that assumes all the
Expand All @@ -33,25 +33,25 @@ Scikit-learn implements different classes to estimate Gaussian
mixture models, that correspond to different estimation strategies,
detailed below.

Gaussian Mixture Model
======================
Gaussian Mixture
================

The :class:`GMM` object implements the
The :class:`GaussianMixture` object implements the
:ref:`expectation-maximization <expectation_maximization>` (EM)
algorithm for fitting mixture-of-Gaussian models. It can also draw
confidence ellipsoids for multivariate models, and compute the
Bayesian Information Criterion to assess the number of clusters in the
data. A :meth:`GMM.fit` method is provided that learns a Gaussian
data. A :meth:`GaussianMixture.fit` method is provided that learns a Gaussian
Mixture Model from train data. Given test data, it can assign to each
sample the Gaussian it mostly probably belong to using
the :meth:`GMM.predict` method.
the :meth:`GaussianMixture.predict` method.

..
..
Alternatively, the probability of each
sample belonging to the various Gaussians may be retrieved using the
:meth:`GMM.predict_proba` method.
:meth:`GaussianMixture.predict_proba` method.

The :class:`GMM` comes with different options to constrain the covariance
The :class:`GaussianMixture` comes with different options to constrain the covariance
of the difference classes estimated: spherical, diagonal, tied or full
covariance.

Expand All @@ -63,37 +63,37 @@ covariance.
.. topic:: Examples:

* See :ref:`example_mixture_plot_gmm_covariances.py` for an example of
using a GMM for clustering on the iris dataset.
using the Gaussian mixture as clustering on the iris dataset.

* See :ref:`example_mixture_plot_gmm_pdf.py` for an example on plotting the
* See :ref:`example_mixture_plot_gmm_pdf.py` for an example on plotting the
density estimation.

Pros and cons of class :class:`GMM`: expectation-maximization inference
------------------------------------------------------------------------
Pros and cons of class :class:`GaussianMixture`
-----------------------------------------------

Pros
.....

:Speed: it is the fastest algorithm for learning mixture models
:Speed: It is the fastest algorithm for learning mixture models

:Agnostic: as this algorithm maximizes only the likelihood, it
:Agnostic: As this algorithm maximizes only the likelihood, it
will not bias the means towards zero, or bias the cluster sizes to
have specific structures that might or might not apply.

Cons
....

:Singularities: when one has insufficiently many points per
:Singularities: When one has insufficiently many points per
mixture, estimating the covariance matrices becomes difficult,
and the algorithm is known to diverge and find solutions with
infinite likelihood unless one regularizes the covariances artificially.

:Number of components: this algorithm will always use all the
:Number of components: This algorithm will always use all the
components it has access to, needing held-out data
or information theoretical criteria to decide how many components to use
or information theoretical criteria to decide how many components to use
in the absence of external cues.

Selecting the number of components in a classical GMM
Selecting the number of components in a classical GMM
------------------------------------------------------

The BIC criterion can be used to select the number of components in a GMM
Expand All @@ -110,7 +110,7 @@ number of components for a Gaussian mixture model.
.. topic:: Examples:

* See :ref:`example_mixture_plot_gmm_selection.py` for an example
of model selection performed with classical GMM.
of model selection performed with classical Gaussian mixture.

.. _expectation_maximization:

Expand All @@ -131,18 +131,15 @@ origin) and computes for each point a probability of being generated by
each component of the model. Then, one tweaks the
parameters to maximize the likelihood of the data given those
assignments. Repeating this process is guaranteed to always converge
to a local optimum.
to a local optimum.

.. _vbgmm:

VBGMM: variational Gaussian mixtures
====================================

The :class:`VBGMM` object implements a variant of the Gaussian mixture
model with :ref:`variational inference <variational_inference>` algorithms. The API is identical to
:class:`GMM`. It is essentially a middle-ground between :class:`GMM`
and :class:`DPGMM`, as it has some of the properties of the Dirichlet
process.
model with :ref:`variational inference <variational_inference>` algorithms.

Pros and cons of class :class:`VBGMM`: variational inference
------------------------------------------------------------
Expand Down Expand Up @@ -205,7 +202,7 @@ DPGMM: Infinite Gaussian mixtures

The :class:`DPGMM` object implements a variant of the Gaussian mixture
model with a variable (but bounded) number of components using the
Dirichlet Process. The API is identical to :class:`GMM`.
Dirichlet Process.
This class doesn't require the user to choose the number of
components, and at the expense of extra computational time the user
only needs to specify a loose upper bound on this number and a
Expand All @@ -228,17 +225,18 @@ components on a dataset composed of 2 clusters. We can see that the DPGMM is
able to limit itself to only 2 components whereas the GMM fits the data fit too
many components. Note that with very little observations, the DPGMM can take a
conservative stand, and fit only one component. **On the right** we are fitting
a dataset not well-depicted by a mixture of Gaussian. Adjusting the `alpha`
a dataset not well-depicted by a Gaussian mixture. Adjusting the `alpha`
parameter of the DPGMM controls the number of components used to fit this
data.

.. topic:: Examples:

* See :ref:`example_mixture_plot_gmm.py` for an example on plotting the
confidence ellipsoids for both :class:`GMM` and :class:`DPGMM`.
confidence ellipsoids for both :class:`GaussianMixture`
and :class:`DPGMM`.

* :ref:`example_mixture_plot_gmm_sin.py` shows using :class:`GMM` and
:class:`DPGMM` to fit a sine wave
* :ref:`example_mixture_plot_gmm_sin.py` shows using
:class:`GaussianMixture` and :class:`DPGMM` to fit a sine wave

Pros and cons of class :class:`DPGMM`: Dirichlet process mixture model
----------------------------------------------------------------------
Expand Down
Loading
0