scikit-learn · ogrisel · Apr 21, 2016 · Apr 14, 2016 · Apr 20, 2016
diff --git a/doc/modules/dp-derivation.rst b/doc/modules/dp-derivation.rst
@@ -13,7 +13,7 @@ as covariance matrices.
 The inference algorithm is the one from the following paper:
 
     * `Variational Inference for Dirichlet Process Mixtures
-      <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.4467&rep=rep1&type=pdf>`_ 
+      <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.4467&rep=rep1&type=pdf>`_
       David Blei, Michael Jordan. Bayesian Analysis, 2006
 
 While this paper presents the parts of the inference algorithm that
@@ -36,7 +36,7 @@ is necessary to invert the covariance/precision matrices and compute
 its determinant, hence the cubic term).
 
 This implementation is expected to scale at least as well as EM for
-the mixture of Gaussians.
+the Gaussian mixture.
 
 Update rules for VB inference
 ==============================
@@ -78,7 +78,7 @@ The variational distribution we'll use is
     \sigma_k &\sim& Gamma(a_{k}, b_{k}) \\
     z_{i}     &\sim& Discrete(\nu_{z_i}) \\
     \end{array}
-  
+
 
 The bound
 ...........
@@ -88,7 +88,7 @@ The variational bound is
 .. math::
 
     \begin{array}{rcl}
-    \log P(X) &\ge& 
+    \log P(X) &\ge&
     \sum_k (E_q[\log P(\phi_k)] - E_q[\log Q(\phi_k)]) \\
     &&
     +\sum_k \left( E_q[\log P(\mu_k)] - E_q[\log Q(\mu_k)] \right) \\
@@ -99,28 +99,28 @@ The variational bound is
     &&
     +\sum_i E_q[\log P(X_t)]
     \end{array}
-  
-  
-**The bound for** :math:`\phi_k` 
+
+
+**The bound for** :math:`\phi_k`
 
 .. math::
 
     \begin{array}{rcl}
-    E_q[\log Beta(1,\alpha)] - E[\log Beta(\gamma_{k,1},\gamma_{k,2})] 
+    E_q[\log Beta(1,\alpha)] - E[\log Beta(\gamma_{k,1},\gamma_{k,2})]
     &=&
-    \log \Gamma(1+\alpha) - \log \Gamma(\alpha) \\ && 
+    \log \Gamma(1+\alpha) - \log \Gamma(\alpha) \\ &&
     +(\alpha-1)(\Psi(\gamma_{k,2})-\Psi(\gamma_{k,1}+\gamma_{k,2})) \\ &&
     - \log \Gamma(\gamma_{k,1}+\gamma_{k,2}) + \log \Gamma(\gamma_{k,1}) +
     \log \Gamma(\gamma_{k,2}) \\ &&
     -
     (\gamma_{k,1}-1)(\Psi(\gamma_{k,1})-\Psi(\gamma_{k,1}+\gamma_{k,2}))
     \\ &&
     -
-    (\gamma_{k,2}-1)(\Psi(\gamma_{k,2})-\Psi(\gamma_{k,1}+\gamma_{k,2}))  
+    (\gamma_{k,2}-1)(\Psi(\gamma_{k,2})-\Psi(\gamma_{k,1}+\gamma_{k,2}))
     \end{array}
-
 
-**The bound for** :math:`\mu_k` 
+
+**The bound for** :math:`\mu_k`
 
 .. math::
 
@@ -131,11 +131,11 @@ The variational bound is
   - \int\!d\mu_f q(\mu_f) \log Q(\mu_f)  \\
   &=&
   - \frac{D}{2}\log 2\pi - \frac{1}{2} ||\nu_{\mu_k}||^2 - \frac{D}{2}
-  + \frac{D}{2} \log 2\pi e 
+  + \frac{D}{2} \log 2\pi e
   \end{array}
 
 
-**The bound for** :math:`\sigma_k` 
+**The bound for** :math:`\sigma_k`
 
 Here I'll use the inverse scale parametrization of the gamma
 distribution.
@@ -155,21 +155,21 @@ distribution.
   \begin{array}{rcl}
   && E_q[\log P(z)] - E_q[\log Q(z)] \\
   &=&
-  \sum_{k} \left( 
-       \left(\sum_{j=k+1}^K  \nu_{z_{i,j}}\right)(\Psi(\gamma_{k,2})-\Psi(\gamma_{k,1}+\gamma_{k,2})) 
+  \sum_{k} \left(
+       \left(\sum_{j=k+1}^K  \nu_{z_{i,j}}\right)(\Psi(\gamma_{k,2})-\Psi(\gamma_{k,1}+\gamma_{k,2}))
    +  \nu_{z_{i,k}}(\Psi(\gamma_{k,1})-\Psi(\gamma_{k,1}+\gamma_{k,2}))
    - \log \nu_{z_{i,k}} \right)
   \end{array}
 
 
-**The bound for** :math:`X` 
+**The bound for** :math:`X`
 
 Recall that there is no need for a :math:`Q(X)` so this bound is just
 
 .. math::
 
     \begin{array}{rcl}
-    E_q[\log P(X_i)] &=& \sum_k \nu_{z_k} \left( - \frac{D}{2}\log 2\pi 
+    E_q[\log P(X_i)] &=& \sum_k \nu_{z_k} \left( - \frac{D}{2}\log 2\pi
     +\frac{D}{2} (\Psi(a_k) - \log(b_k))
     -\frac{a_k}{2b_k} (||X_i - \nu_{\mu_k}||^2+D) - \log 2 \pi e  \right)
     \end{array}
@@ -186,7 +186,7 @@ The updates
 
   \begin{array}{rcl}
   \gamma_{k,1} &=& 1+\sum_i \nu_{z_{i,k}} \\
-  \gamma_{k,2} &=& \alpha + \sum_i \sum_{j > k} \nu_{z_{i,j}}. 
+  \gamma_{k,2} &=& \alpha + \sum_i \sum_{j > k} \nu_{z_{i,j}}.
   \end{array}
 
 
@@ -204,7 +204,7 @@ The gradient is
 
 so the update is
 
-.. math::  
+.. math::
 
     \nu_{\mu_k} = \frac{\sum_i \frac{\nu_{z_{i,k}}b_k}{a_k}X_i}{1+\sum_i \frac{\nu_{z_{i,k}}b_k}{a_k}}
 
@@ -299,9 +299,9 @@ have
 .. math::
 
   \begin{array}{rcl}
-  E_q[\log P(X_i)] &=& \sum_k \nu_{z_k} \Big( - \frac{D}{2}\log 2\pi 
+  E_q[\log P(X_i)] &=& \sum_k \nu_{z_k} \Big( - \frac{D}{2}\log 2\pi
   +\frac{1}{2}\sum_d (\Psi(a_{k,d}) - \log(b_{k,d})) \\
-  && 
+  &&
   -\frac{1}{2}((X_i - \nu_{\mu_k})^T\bm{\frac{a_k}{b_k}}(X_i - \nu_{\mu_k})+ \sum_d \sigma_{k,d})- \log 2 \pi e  \Big)
   \end{array}
 
@@ -315,7 +315,7 @@ The updates only chance for :math:`\mu` (to weight them with the new
 
 **The update for** :math:`\mu`
 
-.. math::  
+.. math::
 
    \nu_{\mu_k} = \left(\mathbf{I}+\sum_i \frac{\nu_{z_{i,k}}\mathbf{b_k}}{\mathbf{a_k}}\right)^{-1}\left(\sum_i \frac{\nu_{z_{i,k}}b_k}{a_k}X_i\right)
 
@@ -328,13 +328,13 @@ of the bound:
 
 .. math::
 
-  \log Q(\sigma_{k,d}) = -\sigma_{k,d} + \sum_i \nu_{z_{i,k}}\frac{1}{2}\log \sigma_{k,d} 
+  \log Q(\sigma_{k,d}) = -\sigma_{k,d} + \sum_i \nu_{z_{i,k}}\frac{1}{2}\log \sigma_{k,d}
   - \frac{\sigma_{k,d}}{2}\sum_i \nu_{z_{i,k}} ((X_{i,d}-\mu_{k,d})^2 + 1)
 
 
-Hence 
+Hence
 
-.. math:: 
+.. math::
 
   a_{k,d} = 1 + \frac{1}{2} \sum_i \nu_{z_{i,k}}
 
@@ -381,7 +381,7 @@ There are two changes in the lower-bound: for :math:`\Sigma` and for :math:`X`.
   \begin{array}{rcl}
   \frac{D^2}{2}\log 2  + \sum_d \log \Gamma(\frac{D+1-d}{2}) \\
   - \frac{aD}{2}\log 2 + \frac{a}{2} \log |\mathbf{B}| + \sum_d \log \Gamma(\frac{a+1-d}{2}) \\
-  + \frac{a-D}{2}\left(\sum_d \Psi\left(\frac{a+1-d}{2}\right) 
+  + \frac{a-D}{2}\left(\sum_d \Psi\left(\frac{a+1-d}{2}\right)
   + D \log 2 + \log |\mathbf{B}|\right) \\
   + \frac{1}{2} a \mathbf{tr}[\mathbf{B}-\mathbf{I}]
   \end{array}
@@ -392,10 +392,10 @@ There are two changes in the lower-bound: for :math:`\Sigma` and for :math:`X`.
 .. math::
 
    \begin{array}{rcl}
-   E_q[\log P(X_i)] &=& \sum_k \nu_{z_k} \Big( - \frac{D}{2}\log 2\pi 
-   +\frac{1}{2}\left(\sum_d \Psi\left(\frac{a+1-d}{2}\right) 
+   E_q[\log P(X_i)] &=& \sum_k \nu_{z_k} \Big( - \frac{D}{2}\log 2\pi
+   +\frac{1}{2}\left(\sum_d \Psi\left(\frac{a+1-d}{2}\right)
    + D \log 2 + \log |\mathbf{B}|\right) \\
-   && 
+   &&
    -\frac{1}{2}((X_i - \nu_{\mu_k})a\mathbf{B}(X_i - \nu_{\mu_k})+ a\mathbf{tr}(\mathbf{B}))- \log 2 \pi e  \Big)
    \end{array}
 
@@ -482,9 +482,9 @@ The updates
 ..............
 
 All that changes in the updates is that the update for mu uses only
-the proper sigma and the updates for a and B don't have a sum over K, so 
+the proper sigma and the updates for a and B don't have a sum over K, so
 
-.. math:: 
+.. math::
 
          \nu_{\mu_k} = \left(\mathbf{I}+ a_k\mathbf{B_k}\sum_i \nu_{z_{i,k}}\right)^{-1}
          \left(a_k\mathbf{B_k}\sum_i \nu_{z_{i,k}} X_i\right)

diff --git a/doc/modules/mixture.rst b/doc/modules/mixture.rst
@@ -2,9 +2,9 @@
 
 .. _gmm:
 
-===================================================
+=======================
 Gaussian mixture models
-===================================================
+=======================
 
 .. currentmodule:: sklearn.mixture
 
@@ -19,7 +19,7 @@ components are also provided.
    :align: center
    :scale: 50%
 
-   **Two-component Gaussian mixture model:** *data points, and equi-probability surfaces of 
+   **Two-component Gaussian mixture model:** *data points, and equi-probability surfaces of
    the model.*
 
 A Gaussian mixture model is a probabilistic model that assumes all the
@@ -33,25 +33,25 @@ Scikit-learn implements different classes to estimate Gaussian
 mixture models, that correspond to different estimation strategies,
 detailed below.
 
-Gaussian Mixture Model
-======================
+Gaussian Mixture
+================
 
-The :class:`GMM` object implements the 
+The :class:`GaussianMixture` object implements the
 :ref:`expectation-maximization <expectation_maximization>` (EM)
 algorithm for fitting mixture-of-Gaussian models. It can also draw
 confidence ellipsoids for multivariate models, and compute the
 Bayesian Information Criterion to assess the number of clusters in the
-data. A :meth:`GMM.fit` method is provided that learns a Gaussian
+data. A :meth:`GaussianMixture.fit` method is provided that learns a Gaussian
 Mixture Model from train data. Given test data, it can assign to each
 sample the Gaussian it mostly probably belong to using
-the :meth:`GMM.predict` method.
+the :meth:`GaussianMixture.predict` method.
 
-..  
+..
     Alternatively, the probability of each
     sample belonging to the various Gaussians may be retrieved using the
-    :meth:`GMM.predict_proba` method.
+    :meth:`GaussianMixture.predict_proba` method.
 
-The :class:`GMM` comes with different options to constrain the covariance
+The :class:`GaussianMixture` comes with different options to constrain the covariance
 of the difference classes estimated: spherical, diagonal, tied or full
 covariance.
 
@@ -63,37 +63,37 @@ covariance.
 .. topic:: Examples:
 
     * See :ref:`example_mixture_plot_gmm_covariances.py` for an example of
-      using a GMM for clustering on the iris dataset.
+      using the Gaussian mixture as clustering on the iris dataset.
 
-    * See :ref:`example_mixture_plot_gmm_pdf.py` for an example on plotting the 
+    * See :ref:`example_mixture_plot_gmm_pdf.py` for an example on plotting the
       density estimation.
 
-Pros and cons of class :class:`GMM`: expectation-maximization inference
-------------------------------------------------------------------------
+Pros and cons of class :class:`GaussianMixture`
+-----------------------------------------------
 
 Pros
 .....
 
-:Speed: it is the fastest algorithm for learning mixture models
+:Speed: It is the fastest algorithm for learning mixture models
 
-:Agnostic: as this algorithm maximizes only the likelihood, it
+:Agnostic: As this algorithm maximizes only the likelihood, it
   will not bias the means towards zero, or bias the cluster sizes to
   have specific structures that might or might not apply.
 
 Cons
 ....
 
-:Singularities: when one has insufficiently many points per
+:Singularities: When one has insufficiently many points per
    mixture, estimating the covariance matrices becomes difficult,
    and the algorithm is known to diverge and find solutions with
    infinite likelihood unless one regularizes the covariances artificially.
 
-:Number of components: this algorithm will always use all the
+:Number of components: This algorithm will always use all the
    components it has access to, needing held-out data
-   or information theoretical criteria to decide how many components to use 
+   or information theoretical criteria to decide how many components to use
    in the absence of external cues.
 
-Selecting the number of components in a classical GMM 
+Selecting the number of components in a classical GMM
 ------------------------------------------------------
 
 The BIC criterion can be used to select the number of components in a GMM
@@ -110,7 +110,7 @@ number of components for a Gaussian mixture model.
 .. topic:: Examples:
 
     * See :ref:`example_mixture_plot_gmm_selection.py` for an example
-      of model selection performed with classical GMM.
+      of model selection performed with classical Gaussian mixture.
 
 .. _expectation_maximization:
 
@@ -131,18 +131,15 @@ origin) and computes for each point a probability of being generated by
 each component of the model. Then, one tweaks the
 parameters to maximize the likelihood of the data given those
 assignments. Repeating this process is guaranteed to always converge
-to a local optimum. 
+to a local optimum.
 
 .. _vbgmm:
 
 VBGMM: variational Gaussian mixtures
 ====================================
 
 The :class:`VBGMM` object implements a variant of the Gaussian mixture
-model with :ref:`variational inference <variational_inference>` algorithms. The API is identical to
-:class:`GMM`. It is essentially a middle-ground between :class:`GMM`
-and :class:`DPGMM`, as it has some of the properties of the Dirichlet
-process.
+model with :ref:`variational inference <variational_inference>` algorithms.
 
 Pros and cons of class :class:`VBGMM`: variational inference
 ------------------------------------------------------------
@@ -205,7 +202,7 @@ DPGMM: Infinite Gaussian mixtures
 
 The :class:`DPGMM` object implements a variant of the Gaussian mixture
 model with a variable (but bounded) number of components using the
-Dirichlet Process. The API is identical to :class:`GMM`. 
+Dirichlet Process.
 This class doesn't require the user to choose the number of
 components, and at the expense of extra computational time the user
 only needs to specify a loose upper bound on this number and a
@@ -228,17 +225,18 @@ components on a dataset composed of 2 clusters. We can see that the DPGMM is
 able to limit itself to only 2 components whereas the GMM fits the data fit too
 many components. Note that with very little observations, the DPGMM can take a
 conservative stand, and fit only one component. **On the right** we are fitting
-a dataset not well-depicted by a mixture of Gaussian. Adjusting the `alpha`
+a dataset not well-depicted by a Gaussian mixture. Adjusting the `alpha`
 parameter of the DPGMM controls the number of components used to fit this
 data.
 
 .. topic:: Examples:
 
     * See :ref:`example_mixture_plot_gmm.py` for an example on plotting the
-      confidence ellipsoids for both :class:`GMM` and :class:`DPGMM`.
+      confidence ellipsoids for both :class:`GaussianMixture`
+      and :class:`DPGMM`.
 
-    * :ref:`example_mixture_plot_gmm_sin.py` shows using :class:`GMM` and
-      :class:`DPGMM` to fit a sine wave
+    * :ref:`example_mixture_plot_gmm_sin.py` shows using
+      :class:`GaussianMixture` and :class:`DPGMM` to fit a sine wave
 
 Pros and cons of class :class:`DPGMM`: Dirichlet process mixture model
 ----------------------------------------------------------------------