scikit-learn
diff --git a/‎doc/modules/svm.rst
Lines changed: 62 additions & 53 deletions b/‎doc/modules/svm.rst
Lines changed: 62 additions & 53 deletions
diff --git a/‎examples/svm/plot_oneclass_vs_svdd.py
Lines changed: 11 additions & 5 deletions b/‎examples/svm/plot_oneclass_vs_svdd.py
Lines changed: 11 additions & 5 deletions
@@ -702,19 +702,20 @@ of a high-dimensional distribution by constructing a supporting hyperplane
 in the feature space corresponding to the kernel, which effectively
 separates the data set from the origin with maximum margin.
 
-The One-Class SVM solves the following primal problem:
+For the training sample :math:`(x_i)_{i=1}^l` with weights :math:`(w_i)_{i=1}^l`,
+:math:`\sum_{i=1}^l w_i>0`, the One-Class SVM solves the following primal problem:
 
 
 .. math::
 
-   \min_{\rho,\xi,w} \frac12 w^Tw - \rho + \frac{1}{\nu l} \sum_{i=1}^l \xi_i\,,\\
+   \min_{\rho,\xi,w} \frac12 w^Tw - \rho + \frac{1}{\nu W} \sum_{i=1}^l w_i \xi_i \,, \\
 
-   \textrm {subject to } & w^T\phi(x_i) \geq \rho - \xi_i\,,\\
-                         & \xi_i \geq 0, i=1, \ldots, l\,,
+   \textrm {subject to } & w^T\phi(x_i) \geq \rho - \xi_i \,, \\
+                         & \xi_i \geq 0\,,\, i=1, \ldots, l \,,
 
 
 where :math:`\phi(\cdot)` is the feature map associated with the
-kernel :math:`K(\cdot,\cdot)`.
+kernel :math:`K(\cdot,\cdot)`, and :math:`W = \sum_{i=1}^l w_i`.
 
 The dual problem is
 
@@ -723,25 +724,25 @@ The dual problem is
 
    \min_\alpha \frac12 \alpha^T Q\alpha\,\\
 
-   \textrm {subject to } & 0\leq \alpha_i \leq \frac{1}{\nu l}, i=1, \ldots, l \,,\\
-                         & e^T\alpha = 1 \,,
+   \textrm {subject to } & 0\leq \alpha_i \leq w_i\,,\, i=1, \ldots, l \,,\\
+                         & e^T\alpha = \nu W \,,
 
 
 where :math:`e\in \mathbb{R}^{l\times 1}` is the vector of ones and
 :math:`Q_{ij} = K(x_i, x_j)` is the kernel Gram matrix.
 
-The decision function is given by:
+The optimal decision function is given by:
 
-.. math:: \operatorname{sgn}(\sum_{i=1}^l \alpha_i K(x_i, x) - \rho) \,,
+.. math:: x\mapsto \operatorname{sgn}(\sum_{i=1}^l \alpha_i K(x_i, x) - \rho) \,,
 
 where :math:`+1` indicates and inliner and :math:`-1` -- an outlier.
 
 The parameter :math:`\nu\in(0,1]` determines the fraction of outliers
 in the training dataset. More technically :math:`\nu` is:
- * an upper bound on the fraction training points outside
- the estimated region;
+  * an upper bound on the fraction of the training points lying outside
+    the estimated region;
 
- * a lower bound on the fraction of support vectors.
+  * a lower bound on the fraction of support vectors.
 
 .. topic:: References:
 
@@ -755,52 +756,51 @@ in the training dataset. More technically :math:`\nu` is:
 
 SVDD
 ----
-Support Vector Data Description (SVDD-L1) for Unsupervised Outlier
-Detection.
-
-This model, proposed by Tax and Duin (2004), aims at finding spherically
-shaped boundary around a data set. The original formulation suffered from
-non-convexity issues related to optimality of the attained solution for
-certain values of regularization parameter `C > 0`. Chang, Lee, and Lin
-(2013) suggested a reformulation of the SVDD was proposed which had provably
-unique global solution for any `C`.
-
-Scikit's SVDD model is a modified version of the 2013 SVDD formulation,
-cf. problem (7) of Chang et al. (2013). The major change is that observations
-can have different penalty costs :math:`(C_i)_{i=1}^l` instead of a single
-cost :math:`C`. The theorems 2-4 of Chang et al. (2013) can be extended to
-the case of vector of penalty costs :math:`C` with :math:`\sum_{i=1}^l C_i > 1`
-being the condition, distinguishing the case of :math:`R>0` (theorem 4 case 1)
-from :math:`R=0` (theorem 4 case 2).
-
-The second change is concerned with the model parametrization. In the original
-and 2013 formulations the parameter `C` was difficult to interpret directly:
-its reciprocal determines the approximate number of support vectors of the
-hypersphere. To improve interpretability and provide unified interface to the
-One-Class SVM model, the Scikit's SVDD is parametrized with :math:`\nu\in(0, 1]`,
-which determines the fraction of outliers in the training sample. The 
-reparametrization in question is:
-
-.. math:: C_i = \frac{w_i}{\nu \sum_{i=1}^l w_i} \,,
-
-where :math:`(w_i)_{i=1}^l` are non-negative sample weights. Note that in a
-typical run of the SVDD model the weights are set to :math:`w_i = 1`, which
-is equivalent to the original 2013 SVDD problem for :math:`C = \frac{1}{\nu l}`.
-
-In the case of a stationary kernel :math:`K(x,y)=K(x-y)` the SVDD model
-and the One-Class SVM models are provably equivalent
-(see :ref:`outlier_detection_ocsvm_vs_svdd`).
+
+Support Vector Data Description (SVDD), proposed by Tax and Duin (2004),
+aims at finding spherically shaped boundary around a data set. The original
+formulation suffered from non-convexity issues related to optimality of
+the attained solution for certain values of regularization parameter :math:`C`.
+Chang, Lee, and Lin (2013) suggested a reformulation of the SVDD model
+which had a well-defined and provably unique global solution for any :math:`C>0`.
+
+Scikit's SVDD model is a modified version of the 2013 SVDD formulation.
+The following changes were made to problem (7) in Chang et al. (2013):
+
+  * **sample weights**: instead of a uniform penalty :math:`C>0` sample
+    observations are allowed to have different costs :math:`(C_i)_{i=1}^l`,
+    :math:`\sum_{i=1}^l C_i > 0`;
+
+  * :math:`\nu`-**parametrization**: the penalties are determined by
+    :math:`C_i = \frac{w_i}{\nu \sum_{i=1}^l w_i}`, where :math:`\nu\in(0, 1]`
+    and :math:`(w_i)_{i=1}^l` are non-negative sample weights.
+
+Straightforward extension of theorems 2-4 of Chang et al. (2013) to the case
+of different penalty yielded the :math:`\sum_{i=1}^l C_i > 1`, or equivalently
+:math:`\nu < 1`, as the condition, which distinguishes the case of :math:`R>0`
+(theorem 4 case 1) from :math:`R=0` (theorem 4 case 2).
+
+The main benefit of :math:`\nu`-parametrization is clearer interpretation
+and unified interface to the :ref:`svm_one_class_svm` model. Under the
+:math:`C`-parametrization the value :math:`\frac{1}{C}` determines the
+approximate average number of support vectors of the hypersphere, whereas
+under :math:`\nu`-parametrization the parameter determines the fraction
+of outliers in the training sample (similarly to the :ref:`svm_one_class_svm`).
+
+Note that in a typical run of the SVDD model the weights are set to :math:`w_i = 1`,
+which is equivalent to the original 2013 SVDD formulation for :math:`C = \frac{1}{\nu l}`.
 
 The primal problem of Scikit's version of SVDD for the training sample
-:math:`(x_i)_{i=1}^l` with weights :math:`(w_i)_{i=1}^l` is:
+:math:`(x_i)_{i=1}^l` with weights :math:`(w_i)_{i=1}^l`,
+:math:`\sum_{i=1}^l w_i>0`, is:
 
 
 .. math::
 
    \min_{R,\xi,a} R + \frac{1}{\nu W} \sum_{i=1}^l w_i \xi_i\,,\\
 
    \textrm {subject to } & \|\phi(x_i) - a\|^2 \leq R + \xi_i\,,\\
-                         & \xi_i \geq 0, i=1, \ldots, l\,,\\
+                         & \xi_i \geq 0\,,\, i=1, \ldots, l\,,\\
                          & R \geq 0\,,
 
 
@@ -816,29 +816,38 @@ reduces to an unconstrained convex optimization problem independent of
 Note that in this case every observation is an outlier.
 
 In the case when :math:`\nu < 1` the constraint :math:`R\geq 0` is redundant,
-and the dual problem has the form:
+strong duality holds, and the dual problem has the form:
 
 
 .. math ::
 
    \min_\alpha \frac12 \alpha^T Q\alpha - \frac{\nu W}{2} \sum_{i=1}^l \alpha_i Q_{ii}\,,\\
 
-   \textrm {subject to } & 0 \leq \alpha_i \leq w_i, \ldots, l\,,\\
+   \textrm {subject to } & 0 \leq \alpha_i \leq w_i\,,\, i=1, \ldots, l\,,\\
                          & e^T \alpha = \nu W\,,
 
 
 where :math:`e\in \mathbb{R}^{l\times 1}` is the vector of ones and
 :math:`Q_{ij} = K(x_i, x_j)` is the kernel Gram matrix.
 
-The decision function of SVDD is given by:
+The decision function of the SVDD is given by:
 
-.. math:: \operatorname{sgn}(R - \|\phi(x) - a\|^2) \,,
+.. math:: x\mapsto \operatorname{sgn}(R - \|\phi(x) - a\|^2) \,,
 
 where :math:`+1` indicates and inliner and :math:`-1` -- an outlier. The
 distances in the feature space and :math:`R` are computed implicitly through
 the coefficients and the optimal value of the objective of the corresponding
 dual problem.
 
+It is worth noting, that in the case of a stationary kernel :math:`K(x,y)=K(x-y)`
+the SVDD and One-Class SVM models are provably equivalent. Indeed, the values
+:math:`Q_{ii} = K(x_i, x_i)` in the last term in the dual of the SVDD are all
+equal to :math:`K(0)`, which makes the whole term independent of :math:`\alpha`.
+Therefore the objective functions of the dual problems of the One-Class SVM
+and the SVDD are equivalent up to a constant. This, however, **does not imply**
+that one model generalizes the other: their solutions just happen to coincide
+for a particular family of kernels (see :ref:`outlier_detection_ocsvm_vs_svdd`).
+
 .. topic:: References:
 
   * `Support vector data description
 
@@ -3,17 +3,23 @@
 One-Class SVM versus SVDD
 =========================
 
-An example using a comparing One-Class SVM against SVDD for novelty
+An example comparing the One-Class SVM and SVDD models for novelty
 detection.
 
 :ref:`Support Vector Data Description (SVDD) <svm_outlier_detection>`
 and :ref:`One-Class SVM <svm_outlier_detection>` are unsupervised
-algorithms that learn a decision function for novelty detection:
-classifying new data as similar or different to the training set.
+algorithms that learn a decision function for novelty detection, i.e
+the problem of classifying new data as similar or different to the
+training set.
 
 It can be shown that the One-Class SVM and SVDD models yield identical
-results in the case of a stationary kernel, like RBF. This example
-demonstrates this.
+results in the case of a stationary kernel, like RBF, but produce different
+decision functions for non-stationary kernels, e.g. polynomial. This
+example demonstrates this.
+
+Note, that it is incorrect to say that the SVDD generalizes the One-Class
+SVM: these are different models, which just happen to coincide for a
+particular family of kernels.
 """