8000 DOC: Reworked documentation related to SVDD · scikit-learn/scikit-learn@b67596b · GitHub
[go: up one dir, main page]

Skip to content

Commit b67596b

Browse files
author
Nikolay Mayorov
committed
DOC: Reworked documentation related to SVDD
1 parent fa716ff commit b67596b

File tree

3 files changed

+106
-79
lines changed

3 files changed

+106
-79
lines changed

doc/modules/outlier_detection.rst

Lines changed: 87 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -60,31 +60,18 @@ There are two SVM-based approaches for that purpose:
6060
2. :class:`svm.SVDD` finds a sphere with a minimum radius which encloses
6161
the data.
6262

63-
Both methods can implicitly work in transformed high-dimensional space using
64-
the kernel trick, the RBF kernel is used by default. :class:`svm.OneClassSVM`
65-
provides :math:`\nu` parameter for controlling the trade off between the
66-
margin and the number of outliers during training, namely it is an upper bound
67-
on the fraction of outliers in a training set or probability of finding a
68-
new, but regular, observation outside the frontier. :clss:`svm.SVDD` provides a
69-
similar parameter :math:`C = 1 / (\nu l)`, where :math:`l` is the number of
70-
samples, such that :math:`1/C` approximately equals the number of outliers in
71-
a training set.
72-
73-
.. topic:: References:
74-
75-
* Bernhard Schölkopf et al, `Estimating the support of a high-dimensional
76-
distribution <http://dl.acm.org/citation.cfm?id=1119749>`_, Neural
77-
computation 13.7 (2001): 1443-1471.
78-
* David M. J. Tax and Robert P. W. Duin, `Support vector data description
79-
<http://dl.acm.org/citation.cfm?id=960109>`_, Machine Learning,
80-
54(1):45-66, 2004.
81-
82-
.. topic:: Examples:
83-
84-
* See :ref:`example_svm_plot_oneclass.py` for visualizing the
85-
frontier learned around some data by :class:`svm.OneClassSVM`.
86-
* See :ref:`example_svm_plot_oneclass_vs_svdd.py` to get the idea about
87-
the difference between the two approaches.
63+
Both methods can implicitly work in a transformed high-dimensional space using
64+
the kernel trick. :class:`svm.OneClassSVM` provides :math:`\nu` parameter for
65+
controlling the trade off between the margin and the number of outliers during
66+
training, namely it is an upper bound on the fraction of outliers in a training
67+
set or probability of finding a new, but regular, observation outside the
68+
frontier. :clss:`svm.SVDD` provides a similar parameter
69+
:math:`C = 1 / (\nu l)`, where :math:`l` is the number of samples, such that
70+
:math:`1/C` approximately equals the number of outliers in a training set.
71+
72+
Both methods are equivalent if a) the kernel used depends only on the
73+
difference between two vectors, one example is RBF kernel, and
74+
b) :math:`C = 1 / (\nu l)`.
8875

8976
.. figure:: ../auto_examples/svm/images/plot_oneclass_001.png
9077
:target: ../auto_examples/svm/plot_oneclasse.html
@@ -96,6 +83,22 @@ a training set.
9683
:align: center
9784
:scale: 75
9885

86+
.. topic:: Examples:
87+
88+
* See :ref:`example_svm_plot_oneclass.py` for visualizing the
89+
frontier learned around some data by :class:`svm.OneClassSVM`.
90+
* See :ref:`example_svm_plot_oneclass_vs_svdd.py` to get the idea about
91+
the difference between the two approaches.
92+
93+
.. topic:: References:
94+
95+
* Bernhard Schölkopf et al, `Estimating the Support of a High-Dimensional
96+
Distribution <http://dl.acm.org/citation.cfm?id=1119749>`_, Neural
97+
computation 13.7 (2001): 1443-1471.
98+
* David M. J. Tax and Robert P. W. Duin, `Support Vector Data Description
99+
<http://dl.acm.org/citation.cfm?id=960109>`_, Machine Learning,
100+
54(1):45-66, 2004.
101+
99102

100103
Outlier Detection
101104
=================
@@ -190,48 +193,73 @@ This strategy is illustrated below.
190193
Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on.
191194
192195
193-
Comparison of different approaches
194-
----------------------------------
196+
One-class SVM versus Elliptic Envelope versus Isolation Forest
197+
--------------------------------------------------------------
195198

196-
Strictly-speaking, the SVM-based methods are not designed for outlier
197-
detection, but rather for novelty detection: its training set should not be
198-
contaminated by outliers as it may fit them. That said, outlier detection in
199-
high-dimension, or without any assumptions on the distribution of the inlying
200-
data is very challenging, and a SVM-based methods give useful results in these
201-
situations.
199+
Strictly-speaking, the One-class SVM is not an outlier-detection method,
200+
but a novelty-detection method: its training set should not be
201+
contaminated by outliers as it may fit them. That said, outlier detection
202+
in high-dimension, or without any assumptions on the distribution of the
203+
inlying data is very challenging, and a One-class SVM gives useful
204+
results in these situations.
202205

203206
The examples below illustrate how the performance of the
204-
:class:`covariance.EllipticEnvelope` degrades as the data is less and less
205-
unimodal, and other methods become more beneficial. Note, that the parameters
206-
of :class:`svm.OneClassSVM` and :class:`svm.SVDD` are set to achieve their
207-
equivalence, i. e. :math:`C = 1 / (\nu l)`.
207+
:class:`covariance.EllipticEnvelope` degrades as the data is less and
208+
less unimodal. The :class:`svm.OneClassSVM` works better on data with
209+
multiple modes and :class:`ensemble.IsolationForest` performs well in all
210+
cases.
208211

209-
|
212+
:class:`svm.SVDD` is not presented in comparison as it works the same as
213+
:class:`svm.OneClassSVM` when using RBF kernel.
210214

211-
- For a inlier mode well-centered and elliptic all methods give approximately
212-
equally good results.
213-
214-
.. figure:: ../auto_examples/covariance/images/plot_outlier_detection_001.png
215+
.. |outlier1| image:: ../auto_examples/covariance/images/plot_outlier_detection_001.png
215216
:target: ../auto_examples/covariance/plot_outlier_detection.html
216-
:align: center
217-
:scale: 75%
217+
:scale: 50%
218218

219-
- As the inlier distribution becomes bimodal,
220-
:class:`covariance.EllipticEnvelope` does not fit well the inliers. However,
221-
we can see that other methods also have difficulties to detect the two modes,
222-
but generally perform equally well.
219+
.. |outlier2| image:: ../auto_examples/covariance/images/plot_outlier_detection_002.png
220+
:target: ../auto_examples/covariance/plot_outlier_detection.html
221+
:scale: 50%
223222

224-
.. figure:: ../auto_examples/covariance/images/plot_outlier_detection_002.png
223+
.. |outlier3| image:: ../auto_examples/covariance/images/plot_outlier_detection_003.png
225224
:target: ../auto_examples/covariance/plot_outlier_detection.html
226-
:align: center
227-
:scale: 75%
225+
:scale: 50%
226+
227+
.. list-table:: **Comparing One-class SVM approach, and elliptic envelope**
228+
:widths: 40 60
229+
230+
*
231+
- For a inlier mode well-centered and elliptic, the
232+
:class:`svm.OneClassSVM` is not able to benefit from the
233+
rotational symmetry of the inlier population. In addition, it
234+
fits a bit the outliers present in the training set. On the
235+
opposite, the decision rule based on fitting an
236+
:class:`covariance.EllipticEnvelope` learns an ellipse, which
237+
fits well the inlier distribution. The :class:`ensemble.IsolationForest`
238+
performs as well.
239+
- |outlier1|
240+
241+
*
242+
- As the inlier distribution becomes bimodal, the
243+
:class:`covariance.EllipticEnvelope` does not fit well the
244+
inliers. However, we can see that both :class:`ensemble.IsolationForest`
245+
and :class:`svm.OneClassSVM` have difficulties to detect the two modes,
246+
and that the :class:`svm.OneClassSVM`
247+
tends to overfit: because it has not model of inliers, it
248+
interprets a region where, by chance some outliers are
249+
clustered, as inliers.
250+
- |outlier2|
251+
252+
*
253+
- If the inlier distribution is strongly non Gaussian, the
254+
:class:`svm.OneClassSVM` is able to recover a reasonable
255+
approximation as well as :class:`ensemble.IsolationForest`,
256+
whereas the :class:`covariance.EllipticEnvelope` completely fails.
257+
- |outlier3|
228258

229-
- As the inlier distribution gets strongly non-Gaussian,
230-
:class:`covariance.EllipticEnvelope` starts to perform inadequate. Other
231-
methods give a reasonable representation, with
232-
:class:`ensemble.IsolationForest` having the least amount of errors.
259+
.. topic:: Examples:
233260

234-
.. figure:: ../auto_examples/covariance/images/plot_outlier_detection_003.png
235-
:target: ../auto_examples/covariance/plot_outlier_detection.html
236-
:align: center
237-
:scale: 75%
261+
* See :ref:`example_covariance_plot_outlier_detection.py` for a
262+
comparison of the :class:`svm.OneClassSVM` (tuned to perform like
263+
an outlier detection method), the :class:`ensemble.IsolationForest`
264+
and a covariance-based outlier
265+
detection with :class:`covariance.MinCovDet`.

doc/modules/svm.rst

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -327,10 +327,10 @@ floating point values instead of integer values::
327327

328328
.. _svm_outlier_detection:
329329

330-
Outlier and novelty detection
330+
Novelty and outlier detection
331331
=============================
332332

333-
Support vector machines can be used for detecting novely and outliers in
333+
Support vector machines can be used for detecting novelty and outliers in
334334
unlabeled data sets. That is, given a set of samples, detect the soft boundary
335335
of that set so as to classify new points as belonging to that set or not.
336336

@@ -359,21 +359,22 @@ See section :ref:`outlier_detection` for more details on their usage.
359359
.. topic:: Examples:
360360

361361
* See :ref:`example_svm_plot_oneclass.py` for visualizing the
362-
frontier learned around some data by :class:`svm.OneClassSVM`.
362+
frontier learned around some data by :class:`OneClassSVM`.
363363
* See :ref:`example_svm_plot_oneclass_vs_svdd.py` to get the idea about
364364
the difference between the two approaches.
365365
* :ref:`example_applications_plot_species_distribution_modeling.py`
366366

367367
.. topic:: References:
368368

369-
* Bernhard Schölkopf et al, `Estimating the support of a high-dimensional
370-
distribution <http://dl.acm.org/citation.cfm?id=1119749>`_, Neural
369+
* Bernhard Schölkopf et al, `Estimating the Support of a High-Dimensional
370+
Distribution <http://dl.acm.org/citation.cfm?id=1119749>`_, Neural
371371
computation 13.7 (2001): 1443-1471.
372-
* David M. J. Tax and Robert P. W. Duin, `Support vector data description
372+
* David M. J. Tax and Robert P. W. Duin, `Support Vector Data Description
373373
<http://dl.acm.org/citation.cfm?id=960109>`_, Machine Learning,
374374
54(1):45-66, 2004.
375375

376376

377+
377378
Complexity
378379
==========
379380

examples/covariance/plot_outlier_detection.py

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
"""
2-
======================================
3-
Outlier detection with several methods
4-
======================================
2+
==========================================
3+
Outlier detection with several methods.
4+
==========================================
55
66
When the amount of contamination is known, this example illustrates three
77
different ways of performing :ref:`outlier_detection`:
@@ -45,15 +45,14 @@
4545
outliers_fraction = 0.25
4646
clusters_separation = [0, 1, 2]
4747

48-
nu = 1.25 * outliers_fraction
49-
C = 1 / (nu * n_samples)
50-
5148
# define two outlier detection tools to be compared
5249
classifiers = {
53-
"One-Class SVM": svm.OneClassSVM(nu=nu, kernel="rbf", gamma=0.1),
54-
"SVDD": svm.SVDD(C=C, kernel='rbf', gamma=0.1),
55-
"robust covariance estimator": EllipticEnvelope(contamination=.1),
56-
"Isolation Forest": IsolationForest(max_samples=n_samples, random_state=rng)}
50+
"One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
51+
kernel="rbf", gamma=0.1),
52+
"Robust Covariance Estimator": EllipticEnvelope(contamination=0.1),
53+
"Isolation Forest": IsolationForest(max_samples=n_samples,
54+
random_state=rng)
55+
}
5756

5857
# Compare given classifiers under given settings
5958
xx, yy = np.meshgrid(np.linspace(-7, 7, 500), np.linspace(-7, 7, 500))
@@ -85,7 +84,7 @@
8584
# plot the levels lines and the points
8685
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
8786
Z = Z.reshape(xx.shape)
88-
subplot = plt.subplot(1, 4, i + 1)
87+
subplot = plt.subplot(1, 3, i + 1)
8988
subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
9089
cmap=plt.cm.Blues_r)
9190
a = subplot.contour(xx, yy, Z, levels=[threshold],
@@ -99,11 +98,10 @@
9998
[a.collections[0], b, c],
10099
['Decision function', 'True inliers', 'True outliers'],
101100
prop=matplotlib.font_manager.FontProperties(size=11))
102-
subplot.set_xlabel("%s\n(errors: %d)" % (clf_name, n_errors))
101+
subplot.set_xlabel("%s (errors: %d)" % (clf_name, n_errors))
103102
subplot.set_xlim((-7, 7))
104103
subplot.set_ylim((-7, 7))
105104
plt.suptitle("Outlier detection")
106-
plt.subplots_adjust(left=0.04, bottom=0.15, right=0.96, top=0.94,
107-
wspace=0.1, hspace=0.26)
105+
plt.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
108106

109107
plt.show()

0 commit comments

Comments
 (0)
0