[go: up one dir, main page]

0% found this document useful (0 votes)
6 views10 pages

Locally Constrained Support Vector Clustering

The document presents a method called Locally Constrained Support Vector Clustering (LCSVC), which enhances traditional Support Vector Clustering (SVC) by addressing its instability with outliers and difficulty in controlling the number of clusters. By utilizing a Mixture of Factor Analyzers to analyze local data properties, LCSVC improves cluster interpretability and robustness against noise, particularly in high-dimensional spaces. The proposed approach aims to better outline the topological structure of data residing on lower-dimensional manifolds, making it more applicable in practical scenarios.

Uploaded by

dridabs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Locally Constrained Support Vector Clustering

The document presents a method called Locally Constrained Support Vector Clustering (LCSVC), which enhances traditional Support Vector Clustering (SVC) by addressing its instability with outliers and difficulty in controlling the number of clusters. By utilizing a Mixture of Factor Analyzers to analyze local data properties, LCSVC improves cluster interpretability and robustness against noise, particularly in high-dimensional spaces. The proposed approach aims to better outline the topological structure of data residing on lower-dimensional manifolds, making it more applicable in practical scenarios.

Uploaded by

dridabs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Locally Constrained Support Vector Clustering

Dragomir Yankov, Eamonn Keogh, Kin Fai Kan


Computer Science & Engineering Department
University of California, Riverside, USA
{dyankov, eamonn, kkan}@cs.ucr.edu

Abstract to regions that are estimated to have lower density support


in the high dimensional feature space. Such elements can
Support vector clustering transforms the data into a high be assigned the label of their closest contour in the origi-
dimensional feature space, where a decision function is nal space. This extension, called support vector clustering
computed. In the original space, the function outlines the (SVC), was initially proposed in [3].
boundaries of higher density regions, naturally splitting the Despite its theoretical soundness the SVC method has re-
data into individual clusters. The method, however, though mained relatively unpopular among the practitioners. There
theoretically sound, has certain drawbacks which make it are several specific characteristics of SVC that diminish its
not so appealing to the practitioner. Namely, it is unsta- appeal. For instance, the map Φ requires a parametrized
ble in the presence of outliers and it is hard to control the kernel to be provided as an input from the user. The ra-
2
number of clusters that it identifies. Parametrizing the algo- dial basis function k(xi , xj ) = e−γkxi −xj k has been rec-
rithm incorrectly in noisy settings, can either disguise some ognized as a preferred kernel function because of its abil-
objectively present clusters in the data, or can identify a ity to form closed contours [3, 18]. This means that the
large number of small and nonintuitive clusters. user needs to provide a suitable kernel width γ. However,
Here, we explore the properties of the data in small re- small values of γ (i.e. large kernel widths) may disguise or
gions building a mixture of factor analyzers. The obtained merge some of the clusters, while very large γ may create
information is used to regularize the complexity of the out- multiple closed contours which outline some rather nonin-
lined cluster boundaries, by assigning suitable weighting to tuitive clusters. The effect of multiple emerging clusters is
each example. The approach is demonstrated to be less sus- especially strong in the presence of noise. This becomes
ceptible to noise and to outline better interpretable clusters an issue, in many practical application where the examples
than support vector clustering alone. lie near the surface of a lower dimensional nonlinear man-
ifold. For example, such noisy manifolds may be defined
by a sample of facial images [13, 14, 19], or by the walking
1 Introduction motions of a human [10]. Though a soft margin can be in-
One-class support vector machine (SVM) is an efficient troduced to alleviate the impact of the outliers, there is again
approach for estimating the density of a population [15, 17]. the issue of how to specify the correct parameter ν, that con-
It works by applying a transformation Φ : X → Φ(X) trols the tradeoff between the generalization performance of
from the input space to a high dimensional feature space, the learner and its tolerance to the noisy examples.
such that points within denser neighborhoods are projected To improve the performance of SVC in the case of Gaus-
further from the origin of the coordinate system. The sup- sian distributed noise and to obtain better control over the
port vectors in the feature space are then used to outline number of detected clusters, we explore the density vari-
closed contours around the dense regions in the input space, ability of the data in very small regions. For the purpose, a
defining a binary decision function which is positive inside Mixture of Factor Analyzers (MFA) [9] is used. The mix-
the contours and negative elsewhere. The method has been ture model, when learned with a large number of analyzers,
demonstrated to be applicable for tasks, such as novelty and implicitly detects points that deviate from the main trajec-
fault detection, context change detection, learning in image tory of the data. The information about those locally de-
retrieval, etc. viating points is used to determine the soft margin trade-
One can easily extend one-class classification to a clus- off between the outliers and the accuracy of the one-class
tering scheme, by labeling each closed contour as a different SVM learner, as well as, to regularize the complexity of
cluster. Elements, not enclosed by any contour, correspond the induced decision boundary. The regularization results
in smoother contours, which are shrunk towards the dense derlying idea is that a global view of the data can be inferred
regions in the data, rather than trying to accommodate all by looking at the overall density distribution. The density
outliers. The subsequent clustering often allows for easier estimate alone, however, provides for a very coarse recon-
interpretation too. Because of the local dimensionality re- struction of the underlying sample space. Local methods,
duction performed by MFA and the nonlinear feature map on the other hand, can smoothen this estimate by looking
Φ, the “locally constrained” SVC method is further demon- at the data statistics in some small regions. This is espe-
strated to correctly identify the topological structure of the cially important if density fluctuations are observed in the
data, when the clusters reside on a lower dimensional non- data and yet an obvious clustering is available. In this sense,
linear manifold. the proposed method is closest in spirit to the manifold re-
construction method proposed by Roweis et al. [14]. They
2 Related Work use a mixture of factor analyzers to infer the local structure
A number of clustering algorithms have been demon- of the underlying manifold, but then a global constraint is
strated to be particularly suitable for learning of non-convex imposed, so that all local models are aligned to follow a
formations, e.g. spectral clustering [12], spectral graph consistent trajectory.
partitioning [8], or kernel K-means [16]. A close relation 3 Support Vector Density Estimation
between all of these approaches has been pointed out be-
fore [4]. We focus on one of these algorithms - spectral 3.1 One-Class Classification
clustering. Interestingly, the algorithm shares a lot of com- Let us have a set of n independent and identically dis-
monalities with SVC. They both start by computing a Gaus- tributed observations: X = {xi }ni=1 . The problem ad-
sian kernel matrix, emulating the high dimensional nonlin- dressed by one-class classification is to find a minimal re-
ear feature map Φ. From here on, however, spectral clus- gion R, which encloses the data (Figure 1). Assuming that
tering performs an eigen decomposition of the data in the the data are generated from the same distribution p, an ad-
feature space. The projected examples are then clustered, ditional to the minimization of R is the requirement that fu-
again in the feature space, using K-means clustering. In- ture test examples generated by p should also fall with high
stead, SVC computes the optimal plane that separates the probability within R. Therefore, apart of being minimal, R
projected data from the origin in the feature space. In this should also generalize well on unseen data, which implies
way a simpler problem is solved by only isolating the higher that it should have a non-complex boundary.
density regions. This comes at the price of not knowing the Following similar reasoning as in support vector classi-
actual clusters in the data, so a subsequent labeling and as- fication, rather than exploring the nonlinear boundary in the
signment step is carried out by SVC. original space, one could describe it as a hyper plane in the
A different set of unsupervised learning approaches try high dimensional feature space defined by Φ(X). All exam-
to infer the nonlinear structure of the data by considering ples, which in the original space are enclosed within R, are
small regions around each example. Some popular meth- going to be projected in the same half-space with respect to
ods following this paradigm are, for example, the Laplacian the hyper plane. If w · Φ(x) = b is the equation of the plane,
eigenmaps [2] and Isomap [19]. The general idea behind this is equivalent to the requirement that for all examples xi ,
these algorithms is to compute a neighborhood graph G, the inequality w · Φ(xi ) ≥ b should hold. The two param-
where each example xi is connected only to examples in its eters that define the plane uniquely, w and b, are its normal
close proximity. The graph is then augmented to a full affin- vector and its displacement from the origin respectively. Fi-
ity matrix, by propagating the neighboring distances, e.g. nally, the plane that corresponds to the smoothest boundary
by solving an all pairs shortest path problem (Isomap) or by in the original space is the one with smallest norm of the
applying a Laplacian operator (Laplacian eigenmaps). Both normal vector w [16]. The equation of this plane is given
methods proceed by computing an eigen decomposition and by the solution of the optimization problem:
projecting the data using a small subset of the eigenvectors.
As they preserve the convexity of the data, the algorithms 1 2
can easily be extended for clustering by using a partition- min
w 2 kwk (1)
ing scheme as K-means or Expectation Maximization (EM) subject to w · Φ(xi ) ≥ b, i = 1..n
for a mixture of Gaussians. While local reduction methods
have been demonstrated to be unstable in the presence of
noise [1], they remain to be the preferred tool for unsuper- It may be useful to restrict R to enclose only a subre-
vised learning from nonlinear manifolds. gion of X that has higher support for the probability density
In the proposed approach we combine the best features function. This will be the case, for example, if we are not
that can be obtained from global methods, such as SVC and interested in the noisy points on the periphery of the distri-
local approximations as the ones discussed above. The un- bution (see Figure 1). In the feature space, the points that
The class of feature mappings Φ(X) that linearly sepa-
rate the data from the origin is not available in parametric
form, yet it is selected so that the dot products in the fea-
ture space correspond to a computable kernel function in
the input space, i.e. k(xi , xj ) = Φ(xi ) · Φ(xj ). In SVC the
2
Gaussian kernel k(xi , xj ) = e−γkxi −xj k is used, as it de-
fines smooth closed contours [3, 18]. All multipliers αi > 0
in the solution of (3) correspond to the support vectors, i.e.
the examples which in the feature space lie on the separat-
ing hyper-plane (see Figure 1). For the rest of the points xi
the corresponding αi is equal to zero. To test on which side
Figure 1: One-class SVMs detect a region R in the data with
of the hyper plane such examples are projected, one needs
higher density support. Points inside the region are projected in
the same half-space defined by the separation hyper plane (w, b).
to substitute them in the equation of the plane as defined by
the computed support vectors:

fall outside of R will satisfy w · Φ(xi ) < b. To account X


for such points the constraints for them in (1) should be f (x) = sgn[ αi k(xi , x) − b] (4)
changed to w · Φ(xi ) ≥ b − ξi , where we have additionally xi ∈SV s
introduced the slack variables ξi ≥ 0. The regularization
term that guarantees the smoothness of the boundary also Positive f (x) implies that x falls within the dense sub-
changes, yielding the new formulation: space R, whereas negative values of the decision function
imply a sparsely populated region. Observations xi ∈ X
Pn for which f (xi ) < 0 are called bounded support vectors.
min q(w, b, ξ) = 21 kwk2 + 1
nν i=1 ξi − b (2) The value of the displacement b can be computed using
w,b,ξ
the fact that any support vector xs lies on the separation
subject to w · Φ(xi ) ≥ b − ξi , ξi ≥ 0, i = 1..n
plane, and thus it satisfies the equality w · Φ(xs ) = b, which
can
P also be expressed in terms of the kernel function as
Formulations (1) and (2) produce the so called hard and xi ∈SV s αi k(xi , xs ) = b.
soft margin decision planes respectively. The penalty pa- The formalization defined so far is not the only way for
rameter ν in (2) controls the tradeoff between the allowed computing high dimensional density support. For instance,
slack for some of the examples and the complexity of the instead of looking for the optimal separation plane, Ben-
region boundary. It takes values in the interval (0, 1] with Hur et al. [3] study the class of spheres in the feature space
ν → 1 allowing for a lot of examples to lie outside the that enclose the projected examples. They derive an alter-
region R, and ν → 0 penalizing significantly the slack vari- native formulation of problem (3), which instead minimizes
ables, converting the problem effectively into a hard margin the volume of the enclosing hyper sphere. An equivalence
decision problem. The latter case leads to a very tight and of the two formulations has been demonstrated in [16]. In
complex boundary for the density region R. the current work, the density estimation step is carried out
Minimizing the quadratic function q(w, b, ξ) in prob- as in the original one-class SVM formulation.
lem (2) is hard, because of the available constraints. In- 3.2 Support Vector Clustering
stead, if we write all constraints in the form qi (w, b, ξ) ≤
The one-class density estimation method can easily be
0, the solution is obtained by Pminimizing the Lagrangian
extended to a clustering scheme by computing a matrix A
L(w, b, ξ, α) = q(w, b, ξ) + i αi qi (w, b, ξ). To mini-
for the data, where Aij = 1 if xi and xj are enclosed within
mize L, one sets the derivatives of L with respect of w,
the same contour and 0 otherwise. Whether xi and xj lie
b and ξ to zero, which allows for expressing them as a
within the same contour can be determined by computing
P of the introduced Lagrangian multipliers αi
function solely
the SVM decision function (4) for all points on the line that
(αi ≥ 0, i αi = 1) and the data in the feature space connects them. In the original SVC formulation (and also
Φ(xi ). Substituting the values back in the Lagrangian, we
in our implementation) 20 regularly spaced points between
obtain the dual optimization problem of problem (2):
xi and xj are tested. An always positive decision function
guarantees that xi and xj are part of the same dense re-
1 gion. The opposite, however, is not necessarily true. For
P
min 2 ij αi αj Φ(xi ) · Φ(xj ) (3)
α
Pn some points, on the line between two examples, f may be
1
subject to i=1 αi = 1, 0 ≤ αi ≤ nν negative, but the examples may still be within the same con-
tour. This is often the case if the contours are too complex.
widths until at least as many clusters as the users require
emerge from the data. Such iterative approach is followed
for example in [11]. As will be shown in the experimental
evaluation, this strategy, though pretty robust in the case of
well separated and dense clusters, can cause the occurrence
of some rather uninformative formations when the clusters
are sparse and noise is present in the data.
Next, we introduce a modification of the SVC approach,
which improves on its stability in the presence of noise. The
method is further demonstrated to be less sensitive to slight
changes in the parametrization.
4 Locally Constrained SVC
Figure 2: One-class SVMs can be extended to a clustering The intuition followed in the current work is that both
scheme, by assigning the same label to all points enclosed within global density estimation methods as SVC, and local recon-
the same contour. For example, xi and xj are within the same struction methods as Isomap [19] or LLE [13] introduce in-
contour if for any point x on the line between them the decision formation about the data, which is somewhat complemen-
function f (x) is non-negative. tary. For example, support vector clustering provides some
very important information about the overall structure of the
data. Namely, an estimate of its density. A local method can
Therefore, one needs to detect the connected components complement this with additional region boundary smooth-
in the graph induced by A. This determines the number of ing and can evaluate locally which points are likely to de-
clusters in the data as well as the labels for each example viate from the unknown distribution that has generated the
that is enclosed by a contour. Finally, the bounded support data. The method that we utilize here to obtain such local
vectors (i.e. the examples outside the contours) are assigned statistics is based on the Mixture of Factor Analyzers frame-
to their closest cluster (see Figure 2). work introduced by Ghahramani et al. in [9]. We term the
algorithm derived in this section Locally constraint Support
While precise parametrization is not so essential when
Vector Clustering (LSVC).
only density estimation is required, it becomes of crucial
importance in the case of clustering. Consider, for example, 4.1 Mixture of Factor Analyzers
Figure 2. Selecting a large kernel width (i.e. small γ) would Factor analysis (FA) is a technique for linearly projecting
disguise the fact that there is large fluctuation between the the data X ⊂ RD into a lower dimensional space Rd [7].
density of the inner and the outer circles. Large values of Ghahramani et al. [9] derive an EM procedure for learn-
γ or too small tradeoff terms ν, on the other hand, can pro- ing the projecting dimensions z. They make the simplifying
duce decision boundary of a very high capacity, which leads assumption that the dimensions z are normally distributed
to multiple tight contours in the original space. Apart of with zero means and variance one, i.e. z ∈ N (0, I) (I here
obtaining too many small and nondescriptive clusters, the marks the identity matrix). Furthermore, each example is
complex decision function impedes the proper labeling even allowed to have some residual noise u, which is also as-
of elements that are within the same contour. For some ex- sumed to be normally distributed with covariance Ψ, i.e.
amples xp all lines connecting them to other examples xq u ∈ N (0, Ψ). The following relation is now enforced:
within the same contour, would pass through regions where x = Λz + u, where Λ is the so called factor loading ma-
the decision function has negative value. Such examples trix, and the noise covariance matrix Ψ is required to be
will be assumed to belong to a different cluster. diagonal. The common factors z are used as latent variables
The lack of control over the number of clusters produced to iteratively obtain an improved likelihood estimate for the
by different parametrizations is a significant drawback of observed data x (E-step of the algorithm), recomputing on
the scheme. A common requirement in clustering is that the each iterations more optimal values for the matrices Ψ and
users provide the number of clusters that they want to be Λ (M-step of the algorithm).
detected in their data. Such a requirement is easily handled Ghahramani et al. [9] also suggest that one could have a
by partition clustering (e.g. K-means), agglomerative clus- mixture of factor analyzers, rather than a single one, where
tering and even kernel based algorithms as spectral cluster- every component in the mixture can have different mean
ing. Unfortunately there is no clear unsupervised strategy µj and loading matrix Λj . The noise term in the mix-
of how such user imposed constraint can be incorporated in ture is preserved the same across all factor analyzers, i.e.
SVC. One reasonable way to emulate such behavior, would zj ∈ N (µj , Ψ). The goal now becomes to find a maximum
be to start exploring kernels with monotonically decreasing likelihood estimate for the observed data x, using the latent
MFA can improve the one-class SVMs, it would be use-
ful to understand how the outliers impact the detected con-
tours. In the soft margin formulation (2), every example is
allowed to cross the decision boundary with a penalty con-
trolled by the slack variables ξ. This makes the decision
function less complex, at the price of some misclassified
examples xi , which in this case means that the function
underestimates the density around these examples. Mis-
classification of all such xi is penalized proportionally to
their distance to the separation plane (ξi ), but with the same
1
weighting factor nν . Assuming that there is an additional,
possibly uncertain, knowledge about which examples are
actually outliers, the procedure might instead be changed to
Figure 3: The topology of the data is closely approximated with use different weighting factors. The idea is similar to the
a mixture of 20 analyzers. The ellipses outline two standard devi- weighted SVM classification, that has been demonstrated
ations from the center of the analyzers. The mixture can be used to be suitable in the case of imbalanced classes [6], with the
to detect “local” outliers (e.g. P2 ) that bridge the existing clusters. difference being that the weights now should be determined
based on the confidence that a certain example is an outlier.
A confidence estimate of the importance of each exam-
variables zj , and the probability that it has been projected ple can be obtained by measuring the example’s deviation
using the j-th factor analyzer (E-step of the mixture model). from the mean of the factor analyzer that it belongs to. If
On every iteration the MFA algorithm, apart of computing zj = (z1j , z2j , . . . , zrjj )0 are the projections of the examples
some more optimal estimates of the matrices Ψ and Λj , also that are assigned to the j-th mixture component, then the
improves on the estimate for the mean of the analyzers µj deviation of each example projection zij can be expressed
too (M-step of the mixture model). through the Mahalanobis distance:
Figure 3 illustrates the MFA algorithm when applied
with twenty components. Apart of clustering the data, MFA
also estimates the optimal lower dimensional representation dj = [(zj − µj )0 Cj (zj − µj )]1/2 (5)
for the examples in each cluster. This is an essential charac-
teristic when the data follow the structure of a lower dimen- In the above, the covariance of the j-th factor analyzer
sional manifold embedded in the original space RD . The lo- is estimated as Cj = Λj Λ0j + Ψ (see [9]). Now we adjust
cally constrained SVC method suggested here exploits this the penalty for misclassifying examples that are believed to
property. be outliers (i.e. examples with large distance dij to their
4.2 Regularizing the One-Class SVMs corresponding center µj ) to be small, so that the decision
function is not so influenced by them. This will smooth
In the proposed approach we are going to use the fact the separation boundary inferred by function (4), and hence
that MFA can single out the majority of the outliers, which will decrease the chance of having multiple small contours
fall outside the main trajectory followed by the data. In around sparser neighborhoods. To achieve that, each indi-
Figure 3 the ellipses outline a two standard deviations re- vidual penalty term is modified to be inversely proportional
gion around the mean of the corresponding local clusters. to its Mahalanobis distance di . Now (2) is written as fol-
Points, such as P1 and P2 , that are too distant from their lows:
cluster centers, are indeed among the noisy points bridg-
ing the two global concentric clusters. Cleaning the data Pn
1 2 1 1
2 kwk −b
set from these points can significantly improve the perfor- min + nν i=1 di ξi (6)
mance of the SVC method. Note also, that using only the w,b,ξ
MFA method for reconstructing the underlying distribution subject to w · Φ(xi ) ≥ b − ξi , ξi ≥ 0, i = 1..n
will not provide a good enough solution either. Applied as a
local method, similarly to Isomap and LLE, MFA can be in-
stable because of the noise [1]. For instance, the two analyz- For brevity of notation in (6), we have omitted the in-
ers that bridge the two clusters on Figure 3 will impede the dicator showing which factor analyzer the projection of an
proper identification of the present formations. This comes example xi belongs to, yet it should be kept in mind that the
to illustrate the importance of having an additional input distances di are computed based on the individual mixture
from the global density method too. components. Note, that the feature map Φ is applied on the
Before we show how the information obtained through original variables xi rather than the projections zi . The lat-
ter is done because the projecting dimensions for every fac- function evaluated for the denser region where P3 resides
tor analyzer are different. As density estimation in higher will be positive for a large set of kernel widths, and the op-
dimensional spaces has degrading effectiveness, it may still timal slack variable for this point will most likely be zero,
be necessary to perform a dimensionality reduction of the regardless of what constraint is imposed on its weight.
space X before solving the optimization problem (6). For The number of mixture components that we use in the
that purpose, one could detect a global coordination for all evaluation procedure is set to be larger than the number of
factor analyzers [14], or just use a linear reduction as PCA clusters that we would like to be detected in the data. In gen-
as suggested by Ben-Hur et al. [3]. Here we use the second eral, we find it as a good practice to use at least several ana-
approach, which does not diminish the importance of MFA lyzers for each cluster that we want to detect. This ensures
in the overall scheme, as the example weights have been that if there are non-convex clusters present, each cluster
computed based on the intrinsic dimensionality inferred by may be covered with more than one component on aver-
the method. age, which would better outline the cluster’s topology. This
The Lagrangian now has the form: may seem like very loose specification, yet we observe that
even providing a relatively large number of components, the
n LSVC algorithm still correctly detects as bounded support
1 1 X 1 vectors points that are indeed outliers. We could also spec-
L= kwk2 + ξi − b
2 nν i=1 di ify the number of analyzers as a fraction of the total number
n
X n
X of examples. In this mode MFA would roughly approxi-
− αi (Φ(xi ) − b + ξi ) − βi ξi (7) mate methods, such as Isomap or LLE which use neighbor-
i=1 i=1 hoods of certain size to reconstruct the underlying structure.
For example, if we set the number of analyzers to be equal
Taking the derivatives with respect to the primal vari- n
to 10 , then most components in the mixture will on aver-
ables w, b, and ξi and substituting in (7) we obtain the dual age have ten elements and will resemble the neighborhoods
optimization problem which we now try to maximize with constructed by the local methods.
respect to the dual variables αi . This yields the constraint Before we conclude this section, we note another inter-
optimization problem: esting estimate that can be obtained through the MFA algo-
rithm, namely, that of the tradeoff parameter ν. [15] demon-
1
P strates that the optimal ν to be specified in the one-class
min 2 ij αi αj Φ(zi ) · Φ(zj ) (8) optimization problem (2) should be an upper bound on the
α
subject to
Pn
αi = 1, 0 ≤ αi ≤ 1 fraction of outliers that are assumed to be present in the
i=1 di nν
data. This fact by itself is not very helpful, as the number
of outliers is unknown in advance. Using the factor ana-
In [16] the one-class SVM optimization problem is lyzers, however, such an estimate can be obtained for ex-
demonstrated to be solvable with a fast iterative tech- ample by counting the elements which deviate significantly
nique called sequential minimal optimization (SMO). What from the mean of their mixture component. For the pur-
makes the method applicable is the special form of pose, we compute the empirical standard deviation of the
the
Pn objective function and the linear equality constraints P distances dij within each analyzer. Then we
Mahalanobis
set ν = j sj /n, where sj is the number of examples that
i=1 αi = 1. Both, the function and the equality con-
straints in (8), are similar to the ones in problem (3), which are more than two standard deviations away from the mean
means that we can perform the optimization using SMO of the j-th analyzer.
again. Formulations (3) and (8) differ only the constraints 5 Discussion
imposed on αi , which are now allowed to be upper-bounded
by different values. That upper-bound is determined based Using an example, we will elaborate on the effect that
on the confidence for the corresponding examples to be out- the introduced weighting scheme has on the detected con-
liers. tours. We run the two algorithms, SVC and the LSVC, on
It may be argued that the described process will also the synthetic “target data set” from Figure 4 (see Section 6
identify as noisy points that are not necessarily outliers. For for details about its generation). The parameters used for
instance, the points P3 and P4 in Figure 3. They are part both algorithms are γ = 8 and ν = 0.1. Ten factor analyz-
of denser regions, yet they deviate from their component ers were used in the weight computing step for LSVC.
centers too. In this sense we say that the feedback obtained The black diamonds on the graphs represent bounded
from MFA is uncertain, yet this will not necessarily have support vectors or support vectors which were found to
a detrimental effect, as the collaboration with the density form no connected components with any of the other ex-
estimation procedure again comes into play. The decision amples (i.e. they form a one point cluster). As Figure 4 left
Figure 4: γ = 8 and ν = 0.1. Left: SVC tries to accom- Figure 5: Left: SVC for γ = 8 and ν = 0.4. Many outliers
modate all examples building complex contours and incorrectly are now correctly identified, but the rest of the points are split
bridging the two concentric clusters. Right: LSVC, the proposed into multiple uninformative clusters. Right: SVC for γ = 9 and
here method, detects most outliers. The contours shrink towards ν = 0.1. Increasing γ also cannot achieve the LSVC effect. The
the dense regions and the two main clusters are separated correctly. contours become very tight and complex and start splitting into
multiple clusters.

shows, SVC tries to learn a decision boundary that accounts


we run with both algorithms SVC and LSVC. For every
for almost all of the examples. This results in bridging the
data set we specify the number of clusters k that we would
two concentric clusters present in the data. For the same
like the algorithm to detect. For all experiments the num-
parameters, LSVC (see Figure 5 right) forms contours that
ber of factor analyzers in LSVC is set to 10. The value of
are shrunk towards the means of the data distribution. Mul-
ν is determined as the fraction of outliers detected in the
tiple points, with lower density around them, are identified
MFA step. The same value of ν is used in parameterizing
as bounded support vectors. Such points are identified as
SVC too. We vary log γ within the interval [−16, 16] start-
noise during the MFA step, and their weights in building the
ing with -16 and incrementing it with step 1 at a time. This
decision function have been decreased. The central circle is
gradually increases γ (i.e. decreases the kernel width) and
now detected as a separate cluster, while the outer circle has
causes for more clusters to emerge. We stop the procedure
approximately as many clusters as in the SVC case.
when the number of clusters k̂ detected by the algorithm
It could be argued that we give an advantage to the LSVC surpasses k (i.e. k̂ ≥ k). The procedure is suitable for com-
algorithm by allowing the penalty to vary due to the dif- paring the robustness of the two algorithms, as the rate with
ferent weights, while for SVC it is fixed with the constant which the clusters emerge when slowly decreasing the ker-
ν. It is true, that if we relax the penalty for all examples nel width is highly correlated to the stability of the density
(i.e. increase ν), some of the noisy points will be identified estimation procedure in the presence of noise.
as bounded support vectors by SVC too. Yet, there is the
Though SVC and LSVC are primarily density estimation
problem of how exactly ν should be determined to improve
methods, rather than clustering algorithms for detection of
the performance of SVC. In this case the value ν = 0.1
fixed number of classes, we also check which would be the
was automatically computed using the previously described
k clusters that the algorithms will return to the users. For
procedure of counting the deviating points for the ten factor
the purpose, if k̂ is larger than k, we start appending smaller
analyzers. Furthermore, a suitable value for ν may not exist
clusters to the k largest clusters. The merging is done based
for the currently selected γ. For example, increasing ν twice
on the minimal pairwise distance between the different clus-
produces almost identical results as ν = 0.1. Increasing it
ters. Though not formal enough, and prone to certain errors,
four times leads to the graph on Figure 5 left.
this merging step is suitable for detecting whether the clus-
SVC detects the internal circle as a separate cluster now, ters identified by the algorithms are well separated or there
but the outer circle is split into multiple nonintuitive clus- are dense regions that bridge them. The bounded support
ters. Another alternative to isolating the noisy points would vectors are also assigned to their closest cluster.
be to keep ν unchanged and decrease the kernel width in-
stead. However, there is again the issue of what kernel 6.1 Synthetic Data Sets
width would be more accurate. Furthermore, decreasing the We first study the performance of SVC and LSVC on the
width increases the complexity of the boundary, forming synthetic data set used throughout this exposition. The data
some very tight contours (see Figure 5 right) that at some represents two concentric circles (see Figure 6), and is gen-
point may also split into multiple clusters. erated similarly to one of the data sets used by Ben-Hur et
al. in [3]. The inner concentric circle contains 150 points
6 Experimental Evaluation from a Gaussian distribution. The outer circle is composed
To demonstrate the performance of the proposed method of 300 points from a radial Gaussian distribution and a uni-
we employ the following unsupervised procedure, which form angular distribution.
Figure 6: Top: the proposed LSVC algorithm; left: the contours Figure 7: Top: the proposed LSVC algorithm; left: clusters iden-
and the clusters identified by the automatic procedure (the black tified by the automatic procedure; right: merging to obtain only
diamonds indicate the bounded support vectors detected as noise); two clusters. Bottom: the SVC algorithm; left: 5 small nonrep-
right: merging to obtain only two clusters. Bottom: the SVC al- resentative clusters are identified with the automatic procedure;
gorithm; left: identified contours and clusters; right: merging to right: using supervision we detect parameters that lead to better
obtain only two clusters. clustering, which still fails to isolate the noise.

We set k = 2 and run the described automatic procedure. much of the bridging noise that could degrade the cluster-
The ν value is computed to be 0.1. For log γ < 2 both ing approach. Applying the merging procedure yields the
SVC and LSVC detect only one cluster. For log γ = 2 clustering presented on Figure 7 top right. The accuracy is
LSVC and SVC detect four clusters (see Figure 6 left) and again approximately 99%.
as k̂ > k the procedure terminates. LSVC identifies 62 The SVC algorithm detects k̂ = 5 clusters for log γ =
bounded support vectors (the black diamonds on the graph) −2, and thus the automatic procedure terminates. Four
against only 2 for SVC. The merging of the detected clusters of the clusters, however, correspond to some small dense
results in 99% accuracy for LSVC and only 54% for SVC neighborhoods and do not detect the two large point for-
(see Figure 6 right). Manually probing among a larger set mations in the data (see Figure 7 bottom left). Only one
of (γ, ν)-pairs we managed to identify values for SVC that bounded support vector was found, underestimating signif-
also produced high accuracy after the merging procedure, icantly the amount of noise present. The accuracy after
but for those values there were multiple nonintuitive clusters merging is 78% with most points from the smaller clus-
detected by the algorithm and some rather complex contour ter being assigned to the larger one. We again manu-
boundaries. ally probe for other possible parameters that can produce
a more accurate merging step for SVC. We find that the
The Swiss roll data set is a standard benchmark data for
pair (log γ = −1, ν = 0.07) identifies 14 clusters and 16
evaluating local unsupervised techniques for clustering and
bounded support vectors (see Figure 7 bottom right), which
dimensionality reduction [13, 19]. We have removed some
after merging do lead to high accuracy as in the LSVC algo-
of the examples from the original data set to obtain two dis-
rithm. Again, in this case, the detection of the suitable val-
connected non-convex clusters (see Figure 7). The data is
ues required additional supervision and still produced larger
three dimensional and contains 900 examples to which we
number of not very representative small clusters.
have additionally added some Gaussian noise.
For this experiment, the MFA step of the LSVC algo- 6.2 Face Data Set
rithm is set to use a two dimensional projection z. The The Frey face images have been demonstrated by Roweis
number of required clusters is set to k = 2. The tradeoff et al. [14] to reside on a smooth two dimensional manifold.
term is computed as ν = 0.07. log γ = −1 is the first value Several examples of the images are presented in Figure 8,
for which the LSVC method detects more than one cluster top right. The position of the examples on the manifold is
(k̂ = 9). The number of bounded support vectors is 65 (see determined by the expression of the face and the rotation of
Figure 7 left top). Note that the bounded support vectors are the head. Those are the features that separate the data into
positioned on the periphery of the two clusters, detecting the two dense clouds seen in the figure. Every example is
Figure 9: Arrowheads data set. 2D MDS projection with repre-
sentative examples for the six classes present in the data.

Figure 8: Top: the proposed LSVC algorithm; left: clusters iden- two clusters similar to the ones identified with LSVC. How-
tified by the automatic procedure; right: merging to obtain only ever, the value required additional supervision and also de-
two clusters. Bottom: the SVC algorithm; left: 1 large and 1 small tected multiple non-representative clusters. Moreover, very
nonrepresentative cluster are identified with the automatic proce- few of the scattered examples between the two dense forma-
dure; right: using supervision we detect parameters that lead to tions were detected as noise (i.e. bounded support vectors).
better separation, but still with some nonrepresentative clusters.
6.3 Arrowheads Data Set
The Arrowheads data set contains time series extracted
recorded as a 560 dimensional vector (the images are 20x28 from the shape contours of 600 projectile images. There are
pixels), where the dimensions correspond to the greyscale six classes of projectiles labeled in the collection. The time
intensities of each pixel. In the evaluation here we randomly series were formed by computing the distance from every
select 1000 examples from the original data set. point of the shape’s contour to its centroid [5]. To allow for
The data are very high dimensional and, as previously rotation and scale invariance, we have further aligned and
mentioned, the density estimation approach in this case may resampled all time series in the data set, representing them
not lead directly to reasonable results. Therefore, we first with 340 dimensional vectors. The data is then projected
reduce the dimensionality using PCA and we work instead using the two largest eigenvectors (see Figure 9).
with the three dimensional projection along the top three The data set is rather difficult to discriminate, with many
eigenvectors. We further require that k = 2, aiming to de- bridging elements between the available classes, and with
tect the two dense formations that can be observed on the some classes (leaf and lanceolate) significantly overlap-
PCA projection in Figure 8. ping. We run SVC and LSVC with k = 6. The MFA pro-
The MFA step is again set to use two dimensional pro- jection z is again two dimensional. The value for ν is com-
jections z. The tradeoff ν is computed to be 0.07 and puted to be 0.09. The contours detected by the two methods
log γ = −14 is the first γ for which LSVC detects more and the clusters after the merging procedure are presented
than one cluster. The algorithm identifies exactly k̂ = 2 in Figure 10.
clusters and 129 bounded support vectors which again out- Both methods detect less than six clusters for log γ <
line correctly the bridging noise between the two distribu- 1. For log γ = 1, LSVC finds 19 clusters and isolates 60
tions (see Figure 8 top left). Assigning the bounded support bounded support vectors (see Figure 10 top left). After the
vectors to the closest dense region results in the clustering merging procedure, we map the six clusters that we identify
demonstrated in Figure 8 top right. to the original labels that yield highest accuracy. The result
For the SVC algorithm log γ = −13 yields the kernel is presented in Figure 10 top right. The accuracy of the
width that first detects more than one cluster (k̂ = 2). One method is ∼ 73%. In summary, the LSVC method performs
of the clusters, however, is a small region of just a few ele- well and succeeds in capturing the objectively dense regions
ments (see Figure 8 bottom left). The merging step does in the data.
not change this result either. Increasing log γ twice did The SVC approach fails to separate the stemmed class,
lead us to better cluster assignment (see Figure 8 bottom and hence the worse accuracy of the clustering ∼ 60% (see
right), which after merging the multiple clusters produced Figure 10 bottom right). The number of clusters detected by
References
[1] M. Balasubramanian and E. Schwartz. The Isomap algo-
rithm and topological stability. Science, 295(5552):7, 2002.
[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimen-
sionality reduction and data representation. Neural Comput.,
15(6):1373–1396, 2003.
[3] A. Ben-Hur, D. Horn, H. Siegelmann, and V. Vapnik. Sup-
port vector clustering. J. Mach. Learn. Res., 2:125–137,
2002.
[4] M. Brand and K. Huang. A unifying theorem for spectral
embedding and clustering. Proc. of the 9-th International
Workshop on Artificial Intelligence and Statistics, 2003.
[5] C. Chang, S. Hwang, and D. Buehrer. A shape recognition
scheme based on relative distances of feature points from the
centroid. Pattern Recognition, 24(11):1053–1063, 1991.
[6] S. Du and S. Chen. Weighted support vector machine for
classification. In Proc. Internation Conference on Systems,
Man and Cybernetics, volume 4, pages 3866– 3871, 2005.
Figure 10: Top: the proposed LSVC algorithm; left: contours [7] B. Everitt. An Introduction to Latent Variable Models
and clusters identified by the automatic procedure (colors are as- (Monographs on Statistics and Applied Probability). Chap-
signed agnostically); right: merging to obtain six clusters. Bottom: man & Hall, 1984.
SVC algorithm; left: identified contours and clusters (colors are [8] M. Fiedler. Algebraic connectivity of graphs. Czechoslovak
assigned agnostically). The method tries to accommodate much Mathematical Journal, 23(98):298–305, 1973.
of the noise building more complex boundaries; right: merging to [9] Z. Ghahramani and G. Hinton. The EM algorithm for mix-
obtain six clusters. The accuracy is significantly lower compared tures of factor analyzers. Technical Report CRG-TR-96-1,
to the LSVC algorithm: 60% vs 73%. 1996.
[10] C. Lee and A. Elgammal. Simultaneous inference of view
and body pose using torus manifolds. In Proc. of the 18th
the method is 18 and the number of bounded support vectors International Conference on Pattern Recognition (ICPR),
is six (see Figure 10 bottom left). SVC also identifies some pages 489–494, 2006.
[11] S. Lee and K. Daniels. Gaussian kernel width generator
objectively dense regions in the data, but the contours are
for support vector clustering. In Proc. of the International
again more complex and tend to accommodate most of the
Conference on Bioinformatics and its Applications. Series in
bridging elements between the different classes. Mathematical Biology and Medicine, volume 8, pages 151–
162, 2005.
7 Conclusions and Future Work [12] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering:
We presented a method for improving the stability of the Analysis and an algorithm. In Advances in Neural Informa-
support vector clustering (SVC) algorithm in the presence tion Processing Systems (NIPS), volume 14, 2001.
[13] S. Roweis and L. Saul. Nonlinear dimensionality reduc-
of noise and bridging elements between the available clus-
tion by locally linear embedding. Science, 290(5500):2323–
ters. The introduced algorithm uses a mixture of factor an- 2326, 2000.
alyzers (MFA) to learn a weighting, representing the confi- [14] S. Roweis, L. Saul, and G. Hinton. Global coordination of
dence that a certain example is an outlier. The weights are local linear models. In Advances in Neural Information Pro-
later used to regularize the complexity of the decision func- cessing Systems (NIPS), volume 14, 2002.
tion computed for the clustering. On synthetic and real data [15] B. Schölkopf, J. Platt, J. Shawe-Taylor, A. Smola, and
sets, we demonstrated that our method produces superior R. Williamson. Estimating the support of a high-
results than SVC alone. The results also indicate that com- dimensional distribution. Neural Comput., 13(7):1443–
plementing the best features from local and global cluster- 1471, 2001.
[16] B. Schölkopf and A. Smola. Learning with Kernels. MIT
ing approaches can provide for a powerful tool for learning Press, 2002.
of clusters sampled from nonlinear manifolds. [17] B. Schölkopf, R. Williamson, A. Smola, J. Shawe-Taylor,
Though the algorithm is fairly robust to a not very pre- and J. Platt. Support vector method for novelty detection.
cise specification of the number of factor analyzers, it would Advances in Neural Information Processing Systems (NIPS),
be useful to have an automatic procedure that removes the pages 582–588, 2000.
need of specifying this parameter. The Dirichlet processes [18] D. Tax and R. Duin. Support vector data description. Mach.
have been demonstrated as suitable means for inferring the Learn., 54(1):45–66, 2004.
[19] J. Tenenbaum, V. de Silva, and J. Langford. A global ge-
number of components in mixture models. We are currently ometric framework for nonlinear dimensionality reduction.
exploring their applicability in the settings of the LSVC al- Science, 290:2319–2323, 2000.
gorithm.

You might also like