Clustering for bioinformatics via matrix optimization

2011, Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine - BCB '11

∗ Clustering for Bioinformatics via Matrix Optimization Suely Oliveira David E. Stewart Department of Computer Science University of Iowa Iowa City, Iowa 52242, USA Department of Mathematics University of Iowa Iowa City, Iowa 52242, USA oliveira@cs.uiowa.edu dstewart@math.uiowa.edu ABSTRACT A new matrix-based clustering method is presented that is able to handle medium to large data sets that is related to semi-definite programming techniques. The method proposed involves solving a non-convex optimization problem. The problem of local minima that are far from global minima, however, does not appear to be a great difficulty. The method was applied to a well known biological clustering problem, and appears to produce consistent clusterings that are close to the original claimed clustering. Categories and Subject Descriptors J.3 [Life and Medical Sciences]: Bioinformatics Keywords clustering, optimization, semi-definite programming 1. INTRODUCTION Clustering and classifying data items are common tasks in mining data sets [4, 12, 14, 1], both for text data mining and mining biological data sets. Clustering tasks are typically separated from classification tasks in that classification starts with a known collection of groups, and the task is to assign each data item into a group. This is often called “supervised clustering” because the group or cluster of certain data items is pre-specified. In this way, the assignment of data items to clusters is “supervised” by the user of the classification method. Clustering on the other hand does not have a pre-specified collection of groups or clusters with certain data items already assigned to a cluster. Instead, the task is to identify what groups should exist, and even the number of groups. With this understanding, clustering is much more appropriate for exploratory data analysis, while classification is more appropriate where there is already considerable information available about the structure of the data. In truth, there is a spectrum of tasks between these two extremes depending on how many of the data items in the data set are pre-assigned. If most of the data items are preassigned to certain clusters, then determining the assignment of the ∗ remaining data items can be done by looking at the “neighboring” data items. This is the basis of the k-nearest neighbors (kNN) algorithm [6, 21]. However, if few of the data items are pre-assigned to clusters, then many data items will probably be far from these few exemplars, and the boundaries of the clusters must be determined from the structure of the collection of unassigned data items. This makes the classification task look more like an unsupervised clustering problem. In this paper we focus on the unsupervised clustering problem. The methods described in this paper can be extended to the supervised classification problem; it is expected that these algorithms will perform well for classification problems where only a few data items are pre-assigned to a cluster. The starting point for a clustering problem is typically a set of data points (or items) xi ∈ Rm , i = 1, 2, . . . , N. In certain problems we have the issue of missing data where each value of the data point may be a known number, or listed as missing. That is xi ∈ (R ∪ {⊥})m where “⊥” indicates a missing value. In this paper we will concentrate mainly on data sets with complete data. An alternative starting point for clustering is a collection of distance or dissimilarity values di j whereboth i and j range over the data items. Typically di j = φ xi − x j for a suitable monotonic function φ ; the most common choices are φ (s) = s and φ (s) = s2 . However the data is provided, clustering with an unspecified number of clusters is an inherently ill-defined task. Just as the classification of living things can be broken into a finer or coarser groups according to the level of refinement and detail desired, any cluster can be broken down into smaller parts. In doing so small differences can be used to do this, even though they are likely to be merely “noise”. The question is a statistical one and can be addressed, at least partially, in terms of information theory. These issues will not be addressed in detail in this paper. 1.1 Clustering via optimization A common approach to clustering is to represent the clustering problem as an optimization problem. This approach goes back to Rao [17]. The advantages of these methods is that they are fairly simple to understand: one formulates an objective function which represents the “quality” of the clustering, and then looks for algorithms to find the minimizer of this objective function. The main difficulty with this approach to clustering is that the problems are discrete optimization problems and are typically combinatorially hard [7]. For example, the k-means clustering approach for K clusters is to to cluster l, and zil = 0 otherwise. The task is to find the minimizer of minimize N ∑ i=1 xi − ca(i) 2 over all all “centers” c j ∈ Rm , j = 1, 2, . . . , K, and assignments a : {1, 2, . . . , N} → {1, 2, . . . , K}. The standard k-means algorithm (also known as Lloyd’s algorithm [10]) is an iteration that first updates the centers c j to be the mean of all xi where a(i) = j, and then updates the assignment so that a(i) = j where c j is the closest center to x j . This algorithm amounts to an iteration where the minimizing centers c j and assignments a : {1, 2, . . . , N} → {1, 2, . . . , K} are a fixed-point of the iteration. Alternative methods that have been explored include local, stochastic and evolutionary search algorithms such as assignment swapping, simulated annealing, genetic algorithms, and related heuristics [8, 25]. 1.2 Semi-definite programs Recently there have been a number of developments in the area of optimization that have the potential to revolutionize the approach to solving the combinatorially hard optimization problems that arise in clustering problems. These have centered around the use of matrices as the primary variables in certain structured optimization problems. The best known of these problems are semi-definite programming (SDP) problems [19, 23]. These problems have the form min C • X X subject to Ai • X = b i , X 0, K N ∑ i = 1, 2, . . . , m, (1.1) (1.2) (1.3) where the minimum is taken over all X that are symmetric n × n matrices. The expression “A • B” is the Frobenius inner product of the matrices A and B: A • B = trace(AT B) = ∑ ai j bi j . di j ∑ zil z jl i, j=1 subject to (2.1) l=1 K ∑ zil = 1 for all i, (2.2) for all l, (2.3) l=1 N ∑ zil ≥ 1 i=1 zil ∈ {0, 1} ∑K l=1 zil z jl for all i, l. (2.4) Note that = 1 if data items i and j belong to the same cluster, and zero otherwise. The constraint ∑K l=1 zil = 1 means that data item i is assigned to exactly one cluster. The constraint ∑N i=1 zil ≥ 1 says that each cluster has at least one data item. As given, this is a binary quadratic optimization problem. We assume that dii = 0; that is, the distance or dissimilarity between any object and itself is zero. 2.1 Relaxations and simplifications Solving (2.1–2.4) directly is likely to be difficult. Instead we relax some of the constraints to obtain a simpler and easier to solve problem. First we remove the constraint ∑N i=1 zil ≥ 1 that each cluster has at least one data item. This allows clusters with no elements. Essentially this allows any number of clusters up to K. If K ≥ N, then the objective function can be set to zero by having N clusters each containing exactly one data item. The most important constraints are that zil ∈ {0, 1} for all i, l, and that ∑K l=1 zil = 1 for all i. If we convexify these constraints we get zil ≥ 0 for all i, l and ∑K l=1 zil = 1 for all i. This is equivalent to requiring that each row of Z belongs to a unit simplex. i, j In general, the inequality “A B” means “zT Az ≥ zT Bz for all z”. Equivalently, “A B” means that “A − B is positive semi-definite”. Thus (1.1–1.3) is the problem of minimizing a linear function of X subject to linear constraints and that X is a positive semi-definite matrix; that is, zT Xz ≥ 0 for all z. Also, the set { X | X 0 } is a convex set: if A, B 0 then for 0 ≤ θ ≤ 1, θ A+(1 − θ ) B 0. Thus (1.1–1.3) is a convex optimization problem; that is, the objective function is convex, and the set of points satisfying the constraints is also convex. Convex optimization problems are easier to work with since every local minimizer (or indeed, critical point) is a global minimizer. In particular, the problem of many irrelevant local minima does not occur with convex optimization problems. Semi-definite programs can be solved in essentially polynomial time by practical algorithms [2, 9, 19]. Semi-definite programs also provide a means of approximating hard combinatorial problems [13, 15, 22, 23, 24], including clustering problems ([16] for k-means). 2. AN OPTIMIZATION FORMULATION An optimization formulation for clustering first proposed by Rao [17] is a matrix-based approach to clustering. The basic data for this approach is not the data set xi , i = 1, 2, . . . , N, but rather distance or dissimilarity data di j representing the distance between xi and x j . In Rao’s formulation of the clustering problem the main variable is a N × K matrix Z; zil = 1 means that data item i belongs The objective function can be convenientlywritten in terms of stanT dard matrix operations: ∑K l=1 zil z jl = ZZ i j , so the objective function is K N ∑ di j ∑ zil z jl = D • ZZ T i, j=1 l=1 = trace Z T DZ . This not a convex function unless D is a positive semi-definite matrix. In clustering problems, since dii = 0 and di j > 0 for i 6= j, then D is not a positive semi-definite matrix. These relaxations of the constraints lead to the following optimization problem: subject to (2.5) min D • ZZ T Z Ze(K) = e(N) , (2.6) Z≥0 (2.7) (componentwise), where e(m) is the m-dimensional vector of ones. Since we obtain this optimization problem by relaxing (or removing) constaints of the original problem (2.1–2.4), we say that (2.5–2.7) is a relaxation of (2.1–2.4). The optimization problem (2.5–2.7) is still not a convex optimization problem since Z 7→ D • ZZ T is not a convex function. We can make the objective function convex by writing it as D •Y with Y = ZZ T , since linear functions are convex. The problem with this is that the constraints Y = ZZ T are not linear, so that again the problem is not convex. However, if we carry out a further relaxation of “Y = ZZ T ”, we can obtain a convex relaxation: “Y ZZ T ” is equivalent to the condition that Y Z 0. (2.8) ZT I Since the set of positive semi-definite matrices is a convex set, the set of pairs (Y, Z) satisfying (2.8) is a convex set. Combining the relaxations gives the following SDP: min D •Y Z,Y Y ZT subject to Z 0, I (2.9) (2.10) Ze(K) = e(N) , (2.11) Z ≥ 0. (2.12) It is also possible to add the constraint Y ≥0 (componentwise). (2.13) Since the entries of D are distances, D ≥ 0 componentwise; (2.13) then ensures that D • Y ≥ 0. A relaxation could also include the constraint Y ≥ ZZ T componentwise, but this is not a convex constraint. 2.2 Symmetry properties Both the original problem (2.1–2.4) and the relaxation (2.9–2.13) have a symmetry property: if P is a K × K permutation matrix and Z ′ = Z P, then D • Z ′ Z ′T = D • ZP PT Z T = D • ZZ T ; Y and Z ′ satisfies all the constraints if and only if Y and Z satisfy all the constraints: Z ≥ 0 componentwise if and only if Z P ≥ 0; Z ′ e(K) = Z P e(K) = Z e(K) = e(N) ; Y Z Y Z′ I I 0. = T T ′T P Z I P Z I Intuitively, this amounts to saying that permuting the cluster numbers does not change the problem. For supervised classification problems, this is no longer true, since some data items are preassigned a cluster or group number. However, if the number of pre-assigned data items is small, it would be approximately true even then. Because the convexified problem (2.9–2.13) has this property, the solution set also has this symmetry. For convex optimization problems, the solution set is a convex set; for strictly convex optimization problems (where the objective function is strictly convex, satisfying the strict inequality φ (θ x+(1− θ )y) < θ φ (x)+(1− θ )φ (y)) whenever x 6= y and 0 < θ < 1) the solution is unique. Strictly convex functions include Z 7→ Z • Z, and multiplying a strictly convex function by a positive number gives a strictly convex function, so minimizing D • Y + ε (Z • Z +Y •Y ) over Z and Y subject to the constraints (2.10–2.13) for any ε > 0 has a unique solution. As (ZP) • (ZP) = trace PT Z T ZP = trace Z T Z = Z • Z is also invariant under the symmetry, the unique solution of the strictly convex approximation also has this symmetry. Interior point methods for convex optimization problems typically converge to the centroid of the solution set, which for (2.9–2.13) must also have this symmetry. Thus any reasonable method for the convex relaxation (2.9–2.13) will converge to a unique point which satisfies the symmetry Z = Z P. Unfortunately, this means that each row zT of Z satisfies the symmetry zT = zT P. Since P can be any permutation matrix, this indicates that each entry of zT must be the same. Thus zT = e(K) T /K. Since this is true for each row of Z we have the optimum Z = e(N) e(K) T /K. Since Z was meant to be the main decision variable in this formulation, it is disappointing that the optimal Z gives no information about the optimal clustering. 2.3 A non-convex relaxation Instead of the fully convex relaxation, we consider a partial relaxation (2.5–2.7). This partial relaxation is a non-convex optimization problem, and so can have multiple local minima. This partial relaxation is also invariant under the transformation Z 7→ Z P where P is a permutation matrix. Thus we expect at least K! local minima, each local minimizer being a relabeling of the others. There may be other local minima, but hopefully not many. Another non-convex but partially convexified approximate problem can be obtained by replacing the objective in (2.5) by (2.14) min D • ZZ T + α kZknuc Z where kZknuc = U,V max U T ZV orthogonal min(K,N) = ∑ σi (Z), i=1 with σi (A) the ith singular value of A. The norm kZknuc is called the nuclear norm of Z. The addition of α kZknuc to the objective function makes the objective “more convex”, but not necessarily convex. The use of the nuclear norm has particular relevance as it is closely associated with minimizing the rank of Z [18], which corresponds to the number of clusters. As α ≥ 0 increases from zero, the number of clusters represented by Z should decrease. If α → ∞, then eventually there is only one cluster. 3. IMPLEMENTATION AND COMPUTATIONAL RESULTS 3.1 Implementation The optimization task was implemented by a gradient-projection method. For a step length s, steps are taken in the directions of the negative gradients of Z 7→ D • ZZ T and Z 7→ α kZknuc , and then o the nearest point in the feasible set: n the result is projected onto Z | Z ≥ 0 and Ze(K) = e(N) . The distance measure used was the i1/2 h √ . The gradient direcFrobenius norm kZkF = Z • Z = ∑i, j z2i j tion of the nuclear norm Z 7→ kZknuc can be computed by means of a reduced singular value decomposition (SVD). If Z = UΣV T is the reduced SVD (U is N × K and V is K × K, both with orthonormal colums, and Σ = diag(σ1 , σ2 , . . . , σK )), then the gradient step is δ Z = −s α UV T , which is equivalent to setting σi ← σi − α s. In order to avoid negative singular values, negative singular values are replaced by zero: σi ← max(σi − α s, 0), for i = 1, 2, . . . , K. The projection onto the feasible set amounts to projecting each row of Z onto the unit simplex x | x ≥ 0 and eT x = 1 . This can be done using the algorithm in [11]. These gradient and projection steps are repeated for a user-specified number of steps. To compare two clusterings with the same number of clusters, we need to identify an optimal matching between the clusters of each clustering. So for clusterings C = {C1 , C2 , . . . , CK } and D = {D1 , D2 , . . . , DK } where the clusters Ci and D j are subsets of the data items {1, 2, . . . , N}, we need to find a permuation π so that cluster Ci is matched to cluster D j with j =π (i). To do this in an optimal way, we first create a matrix B = bi j where bi j is the number of data items in Ci ∩D j . We then want to find a permutation matrix P that maximizes trace(BP), which is the sum of the diagonal entries of BP. This task is equivalent to a linear programming problem [3] because the objective function is linear and the convex hull of N × N n permutation matrices is the set of N × N doubly stochastic o matrices P ∈ RN×N | P ≥ 0, Pe(N) = e(N) , PT e(N) = e(N) . Thus this task can be accomplished by a simplex-type algorithm which swaps pairs of columns. The method was implemented using MATLAB R 7.6.1. 3.2 Computational results Computational results were obtained for a rat central nervous system study (CNS-Rat) [20]. The algorithm was run with on the CNS-Rat data set with di j = 2 vi − v j 2 using the normalized values (maximum levels of gene expression set to one for each gene), and α = 10, and a step size s = 5 × 10−3 . Significantly larger values of s, such as s = 10−2 , resulted in the objective function increasing rather than decreasing. Several initial values of Z were generated, and 1000 gradient projection steps were carried out. The computed Z were then “quantized” by setting b zi j = 1 if zi j = maxk zik , and b zik = 0 if k 6= j, for each i. The clusters so computed were then compared with the published clusters given in [20]. These published clusters were obtained using the Fitch–Margoliash method [5] for creating phylogenetic trees using a least-squares technique. Twenty runs were carried out for this data set with different randomized starting values for Z with the entries being pseudorandom numbers in the range [0, 1]. Each row of these matrices was projected onto the unit simplex to ensure feasibility. Objective function values were recorded for the computed Z matrices, and for the “quantized” Z matrices. Each run took about 30 to 40 seconds. 3.2.1 Multiple local minima From the diagnostic output it appeared that the gradient projection method had approached a local minimum or had stagnated in each run. The computed matrices Z typically had all but about six to twelve entries that were not zero or one; some of these were clearly close to zero or one. A serious concern about this algorithm is that the computed matrices Z converge to points that are local minima but far from a global minimum. The runs show evidence that there are significant local minima that are not global minima, as can be seen in Table 1. In spite of the fact of multiple local minima, only a few runs are required to obtain consistent values. The computed Z giving the best computed objective value is denoted Z ∗ . The quantized version of Z ∗ is Zb∗ . As we might hope, the objective function values for the quantized matrices Zb∗ is only slightly above the computed Z ∗ . Run 1 2 3 4 5 6 7 8 9 10 φ (Z ∗ ) 1531.84 1540.65 1534.43 1566.70 1532.14 1531.84 1532.14 1533.89 1533.26 1532.14 φ (Zb∗ ) 1532.57 1542.17 1535.63 1567.32 1532.86 1532.57 1532.86 1535.92 1533.52 1532.86 Run 11 12 13 14 15 16 17 18 19 20 φ (Z ∗ ) 1540.73 1531.84 1534.49 1534.64 1569.72 1531.84 1531.84 1532.14 1532.14 1531.84 φ (Zb∗ ) 1542.23 1532.57 1536.11 1534.81 1572.10 1532.57 1532.57 1532.86 1532.86 1532.57 Table 1: Objective function values φ (Z) := trace(Z T W Z) + α kZknuc for runs with different starting points for computed matrices Z ∗ and the quantized matrices Zb∗ . Minimum values computed are in bold; 2nd smallest are in italics. Run 1 2 3 4 5 6 7 8 9 10 best computed 0 7 4 13 2 0 2 1 3 2 published Run 27 28 29 28 28 27 28 27 29 28 11 12 13 14 15 16 17 18 19 20 best computed 6 0 3 8 19 0 0 2 2 0 published 29 27 27 32 25 27 27 28 28 27 Table 2: Mismatches between computed clusterings and the best computed clustering and the published clustering. 3.2.2 Comparison with the clusterings obtained and the published clustering The objective function values given in Table 1 are well below the objective function value for the published clustering: φ (Z pub ) ≈ 2130.82. Even if we ignore the nuclear norm part of the objective function we see that trace(Z TpubW Z pub ) ≈ 1876.76, which is still much higher than the values computed for φ (Zb∗ ) which includes this term. The computed quantized matrices tend to be very consistent in terms of the clusterings they produce, as can be seen in the mismatches between the different computed clusterings (being the number of incorrectly clustered data items) as shown in Table 2. The disagreements between the computed and “exact” clusterings appear to be fairly consistent, at least in terms of numbers of mismatches. Another claimed clustering provided with the data had 62 mismatches with the “exact” clustering; this clustering had 60 mismatches with the best computed clustering. 4. CONCLUSIONS This paper presents a matrix-based optimization method for clustering data that is suitable for data sets of biological interest. These methods are practical and appears capable of accurately clustering data sets. A number of refinements are possible to improve computation times. The step length can be adjusted during the computation, which should aid convergence, and the code can be modified to improve the execution time. Implementation in compiled rather than interpreted languages can also dramatically reduce the execution time. 5. REFERENCES [1] Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 113–120, 2002. [2] Steven J. Benson, Yinyu Ye, and Xiong Zhang. Solving large-scale sparse semidefinite programs for combinatorial optimization. SIAM J. Optim., 10(2):443–461, 2000. [3] G. B. Dantzig. Linear Programming and Extensions. Princetn Uni. Press, Princeton, NJ, 1963. [4] V. Faber. Clustering and the continuous k-means algorithm. Technical Report LA_Science_22n15, Los Alamos National Laboratory, November 1994. [5] Walter M. Fitch and Emanuel Margoliash. Construction of phylogenetic trees. Science, 155(3760):279–284, 1967. [6] Anil K. Ghosh. On optimum choice of k in nearest neighbor classification. Comput. Statist. Data Anal., 50(11):3113–3123, 2006. [7] Pierre Hansen and Brigitte Jaumard. Cluster analysis and mathematical programming. Math. Programming, 79(1-3, Ser. B):191–215, 1997. Lectures on mathematical programming (ISMP97) (Lausanne, 1997). [8] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89–112, 2004. [9] Masakazu Kojima, Susumu Shindoh, and Shinji Hara. Interior-point methods for the monotone semidefinite linear complementarity problem in symmetric matrices. SIAM J. Optim., 7(1):86–125, 1997. [10] Stuart P. Lloyd. Least squares quantization in PCM. IEEE Trans. Inform. Theory, 28(2):129–137, 1982. [11] C. Michelot. A finite algorithm for finding the projection of a point onto the canonical simplex of Rn . J. Optim. Theory Appl., 50(1):195–200, 1986. [12] S. T. Milagre, C. D. Maciel, A. A. Shinoda, M. Hungria, and J. R. B. Almeida. Multidimensional cluster stability analysis from a Brazilian bradyrhizobium sp. RFLP/PCR data set. J. Comput. Appl. Math., 227(2):308–319, 2009. [13] S. Oliveira, T. Soma, and D. Stewart. Semidefinite programming for graph partitioning with preference in data distribution. In 5th International Conference High Performance Computing for Computational Science VECPAR 2002, volume 2565 of Lecture Notes in Computer Science, pages 307–321, Heidelberg, 2003. Springer. [14] James Joseph Palmersheim. Nearest Neighbor Classification Rules: Small Sample Performance and Comparison with the Linear Discriminant Function and the Optimum Rule. ProQuest LLC, Ann Arbor, MI, 1970. Thesis (Ph.D.)–University of California, Los Angeles. [15] Javier Peña, Juan Vera, and Luis F. Zuluaga. Computing the stability number of a graph via linear and semidefinite programming. SIAM J. Optim., 18(1):87–105 (electronic), 2007. [16] Jiming Peng and Yu Wei. Approximating K-means-type clustering via semidefinite programming. SIAM J. Optim., 18(1):186–205 (electronic), 2007. [17] M. R. Rao. Cluster analysis and mathematical programming. Journal of the American Statistical Association, 66(335):622–626, 1971. [18] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev., 52(3):471–501, 2010. [19] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Review, 38(1):49–95, 1996. [20] XL Wen, S Fuhrman, GS Michaels, DB Carr, S Smith, JL Barker, and R Somogyi. Large-scale temporal gene expression mapping of central nervous system development. Proceedings of the National Academy of Sciences of the United States of America, 95(1):334–339, JAN 6 1998. [21] Gordon Wilfong. Nearest neighbor problems. Internat. J. Comput. Geom. Appl., 2(4):383–416, 1992. [22] Henry Wolkowicz. Semidefinite and Lagrangian relaxations for hard combinatorial problems. In System modelling and optimization (Cambridge, 1999), pages 269–309. Kluwer Acad. Publ., Boston, MA, 2000. [23] Henry Wolkowicz and Miguel F. Anjos. Semidefinite programming for discrete optimization and matrix completion problems. Discrete Appl. Math., 123(1-3):513–577, 2002. Workshop on Discrete Optimization, DO’99 (Piscataway, NJ). [24] Henry Wolkowicz and Qing Zhao. Semidefinite programming relaxations for the graph partitioning problem. Discrete Appl. Math., 96/97:461–479, 1999. The satisfiability problem (Certosa di Pontignano, 1996); Boolean functions. [25] Xiang Yin and Alex Tay Leng Phuan. Genetic algorithm based K-means fast learning artificial neural network. In AI 2004: Advances in artificial intelligence, volume 3339 of Lecture Notes in Comput. Sci., pages 828–839. Springer, Berlin, 2004.

Log In

Clustering for bioinformatics via matrix optimization

Related papers

Related papers

Related topics