Abstract
DNA micro-arrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new micro-array devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive signatures of gene expression over normal tissues or other types of cancer tissues.
In this paper, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA micro-arrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address this problem select genes with correlation techniques. We propose a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE). We demonstrate experimentally that the genes selected by our techniques yield better classification performance and are biologically relevant to cancer.
In contrast with the baseline method, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. In patients with leukemia our method discovered 2 genes that yield zero leave-one-out error, while 64 genes are necessary for the baseline method to get the best result (one leave-one-out error). In the colon cancer database, using only 4 genes our method is 98% accurate, while the baseline method is only 86% accurate.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aerts, H. (1996). Chitotriosidase-New biochemical marker. Gauchers News.
Alizadeh, A. et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403:3, 503-511.
Alon, U. et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. PNAS, 96, 6745-6750, Cell Biology. The data is available on-line at http://www.molbio.princeton.edu/colondata.
Aronson, N. (1999). Remodeling the mammary GI and at the termination of breast feeding: Role of a new regulator protein BRP39. The Beat, University of South Alabama College of Medecine, July, 1999.
Ben Hur, A., Horn, D., Siegelman, H., & Vapnik, V. (2000). A support vector method for clustering. Advances in Neural Information Processing Systems 13, Cambridge, MA: MIT Press.
Blum, A. & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97, 245-271.
Boser, B., Guyon, I., & Vapnik, V. (1992). An training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (pp. 144-152). Pittsburgh: ACM.
Bradley, P. & Mangasarian, O. (1998). Feature selection via concave minimization and support vector machines. In Proceedings of the 13th International Conference on Machine Learning (pp. 82-90). San Francisco, CA.
Bradley, P., Mangasarian, O., & Street, W. (1998). Feature selection via mathematical programming. Technical Report. INFORMS Journal on Computing, 10, 209-217.
Bredensteiner, E. & Bennett, K. (1999). Multicategory classification for support vector machines. Computational Optimizations and Applications, 12, 53-79.
Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M., Jr., & Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines.
Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2000). Choosing kernel parameters for support vector machines. AT &T Labs Technical Report.
Cortes, C. & Vapnik, V. (1995). Support vector networks. Machine Learning, 20:3, 273-297.
Cristianini, N. & Shawe-Taylor, J. (1999). An introduction to support vector machines. Cambridge,MA: Cambridge University Press.
Duda, R. O. & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley.
Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. PNAS, 95, 14863-14868.
Fodor, S. A. (1997). Massively parallel genomics. Science, 277, 393-395.
Furey, T., Cristianini, N., Duffy, N., Bednarski, D., Schummer, M., & Haussler, D. (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906-914.
Ghigna, C., Moroni, M., Porta, C., Riva, I., & Biamonti, G. (1998). Altered expression of heterogeneous nuclear ribonucleoproteins and SR factors in human. Cancer Research, 58, 5818-5824.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., & Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531-537. The data is available on-line at http://www.genome.wi.mit. edu/MPR/data set ALL AML.html.
Guyon, I. (1999). SVM Application Survey: http://www.clopinet.com/SVM.applications.html.
Guyon, I., Makhoul, J., Schwartz, R., & Vapnik, V. (1998). What size test set gives good error rate estimates? PAMI, 20:1, 52-64, IEEE.
Guyon, I., Matic, N., & Vapnik, V. (1996). Discovering informative patterns and data cleaning. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy, (Eds.). Advances in knowledge discovery and data mining (pp. 181-203). Cambridge, MA: MIT Press.
Guyon, I., Vapnik, V., Boser, B., Bottou, L., & Solla, S. A. (1992). Structural risk minimization for character recognition. In J. E. Moody et al. (Ed), Advances in neural information processing systems 4 (NIPS 91), (pp. 471-479). San Mateo CA: Morgan Kaufmann.
Harlan, D. M., Graff, J. M., Stumpo, D. J., Eddy Jr, R. L., Shows, T. B., Boyle, J. M., & Blackshear, P. J. (1991). The human myristoylated alanine-rich C kinase substrate (MARCKS) gene (MACS). Analysis of its gene product, promoter, and chromosomal localization. Journal of Biological Chemistry, 266:22, 14399-14405.
Hastie, T., Tibshirani, R., Eisen, M., Brown, P., Ross, D., Scherf, U., Weinstein, J., Alisadeh, A., Staudt, L., & Botstein, D. (2000). Gene shaving: A new class of clustering methods for expression arrays. Stanford Technical Report.
Jebara, T. & Jaakkola, T. (2000). Feature selection and dualities in maximum entropy discrimination. In 16th Conference on Uncertainty in Artificial Intelligence, UAI 2000, July 2000.
Karakiulakis, G., Papanikolaou, C., Jankovic, S. M., Aletras, A., Papakonstantinou, E., Vretou, E., & Mirtsou-Fidani, V. (1997). Increased type IV collagen-degrading activity in metastases originating from primary tumors of the human colon. Invasion and Metastasis, 17:3, 158-168.
Kearns, M., Mansour, Y., Ng, A. Y., & Ron, D. (1997). An experimental and theoretical comparison of model selection methods. Machine Learning, 27, 7-50.
Kohavi, R. & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97:12, 273-324.
LeCun, Y., Denker, J. S., & Solla, S. A. (1990). Optimum brain damage. In D. Touretzky (Ed.). Advances in neural information processing systems 2 (pp. 598-605). San Mateo, CA: Morgan Kaufmann.
Macalma, T., Otte, J., Hensler, M. E., Bockholt, S. M., Louis, H. A., Kalff-Suske, M., Grzeschik, K. H., von der Ahe, D., & Beckerle, M. C. (1996). Molecular characterization of human zyxin. Journal of Biological Chemistry, 271:49, 31470-31478.
Moser, T. L., Sharon Stack, M., Asplin, I., Enghild, J. J., Højrup, P., Everitt, L., Hubchak, S., William Schnaper, H., & Pizzo, S. V. (1999). Angiostatin binds ATP synthase on the surface of human endothelial cells. PNAS, 96:6, 2811-2816.
Mukherjee, S., Tamayo, P., Slonim, D., Verri, A., Golub, T., Messirov, J. P., & Poggio, T. (2000). Support vector machine classification of microarray data. AI memo 182. CBCL paper 182. MIT. Can be retrieved from ftp://publications.ai.mit.edu.
de Oliveira, E. C. (1999). Chronic Trypanosoma cruzi infection associated to colon cancer. An experimental study in rats. Resumo di Tese. Revista da Sociedade Brasileira de Medicina Tropical, 32:1, 81-82.
Osaka, M., Rowley, J. D., & Zeleznik-Le, N. J. (1999). MSF (MLL septin-like fusion), a fusion partner gene of MLL, in a therapy-related acute myeloid leukemia with at (11; 17)(q23; q25). PNAS, 96:11, 6428-6433.
Pavlidis, P., Weston, J., Cai, J., & Grundy, W. N. (2000). Gene functional analysis from heterogeneous data. Submitted for publication.
Perou, C. M. et al. (1999). Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. PNAS, 96, 9212-9217.
Schölkopf, B., Smola, A., & Muller, K.-R. (1998). Non-linear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319.
Shürmann, J. (1996). Pattern classification. Wiley Interscience.
Smola, A. & Schölkopf, B. (2000). Sparce greedy matrix approximation for machine learning. In Proceedings of the 17th International Conference on Machine Learning (pp. 911-918).
Thorsteinsdottir, U., Krosl, J., Kroon, E., Haman, A., Hoang, T., & Sauvageau, G. (1999). The oncoprotein E2APbx1a collaborates with Hoxa9 to acutely transform primary bone marrow cells. Molecular Cell Biology, 19:9, 6355-6366.
Vapnik, V. N. (1998). Statistical learning theory. Wiley Interscience.
Walsh, J. H. (1999). Epidemiologic evidence underscores role for folate as foiler of colon cancer. Gastroenterology, 116, 3-4.
Weston, J., Muckerjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2000). Feature selection for SVMs. In Proceedings of NIPS 2000, to appear.
Weston, J. & Guyon, I. (2000b). Feature selection for kernel machines using stationary weight approximation. In preparation.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Guyon, I., Weston, J., Barnhill, S. et al. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 46, 389–422 (2002). https://doi.org/10.1023/A:1012487302797
Issue Date:
DOI: https://doi.org/10.1023/A:1012487302797