Abstract
Microarray gene expression data are often accompanied by a large number of genes and a small number of samples. However, only a few of these genes are relevant to cancer, resulting in significant gene selection challenges. Hence, we propose a two-stage gene selection approach by combining extreme gradient boosting (XGBoost) and a multi-objective optimization genetic algorithm (XGBoost-MOGA) for cancer classification in microarray datasets. In the first stage, the genes are ranked using an ensemble-based feature selection using XGBoost. This stage can effectively remove irrelevant genes and yield a group comprising the most relevant genes related to the class. In the second stage, XGBoost-MOGA searches for an optimal gene subset based on the most relevant genes’ group using a multi-objective optimization genetic algorithm. We performed comprehensive experiments to compare XGBoost-MOGA with other state-of-the-art feature selection methods using two well-known learning classifiers on 14 publicly available microarray expression datasets. The experimental results show that XGBoost–MOGA yields significantly better results than previous state-of-the-art algorithms in terms of various evaluation criteria, such as accuracy, F-score, precision, and recall.
Graphical abstract



Similar content being viewed by others
References
Güçkıran K, Cantürk İ, Özyılmaz L (2019) LASSO ve Relief Özellik Seçimi Yöntemleri ile DVM, ÇKA ve RO Ağ Yapıları Kullanılarak DNA Mikroçip Gen İfadesi Verisetlerinin Sınıflandırılması. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 23:115–121. https://doi.org/10.19113/sdufenbed.453462
Lazar C, Taminau J, Meganck S et al (2012) A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol and Bioinf 9:1106–1119. https://doi.org/10.1109/TCBB.2012.33
Lee C-P, Leu Y (2011) A novel hybrid feature selection method for microarray data analysis. Appl Soft Comput 11:208–213. https://doi.org/10.1016/j.asoc.2009.11.010
Hira ZM, Gillies DF (2015) A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinform 2015:1–13. https://doi.org/10.1155/2015/198363
Bhalla A, Agrawal RK (2013) Microarray gene-expression data classification using less gene expressions by combining feature selection methods and classifiers. IJIEEB 5:42–48. https://doi.org/10.5815/ijieeb.2013.05.06
Bindu NH, Chakravarthi T (2018) Booster of an FS algorithm on high dimensional data. IJSRSET 4:496–500
Yu H, Ni J (2014) An Improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data. IEEE/ACM Trans Comput Biol Bioinf 11:657–666
Li M, Xiong A, Wang L et al (2020) ACO resampling: enhancing the performance of oversampling methods for class imbalance classification. Knowledge-Based Systems 196:105818
Li W, Yin Y, Quan X, Zhang H (2019) Gene expression value prediction based on XGBoost algorithm. Front Genet 10:1077. https://doi.org/10.3389/fgene.2019.01077
Islam A, Rahman MM, Ahmed E, et al (2020) Adaptive feature selection and classification of colon cancer from gene expression data: an ensemble learning approach. In: Proceedings of the International Conference on Computing Advancements. ACM, Dhaka Bangladesh 1–7
Kavitha KR, Gopinath A, Gopi M (2017) Applying improved svm classifier for leukemia cancer classification using FCBF. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 61–66
Ben Brahim A, Limam M (2013) Robust ensemble feature selection for high dimensional data sets. In: 2013 International Conference on High Performance Computing & Simulation (HPCS). IEEE, Helsinki, Finland 151–157
Hall MA, Smith LA (1999) Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, May 1–5, 1999, Orlando, Florida, USA
Zeng X-Q, Li G-Z, Chen S-F (2010) Gene selection by using an improved fast correlation-based filter. In: 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW). IEEE, HongKong, China 625–630
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40:16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024
Li J, Cheng K, Wang S et al (2018) Feature selection: a data perspective. ACM Comput Surv 50:1–45. https://doi.org/10.1145/3136625
Elyasigomari V, Lee DA, Screen HRC, Shaheed MH (2017) Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification. J Biomed Inform 67:11–20. https://doi.org/10.1016/j.jbi.2017.01.016
Huang X, Zhang L, Wang B et al (2018) Feature clustering based support vector machine recursive feature elimination for gene selection. Appl Intell 48:594–607. https://doi.org/10.1007/s10489-017-0992-2
Shukla AK, Singh P, Vardhan M (2019) A new hybrid feature subset selection framework based on binary genetic algorithm and information theory. Int J Comp Intel Appl 18:1950020. https://doi.org/10.1142/S1469026819500202
Huan Liu, Setiono R (1995) Chi2: feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence. IEEE Comput. Soc. Press, Herndon, VA, USA 388–391
Liu Y (2004) A comparative study on feature selection methods for drug discovery. J Chem Inf Comput Sci 44:1823–1828. https://doi.org/10.1021/ci049875d
Robnik-Šikonja M, Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 53:23–69. https://doi.org/10.1023/A:1025667309714
Ghosh M, Adhikary S, Ghosh KK et al (2019) Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods. Med Biol Eng Comput 57:159–176. https://doi.org/10.1007/s11517-018-1874-4
Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555
Lu H, Chen J, Yan K et al (2017) A hybrid feature selection algorithm for gene expression data classification. Neurocomputing 256:56–62. https://doi.org/10.1016/j.neucom.2016.07.080
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Francisco California USA 785–794
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Chen S, Zhou W, Tu J et al (2021) A novel XGBoost method to infer the primary lesion of 20 solid tumor types from gene expression data. Front Genet 12:632761. https://doi.org/10.3389/fgene.2021.632761
Islam A, Rahman MM, Ahmed E, et al (2020) Adaptive feature selection and classification of colon cancer from gene expression data: an ensemble learning approach. In: Proceedings of the International Conference on Computing Advancements. Association for Computing Machinery, New York, NY, USA 1–7
Dimitrakopoulos GN, Vrahatis AG, Plagianakos V, Sgarbas K (2018) Pathway analysis using XGBoost classification in biomedical data. In: Proceedings of the 10th Hellenic Conference on Artificial Intelligence. ACM, Patras Greece 1–6
Sujamol S, Vimina ER, Krishnakumar U (2020) Improving recurrence prediction accuracy of ovarian cancer using multi-phase feature selection methodology. Appl Artif Intell 35:1–21. https://doi.org/10.1080/08839514.2020.1854988
Abdu-Aljabar RD, Awad OA (2021) A Comparative analysis study of lung cancer detection and relapse prediction using XGBoost classifier. IOP Conf Ser: Mater Sci Eng 1076:012048. https://doi.org/10.1088/1757-899X/1076/1/012048
Haidar A, Verma B, Haidar R (2019) A swarm based optimization of the XGBoost parameters. Aust J Intell Inf Process Syst 16:74–81
Djellali H, Guessoum S, Ghoualmi-Zine N, Layachi S (2017) Fast correlation based filter combined with genetic algorithm and particle swarm on feature selection. In: 2017 5th International Conference on Electrical Engineering - Boumerdes (ICEE-B). IEEE, Boumerdes 1–6
Pragadeesh C, Jeyaraj R, Siranjeevi K et al (2019) Hybrid feature selection using micro genetic algorithm on microarray gene expression data. IFS 36:2241–2246. https://doi.org/10.3233/JIFS-169935
Babatunde OH, Armstrong L, Leng J, Diepeveen D (2014) A genetic algorithm-based feature selection. British J Math Comput Sci 5:889–905
Sayed S, Nassef M, Badr A, Farag I (2019) A nested genetic algorithm for feature selection in high-dimensional cancer microarray datasets. Expert Syst Appl 121:233–243. https://doi.org/10.1016/j.eswa.2018.12.022
Song K, Yan F, Ding T et al (2020) A steel property optimization model based on the XGBoost algorithm and improved PSO. Comput Mater Sci 174:109472. https://doi.org/10.1016/j.commatsci.2019.109472
Alizadeh AA, Eisen MB, Davis RE et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511. https://doi.org/10.1038/35000501
Zhu Z, Ong Y-S, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recogn 40:3236–3248. https://doi.org/10.1016/j.patcog.2007.02.007
Alon U, Barkai N, Notterman DA et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96:6745–6750. https://doi.org/10.1073/pnas.96.12.6745
Subramanian AA, Tamayo PP, Mootha VKV et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102:15545–15550
Singh D, Febbo PG, Ross K et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209
Borovecki F, Lovrecic L, Zhou J et al (2005) Genome-wide expression profiling of human blood reveals biomarkers for Huntington’s disease. Proc Natl Acad Sci USA 102:11023–11028
Tian E, Zhan F, Walker R et al (2003) The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. N Engl J Med 349:2483–2494
Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classfication methods for tissue classification based on gene expression. Bioinformatics (Oxford, England) 20:2429–2437. https://doi.org/10.1093/bioinformatics/bth267
The Cancer Genome Atlas Program - National Cancer Institute. https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga. Accessed 10 Oct 2021
Pedregosa F, Varoquaux G, Gramfort A, et al (2012) Scikit-learn: machine learning in python
Calzolari M (2019) manuel-calzolari/sklearn-genetic: sklearn-genetic 0.2. Zenodo
Soufan O, Kleftogiannis D, Kalnis P, Bajic VB (2015) DWFS: a wrapper feature selection tool based on a parallel genetic algorithm. PLoS ONE 10:e0117988. https://doi.org/10.1371/journal.pone.0117988
Syafrudin M, Alfian G, Fitriyani NL et al (2020) A self-care prediction model for children with disability based on genetic algorithm and extreme gradient boosting. Mathematics 8:1590. https://doi.org/10.3390/math8091590
Hall MA (1999) Correlation-based feature selection for machine learning. 198
Urbanowicz RJ, Olson RS, Schmitt P, et al (2017) Benchmarking relief-based feature selection methods
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1:80–83. https://doi.org/10.2307/3001968
Pratt JW (1959) Remarks on zeros and ties in the Wilcoxon signed rank procedures. J Am Stat Assoc 54:655–667. https://doi.org/10.1080/01621459.1959.10501526
Barot RK, Shitole SC, Bhagat N, et al (2016) Therapeutic effect of 0.1% Tacrolimus Eye Ointment in Allergic Ocular Diseases. J Clin Diagn Res 10:NC05–NC09. https://doi.org/10.7860/JCDR/2016/17847.7978
Maino P, Presilla S, ColliFranzone PA et al (2018) Radiation dose exposure for lumbar transforaminal epidural steroid injections and facet joint blocks under CT vs. fluoroscopic guidance. Pain Pract 18:798–804. https://doi.org/10.1111/papr.12677
Wang A, Liu X, Wu J et al (2014) Combined FV and FVIII deficiency (F5F8D) in a Chinese family with a novel missense mutation in MCFD2 gene. Haemophilia 20:e436-438. https://doi.org/10.1111/hae.12549
Ye H, Zhang X, Chen Z et al (2018) Association between the polymorphism (rs17222919, -1316T/G) of 5-lipoxygenase-activating protein gene (ALOX5AP) and the risk of stroke: A meta analysis. Medicine (Baltimore) 97:e12682. https://doi.org/10.1097/MD.0000000000012682
Zhou Y, Chu L, Wang Q et al (2018) CD59 is a potential biomarker of esophageal squamous cell carcinoma radioresistance by affecting DNA repair. Cell Death Dis 9:887. https://doi.org/10.1038/s41419-018-0895-0
Qin Y, Du J, Fan C (2020) Ube2S regulates Wnt/β-catenin signaling and promotes the progression of non-small cell lung cancer. Int J Med Sci 17:274–279. https://doi.org/10.7150/ijms.40243
Ostuni A, Carmosino M, Miglionico R et al (2020) Inhibition of ABCC6 transporter modifies cytoskeleton and reduces motility of HepG2 cells via purinergic pathway. Cells 9:E1410. https://doi.org/10.3390/cells9061410
Miao T, Peng C, Tang Z et al (2021) Implication of ataxia-telangiectasia-mutated kinase in epithelium-mesenchyme transition. Carcinogenesis 42:640–649. https://doi.org/10.1093/carcin/bgab002
Grun LK, da Teixeira N, R, Mengden L von, et al (2018) TRF1 as a major contributor for telomeres’ shortening in the context of obesity. Free Radic Biol Med 129:286–295. https://doi.org/10.1016/j.freeradbiomed.2018.09.039
Acknowledgements
The authors would like to thank the anonymous referees their comments. Research on this work was partially supported by the grants from the National Nature Science Foundation of China (No. 621660028).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Deng, X., Li, M., Deng, S. et al. Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Med Biol Eng Comput 60, 663–681 (2022). https://doi.org/10.1007/s11517-021-02476-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11517-021-02476-x