Abstract
Complex insertion and deletion (complex indel) is a rare category of genomic structural variations. A complex indel presents as one or multiple DNA fragments inserted into the genomic location where a deletion occurs. Several studies emphasize the importance of complex indels, and some state-of-the-art approaches are proposed to detect them from sequencing data. However, genotyping complex indel calls is another challenged computational problem because some commonly used features for genotyping indel calls from the sequencing data could be invalid due to the components of complex indels. Thus, in this article, we propose a machine learning approach, CIGenotyper to estimate genotypes of complex indel calls. CIGenotyper adopts a relevance vector machine (RVM) framework. For each candidate call, it first extracts a set of features from the candidate region, which usually includes the read depth, the variant allelic frequency for aligned contigs, the numbers of the splitting and discordant paired-end reads, etc. For a complex indel call, given its features to a trained RVM, the model outputs the genotype with highest likelihood. An algorithm is also proposed to train the RVM. We compare our approach to two popular approaches, Gindel and Pindel, on multiple groups of artificial datasets. The results of our model outperforms them on average success rates in most of the cases when vary the coverages of the given data, the read lengths and the distributions of the lengths of the pre-set complex indels.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
The Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Briefings Bioinf. 19(1), 118–135 (2018)
Lu, C., Xie, M., Wendl, M., et al.: Patterns and functional implications of rare germline variants across 12 cancer types. Nat. Commun. 6, 10086 (2015)
DePristo, M., Banks, E., Polon, R., et al.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43(5), 491–498 (2011)
Ye, K., Wang, J., Jayasinghe, R., et al.: Systematic discovery of complex insertions and deletions in human cancers. Nat. Med. 22(1), 97–104 (2016)
Iakovishina, D., Janoueix-Lerosey, I., Barillot, E., et al.: SV-Bay: structural variant detection in cancer genomes using a Bayesian approach with correction for GC-content and read mappability. Bioinformatics 32(7), 984–992 (2016)
Kloosterman, W., Francioli, L., Hormozdiari, F., et al.: Characteristics of de novo structural changes in the human genome. Genome Res. 25(6), 792–801 (2015)
Zhang, X., Chen, H., Zhang, R., et al.: Detecting complex indels with wide length-spectrum from the third generation sequencing data. BIBM 2017, 1980–1987 (2017)
Geng, Y., Zhao, Z., Xu, J., et al.: Identifying heterogeneity patterns of allelic imbalance on germline variants to infer clonal architecture. In: Huang, D., Jo, K., Figueroa-García, J. (eds.) ICIC 2017. LNCS, vol. 10362, pp. 286–297. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63312-1_26
Geng, Y., Zhao, Z., Zhang, X., et al.: An improved burden-test pipeline for identifying associations from rare germline and somatic variants. BMC Genom. 18(7:55), 55–62 (2017)
Zhang, J., Wang, J., Wu, Y.: An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data. BMC Bioinf. 13(6), S6 (2012)
Bansal, V., Libiger, O.: A probabilistic method for the detection and genotyping of small indels from population-scale sequence data. Bioinformatics 27(15), 2047–2053 (2011)
Marschall, T., Hajirasouliha, I., Schonhuth, A.: MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels. Bioinformatics 29(24), 3143–3150 (2013)
Chu, C., Zhang, J., Wu, Y.: GINDEL: accurate genotype calling of insertions and deletions from low coverage population sequence reads. PLoS One 9(11), e113324 (2014)
Camps-Valls, G., Martínez-Ramón, M., Rojo-Alvarez, J., et al.: Nonlinear system identification with composite relevance vector machines. IEEE Sig. Process. Lett. 14(4), 279–282 (2007)
Zhang, X., Xu, M., Wang, Y., et al.: A graph-based algorithm for prioritizing cancer susceptibility genes from gene fusion data. BIBM 2017, 2204–2210 (2017)
Acknowledgement
This work is supported by the National Science Foundation of China (Grant No: 31701150) and the Fundamental Research Funds for the Central Universities (CXTD2017003).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Zheng, T. et al. (2018). CIGenotyper: A Machine Learning Approach for Genotyping Complex Indel Calls. In: Rojas, I., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2018. Lecture Notes in Computer Science(), vol 10813. Springer, Cham. https://doi.org/10.1007/978-3-319-78723-7_41
Download citation
DOI: https://doi.org/10.1007/978-3-319-78723-7_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-78722-0
Online ISBN: 978-3-319-78723-7
eBook Packages: Computer ScienceComputer Science (R0)