Abstract
Single nucleotide polymorphisms (SNPs) in metagenomics are used to quantify population structure, track strains and identify genetic determinants of microbial phenotypes. However, existing alignment-based approaches for metagenomic SNP detection require high-performance computing and enough read coverage to distinguish SNPs from sequencing errors. To address these issues, we developed the GenoTyper for Prokaryotes (GT-Pro), a suite of methods to catalog SNPs from genomes and use unique k-mers to rapidly genotype these SNPs from metagenomes. Compared to methods that use read alignment, GT-Pro is more accurate and two orders of magnitude faster. Using high-quality genomes, we constructed a catalog of 104 million SNPs in 909 human gut species and used unique k-mers targeting this catalog to characterize the global population structure of gut microbes from 7,459 samples. GT-Pro enables fast and memory-efficient metagenotyping of millions of SNPs on a personal computer.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All described datasets are publicly available through the corresponding repositories. Genome assemblies for building GT-Pro used in this study were downloaded from the UHGG database and are available at MGnify (http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes). The 1,171 C. difficile genomes are available at NCBI RefSeq (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Clostridioides_difficile/), and the accession numbers of 114 high-quality nonredundant C. difficile genomes are in Supplementary Table 15. All metagenomic samples are available at NCBI SRA (https://www.ncbi.nlm.nih.gov/sra) with accession numbers in supplementary tables: 25,133 human microbiome samples (Supplementary Table 8), Tanzania (Supplementary Table 9), North America (Supplementary Table 10), Madagascar (Supplementary Table 11) and North American IBD cohort (Supplementary Table 12) and global biogeography samples (Supplementary Table 13). The GT-Pro SNP databases and genotype profiles of 25,133 human microbiome samples generated in this study are available in a cloud server with public access permission (https://fileshare.czbiohub.org/s/waXQzQ9PRZPwTdk) and can be accessed through GitHub (https://github.com/zjshi/gt-pro).
Code availability
The implementation and documentation of GT-Pro is available on the GitHub (https://github.com/zjshi/gt-pro). GT-Pro is written in C++ with python scripts, it is released as open-source software under the MIT license.
References
Garud, N. R. & Pollard, K. S. Population genetics in the human microbiome. Trends Genet. 36, 53–67 (2020).
Maini Rekdal, V., Bess, E. N., Bisanz, J. E., Turnbaugh, P. J. & Balskus, E. P. Discovery and inhibition of an interspecies gut bacterial pathway for Levodopa metabolism. Science 364, eaau6323 (2019).
Zeng, Q., Liao, C., Terhune, J. & Wang, L. Impacts of florfenicol on the microbiota landscape and resistome as revealed by metagenomic analysis. Microbiome 7, 155 (2019).
Chattopadhyay, S. et al. High frequency of hotspot mutations in core genes of Escherichia coli due to short-term positive selection. Proc. Natl Acad. Sci. USA 106, 12412–12417 (2009).
Treangen, T. J., Ondov, B. D., Koren, S. & Phillippy, A. M. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol. 15, 524 (2014).
Schloissnig, S. et al. Genomic variation landscape of the human gut microbiome. Nature 493, 45–50 (2013).
Luo, C. et al. ConStrains identifies microbial strains in metagenomic datasets. Nat. Biotechnol. 33, 1045–1052 (2015).
Nayfach, S., Rodriguez-Mueller, B., Garud, N. & Pollard, K. S. An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Res. 26, 1612–1625 (2016).
Costea, P. I. et al. metaSNV: a tool for metagenomic strain level analysis. PLoS ONE 12, e0182392 (2017).
Quince, C. et al. DESMAN: a new tool for de novo extraction of strains from metagenomes. Genome Biol. 18, 181 (2017).
Truong, D. T., Tett, A., Pasolli, E., Huttenhower, C. & Segata, N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. 27, 626–638 (2017).
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
Ounit, R., Wanamaker, S., Close, T. J. & Lonardi, S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16, 236 (2015).
Liu, Y., Zhang, L. Y. & Li, J. Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index k-mers. Bioinformatics 35, 4560–4567 (2019).
Breitwieser, F. P., Baker, D. N. & Salzberg, S. L. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 19, 198 (2018).
Phillippy, A. M. et al. Comprehensive DNA signature discovery and validation. PLoS Comput. Biol. 3, e98 (2007).
Shajii, A., Yorukoglu, D., William Yu, Y. & Berger, B. Fast genotyping of known SNPs through approximate k-mer matching. Bioinforma. 32, i538–i544 (2016).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662.e20 (2019).
Nayfach, S., Shi, Z. J., Seshadri, R., Pollard, K. S. & Kyrpides, N. C. New insights from uncultivated genomes of the global human gut microbiome. Nature 568, 505–510 (2019).
Almeida, A. et al. A new genomic blueprint of the human gut microbiota. Nature 568, 499–504 (2019).
Marçais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Smith, J. M., Smith, N. H., O’Rourke, M. & Spratt, B. G. How clonal are bacteria? Proc. Natl Acad. Sci. USA 90, 4384 (1993).
Redfield, R. J. Do bacteria have sex? Nat. Rev. Genet. 2, 634–639 (2001).
Lin, M. & Kussell, E. Inferring bacterial recombination rates from large-scale sequencing datasets. Nat. Methods 16, 199–204 (2019).
Ansari, M. A. & Didelot, X. Inference of the properties of the recombination process from whole bacterial genomes. Genetics 196, 253 (2014).
González-Torres, P., Rodríguez-Mateos, F., Antón, J. & Gabaldón, T. Impact of homologous recombination on the evolution of prokaryotic core genomes. mBio. 10, e02494–18 (2019).
Garud, N. R., Good, B. H., Hallatschek, O. & Pollard, K. S. Evolutionary dynamics of bacteria in the gut microbiome within and across hosts. PLoS Biol. 17, e3000102 (2019).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Leinonen, R., Sugawara, H. & Shumway, M., International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011).
Smits, S. A. et al. Seasonal cycling in the gut microbiome of the Hadza hunter-gatherers of Tanzania. Science 357, 802 (2017).
Zou, Y. et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat. Biotechnol. 37, 179–185 (2019).
Turnbaugh, P. J. et al. The Human Microbiome Project. Nature 449, 804–810 (2007).
Franzosa, E. A. et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat. Microbiol. 4, 293–305 (2019).
Issa, M., Ananthakrishnan, A. N. & Binion, D. G. Clostridium difficile and inflammatory bowel disease. Inflamm. Bowel Dis. 14, 1432–1442 (2008).
Rousseau, C. et al. Clostridium difficile colonization in early infancy is accompanied by changes in intestinal microbiota composition. J. Clin. Microbiol. 49, 858–865 (2011).
Vincent, C. et al. Bloom and bust: intestinal microbiota dynamics in response to hospital exposures and Clostridium difficile colonization or infection. Microbiome 4, 12 (2016).
Tierney, B. T. et al. The landscape of genetic content in the gut and oral human microbiome. Cell Host Microbe 26, 283–295.e8 (2019).
Almeida, A. et al. A unified sequence catalogue of over 280,000 genomes obtained from the human gut microbiome. Preprint at bioRxiv https://doi.org/10.1101/762682 (2019).
Nei, M. & Gojobori, T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3, 418–426 (1986).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7 (2015).
Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).
Liu, X. et al. A novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures. Bioinformatics 34, 171–178 (2017).
Kokot, M., Długosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).
Mende, D. R., Sunagawa, S., Zeller, G. & Bork, P. Accurate and universal delineation of prokaryotic species. Nat. Methods 10, 881–884 (2013).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf. 11, 119–119 (2010).
Kultima, J. R. et al. MOCAT: a metagenomics assembly and gene prediction toolkit. PLoS ONE 7, e47656 (2012).
Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Segata, N. et al. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9, 811–814 (2012).
Gourlé, H., Karlsson-Lindsjö, O., Hayer, J. & Bongcam-Rudloff, E. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics 35, 521–522 (2018).
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinforma. 30, 1312–1313 (2014).
Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864–2868 (2017).
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinf. 10, 421 (2009).
Acknowledgements
This study was funded by NSF (grant #1563159), Chan Zuckerberg Biohub, Chan Zuckerberg Initiative, and Gladstone Institutes.
Author information
Authors and Affiliations
Contributions
K.S.P. and S.N. conceived the project. K.S.P., S.N. and Z.J.S. designed experiments and drafted the manuscript. Z.J.S. conducted experiments, analyzed data, made figures and wrote software. B.D. wrote software and contributed to analysis of software performance. C.Z. contributed to analysis of structural variation imputation and tested software. K.S.P. supervised the project, provided computational resources and funding. K.S.P. and S.N. provided feedback. All authors read, edited and reviewed the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Biotechnology thanks Yun William Yu, Falk Hildebrand and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–42.
Supplementary Tables
Supplementary Tables 1–15.
Rights and permissions
About this article
Cite this article
Shi, Z.J., Dimitrov, B., Zhao, C. et al. Fast and accurate metagenotyping of the human gut microbiome with GT-Pro. Nat Biotechnol 40, 507–516 (2022). https://doi.org/10.1038/s41587-021-01102-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-021-01102-3