Fast and accurate metagenotyping of the human gut microbiome with GT-Pro


Single nucleotide polymorphisms (SNPs) in metagenomics are used to quantify population structure, track strains and identify genetic determinants of microbial phenotypes. However, existing alignment-based approaches for metagenomic SNP detection require high-performance computing and enough read coverage to distinguish SNPs from sequencing errors. To address these issues, we developed the GenoTyper for Prokaryotes (GT-Pro), a suite of methods to catalog SNPs from genomes and use unique k-mers to rapidly genotype these SNPs from metagenomes. Compared to methods that use read alignment, GT-Pro is more accurate and two orders of magnitude faster. Using high-quality genomes, we constructed a catalog of 104 million SNPs in 909 human gut species and used unique k-mers targeting this catalog to characterize the global population structure of gut microbes from 7,459 samples. GT-Pro enables fast and memory-efficient metagenotyping of millions of SNPs on a personal computer.

Fig. 1: In sillico metagenotyping framework.
Fig. 2: Genetic landscape of 909 human gut species.
Fig. 3: Computational performance evaluation of GT-Pro.
Fig. 4: Metagenotyping accuracy evaluation of GT-Pro using simulations.
Fig. 5: Metagenotyping and gene imputation from gut metagenomes.
Fig. 6: Global genetic structure in 7,459 human gut metagenomes.

Data availability

All described datasets are publicly available through the corresponding repositories. Genome assemblies for building GT-Pro used in this study were downloaded from the UHGG database and are available at MGnify (http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes). The 1,171 C. difficile genomes are available at NCBI RefSeq (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Clostridioides_difficile/), and the accession numbers of 114 high-quality nonredundant C. difficile genomes are in Supplementary Table 15. All metagenomic samples are available at NCBI SRA (https://www.ncbi.nlm.nih.gov/sra) with accession numbers in supplementary tables: 25,133 human microbiome samples (Supplementary Table 8), Tanzania (Supplementary Table 9), North America (Supplementary Table 10), Madagascar (Supplementary Table 11) and North American IBD cohort (Supplementary Table 12) and global biogeography samples (Supplementary Table 13). The GT-Pro SNP databases and genotype profiles of 25,133 human microbiome samples generated in this study are available in a cloud server with public access permission (https://fileshare.czbiohub.org/s/waXQzQ9PRZPwTdk) and can be accessed through GitHub (https://github.com/zjshi/gt-pro).

Code availability

The implementation and documentation of GT-Pro is available on the GitHub (https://github.com/zjshi/gt-pro). GT-Pro is written in C++ with python scripts, it is released as open-source software under the MIT license.


K.S.P. and S.N. conceived the project. K.S.P., S.N. and Z.J.S. designed experiments and drafted the manuscript. Z.J.S. conducted experiments, analyzed data, made figures and wrote software. B.D. wrote software and contributed to analysis of software performance. C.Z. contributed to analysis of structural variation imputation and tested software. K.S.P. supervised the project, provided computational resources and funding. K.S.P. and S.N. provided feedback. All authors read, edited and reviewed the paper.

Corresponding authors

Correspondence to Stephen Nayfach or Katherine S. Pollard.

Ethics declarations

Competing interests

The authors declare no competing interests.

Supplementary information

Supplementary Information

Supplementary Figs. 1–42.

Reporting Summary.

Supplementary Tables

Supplementary Tables 1–15.

