Sequence classification is valuable for reducing the complexity of metagenomes and providing a fu... more Sequence classification is valuable for reducing the complexity of metagenomes and providing a fundamental understanding of the composition of metagenomic samples. Binary metagenomic classifiers offer an insufficient solution because metagenomes of most natural environments are typically derived from multiple sequence sources including prokaryotes, eukaryotes and the viruses of both. Here we introduce a deep-learning based (not reference-based) sequence classifier, DeepMicrobeFinder, that classifies metagenomic contigs into five sequence classes, e.g., viruses infecting prokaryotic or eukaryotic hosts, eukaryotic or prokaryotic chromosomes, and prokaryotic plasmids. At different sequence lengths, DeepMicrobeFinder achieved area under the receiver operating characteristic curve (AUC) scores >0.9 for most sequence classes, the exception being distinguishing prokaryotic chromosomes from plasmids. By benchmarking on 20 test datasets with variable sequence class composition, we showed...
Little is known about the species composition and variability of natural bacterial communities, m... more Little is known about the species composition and variability of natural bacterial communities, mostly because conventional identification requires pure cultures, but less than 1% of active natural bacteria are cultivable. This problem was circumvented by comparing species compositions via hybridization of total DNA of natural bacterioplankton communities for the estimation of the fraction of DNA in common between two samples (similarity). DNA probes that were labeled with 35S by nick translation were hybridized to filter-bound DNA in a reciprocal fashion; similarities (in percent) were calculated by normalizing the values to self-hybridizations. In tests with DNA mixtures of pure cultures, the experimentally observed similarities agreed with expectations. However, reciprocal similarities (probe and target reversed) were often asymmetric, unlike those of DNA from single strains. This was due to the relative complexity and G + C content of DNA, which provided a means to interpret the...
We compared several currently discussed methods for the assessment of bacterial numbers and activ... more We compared several currently discussed methods for the assessment of bacterial numbers and activity in marine waters, using samples from a variety of marine environments, from aged offshore seawater to rich harbor water. Samples were simultaneously tested for binding to a fluorescently labeled universal 16S rRNA probe; (sup3)H-labeled amino acid uptake via autoradiography; nucleoid-containing bacterial numbers by modified DAPI (4(prm1),6-diamidino-2-phenylindole) staining; staining with 5-cyano-2,3-ditolyl tetrazolium chloride (CTC), a compound supposed to indicate oxidative cell metabolism; and total bacterial counts (classical DAPI staining), taken as a reference. For the universal-probe counts, we used an image intensifying and processing system coupled to the epifluorescence microscope. All of the above-mentioned methods yielded lower cell counts than DAPI total counts. Universal-probe counts averaged about half of the corresponding DAPI count and were highly correlated to auto...
Our growing awareness of the microbial world's importance and diversity contrasts starkly wit... more Our growing awareness of the microbial world's importance and diversity contrasts starkly with our limited understanding of its fundamental structure. Despite recent advances in DNA sequencing, a lack of standardized protocols and common analytical frameworks impedes comparisons among studies, hindering the development of global inferences about microbial life on Earth. Here we present a meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project. Coordinated protocols and new analytical methods, particularly the use of exact sequences instead of clustered operational taxonomic units, enable bacterial and archaeal ribosomal RNA gene sequences to be followed across multiple studies and allow us to explore patterns of diversity at an unprecedented scale. The result is both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete ...
Marine Thaumarchaeota are abundant ammonia-oxidizers but have few representative laboratory-cultu... more Marine Thaumarchaeota are abundant ammonia-oxidizers but have few representative laboratory-cultured strains. We report the cultivation of Candidatus Nitrosomarinus catalina SPOT01, a novel strain that is less warm-temperature tolerant than other cultivated Thaumarchaeota. Using metagenomic recruitment, strain SPOT01 comprises a major portion of Thaumarchaeota (4-54%) in temperate Pacific waters. Its complete 1.36 Mbp genome possesses several distinguishing features: putative phosphorothioation (PT) DNA modification genes; a region containing probable viral genes; and putative urea utilization genes. The PT modification genes and an adjacent putative restriction enzyme (RE) operon likely form a restriction modification (RM) system for defence from foreign DNA. PacBio sequencing showed >98% methylation at two motifs, and inferred PT guanine modification of 19% of possible TGCA sites. Metagenomic recruitment also reveals the putative virus region and PT modification and RE genes ar...
Alignment-free genome and metagenome comparisons are increasingly important with the development ... more Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, $d_2^*$ and $d_2^S$ are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely availabl...
The advent of next-generation sequencing (NGS) technologies enables researchers to sequence compl... more The advent of next-generation sequencing (NGS) technologies enables researchers to sequence complex microbial communities directly from environment. Since assembly typically produces only genome fragments, also known as contigs, instead of entire genome, it is crucial to group them into operational taxonomic units (OTUs) for further taxonomic profiling and down-streaming functional analysis. OTU clustering is also referred to as binning. We present COCACOLA, a general framework automatically bin contigs into OTUs based upon sequence composition and coverage across multiple samples. The effectiveness of COCACOLA is demonstrated in both simulated and real datasets in comparison to state-of-art binning approaches such as CONCOCT, GroopM, MaxBin and MetaBAT. The superior performance of COCACOLA relies on two aspects. One is employing L1 distance instead of Euclidean distance for better taxonomic identification during initialization. More importantly, COCACOLA takes advantage of both har...
Sequence classification is valuable for reducing the complexity of metagenomes and providing a fu... more Sequence classification is valuable for reducing the complexity of metagenomes and providing a fundamental understanding of the composition of metagenomic samples. Binary metagenomic classifiers offer an insufficient solution because metagenomes of most natural environments are typically derived from multiple sequence sources including prokaryotes, eukaryotes and the viruses of both. Here we introduce a deep-learning based (not reference-based) sequence classifier, DeepMicrobeFinder, that classifies metagenomic contigs into five sequence classes, e.g., viruses infecting prokaryotic or eukaryotic hosts, eukaryotic or prokaryotic chromosomes, and prokaryotic plasmids. At different sequence lengths, DeepMicrobeFinder achieved area under the receiver operating characteristic curve (AUC) scores >0.9 for most sequence classes, the exception being distinguishing prokaryotic chromosomes from plasmids. By benchmarking on 20 test datasets with variable sequence class composition, we showed...
Little is known about the species composition and variability of natural bacterial communities, m... more Little is known about the species composition and variability of natural bacterial communities, mostly because conventional identification requires pure cultures, but less than 1% of active natural bacteria are cultivable. This problem was circumvented by comparing species compositions via hybridization of total DNA of natural bacterioplankton communities for the estimation of the fraction of DNA in common between two samples (similarity). DNA probes that were labeled with 35S by nick translation were hybridized to filter-bound DNA in a reciprocal fashion; similarities (in percent) were calculated by normalizing the values to self-hybridizations. In tests with DNA mixtures of pure cultures, the experimentally observed similarities agreed with expectations. However, reciprocal similarities (probe and target reversed) were often asymmetric, unlike those of DNA from single strains. This was due to the relative complexity and G + C content of DNA, which provided a means to interpret the...
We compared several currently discussed methods for the assessment of bacterial numbers and activ... more We compared several currently discussed methods for the assessment of bacterial numbers and activity in marine waters, using samples from a variety of marine environments, from aged offshore seawater to rich harbor water. Samples were simultaneously tested for binding to a fluorescently labeled universal 16S rRNA probe; (sup3)H-labeled amino acid uptake via autoradiography; nucleoid-containing bacterial numbers by modified DAPI (4(prm1),6-diamidino-2-phenylindole) staining; staining with 5-cyano-2,3-ditolyl tetrazolium chloride (CTC), a compound supposed to indicate oxidative cell metabolism; and total bacterial counts (classical DAPI staining), taken as a reference. For the universal-probe counts, we used an image intensifying and processing system coupled to the epifluorescence microscope. All of the above-mentioned methods yielded lower cell counts than DAPI total counts. Universal-probe counts averaged about half of the corresponding DAPI count and were highly correlated to auto...
Our growing awareness of the microbial world's importance and diversity contrasts starkly wit... more Our growing awareness of the microbial world's importance and diversity contrasts starkly with our limited understanding of its fundamental structure. Despite recent advances in DNA sequencing, a lack of standardized protocols and common analytical frameworks impedes comparisons among studies, hindering the development of global inferences about microbial life on Earth. Here we present a meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project. Coordinated protocols and new analytical methods, particularly the use of exact sequences instead of clustered operational taxonomic units, enable bacterial and archaeal ribosomal RNA gene sequences to be followed across multiple studies and allow us to explore patterns of diversity at an unprecedented scale. The result is both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete ...
Marine Thaumarchaeota are abundant ammonia-oxidizers but have few representative laboratory-cultu... more Marine Thaumarchaeota are abundant ammonia-oxidizers but have few representative laboratory-cultured strains. We report the cultivation of Candidatus Nitrosomarinus catalina SPOT01, a novel strain that is less warm-temperature tolerant than other cultivated Thaumarchaeota. Using metagenomic recruitment, strain SPOT01 comprises a major portion of Thaumarchaeota (4-54%) in temperate Pacific waters. Its complete 1.36 Mbp genome possesses several distinguishing features: putative phosphorothioation (PT) DNA modification genes; a region containing probable viral genes; and putative urea utilization genes. The PT modification genes and an adjacent putative restriction enzyme (RE) operon likely form a restriction modification (RM) system for defence from foreign DNA. PacBio sequencing showed >98% methylation at two motifs, and inferred PT guanine modification of 19% of possible TGCA sites. Metagenomic recruitment also reveals the putative virus region and PT modification and RE genes ar...
Alignment-free genome and metagenome comparisons are increasingly important with the development ... more Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, $d_2^*$ and $d_2^S$ are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely availabl...
The advent of next-generation sequencing (NGS) technologies enables researchers to sequence compl... more The advent of next-generation sequencing (NGS) technologies enables researchers to sequence complex microbial communities directly from environment. Since assembly typically produces only genome fragments, also known as contigs, instead of entire genome, it is crucial to group them into operational taxonomic units (OTUs) for further taxonomic profiling and down-streaming functional analysis. OTU clustering is also referred to as binning. We present COCACOLA, a general framework automatically bin contigs into OTUs based upon sequence composition and coverage across multiple samples. The effectiveness of COCACOLA is demonstrated in both simulated and real datasets in comparison to state-of-art binning approaches such as CONCOCT, GroopM, MaxBin and MetaBAT. The superior performance of COCACOLA relies on two aspects. One is employing L1 distance instead of Euclidean distance for better taxonomic identification during initialization. More importantly, COCACOLA takes advantage of both har...
Uploads
Papers by Jed Fuhrman