Microbial natural products are a major source of bioactive compounds for drug discovery. Among th... more Microbial natural products are a major source of bioactive compounds for drug discovery. Among these molecules, nonribosomal peptides (NRPs) represent a diverse class of natural products that include antibiotics, immunosuppressants, and anticancer agents. Recent breakthroughs in natural product discovery have revealed the chemical structure of several thousand NRPs. However, biosynthetic gene clusters (BGCs) encoding them are known only for a few hundred compounds. Here, we developed Nerpa, a computational method for the high-throughput discovery of novel BGCs responsible for producing known NRPs. After searching 13,399 representative bacterial genomes from the RefSeq repository against 8368 known NRPs, Nerpa linked 117 BGCs to their products. We further experimentally validated the predicted BGC of ngercheumicin from Photobacterium galatheae via mass spectrometry. Nerpa supports searching new genomes against thousands of known NRP structures, and novel molecular structures against ...
COVID-19 pandemic has ignited a broad scientific interest in coronavirus research. The identifica... more COVID-19 pandemic has ignited a broad scientific interest in coronavirus research. The identification of coronaviridae species in natural reservoirs often requires de novo assembly. However, existing transcriptome assemblers often are not able to assemble coronaviruses into a single contig. We developed coronaSPAdes, a new module for SPAdes assembler for coronavirus species recovery. coronaSPAdes uses the knowledge about coronaviridae genome structure to improve assembly. We have shown that coronaSPAdes outperforms existing SPAdes modes and other popular short-read assemblers in the recovery of full-length coronavirus genomes. This should allow to better understand the coronaviridae spread and diversity.
This repository contains benchmarking datasets and scripts for the manuscript "SPAligner: al... more This repository contains benchmarking datasets and scripts for the manuscript "SPAligner: alignment of long error-prone reads to assembly graphs". Graph representation of genome assemblies has been recently used in different applications — from gene finding to haplotype separation. While many of these applications are based on aligning DNA and protein sequences to assembly graphs, existing software tools for finding such alignments have important limitations. We present a novel SPAligner (Saint Petersburg Aligner) tool for aligning long reads to assembly graphs and demonstrate that it generates accurate alignments.
MotivationA recently published article in BMC Genomics by Fuentes-Trillo et al (2021) contains a ... more MotivationA recently published article in BMC Genomics by Fuentes-Trillo et al (2021) contains a comparison of assembly approaches of several Noroviral samples via different tools and preprocessing strategies. Unfortunately the study used outdated versions of tools as well as tools that were not designed for the viral assembly task. In order to improve the suboptimal assemblies the authors suggested different sophisticated preprocessing strategies that seem to make only minor contributions to the results. We redone the analysis using state-of-the art tools designed for viral assembly.ResultsHere we demonstrate that tools from the SPAdes toolkit (rnaviralSPAdes and coronaSPAdes) allows one to assemble the samples from the original study into a single contig without any additional preprocessing.
Despite the recent advances in high-throughput sequencing, analysis of the metagenome of the whol... more Despite the recent advances in high-throughput sequencing, analysis of the metagenome of the whole microbial population still remains a challenge. In particular, the metagenome-assembled genomes (MAGs) are often fragmented due to interspecies repeats, uneven coverage and vastly different strain abundance. MAGs are usually constructed via a dedicated binning process that uses different features of input data in order to cluster contigs that might belong to the same species. This process has some limitations and therefore binners usually discard input contigs that are shorter than several kilobases. Therefore, binning of even simple metagenome assemblies can miss a decent fraction of contigs and resulting MAGs oftentimes do not contain important conservative sequences that might be of great interest of researcher.In this work we present BinSPreader — a novel binning refiner tool that exploits the assembly graph topology and other connectivity information to refine the existing binning...
Gut microbiome in critically ill patients shows profound dysbiosis. The most vulnerable is the su... more Gut microbiome in critically ill patients shows profound dysbiosis. The most vulnerable is the subgroup of chronically critically ill (CCI) patients – those suffering from long-term dependence on support systems in intensive care units. It is important to investigate their microbiome as a potential reservoir of opportunistic taxa causing co-infections and a morbidity factor. We explored dynamics of microbiome composition in the CCI patients by combining “shotgun” metagenomics with chromosome conformation capture (Hi-C). Stool samples were collected at 2 time points from 2 patients with severe brain injury with different outcomes within a 1–2-week interval. The metagenome-assembled genomes (MAGs) were reconstructed based on the Hi-C data using a novel hicSPAdes method (along with the bin3c method for comparison), as well as independently of the Hi-C using MetaBAT2. The resistomes of the samples were derived using a novel assembly graph-based approach. Links of bacteria to antibiotic ...
In Chap. 2, the use of SSA for analyzing one-dimensional data is thoroughly examined. In this cha... more In Chap. 2, the use of SSA for analyzing one-dimensional data is thoroughly examined. In this chapter, the use of models is minimal so that the main techniques can be considered as non-parametric and descriptory. Relations with algorithms of space rotation and many other methods aiming at achieving better separability of signal from noise are outlined. The common problems of smoothing, filtration, and splitting of a time series into identifiable components such as trend, seasonality, and noise are thoroughly discussed and illustrated on case studies with real data. An important issue of automatization of the SSA methods is also considered in detail.
In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are k... more In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on datasets of unprecedented complexity and realism. Benchmark metagenomes were generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups. Across all datasets, assembly and genome binning programs performed well for species represented by individual genomes, while performance was substantially affected by the presence of related strains. Taxonomic profiling and binning programs were profici...
Background Illumina paired-end reads are often used for 16S analysis in metagenomic studies. Sinc... more Background Illumina paired-end reads are often used for 16S analysis in metagenomic studies. Since DNA fragment size is usually smaller than the sum of lengths of paired reads, reads can be merged for downstream analysis. In spite of development of several tools for merging of paired-end reads, poor quality at the 3′ ends within the overlapping region prevents the accurate combining of significant portion of read pairs. Recently CD-HIT-OTU-Miseq was presented as a new approach for 16S analysis using the paired-end reads, it completely avoids the reads merging process due to separate clustering of paired reads. CD-HIT-OTU-Miseq is a set of tools which are supposed to be successively launched by auxiliary shell scripts. This launch mode is not suitable for processing of big amounts of data generated in modern omics experiments. To solve this issue we created CDSnake – Snakemake pipeline utilizing CD-HIT tools for easier consecutive launch of CD-HIT-OTU-Miseq tools for complete process...
The Siberian Journal of Clinical and Experimental Medicine
The identification of new SARS-CoV-2 and human protein and gene targets, which may be markers of ... more The identification of new SARS-CoV-2 and human protein and gene targets, which may be markers of the severity and outcome of the disease, are extremely important during the COVID-19 pandemic. The goal of this study was to carry out genetic analysis of SARS-CoV-2 RNA samples to elucidate correlations of genetic parameters (SNPs) with clinical data and severity of COVID-19 infection.Material and Methods. The study included viral RNA samples isolated from 56 patients with COVID-19 infection who received treatment at the City Hospital No. 40 of St. Petersburg from 04/18/2020 to 04/18/2021. Patients underwent physical examination with the assessments of hemodynamic and respiratory parameters, clinical risk according to National Early Warning Score (NEWS), computed tomography (CT) of the chest, and laboratory studies including clinical blood analysis, assessment of ferritin, C-reactive protein (CRP), interleukin-6 (IL-6), lactate dehydrogenase (LDH), D-dimer, creatinine, and glucose level...
Peptidic Natural Products (PNPs) are highly sought after bioactive compounds that include many an... more Peptidic Natural Products (PNPs) are highly sought after bioactive compounds that include many antibiotic, antiviral and antitumor agents, immunosuppressors and toxins. Even though recent advancements in mass-spectrometry have led to the development of accurate sequencing methods for nonlinear (cyclic and branch-cyclic) peptides, requiring only picograms of input material, the identification of PNPs via a database search of mass spectra remains problematic. This holds particularly true when trying to evaluate the statistical significance of Peptide Spectrum Matches (PSM) especially when working with non-linear peptides that often contain non-standard amino acids, modifications and have an overall complex structure. In this paper we describe a new way of estimating the statistical significance of a PSM, defined by any peptide (including linear and non-linear), by using state-of-the-art Markov Chain Monte Carlo methods. In addition to the estimate itself our method also provides an ...
In Chap. 4 the problem of simultaneous decomposition, reconstruction, and forecasting for a colle... more In Chap. 4 the problem of simultaneous decomposition, reconstruction, and forecasting for a collection of time series is considered from the viewpoint of SSA; note that individual time series can have different length. The main method of this chapter is usually called either Multichannel SSA or Multivariate SSA, shortly MSSA. The principal idea of MSSA is the same as for the case of one-dimensional time with the difference lying in the way of constructing of the trajectory matrix. The aim of MSSA is to take into consideration the combined structure of a multivariate series to obtain more accurate results.
Chapter 3 is devoted to applications of SSA for one-dimensional series for forecasting, gap filli... more Chapter 3 is devoted to applications of SSA for one-dimensional series for forecasting, gap filling, low-rank approximation, parameter estimation, and change-point detection. The SSA analysis of time series of Chap. 2 is model-free. Methods of Chap. 3, on the contrary, are model-based. The model is constructed on the base of the approximating subspace built in the process of performing the SSA analysis of Chap. 2. The main parametric model is a linear recurrence relation which the signal should approximately satisfy. Application of methods is illustrated on real-life data.
Microbial communities in many environments include distinct lineages of closely related organisms... more Microbial communities in many environments include distinct lineages of closely related organisms which have proved challenging to separate in metagenomic assembly, preventing generation of complete metagenome-assembled genomes (MAGs). The advent of long and accurate HiFi reads presents a possible means to address this challenge by generating complete MAGs for nearly all sufficiently abundant bacterial genomes in a microbial community. We present a metagenomic HiFi assembly of a complex microbial community from sheep fecal material that resulted in 428 high-quality MAGs from a single sample, the highest resolution achieved with metagenomic deconvolution to date. We applied a computational approach to separate distinct haplotype lineages and identified haplotypes of hundreds of variants across hundreds of kilobases of genomic sequence. Analysis of these haplotypes revealed 220 lineage-resolved complete MAGs, including 44 in single circular contigs, and demonstrated improvement in ove...
The lack of control over the usage of antibiotics leads to propagation of the microbial strains t... more The lack of control over the usage of antibiotics leads to propagation of the microbial strains that are resistant to many antimicrobial substances. This situation is an emerging threat to public health and therefore the development of approaches to infer the presence of resistant strains is a topic of high importance. The resistome construction of an isolate microbial species could be considered a solved task with many state-of-the-art tools available. However, when it comes to the analysis of the resistome of a microbial community (metagenome), then there exist many challenges that influence the accuracy and precision of the predictions. For example, the prediction sensitivity of the existing tools suffer from the fragmented metagenomic assemblies due to interspecies repeats: usually it is impossible to recover conservative parts of antibiotic resistance genes that belong to different species that occur due to e.g., horizontal gene transfer or residing on a plasmid. The recent adv...
Microbial natural products are a major source of bioactive compounds for drug discovery. Among th... more Microbial natural products are a major source of bioactive compounds for drug discovery. Among these molecules, nonribosomal peptides (NRPs) represent a diverse class of natural products that include antibiotics, immunosuppressants, and anticancer agents. Recent breakthroughs in natural product discovery have revealed the chemical structure of several thousand NRPs. However, biosynthetic gene clusters (BGCs) encoding them are known only for a few hundred compounds. Here, we developed Nerpa, a computational method for the high-throughput discovery of novel BGCs responsible for producing known NRPs. After searching 13,399 representative bacterial genomes from the RefSeq repository against 8368 known NRPs, Nerpa linked 117 BGCs to their products. We further experimentally validated the predicted BGC of ngercheumicin from Photobacterium galatheae via mass spectrometry. Nerpa supports searching new genomes against thousands of known NRP structures, and novel molecular structures against ...
COVID-19 pandemic has ignited a broad scientific interest in coronavirus research. The identifica... more COVID-19 pandemic has ignited a broad scientific interest in coronavirus research. The identification of coronaviridae species in natural reservoirs often requires de novo assembly. However, existing transcriptome assemblers often are not able to assemble coronaviruses into a single contig. We developed coronaSPAdes, a new module for SPAdes assembler for coronavirus species recovery. coronaSPAdes uses the knowledge about coronaviridae genome structure to improve assembly. We have shown that coronaSPAdes outperforms existing SPAdes modes and other popular short-read assemblers in the recovery of full-length coronavirus genomes. This should allow to better understand the coronaviridae spread and diversity.
This repository contains benchmarking datasets and scripts for the manuscript "SPAligner: al... more This repository contains benchmarking datasets and scripts for the manuscript "SPAligner: alignment of long error-prone reads to assembly graphs". Graph representation of genome assemblies has been recently used in different applications — from gene finding to haplotype separation. While many of these applications are based on aligning DNA and protein sequences to assembly graphs, existing software tools for finding such alignments have important limitations. We present a novel SPAligner (Saint Petersburg Aligner) tool for aligning long reads to assembly graphs and demonstrate that it generates accurate alignments.
MotivationA recently published article in BMC Genomics by Fuentes-Trillo et al (2021) contains a ... more MotivationA recently published article in BMC Genomics by Fuentes-Trillo et al (2021) contains a comparison of assembly approaches of several Noroviral samples via different tools and preprocessing strategies. Unfortunately the study used outdated versions of tools as well as tools that were not designed for the viral assembly task. In order to improve the suboptimal assemblies the authors suggested different sophisticated preprocessing strategies that seem to make only minor contributions to the results. We redone the analysis using state-of-the art tools designed for viral assembly.ResultsHere we demonstrate that tools from the SPAdes toolkit (rnaviralSPAdes and coronaSPAdes) allows one to assemble the samples from the original study into a single contig without any additional preprocessing.
Despite the recent advances in high-throughput sequencing, analysis of the metagenome of the whol... more Despite the recent advances in high-throughput sequencing, analysis of the metagenome of the whole microbial population still remains a challenge. In particular, the metagenome-assembled genomes (MAGs) are often fragmented due to interspecies repeats, uneven coverage and vastly different strain abundance. MAGs are usually constructed via a dedicated binning process that uses different features of input data in order to cluster contigs that might belong to the same species. This process has some limitations and therefore binners usually discard input contigs that are shorter than several kilobases. Therefore, binning of even simple metagenome assemblies can miss a decent fraction of contigs and resulting MAGs oftentimes do not contain important conservative sequences that might be of great interest of researcher.In this work we present BinSPreader — a novel binning refiner tool that exploits the assembly graph topology and other connectivity information to refine the existing binning...
Gut microbiome in critically ill patients shows profound dysbiosis. The most vulnerable is the su... more Gut microbiome in critically ill patients shows profound dysbiosis. The most vulnerable is the subgroup of chronically critically ill (CCI) patients – those suffering from long-term dependence on support systems in intensive care units. It is important to investigate their microbiome as a potential reservoir of opportunistic taxa causing co-infections and a morbidity factor. We explored dynamics of microbiome composition in the CCI patients by combining “shotgun” metagenomics with chromosome conformation capture (Hi-C). Stool samples were collected at 2 time points from 2 patients with severe brain injury with different outcomes within a 1–2-week interval. The metagenome-assembled genomes (MAGs) were reconstructed based on the Hi-C data using a novel hicSPAdes method (along with the bin3c method for comparison), as well as independently of the Hi-C using MetaBAT2. The resistomes of the samples were derived using a novel assembly graph-based approach. Links of bacteria to antibiotic ...
In Chap. 2, the use of SSA for analyzing one-dimensional data is thoroughly examined. In this cha... more In Chap. 2, the use of SSA for analyzing one-dimensional data is thoroughly examined. In this chapter, the use of models is minimal so that the main techniques can be considered as non-parametric and descriptory. Relations with algorithms of space rotation and many other methods aiming at achieving better separability of signal from noise are outlined. The common problems of smoothing, filtration, and splitting of a time series into identifiable components such as trend, seasonality, and noise are thoroughly discussed and illustrated on case studies with real data. An important issue of automatization of the SSA methods is also considered in detail.
In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are k... more In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on datasets of unprecedented complexity and realism. Benchmark metagenomes were generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups. Across all datasets, assembly and genome binning programs performed well for species represented by individual genomes, while performance was substantially affected by the presence of related strains. Taxonomic profiling and binning programs were profici...
Background Illumina paired-end reads are often used for 16S analysis in metagenomic studies. Sinc... more Background Illumina paired-end reads are often used for 16S analysis in metagenomic studies. Since DNA fragment size is usually smaller than the sum of lengths of paired reads, reads can be merged for downstream analysis. In spite of development of several tools for merging of paired-end reads, poor quality at the 3′ ends within the overlapping region prevents the accurate combining of significant portion of read pairs. Recently CD-HIT-OTU-Miseq was presented as a new approach for 16S analysis using the paired-end reads, it completely avoids the reads merging process due to separate clustering of paired reads. CD-HIT-OTU-Miseq is a set of tools which are supposed to be successively launched by auxiliary shell scripts. This launch mode is not suitable for processing of big amounts of data generated in modern omics experiments. To solve this issue we created CDSnake – Snakemake pipeline utilizing CD-HIT tools for easier consecutive launch of CD-HIT-OTU-Miseq tools for complete process...
The Siberian Journal of Clinical and Experimental Medicine
The identification of new SARS-CoV-2 and human protein and gene targets, which may be markers of ... more The identification of new SARS-CoV-2 and human protein and gene targets, which may be markers of the severity and outcome of the disease, are extremely important during the COVID-19 pandemic. The goal of this study was to carry out genetic analysis of SARS-CoV-2 RNA samples to elucidate correlations of genetic parameters (SNPs) with clinical data and severity of COVID-19 infection.Material and Methods. The study included viral RNA samples isolated from 56 patients with COVID-19 infection who received treatment at the City Hospital No. 40 of St. Petersburg from 04/18/2020 to 04/18/2021. Patients underwent physical examination with the assessments of hemodynamic and respiratory parameters, clinical risk according to National Early Warning Score (NEWS), computed tomography (CT) of the chest, and laboratory studies including clinical blood analysis, assessment of ferritin, C-reactive protein (CRP), interleukin-6 (IL-6), lactate dehydrogenase (LDH), D-dimer, creatinine, and glucose level...
Peptidic Natural Products (PNPs) are highly sought after bioactive compounds that include many an... more Peptidic Natural Products (PNPs) are highly sought after bioactive compounds that include many antibiotic, antiviral and antitumor agents, immunosuppressors and toxins. Even though recent advancements in mass-spectrometry have led to the development of accurate sequencing methods for nonlinear (cyclic and branch-cyclic) peptides, requiring only picograms of input material, the identification of PNPs via a database search of mass spectra remains problematic. This holds particularly true when trying to evaluate the statistical significance of Peptide Spectrum Matches (PSM) especially when working with non-linear peptides that often contain non-standard amino acids, modifications and have an overall complex structure. In this paper we describe a new way of estimating the statistical significance of a PSM, defined by any peptide (including linear and non-linear), by using state-of-the-art Markov Chain Monte Carlo methods. In addition to the estimate itself our method also provides an ...
In Chap. 4 the problem of simultaneous decomposition, reconstruction, and forecasting for a colle... more In Chap. 4 the problem of simultaneous decomposition, reconstruction, and forecasting for a collection of time series is considered from the viewpoint of SSA; note that individual time series can have different length. The main method of this chapter is usually called either Multichannel SSA or Multivariate SSA, shortly MSSA. The principal idea of MSSA is the same as for the case of one-dimensional time with the difference lying in the way of constructing of the trajectory matrix. The aim of MSSA is to take into consideration the combined structure of a multivariate series to obtain more accurate results.
Chapter 3 is devoted to applications of SSA for one-dimensional series for forecasting, gap filli... more Chapter 3 is devoted to applications of SSA for one-dimensional series for forecasting, gap filling, low-rank approximation, parameter estimation, and change-point detection. The SSA analysis of time series of Chap. 2 is model-free. Methods of Chap. 3, on the contrary, are model-based. The model is constructed on the base of the approximating subspace built in the process of performing the SSA analysis of Chap. 2. The main parametric model is a linear recurrence relation which the signal should approximately satisfy. Application of methods is illustrated on real-life data.
Microbial communities in many environments include distinct lineages of closely related organisms... more Microbial communities in many environments include distinct lineages of closely related organisms which have proved challenging to separate in metagenomic assembly, preventing generation of complete metagenome-assembled genomes (MAGs). The advent of long and accurate HiFi reads presents a possible means to address this challenge by generating complete MAGs for nearly all sufficiently abundant bacterial genomes in a microbial community. We present a metagenomic HiFi assembly of a complex microbial community from sheep fecal material that resulted in 428 high-quality MAGs from a single sample, the highest resolution achieved with metagenomic deconvolution to date. We applied a computational approach to separate distinct haplotype lineages and identified haplotypes of hundreds of variants across hundreds of kilobases of genomic sequence. Analysis of these haplotypes revealed 220 lineage-resolved complete MAGs, including 44 in single circular contigs, and demonstrated improvement in ove...
The lack of control over the usage of antibiotics leads to propagation of the microbial strains t... more The lack of control over the usage of antibiotics leads to propagation of the microbial strains that are resistant to many antimicrobial substances. This situation is an emerging threat to public health and therefore the development of approaches to infer the presence of resistant strains is a topic of high importance. The resistome construction of an isolate microbial species could be considered a solved task with many state-of-the-art tools available. However, when it comes to the analysis of the resistome of a microbial community (metagenome), then there exist many challenges that influence the accuracy and precision of the predictions. For example, the prediction sensitivity of the existing tools suffer from the fragmented metagenomic assemblies due to interspecies repeats: usually it is impossible to recover conservative parts of antibiotic resistance genes that belong to different species that occur due to e.g., horizontal gene transfer or residing on a plasmid. The recent adv...
Uploads
Papers