WO2014028771A1 - Iterative genome assembler - Google Patents
Iterative genome assembler Download PDFInfo
- Publication number
- WO2014028771A1 WO2014028771A1 PCT/US2013/055197 US2013055197W WO2014028771A1 WO 2014028771 A1 WO2014028771 A1 WO 2014028771A1 US 2013055197 W US2013055197 W US 2013055197W WO 2014028771 A1 WO2014028771 A1 WO 2014028771A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- contigs
- read
- contig
- clusters
- reads
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
Definitions
- the field relates to de novo genome assembly of short reads.
- NGS next-generation sequencing
- This massive amount of sequencing data has provided the basis to better understand the tree of life and to identify molecular signatures of human variation and disease mechanisms.
- the key computational task is to de novo assemble raw reads from NGS technologies into complete or near-complete genomes.
- the enormous amount of data creates an inevitable barrier to the assembly process in terms of memory usage.
- the lower quality and limited read length produced by NGS as compared to the traditional Sanger sequencing, make it extremely difficult to assemble reads into long scaffolds, which are essential to facilitate the analyses of large-scale genome rearrangements.
- NGS-based de novo genome assemblers adopt the de Bruijn Graph (DBG) data structure to handle extremely high coverage data [1-3].
- DBG de Bruijn Graph
- Several assemblers have specifically been developed with some success to assemble large genomes.
- SOAPdenovo [4] and ALLPATHS-LG [5] a DBG was constructed in a large shared memory and the assembly process was done in parallel within multiple threads. However, all of them required hundreds of gigabytes (GB) of memory to assemble large genomes, such as those from human and other mammalian species.
- GB gigabytes
- ABySS [6], YAGA [7] and PASHA [8] developed distributed DBG algorithms that split the DBG and parallelize the assembly process on a cluster on the basis of message passing interface (MPI).
- MPI message passing interface
- this imposed considerable communication among servers because many adjacent vertices in the DBG could be located on different servers.
- the amount of communication among servers increases nonlinearly when the number of servers increases, causing scalability issues.
- Some assemblers made modifications to DBG in order to reduce memory usage.
- Goassembler [9-10] used a compressed bitmap representation of the DBG, resulting in a memory usage that could be close to the theoretical minimum value of a DBG.
- Cortex [11] utilized colored DBG to detect the variations among 10 human genomes with less than 256 GB of memory.
- SparseAssembler2 [12] reduced memory usage dramatically by storing only a small fraction of k- mers.
- SGA [13] used a compressed index of reads and it could assemble a human genome under 60 GB of memory.
- the methods include receiving a plurality of input reads representing nucleotide sequences of DNA or RNA fragments; partitioning the input reads into respective read tiles; assembling sets of contigs for respective of the read tiles; clustering the contigs into contig clusters, wherein the clustering is based on overlap between the contigs; clustering the input reads into read clusters, wherein the clustering is based on similarity between the input reads and the contig clusters; and assembling sets of contigs for respective of the read clusters.
- (d) - (f) are iteratively repeated until an end condition is met.
- the method includes (a) receiving a plurality of input reads representing nucleotide sequences of DNA or RNA fragments; (b) partitioning the input reads into respective read tiles; (c) coordinating assembly of the read tiles into respective sets of contigs; (d) clustering the contigs into contig clusters, wherein the clustering is based on overlap between the contigs; (e) coordinating clustering the input reads into read clusters, wherein the clustering is based on similarity between the input reads and the contig clusters; and (f) assembling sets of contigs for respective of the read clusters.
- the method includes receiving a plurality of contigs assembled from a plurality of input reads representing nucleotide sequences of DNA or RNA fragments; clustering the contigs into contig clusters, wherein the clustering is based on overlap between the contigs; clustering the input reads into read clusters, wherein the clustering is based on similarity between the input reads and the contig clusters; and assembling sets of contigs for respective of the read clusters.
- Also provided are systems which can include a plurality of stored input reads representing sequencing reads generated from a sequencing method applied to a sample; a partitioner configured to partition the stored input reads into a plurality of read tiles; an assembler configured to assemble input reads in a given read tile into a set of contigs; a contig cluster tool configured to organize the contigs into a graph and partition the graph into contig clusters; and an alignment tool configured to accept the contig clusters and the input reads and output read clusters partitioning the input reads according to alignment against the contig clusters; wherein the assembler is further configured to assemble input reads in a given read cluster into a new set of contigs.
- Figure 1 is a block diagram of an exemplary system implementing iterative genome assembly.
- Figure 2 is a flowchart of an exemplary method implementing iterative genome assembly.
- Figure 3 is a block diagram of an exemplary system clustering contigs based on overlap.
- Figure 4 is a flowchart of an exemplary method of clustering contigs based on overlap.
- Figure 5 is a block diagram of an exemplary graph representing contig overlap.
- Figure 6 is a flowchart of an exemplary method of clustering contigs via a word map.
- Figure 7 is a block diagram of an exemplary system performing read tile assembly into contigs in parallel on different machines.
- Figure 8 is a block diagram of an exemplary system performing read clustering based on contig clusters in parallel on different machines.
- Figure 9 is a schematic view of an iterative framework for genome assembly.
- Figure 10 is a schematic view of exemplary contig clustering.
- Figures 11A-B are bar graphs showing a Tiger- Velvet- 1 li runtime comparison using data from human chromosome 14.
- Figure 12 is a diagram of an exemplary computing system in which described embodiments can be implemented.
- Figure 13 is an exemplary cloud- support environment that can be used in conjunction with the technologies described herein.
- a range for example, a temperature range, a time range, a size range, a length range, a cardinality range, or a composition or concentration range, all intermediate ranges and subranges, as well as all individual values included in the ranges given are included in the disclosure.
- the specific embodiments provided herein are examples of useful embodiments of the present disclosure and it will be apparent to one skilled in the art that the methods and systems provided herein can be carried out using a large number of variations of the devices, device components, software components, methods, and steps set forth in the present description. Methods and devices useful for the present disclosure can include a large number of optional composition and processing elements and steps.
- Target nucleic acid sequence or molecule A defined region or particular portion of a nucleic acid molecule, for example, a portion of a genome (such as a chromosome of mammalian or plant genomic DNA of interest).
- a target can be defined by its position on a chromosome (e.g., in a normal cell), for example, according to cytogenetic nomenclature by reference to a particular location on a chromosome; by reference to its location on a genetic map; by reference to a hypothetical or assembled contig; by its specific sequence or function; by its gene or protein name; or by any other means that uniquely identifies it from among other genetic sequences of a genome.
- the target nucleic acid sequence is a genomic sequence (such as a mammalian genomic sequence, for example human genomic sequence, a plant genomic sequence, a bacterial genomic sequence, a viral genomic sequence, or a fungal genomic sequence).
- alterations of a target nucleic acid sequence are "associated with" a disease or condition. That is, detection of the target nucleic acid sequence can be used to infer the status of a sample with respect to the disease or condition.
- the target nucleic acid sequence can exist in two (or more) distinguishable forms, such that a first form correlates with absence of a disease or condition and a second (or different) form correlates with the presence of the disease or condition. The two different forms can be
- NGS next-generation sequencing
- a genome can include billions of nucleotides.
- Current sequencing methods divide these very long sequences into shorter sequences, for example, by cutting the nucleic acid molecule into shorter pieces.
- de novo genome assembly it can be difficult to efficiently put the chromosome or genome back together after it has been split up (referred to as de novo genome assembly).
- de novo assemblers require enormous amount of computational resource, which is not accessible for most research groups and medical personnel.
- a novel de novo assembly framework including a software program referred to herein as Tiger (Tiled Iterative GEnome assembleR), which adapts to available computing resources by iteratively decomposing the assembly problem into sub-problems.
- the disclosed methods are also flexible to embed different assemblers for various types of target genomes.
- NGS genome assembly problem in terms of memory usage and scalability, can be solved if the large computational tasks are decomposed into modest-sized independent sub-problems, which then fit into smaller memories and can be solved in parallel. This can effectively move large-scale de novo assembly tasks into commodity PC networks.
- this new approach provides improved assembly quality compared to the current state-of-the-art assemblers.
- a highly effective framework for decomposing the problem of genome assembly from NGS reads is provided.
- the decomposed sub-problems can be solved in a sequential manner using significantly less memory or solved simultaneously if more computing nodes are available.
- the method partitions and clusters reads so that the memory usage for assembly is inversely proportional to the number of partitions. This fundamentally resolves the issue of high memory requirement in sequence assembly and can accelerate the research progress of this field.
- the disclosed methods and systems can leverage existing well-built assemblers and iteratively improve the results.
- the disclosed methods and systems can also start from an already assembled result and further improve it, using the original read set. In situations where traditional assemblers require more memory than that is present in available computers, the disclosed methods and systems provide the only known viable path.
- the results provided herein demonstrate the feasibility of getting better quality results with low memory footprint and the scalability of using distributed commodity computers.
- the disclosed de novo genome assembly methods and systems can be used in cancer research.
- the target to be assembled is from a particular cancer, such as from a cancer from a specific patient. Such sequencing will permit identification of the specific and unique changes present in the tumor from the patient. Based on these changes a personalized therapeutic strategy can be undertaken.
- the target to be assembled is from a particular cancer, for example to identify new mutations associated with that cancer.
- genome sequencing and assembly using the disclosed methods and systems permits personalized treatment for a variety of diseases, and permits identification of new classes of therapeutics, for example, to eliminate side effects of current therapeutics, identify correct patient dosage, and improve diagnostic tests.
- the disclosed methods and systems are useful for understanding the diversity and importance in transfer of genetic material among different organisms, such as bacterial species.
- the disclosed methods and systems are useful for identifying biomarkers of plant species under critical conditions such as temperature, disease and drought. It can also be used to study factors affecting growth, for example, in identifying essential functional genomic mechanisms of synthesis of plant material grown under stress.
- Example 1 Exemplary System Implementing Iterative Genome Assembly
- FIG. 1 is a block diagram of an exemplary system 100 implementing iterative genome assembly as described herein.
- a partitioner 115 accepts stored input reads 110 as input and outputs a plurality of read tiles 120.
- the stored input reads 110 can represent sequencing reads generating from a sequencing method applied to a sample as described herein.
- the partitioner 115 can be configured to partition the stored input reads into a plurality of different read tiles.
- Various techniques can be used, such as randomly partitioning the stored input reads. For example, a round-robin approach can be used where a first stored input read is placed into a first read tile, a second stored input read is placed into a second read tile, etc.
- the number of read tiles can be varied as described herein.
- the assembler 125 can be configured to assemble input reads in a given read tile into a set of contigs (e.g., for the given read tile).
- a set of contigs e.g., for the given read tile.
- r contig sets 130 can be generated (e.g., one contig set per read tile).
- assembly can be distributed among a plurality of different machines.
- a contig cluster tool 140 can be configured to organize the contigs (e.g., the collective contigs from the different contig sets 130) from the contig sets 130 into a graph 145 and partition the graph into contig clusters 150 (e.g., based on overlap as described herein). As described herein, processing for the contig cluster tool 140 can be distributed among a plurality of different machines. As iterations progress, the contig clusters 150 can become well-clustered contigs.
- An alignment tool 160 can accept the contig clusters 150 and the input reads 110 and output read clusters 165. As described herein, processing for the alignment tool 160 can be distributed among a plurality of different machines. As iterations progress, the read clusters 150 can become well-clustered read clusters.
- the assembler 125 is further configured to assemble input reads in a given read cluster 165 into a new set of contigs (e.g., like those in 130, but improving with successive iterations). As described herein, iterations can stop when there ceases to be observed improvement.
- system 100 can be more complicated, with additional functionality, tools, and the like.
- the system 100 and any of the other systems described herein can be implemented in conjunction with any of the hardware components described herein, such as the computing systems or mobile devices described below (e.g., comprising one or more processors, memory, and the like).
- the inputs, input reads, contigs, outputs, graphs, and tools can be stored in one or more computer-readable storage media or computer-readable storage devices.
- the technologies described herein can be generic to the specifics of operating systems or hardware and can be applied in any variety of environments to take advantage of the described features.
- FIG. 2 is a flowchart of an exemplary method 200 of implementing iterative genome assembly and can be implemented, for example, in the system shown in FIG. 1.
- a plurality of input reads are received.
- the input reads are initially partitioned into respective read tiles.
- the result is a set of read tiles with each given read tile comprising a subset of the input reads.
- random (e.g., round robin or the like) partitioning can be supported.
- the input reads can thus be substantially equally divided among the read tiles.
- the number of read tiles can vary.
- the number of read tiles can also vary from iteration to iteration.
- sets of contigs are assembled for respective of the read tiles.
- the input reads for a read tiles are assembled into respective sets of contigs.
- a different contig set can be assembled for each read tile. Assembly thus can take place for input reads within each read tile.
- the inputs in a given read tile can be submitted as input to an assembler, and the output is a set of contigs assembled for the read tile.
- different assemblers can be used.
- a read tile can be treated as a complete set of inputs that can be processed separately (e.g., in parallel, sequentially, on a different machine, or the like) from another read tile.
- assembly of the read tiles can be coordinated (e.g., scheduled for execution on another machine or system) rather than explicitly performed.
- Guidance can be provided as input to an assembler in the form of a specified k-mer size. Responsive to the input, the assembler can attempt to assemble contigs of the k-mer size.
- An automatic k-mer can be chosen as described herein. Other guidance can be provided based on characteristics of the input reads (e.g., length, coverage, errors, and the like) if known.
- the resulting contigs are (e.g., collectively) clustered into contig clusters.
- clustering can be based on overlap between the contigs as described herein.
- a graph of the contigs can be built to model overlap within the contigs, and the graph can be partitioned into clusters of contigs.
- the sets of contigs can be combined (e.g., into a collective set for inclusion in the graph) before they are clustered. Scaffolding can also be performed.
- the input reads can be clustered into read clusters, based on similarity between the input reads and the contig clusters. For example, the percentage of base-pairs (or nucleotides) that are in a common pattern between the two (e.g., an input read and a contig in a cluster) can be determined via an alignment tool. As described herein, an input read can be assigned to more than one read cluster. As described herein, clustering can be performed by distributed processing. Thus, clustering can be coordinated (e.g., scheduled for execution on another machine or system) rather than explicitly performed.
- sets of contigs for respective of the read clusters can be assembled.
- the read clusters can then be processed as read tiles. For example, similar to the assembly of 230, a different contig set can be assembled for each read cluster. The input reads for a given read cluster are thus assembled into a set of contigs.
- 210-230 can be performed as preprocessing (e.g., by another technology), and the method 200 can comprise 240 - 260.
- the method 200 and any of the other methods described herein can be performed by computer-executable instructions (e.g., causing a computing system to perform the method) stored in one or more computer-readable media (e.g., storage or other tangible media) or stored in one or more computer-readable storage devices.
- computer-executable instructions e.g., causing a computing system to perform the method
- computer-readable media e.g., storage or other tangible media
- input reads can represent nucleotide sequences of DNA or
- the input reads can be a result of sequencing reads generated from a sequencing method applied to a sample. Such input reads can correspond to subsequences of an entire genome and can range up to the billions of reads. As described herein, the lengths of the input reads can vary.
- input reads can take the form of sequences from publicly available databases as described herein, assembled contigs, assembled scaffolds, or the like.
- input to the technologies described herein are not necessarily limited to raw reads from a sequencer.
- an input read takes the form of a digital representation of a string of nucleotides (e.g., DNA or RNA fragments) as sequenced by the sequencing method or assembled by any of a variety of arbitrary assembly methods.
- nucleotides e.g., DNA or RNA fragments
- a read tile can take the form of a subset of input reads.
- the input reads are partitioned into a plurality of read tiles as described herein.
- the term "read cluster” is synonymous with “read tile” in that it is a subset of input reads.
- the term “read cluster” is typically used for a set of read tiles that have been partitioned based on contig clusters as described herein.
- the read tiles can become well-clustered read clusters in that they can be better clustered according to where in the genome they appear, even if their location need not be explicitly known or represented.
- a contig can take the form of a digital representation of a string of nucleotides (e.g., DNA or RNA) as assembled from the input reads.
- the contigs will be of larger size than the input reads and will eventually become of such a size that they are useful for a wide variety of purposes as described herein.
- a contig cluster can take the form of a subset of contigs that have been determined to have a degree of overlap as described herein. As iterations progress, the contig clusters are said to be "well-clustered" because the contigs contribute to a continuous region of the target genome.
- assembly of some input reads can be attempted with at least one of the assemblers via a plurality of different k-mer sizes. Performance of assembly under the different k-mer sizes can be evaluated. A superior k-mer size can then be chosen for use across the reads based on the evaluation. Finer levels of granularity can be supported (e.g., different k-mer sizes for different read tiles).
- Performance of assembly can be based at least on the number of used reads in the assembly, length of the contigs (e.g., N50 metrics, total length, or the like), errors (e.g., with reference to a reference genome, if any) or the like.
- length of the contigs e.g., N50 metrics, total length, or the like
- errors e.g., with reference to a reference genome, if any
- iteration can continue until an end condition is met.
- end conditions can be used.
- a lack of improvement e.g., or threshold amount of improvement
- in the sets of assembled contigs can indicate an end condition.
- Improvement can be measured in terms of quality improvement.
- improvement can be measured in terms of the length of the contigs, or the like.
- the distribution of contig length can be measured and compared against a threshold.
- An end condition can also be stated in terms of average, median, or mode contig length; minimum contig length; or the like. Metrics such as N50 can be used.
- FIG. 3 is a block diagram of an exemplary system 300 clustering contigs based on overlap.
- combined contigs 330 are accepted by a graph builder 340, which outputs a contig connectivity graph 345 based on overlap in the combined contigs 330.
- the combined contigs 330 can be the collective contigs in the contig sets (e.g., 130 of FIG. 1) generated from assembly of read tiles or read clusters (e.g., the read tiles 120 or the read clusters 165 of FIG. 1).
- the graph partitioner 347 accepts the contig connectivity graph 345 as input and outputs a plurality of contig clusters 350A-N, which are clustered based on overlap as represented in the graph 345.
- the clusters 350A-N can be used as contig clusters for the technologies described herein (e.g., the contig clusters 150 of FIG. 1).
- the graph partitioner 347 can be provided with a target number of contig clusters 350A-N and divide the graph 345 accordingly.
- the number of clusters can change (e.g., increase, decrease) with subsequent iterations.
- the target number of contig clusters can be chosen in a variety of ways. For example, the target genome length (if known in advance), available memory size on machines doing the processing, limited runtime (e.g., more clusters typically result in more runtime and better results), similarity to previous experience, or the like.
- Example 10 Exemplary Method Clustering Contigs Based on Overlap
- FIG. 4 is a flowchart of an exemplary method 400 of clustering contigs based on overlap and can be implemented, for example, in the system shown in FIG. 3.
- the method 400 can also be used to implement contig clustering 240 shown in FIG. 2.
- the method 400 can include initially receiving collective contigs.
- a contig connectivity graph is built based on overlap between the combined contigs.
- Contigs can be represented as respective nodes in the graph.
- Edges are based on (e.g., placed based on, weighted based on, or both) detected overlap between the contigs.
- clustering based on overlap can result in grouping the contigs based on how likely the contigs or portions thereof will be next to each other in the genome.
- the graph is partitioned into contig clusters. Because nodes can represent respective contigs, partitioning the graph effectively partitions the corresponding contigs. Partitioning of the graph can be accomplished using a graph partitioning tool. For example, vertex weights, edges, or both can be considered when partitioning the graph into sub-graphs based on connectivity. A variety of partition techniques, such as circuit or graph partitioning can be used. A goal can be to partition the graph into parts with the least number of edges between groups while having substantially balanced group sizes.
- vertex weights can be given less importance than edge weights to result in more edge-oriented subgraphs.
- vertex weights can still be considered to address situations where many short contigs have little connectivity.
- Such short contigs can thus be distributed evenly among the clusters to reduce their influence on the rest of the clustered contigs as the technique progresses. Because overlapping contigs are clustered together, they can subsequently be rebuilt as one long complete contig (e.g., when sets of contigs are subsequently assembled from read clusters based on the contig clusters).
- the contig clusters can be used to cluster the input reads as described herein.
- FIG. 5 is a block diagram of an exemplary graph 500 representing contig overlap.
- the example shows three nodes 510A-C.
- the graph 500 will have many more nodes and edges.
- a node 510A, 510B, 5 IOC represents a respective contig.
- Nodes can be weighted based on the length of the contig represented (e.g., in mers or the like).
- edges 520A, 520B can placed based on detected overlap between nodes.
- the edges 520A, 520B can further be weighted based on the degree of overlap between a pair of contigs represented by respective nodes. So, for example, the weight of the edge 520A can indicate the degree of overlap between a contig represented by the node 51 OA and a contig represented by the node 51 OB.
- edges in the graph represent overlap between contigs connected by the edges. For example, as described herein, a number of base pairs (e.g., nucleotides) in common between contigs connected by an edge can represent overlap and the number of base pairs indicates the edge weight between them.
- the graph 500 can be represented in a variety of ways. Data structures representing the contigs need not be stored in the nodes themselves. For example, a reference to the contig data can be stored.
- Other information can be included to assist in the partition process.
- the graph 500 represents contigs and connects the contigs according to overlap, it is sometimes called a "contig connectivity graph.”
- connectivity of the graph can vary. For example, there may be nodes that have no edges or the like.
- Example 12 Exemplary Method Clustering Contigs Using Word Map
- FIG. 6 is a flowchart of an exemplary method 600 of clustering contigs via a word map and can be implemented, for example, using the graph shown in FIG. 5.
- the method 600 can be used in place of or to supplement the method 400 of FIG. 4.
- the method 600 can include initially receiving collective contigs.
- words are extracted from the contigs.
- the words can be consecutive base pair sequences (e.g., of equal length, substantially equal length, or the like) from the contigs.
- a word map is built from the words and the contigs.
- the map can map the words to the contigs containing them. Building the word map can be accomplished by distributing processing among machines (e.g., with machines receiving respective sets of contigs and determining which of the contigs have which words in them).
- edge weights can be placed and set to the degree of overlap (e.g., number of base pairs in common) between contigs represented by vertices.
- the graph can then be portioned at 650.
- a graph partitioning tool suitable for processing millions of vertices e.g., METIS or the like
- contigs can be clustered based on overlap.
- a graph can be built to model overlap among the contigs and then partitioned into subgraphs. Nodes in different subgraphs can be considered as being in different respective contig clusters.
- the graph can model contig connectivity intensity.
- a threshold for determining overlap can be empirically determined. To conserve resources, a higher threshold can be set (e.g., to reduce edges in a large graph), Overlap threshold can be set in terms of base pairs, number of words, or some other metric to measure common regions between two contigs.
- overlap also tends to indicate conceptual connectedness, such as how likely the contigs are in the same region (e.g., next to each other) of the genome.
- read clustering can be done to cluster the input reads into a plurality of read clusters. Such clustering can be based on the contig clusters that are determined as described herein. For example, an input read can be included in a read cluster associated with a particular contig cluster if sufficient alignment between the input read and a contig in the contig cluster is detected. Alignment between an input read and a contig can be partially (e.g., partial words extracted from both).
- Alignment can be performed using read alignment software, such as the Bowtie ultrafast, memory-efficient short read aligner developed at the University of Maryland; the MUMer bioinformatics software system maintained by the Center for Computational Biology at Johns Hopkins University; SOAP Aligner available from BGI; custom aligners; and others.
- read alignment software such as the Bowtie ultrafast, memory-efficient short read aligner developed at the University of Maryland; the MUMer bioinformatics software system maintained by the Center for Computational Biology at Johns Hopkins University; SOAP Aligner available from BGI; custom aligners; and others.
- the read clusters re-sort the input reads based on assembly of contigs that was done for the input reads (e.g., as partitioned into read tiles or clusters).
- Example 15 Exemplary System Performing Read Tile Assembly into Contigs in Parallel
- FIG. 7 is a block diagram of an exemplary system 700 performing read tile assembly into contigs in parallel on different machines 790A-N.
- read cluster (or read tile) assembly can be performed in parallel by distributing different read clusters to different machines for assembly.
- the read clusters can be distributed as sub-problems to different machines 790A-N as shown.
- the different read tiles (or clusters) 710A, 710B, 7 ION are accessible to respective of the machines 710A, 710B, 710N.
- An assembler 725A, 725B, 725N can execute at each of the machines to assemble the read tiles (or clusters) into respective contigs sets 730A, 730B, 730N. As described herein, the same or different assemblers can be used.
- assembly can be performed as a sub-problem requiring no additional
- assembly at one machine may finish before the assembly at another machine.
- the resulting contig sets 730A-N can be collectively partitioned into contig clusters as described herein.
- contig assembly can be performed by sending sets of read clusters (or tiles) to one or more other machines, and subsequently receiving contig sets assembled from respective of the read clusters (or tiles).
- Example 16 Exemplary System Performing Read Clustering in Parallel
- FIG. 8 is a block diagram of an exemplary system 800 performing read clustering based on contig clusters in parallel on different machines 890A-N.
- read clustering can be performed in parallel by distributing different contig clusters to different machines for alignment against the input reads.
- the contig clusters can be distributed as sub-problems to different machines 890A-N as shown.
- the different contig clusters 850A, 850B, 850N are accessible to respective of the machines 810A, 810B, 8 ION.
- the input reads 810 e.g., the same input reads
- An alignment tool 860A, 860B, 860N can execute at each of the machines to cluster the input reads 810 into read clusters 865A, 865B, 865N based on respective of the contig clusters 850A, 850B, 850N.
- the same or different alignment tools can be used.
- contig cluster alignment against the input reads can be performed as a sub- problem requiring no additional information regarding other contig clusters.
- clustering at one machine may finish before the clustering at another machine.
- the resulting read clusters 830A-N can be assembled into respective contig sets as described herein.
- read clustering can be performed by sending contig clusters to one or more other machines, and subsequently receiving read clusters clustered based on respective of the contig clusters and the input reads.
- the sequence used to generate input reads for the disclosed methods can be obtained from any organism of interest, such as a microbe (such as a virus, bacteria, fungi, or protozoa), plant, or animal (such as a mammal, for example, a human or mouse or an amphibian such as a frog or salamander).
- a microbe such as a virus, bacteria, fungi, or protozoa
- plant or animal (such as a mammal, for example, a human or mouse or an amphibian such as a frog or salamander).
- input reads are obtained from a virus, such as a positive-strand RNA virus or a negative-strand RNA virus.
- exemplary positive-strand RNA viruses include, but are not limited to: Picornaviruses (such as Aphthoviridae [for example, foot-and-mouth-disease virus (FMDV)]), Cardioviridae; Enteroviridae (such as Coxsackie viruses, Echoviruses, Enteroviruses, and Polioviruses); Rhinoviridae (Rhinoviruses)); Hepataviridae (Hepatitis A viruses); Togaviruses (examples of which include rubella; alphaviruses (such as Western equine encephalitis virus, Eastern equine encephalitis virus, and Venezuelan equine encephalitis virus)); Flaviviruses (examples of which include Dengue virus, West Nile virus, and Japanese encephalitis virus);
- Picornaviruses such as Ap
- Calciviridae which includes Norovirus and Sapovirus
- Coronaviruses examples of which include SARS coronaviruses, such as the Urbani strain
- Exemplary negative- strand RNA viruses include, but are not limited to: Orthomyxyo viruses (such as the influenza virus), Rhabdo viruses (such as Rabies virus), and Paramyxoviruses (examples of which include measles virus, respiratory syncytial virus, and parainfluenza viruses).
- DNA viruses include, but are not limited to: Herpesviruses (such as Varicella-zoster virus, for example, the Oka strain; cytomegalovirus; and Herpes simplex virus (HSV) types 1 and 2); Adenoviruses (such as Adenovirus type 1 and Adenovirus type 41); Poxviruses (such as Vaccinia virus); and Parvoviruses (such as Parvovirus B 19).
- Herpesviruses such as Varicella-zoster virus, for example, the Oka strain
- cytomegalovirus cytomegalovirus
- HSV Herpes simplex virus
- Adenoviruses such as Adenovirus type 1 and Adenovirus type 41
- Poxviruses such as Vaccinia virus
- Parvoviruses such as Parvovirus B 19
- retroviruses include, but are not limited to: human immunodeficiency virus type 1 (HIV- 1), such as subtype C; HrV-2; equine infectious anemia virus; feline immunodeficiency virus (FIV); feline leukemia viruses (FeLV); simian immunodeficiency virus (SIN); and avian sarcoma virus.
- HMV- 1 human immunodeficiency virus type 1
- FMV feline immunodeficiency virus
- FeLV feline leukemia viruses
- SIN simian immunodeficiency virus
- avian sarcoma virus avian sarcoma virus.
- input reads are obtained from one or more of the following: HIV; Hepatitis A virus; Hepatitis B (HB) virus; Hepatitis C (HC) virus; Hepatitis D (HD) virus; Hepatitis E virus; a respiratory virus (such as influenza A & B, respiratory syncytial virus, human parainfluenza virus, or human metapneumovirus), or West Nile Virus.
- HIV Hepatitis A virus
- Hepatitis B (HB) virus Hepatitis C (HC) virus
- Hepatitis D (HD) virus Hepatitis E virus
- a respiratory virus such as influenza A & B, respiratory syncytial virus, human parainfluenza virus, or human metapneumovirus
- input reads are obtained from bacteria.
- Bacteria can be classified as gram- negative or gram-positive.
- Exemplary gram-negative bacteria include, but are not limited to:
- Exemplary gram-positive bacteria include, but are not limited to: Bacillus anthracis, Staphylococcus aureus, Listeria, pneumococcus, gonococcus, and streptococcal meningitis.
- input reads are obtained from one or more of the following: Group A Streptococcus; Group B Streptococcus; Helicobacter pylori; Methicillin-resistant Staphylococcus aureus; Vancomycin-resistant
- enterococci Clostridium difficile; E. coli ⁇ e.g., Shiga toxin producing strains); Listeria; Salmonella; Campylobacter; B. anthracis; Chlamydia trachomatis; and Neisseria gonorrhoeae.
- Protozoa, nemotodes, and fungi are also types of pathogens.
- input reads are obtained from Plasmodium ⁇ e.g., Plasmodium falciparum to diagnose malaria), Leishmania, Acanthamoeba, Giardia, Entamoeba, Cryptosporidium, Isospora, Balantidium, Trichomonas, Trypanosoma (e.g., Trypanosoma brucei), Naegleria, or Toxoplasma.
- input reads are obtained from Coccidiodes immitis or Blastomyces dermatitidis.
- input reads are obtained from an organism having a large genome, such as a plant genome, for example, a flowering plant genome, an amphibian, a fish, or a mammal.
- Plant genomes that can be analyzed with the disclosed methods and systems include angiosperms, gymnosperms, monocotyledons and dicotyledons. Examples of monocotyledonous plants include asparagus, field and sweet corn, barley, wheat, rice, sorghum, onion, pearl millet, rye and oats.
- dicotyledonous plants include tomato, tobacco, cotton, potato, rapeseed, field beans, soybeans, peppers, lettuce, peas, alfalfa, clover, cole crops or Brassica oleracea ⁇ e.g., cabbage, broccoli, cauliflower, brussels sprouts), radish, carrot, beets, eggplant, spinach, cucumber, squash, melons, cantaloupe, sunflowers and various ornamentals.
- Woody species include poplar, pine, sequoia, cedar, oak, and the like.
- the plant is a highly inbred crop, such as rice, soybean, corn, cotton, wheat, oat, barley, or sorghum.
- plants include fruit plants (such as strawberry), fruit trees (such as a citrus tree, ⁇ e.g., orange, lime, lemon or grapefruit tree), as well as other fruit trees ⁇ e.g., cherry, papaya or plum tree), flowering plants, and grasses.
- the plant is a crop plant, such as soybean, corn, canola, tobacco, cotton and the like.
- exemplary plants that can be used with the methods provided herein include rice, maize, wheat, barley, sorghum, millet, grass, oats, tomato, corn, potato, banana, kiwi fruit, avocado, melon, mango, cane, sugar beet, tobacco, papaya, peach, strawberry, raspberry, blackberry, blueberry, lettuce, cabbage, cauliflower, onion, broccoli, brussels sprouts, cotton, canola, grape, soybean, oil seed rape, asparagus, beans, carrots, cucumbers, eggplant, melons, okra, parsnips, peanuts, peppers, pineapples, squash, sweet potatoes, rye, cantaloupes, peas, pumpkins, sunflowers, spinach, apples, cherries, cranberries, grapefruit, lemons, limes, nectarines, oranges, pears, tangelos, tangerines, lily, carnation, chrysanthemum, petunia, rose, geranium, violet, gladioli, orchid
- the disclosed methods use input reads containing sequence information for a target to be sequenced using the disclosed de novo genome assembly methods.
- Methods for generating input reads are routine in the art.
- the disclosed methods or systems include a step of generating input reads, for example, by sequencing a target nucleic acid of interest.
- input reads can be obtained from publicly available databases such as GenBank, EMBL (the European Molecular Biology Laboratory), the NCI DTP, CellMiner, and Ingenuity websites, as well as DDBJ (the DNA Databank of Japan).
- a target genome of interest (such as an entire genome or portion thereof, such as a particular gene or chromosome) can be sequenced using routine methods to produce input reads.
- the disclosure is not limited to particular sequencing methods, as numerous different methods are known in the art.
- one or more target nucleic acids in a sample are sequenced to produce input reads. For example, at least 2, at least 3, at least 4, at least 5, at least 10, at least 100, at least 1000, or at least 10,000 different target nucleic acids (such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100, 500, 1000, 5000, 10,000, 30,000 or 50,000 different target nucleic acids) can be sequenced simultaneously or contemporaneously to produce input reads.
- an entire target genome or entire chromosome is sequenced.
- at least 2, at least 3, at least 4, at least 5, at least 10, at least 12, at least 15, at least 20, at least 23, at least 30, at least 40, or at least 50 different chromosomes are sequenced or assembled using the disclosed methods or systems, and thus can be used to generate input reads.
- genes can be to used to produce input reads and be assembled using the disclosed methods.
- genes contain at least 500 nucleotides (nt), at least 1000 nt, at least 10,000 nt, at least 100,000 nt at least 500,000 nt, at least 1,000,000 nt, or at least 10,000,000 nt, (such as 1,000 to 10,000 nt; 1,000 to 1,000,000 nt or 1,000 to 25,000 nt).
- nt nucleotides
- chromosomes can be used to produce input reads and can be assembled using the disclosed methods.
- chromosomes contain at least 100,000 nucleotides (nt), at least 500,000 nt, at least 1,000,000 nt, at least 10,000,000 nt, at least 100,000,000 nt, at least
- nt 1,000,000,000 nt, at least 2,000,000,000 nt or at least 3,750,000,000 nt (such as 10,000 to
- genomes can be used to produce input reads and can be assembled using the disclosed methods.
- genomes are at least 1,000 nt, at least 5,000 nt, at least 10,000 nt, at least 100,000 nt, at least 500,000 nt, at least 1,000,000 nt, at least 10,000,000 nt, at least 100,000,000 nt, at least 1,000,000,000 nt, at least 2,000,000,000 nt, at least 3,000,000,000 nt, at least 10,000,000,000 nt, at least 50,000,000,000 nt, at least 100,000,000,000 nt, at least 130,000,000,000 nt or at least
- 150,000,000,000 nt (such as 1,000 to 10,000,000 nt; 1,000,000 to 200,000,000,000 nt or 50,000,000 to 150,000,000,000 nt).
- Nucleic acid molecules such as DNA or RNA or both, can be extracted from a sample using routine methods, prior to sequencing.
- General methods for nucleic acid extraction and isolation are well known in the art and are disclosed in standard textbooks of molecular biology, including Ausubel et al., Current Protocols of Molecular Biology, John Wiley and Sons (1997). Methods for nucleic acid extraction extraction from paraffin-embedded tissues are disclosed in for example, by Rupp and Locker (Lab. Invest., 56:A67, 1987) and DeAndres et al. (BioTechniques, 18:42044, 1995).
- nucleic acid extraction can be performed using a purification kit, buffer set and protease obtained from commercial manufacturers, such as Qiagen, according to the manufacturer's instructions.
- Other commercially available nucleic acid isolation kits include MASTERPURETM Complete DNA and RNA Purification Kit (EPICENTRETM
- nucleic acids are isolated from an FFPE sample, which may be sectioned.
- one or more chromosomes from a sample are isolated using routine methods.
- the disclosed methods include the step of obtaining a sample, preparing the sample for analysis (for example fixing the sample, isolating nucleic acids, or combinations thereof), or both.
- Methods of obtaining a biological sample from a subject are known in the art.
- tissue such as tissue from a human or other mammalian subject, as well as from other organisms such as a microbe or plant, are known.
- the sample is obtained from a microbe (e.g., fungus, bacteria, or parasite; specific examples listed above), plant, or other organism of interest.
- the sample can be fresh, frozen, or fixed.
- samples are processed post- collection by fixation and in some examples are wax- (e.g., paraffin-) embedded.
- Fixatives for mounted cell and tissue preparations are well known in the art and include, without limitation, formalin fixative, 95% alcoholic Bouin's fixative; 95% alcohol fixative; B5 fixative, Bouin's fixative, Karnovsky's fixative (glutaraldehyde), Hartman's fixative, Hollande's fixative, Orth's solution (dichromate fixative), and Zenker's fixative (see, e.g., Carson, Histotechology: A
- the sample is a fixed, wax-embedded tissue sample, such as a fixed, wax-embedded tissue sample.
- the sample includes a plant or part thereof, such as a leaf, flower, pollen, or ovule.
- the sample is a tumor sample, which in some examples is obtained by surgical excision of all or part of the tumor, by collecting a fine needle aspirate from the tumor, as well as other methods known in the art.
- hematological tumors include leukemias, including acute leukemias (such as acute lymphocytic leukemia, acute myelocytic leukemia, acute myelogenous leukemia and myeloblastic, promyelocytic, myelomonocytic, monocytic and erythroleukemia), chronic leukemias (such as chronic myelocytic (granulocytic) leukemia, chronic myelogenous leukemia, and chronic lymphocytic leukemia), polycythemia vera, lymphoma, Hodgkin's disease, non-Hodgkin's lymphoma (indolent and high grade forms), multiple myeloma, Waldenstrom's macroglobulinemia, heavy chain disease, mye
- solid tumors such as sarcomas and carcinomas
- solid tumors include fibrosarcoma, myxosarcoma, liposarcoma, chondrosarcoma, osteogenic sarcoma, and other sarcomas, synovioma, mesothelioma, Ewing's tumor, leiomyosarcoma, rhabdomyosarcoma, colon carcinoma, lymphoid malignancy, pancreatic cancer, breast cancer, lung cancers (such as non-small cell lung cancer), ovarian cancer, prostate cancer, hepatocellular carcinoma, squamous cell carcinoma, basal cell carcinoma, adenocarcinoma, sweat gland carcinoma, medullary thyroid carcinoma, papillary thyroid carcinoma, pheochromocytomas sebaceous gland carcinoma, papillary carcinoma, papillary adenocarcinomas, medullary carcinoma, bronchogenic carcinoma, renal cell carcinoma, hepatoma, bile duct carcinoma,
- Exemplary sequencing methods that can be used to generate input reads for the methods provided herein include, but are not limited to: Maxam-Gilbert sequencing, chain termination methods (e.g., Sanger sequencing), shotgun sequencing, bridge PCR, single-molecule real-time sequencing (e.g., from Pacific Bio), ion semiconductor (e.g., Ion Torrent sequencing),
- pyrosequencing e.g., 454 from Life Sciences
- sequencing by synthesis e.g., from Illumina
- sequencing by ligation e.g., SOLiD sequencing
- polony sequencing heliscope single molecule sequencing
- massively parallel signature sequencing MPSS
- an individual input read is at least about 25 nucleotides (nt) in length, such as at least about 30 nt, at least 40 nt, at least 50 nt, at least 60 nt, at least 70 nt, at least 80 nt, at least 90 nt, at least 100 nt, at least 110 nt, at least 120 nt, at least 130 nt, or at least 150 nt, such as 25 to 1000 nt, 25 to 500 nt, 25 to 150 nt, 25 to 120 nt, or 50 to 120 nt.
- nt nucleotides
- an individual input read is longer, such as at least about 1000 nt in length, such as at least about 2000 nt, at least 3000 nt, at least 4000 nt, at least 5000 nt, at least 6000 nt, at least 7000 nt, at least 8000 nt, at least 9000 nt, at least 10,000 nt, at least 50,000 nt, at least 100,000 nt, or at least 1,000,000 nt, such as 1000 to 10,000,000 nt, 1000 to 1,000,000 nt, 1000 to 100,000 nt, 1000 to 10,000 nt, or 1000 to 5000 nt.
- an average length of an input read in a population of input reads is at least about 25 nt in length, such as at least about 30 nt, at least 40 nt, at least 50 nt, at least 60 nt, at least 70 nt, at least 80 nt, at least 90 nt, at least 100 nt, at least 110 nt, at least 120 nt, at least 130 nt, or at least 150 nt, such as 25 to 1000 nt, 25 to 500 nt, 25 to 150 nt, 25 to 120 nt, or 50 to 120 nt.
- an average length of an input read in a population of input reads longer such as at least about 1000 nt in length, such as at least about 2000 nt, at least 3000 nt, at least 4000 nt, at least 5000 nt, at least 6000 nt, at least 7000 nt, at least 8000 nt, at least 9000 nt, at least 10,000 nt, at least 50,000 nt, at least 100,000 nt, or at least 1,000,000 nt, such as 1000 to 10,000,000 nt, 1000 to 1,000,000 nt, 1000 to 100,000 nt, 1000 to 10,000 nt, or 1000 to 5000 nt.
- a population of input reads can have numerous individual input reads, such as at least 1000 different input reads, at least 5000, at least 10,000, at least 50,000, at least 100,000, at least 500,000, at least 1,000,000, at least 5,000,000, at least 10,000,000, at least 50,000,000, at least 100,000,000, at least 500,000,000, at least 1,000,000,000, at least 5,000,000,000, at least
- Each of the input reads is placed or partitioned into a read tile or "bucket".
- the number of read tiles used can be selected by the user, but there are at least two different read tiles. In some examples, there are at least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 500, at least 600, at least 750, or at least
- 1000 different read tiles such as 10 to 1000, 10 to 600, 10 to 500, 10 to 100, 50 to 100, 50 to 1000, 50 to 500, or 50 to 100 different read tiles.
- the assignment of a particular input read to a particular read tile is random.
- the first input read can be assigned to read tile 1
- the second input can be assigned to read tile 2, and so forth.
- input reads 1, 51, 101 etc. can be assigned to read tile 1
- input reads 2, 52, 102 etc. can be assigned to read tile 2, and so forth.
- the input reads are assigned to the read tiles in groups. For example, if there are 50 read tiles, the first 1/50 ⁇ of the input reads can be assigned to read tile 1, the second 1/50 ⁇ of the input reads can be assigned to read tile 2, and so forth.
- the input reads are non-randomly assigned to a read tile. For example, if it is known that particular reads are near one another on the target, these related input reads can be assigned to the same read tile.
- Each of the individual read tiles containing a plurality of input reads are assembled, for example using a known assembler. Assemblers align the individual input reads in each read tile based on the ends of the sequence of the input reads. An empirically determined threshold (e.g., n nt) can be used to confirm that one input read is related to another. Examples of known assemblers include Velvet available from the European Bioinformatics Institute, SOAPdenovo available from BGI, ALLPATHS-LG, ABBySS, and IDBA, and the like.
- the assembly of the read tiles can be done independently in serial or in parallel on a shared or distributed memory computer cluster. At this stage, there is no communication between the assemblies of the different read tiles.
- the result of assembling of the input reads in each read tile is the production of contigs in each read tile. As a result of this assembly, the number of input reads decreases, but the number of read tiles (now referred to as contig sets) stays the same. All of the contigs generated from all of the read tiles are combined into a single contig set.
- the contig set generated from assembling the input reads in each read tile is used to generate a graph, referred to as a contig connectivity graph, showing the relatedness of the individual contigs in the set (e.g., as shown in Figure 10 as described below).
- the contig connectivity graph is generated based on generating graph vertices for the contigs, wherein the weight of the vertex is the contig length.
- the edge weights can be defined based on the degree of overlap between one contig and other contig.
- the resulting contig connectivity graph is then partitioned into contig clusters, for example using a graph-partitioning tool, such as METIS, KaHIP, kMetric, Scotch of the Laboratoire Bordelais debericht en Informatique, or the like.
- a graph-partitioning tool such as METIS, KaHIP, kMetric, Scotch of the Laboratoire Bordelais debericht en Informatique, or the like.
- the contig lengths are given less importance than the contig overlapping degrees (edge weights) in the graph partitioning process. However, if there are many short contigs with little connectivity in-between, the vertex weights can be given greater importance (for example this may be true during the first few iterations, until the contig length increases).
- the number of contig clusters can be determined by a user.
- At least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 10,000, or more than 10,000 different read tiles, such as 10 to 1000, 10 to 500, 10 to 100, 50 to 1000, 50 to 500, 50 to 100, or over 10,000 different contig clusters are generated.
- the number of clusters can be chosen based on the given computation resources, the genome's characteristics, and the like. Dramatic variation within the same species is supported.
- the resulting contig clusters are then used as part of the input in the next step.
- the entire input read set (i.e. , the initial sequence data partitioned into the read tiles) is then aligned to the contig clusters generated above.
- Each input read having similarity to a contig cluster is collected.
- a single input read may have similarly to more than one contig cluster, and thus be assigned to more than one contig cluster (e.g., due to repeats).
- the input read-to-contig cluster alignment can be performed using an alignment tool, such as Bowtie.
- the result is the production of a plurality of read clusters, which can be longer and/or have more bridges than the contig clusters.
- the resulting read clusters are assembled as described above for "Assembly of Read Tiles into Contigs". This produces a plurality of contigs, which are longer but fewer in number than those generated above.
- Merged contigs can be assembled in second phase assembly to promote elongation. In some examples, such as when the assembly of the merged contigs is not improving, the merged contig assembly can be skipped in later iterations to save time. In some examples, merged contig assembly is only performed in the first few iterations, if at all.
- the resulting contigs can be analyzed to see if there was improvement (e.g., length of contigs increased, numbed or bridges between contigs increased, or combinations thereof). If there was improvement, the process can be repeated with one or more additional iterations of steps and contig clustering, read clustering, read assembly. If there was improvement, the process can be terminated, and the resulting contigs outputted.
- improvement e.g., length of contigs increased, numbed or bridges between contigs increased, or combinations thereof. If there was improvement, the process can be repeated with one or more additional iterations of steps and contig clustering, read clustering, read assembly. If there was improvement, the process can be terminated, and the resulting contigs outputted.
- the steps of contig clustering, read clustering, read assembly are iterated at least 2 times, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 15, at least 20, at least 25, at least 50, at least 75, at least 100, at least 150, at least 200, at least 500, at least 1000, at least 2500, at least 5000, or at least 10,000 times, such as 50 to 100, 50 to 500, or 100 to 1000 times.
- the contig clusters can be provided to a user.
- the resulting contigs can be analyzed, for example to identify functional regions or domains (e.g., annotate the contigs), to compare the sequence of the contigs to other species, or to examine genetic variations.
- the resulting contigs are assembled into a genome or chromosome.
- a treatment plan is administered to a patient based on the sequencing results, for example the presence or absence of particular mutations in the target assembled.
- Example Al is illustrated by the disclosed non-limiting Examples.
- Example Al is illustrated by the disclosed non-limiting Examples.
- This example describes the Tiger implementation used to generate the results described in the examples below.
- assemblers can deal with small genomes (such as E. coli) very well using a small amount of computation resource and time. For very large genomes (such as mammalian- size genomes), most assemblers either cannot produce good results or require tremendous amount of resources and/or time. Besides, assemblers usually have their own design characteristics targeting at specific types of genomes [31].
- the disclosed approach substantially reduces the computational complexity and resources needed for large genome assembly. The method and system more effectively divides the genome assembly problem into smaller sub-problems and assembles them individually (for example, without inter-node communication) with the flexibility of using various off-the-shelf assemblers. In some examples, an iterative refinement approach is used to gradually improve the quality of problem partitioning and the overall solution.
- a contig from a specific region in the target genome
- a contig from a specific region in the target genome
- most assemblers extract the fc-mers from all the input reads and mix them together when constructing the DBG.
- the fc-mers whose source reads are not contributing to a specific region in the target genome, may still be used in the DBG construction process.
- Such k- mers are termed ambiguous k-mers. For genomes that are less repetitive, the ambiguous fc-mers could be few. But for genomes that are highly repetitive, they can be significant enough to confuse the assembly process.
- a well-clustered read tile should contribute to a continuous region of the target genome.
- the region can be composed of one long contig or a set of contigs covering the whole region without gaps in-between.
- a set of such contigs is called well-clustered and can be obtained by sorting or clustering closely related contigs together. Therefore, by aligning the input reads against a well-clustered contig set, the reads having high similarity with sub-sections in the contigs can be collected as a well-clustered read tile. This process is called read clustering. The collected reads can then be assembled to produce a similar set of the contigs but with improved contig lengths and quality.
- a target genome can be considered as a combination of multiple continuous regions (the minimum number is the number of chromosomes in the target genome), where each region can be contributed completely by one or many contigs. Therefore, ultimately there would be multiple well-clustered contig sets corresponding to multiple regions in the target genome.
- the contigs from assembly were treated as the intermediate reference genome and the contigs were arranged in multiple clustered contig sets.
- the reads in tiles are used, the reads in each tile assembled, and all contig sets merged into one as the intermediate reference.
- the reads in an initial, randomly partitioned tile will correspond to random regions in the target genome.
- the initial contig sets that serve as the intermediate reference will likely be fragmented and will have errors.
- the disclosed approach iteratively improves the clustering and thus the quality of the intermediate reference genome. In the end, the intermediate reference genome converges to the final target genome.
- Step 1 Reads partitioning: The input reads were first partitioned into multiple read tiles. In one example, the input reads were randomly partitioned evenly into N subsets, which can be determined by users based on the available resources and the total size of the input reads. In some examples, a non-random initial assignment of reads, for instance in cases where a priori knowledge about the reads can be used to provide clustering information.
- Step 2 Read assembly: Read tiles were assembled individually, using an off-the-shelf assembler, such as Velvet, embedded into Tiger. Depending on the available system memory, the assembly of read tiles can be done independently in serial or in parallel on a shared or distributed memory computer cluster. There is no communication between the assemblies of different read tiles.
- k-mer sizes can be decided either manually by users or automatically through the auto-fc-mer scheme in Tiger.
- a k-mer size is used in all read tile assemblies for all Tiger iterations.
- the auto-fc-mer scheme randomly picks k-mer sizes within a given range and records the best k-mer size in the assembly results.
- the best k-mer size and the randomly picked ones will be considered in the subsequent assemblies.
- User-specified k-mer size can be introduced into this k- mer history database but may not be used again if the first attempt is not good.
- the number of used reads in the assembly, the total length of the contigs, and the resultant N50s are used to evaluate whether a k-mer size can help produce the best result without knowing the target genome. This avoids the problem of picking a contig set with high N50 and low coverage and enables Tiger to find a good direction in the iterative process and to converge to high quality results.
- Alternative embodiments could optionally determine k-mer size to maximize N50 or other metrics, rather than coverage.
- Step 2 is the first time to assemble the initial read tiles, the contigs can be short and may cause long running time in the later iterations. In some examples this is addressed by merging the contig sets and feeding the merged contig set to Velvet with the LONGSEQUENCE flag enabled. Velvet may further elongate the contigs by treating the input contigs as long reads. The new contig set is used when it is better than the merged contig set.
- the output contig set is scaffolded by SSPACE [24].
- the scaffolded contig set is the input to Step 3.
- the purpose of this scaffolding process is to leverage paired-end information to bridge contigs which may be from different assemblies. This is beneficial for better clustered contigs at Step 3.
- the scaffolding process also helps resolve duplicated contigs from different assemblies.
- Step 3 Contig clustering: An overview of the contig clustering algorithm used is depicted in Figure 10.
- words are extracted from contigs.
- the number of common words e.g., base pair sets, base pairs, or the like
- Contig lengths are modeled as vertex weights.
- the contig connectivity graph is thus built, followed by the METIS partitioning process.
- the partitioned sub-graphs are clustered contig sets.
- a graph that models the contig connectivity intensity is built from the merged contig set. This graph is called the "contig connectivity graph.”
- Graph vertices are the contigs.
- Vertex weights are contig lengths.
- Edge weights are defined based on the contig overlapping degree with each other.
- a threshold e.g., 10 or the like
- the contig connectivity graph is much smaller than the DBG so it uses a much smaller amount of memory.
- a graph- partitioning tool, METIS [25] was used to partition the graph into contig clusters. METIS is known to be fast and memory-efficient in processing millions of graph vertexes.
- the contig lengths (vertex weights) were given less importance than the contig overlapping degrees (edge weights) in the graph partitioning process. This is because it was desired that the partitioned contig connectivity sub-graphs be more edge-oriented instead of being vertex-oriented. But the vertex weights need to be considered for the situations where there exist many short contigs with little connectivity in-between. This is very common for the assembly results in the first few iterations on assembling randomly partitioned read tiles. These short contigs can be distributed to all clusters evenly. This not only preserves their existence in the following Tiger iterations but also reduces their influence on the rest of the clustered contigs.
- a contig connectivity graph can be time-consuming if a traditional sequence alignment method is used, like the Smith- Waterman algorithm [26] . Since the degree of overlap between contigs need not be determined exactly for the methods and systems herein, a heuristic algorithm was used based on the image recognition algorithm using vocabulary trees in [27], with inverse document frequency scoring dropped. Consecutive sequences of equal length (called "words") are extracted from each of the contigs in the set. The extracted words are used to build a map (or inverted file) from the words to the contigs containing them. A contig connectivity graph is then built from the map with the edge weights being set to the number of base pairs in common between the vertices (contigs) connected by that edge.
- words Consecutive sequences of equal length
- the connectivity graph Since the connectivity graph stores only the contig connectivity information, the memory usage of this step is much lower than that at the read assembly step.
- building a connectivity graph can dominate the whole step. Building the word map can be done in parallel if desired. Overall the runtime is still much shorter than Step 4 and 5.
- Step 4 Read clustering: The entire input read set is aligned to the contig sets from Step 3. The reads having high similarity to each contig set are collected as one read cluster. Each read is collected once for each cluster. This means a read can appear in multiple read clusters if similar contigs are not clustered together. This process guarantees any read potentially contributing to a contig set will be collected.
- the read-to-contig alignment was done using Bowtie [28], but one skilled in the art will appreciate that other methods can be used.
- For paired-end reads if one of a read pair aligns to the given contig cluster, both reads are collected. This step provides the opportunity to extend and/or bridge the contigs. This clustering process can be done in parallel or in serial on a shared or distributed memory computer cluster. In some examples, no
- the required memory is also proportional to the size of a read tile.
- Step 5 Read assembly: The read tiles were assembled the same as described for Step 2. But the assembly of the merged contigs from all read tile assemblies may optionally not be performed. If the assembly of the merged contigs is not improving, it is skipped in later iterations to save time. It was observed that it is useful to have this additional assembly in the first few iterations.
- Step 6 Post processing: In some examples, the methods will exit upon reaching the given number of iterations. In these examples, iterations which do not exit will return to Step 3. Step 3, 4, and 5 and thus form an iterative process.
- systems and methods can be implemented which include additional pre- or post-processing steps or which modify certain iterations, for example optionally performing the read tile assembly sub-step according to Step 5.
- Type R Two types of evaluation were carried out: type R and type I, labeled as Tiger- Velvet/Soap- R/I.
- Type R Random
- Type I IImproved
- Type I started from the assembly result generated by Velvet/SOAPdenovo (instead of random partitioning), followed by Tiger- Velvet/Soap, respectively, to improve the result. This was to demonstrate that Tiger could also improve the single-tile assembly by its embedded assembler.
- Both types of evaluation used 150-tile assembly with auto-fc-mers.
- the machine for these evaluations is installed with Intel Core i7 CPU 950 (4 physical cores in clock rate 3.07 GHz), 24 GB system memory, and 2 TB disk space. Five of such machines were used.
- the human chromosome 14 data set in the GAGE assembly competition [20, 21] was used to assess Tiger.
- This chromosome is 88 Mbp excluding Ns ⁇ e.g., low quality values or reads, such as those that could not be positively identified as A, T, C, or G by the sequencer).
- Ns ⁇ e.g., low quality values or reads, such as those that could not be positively identified as A, T, C, or G by the sequencer.
- the reads were corrected by Quake [22] before assembly.
- the other data set was the 4.6 Mbp long E. coli genome (Illumina paired-end reads, accession no. SRR001665) with 36 bp read length, generated from a 200 bp insert length, E. coli K- 12 MG1655 library (accession no. NC_000913).
- the assembly results were analyzed by the evaluation script from [21], using the MUMMer package [23], with 200 bp as the minimum contig length.
- the same analysis metrics in [21] are reused in Table 2 and Table 5.
- the NG50 value is the smallest contig size such that 50% of the reference genome is contained in contigs of size NG50 or larger.
- the error-corrected NG50 is calculated by splitting contigs at every misjoin and at every indel that is longer than 5 bp.
- SNPs mean the single nucleotide differences.
- Inversions are the reversed sequences in strands.
- Relocations are the sequence rearrangements.
- "Unaligned ref.” is the bases in the reference that was not aligned to any contig.
- “100% - Unaligned ref.” is the genome coverage.
- Duplicated ref is the sequence occurrence frequencies in contigs.
- Tables 2-7 summarize the evaluation results using Tiger.
- the Tiger results had better NG50s and genome coverage as compared to the best Velvet and SOAPdenovo results using fc-mer sizes 61 and 55, respectively, indicating that Tiger can iteratively improve the assembly results.
- Table 2. The human chromosome 14 assembly results in terms of continuity, accuracy, and statistics. The columns include the number of contigs, NG50 size and its error-corrected size, the number of single nucleotide polymorphisms (SNPs), the number of indels and misjoins in contigs, total assembly length, genome coverage (100 - Unaligned ref.), and duplications.
- K-mer 61 and 55 are the best k-mer sizes for Velvet and SOAPdenovo, respectively. "#k” stands for the applied k-mer size. “#i” stands for the iteration number.
- the human chromosome 14 assembly results in terms of contiguity and accuracy.
- the columns from the second are the number of contigs, NG50 size and its error-corrected size, the number of single nucleotide polymorphisms (SNPs), and the number of indels and misjoins in contigs.
- K-mer 61 and 55 are the best k-mer sizes for Velvet and SOAPdenovo (rows with bold border), respectively. "#i" stands for the iteration number. Since each experimental result takes the GAGE evaluation scheme about 4 to 8 hours to finish, we only picked the experimental result for evaluation based on the reported higher NG50 and lower contig number.
- E. coli (SRR001665) 24-tile assembly results in terms of contiguity, accuracy, and statistics.
- the columns include the number of contigs, NG50 size and its error-corrected size, the number of single nucleotide polymorphisms (SNPs), the number of indels and misjoins in contigs, total assembly length, genome coverage (100 - Unaligned ref.), and duplications.
- K-mer 25 and 27 are the best k-mer sizes for Velvet and SOAPdenovo, respectively.
- “#k” stands for the applied k-mer size.
- “#i” stands for the iteration number.
- Both Tiger-Velvet-I and Tiger-Soap-I evaluations use the best results from Velvet and SOAPdenovo as input, respectively.
- E. coli (SRR001665) 24-tile assembly results in terms of contiguity and accuracy.
- the columns from the second are the number of contigs, NG50 size and its error-corrected size, the number of single nucleotide polymorphisms (SNPs), and the number of indels and misjoins in contigs.
- K-mer 25 and 27 are the best k-mer sizes for Velvet and SOAPdenovo (rows with bold border), respectively.
- "#i" stands for the iteration number.
- Both Tiger-Velvet-I and Tiger-Soap-I experiments use the best results from Velvet and SOAPdenovo as input, respectively.
- E. coli SRR001665) 24-tile assembly statistics. The columns from the second are the total assembly lengths, genome coverage (100 - Unaligned ref.), etc. K-mer 25 and 27 are the best k-mer sizes for Velvet and SOAPdenovo (rows with bold border), respectively. "#i" stands for the iteration number.
- Tiger- Velvet-I used the Velvet best result as input
- Tiger-Soap-I used the SOAPdenovo result with k-mer size 61. It is noted that although Tiger applies an iterative assembly process, with the same parameter setting and inputs, Tiger can reproduce the same results.
- the NG50s and coverages of the type R show continuous improvement from iteration to iteration.
- the best NG50s by Tiger- Velvet/Soap-R reach 11.6 kbp and 3.6 kbp (or 2.2x and 1.2x improvement), respectively, as compared to the best Velvet/SOAPdenovo results.
- the type I results also show continuous NG50 improvement.
- Regarding the coverages although Tiger- Velvet-I results have an improving trend, such trend is not clear on Tiger-Soap-I results.
- the best NG50s by Tiger- Velvet/Soap-I reach 10.9 kbp and 3.8 kbp (or 2.1x and 1.3x improvement), respectively.
- Table 8 lists the runtime and memory usage results observed on the read assembly to demonstrate the low memory usage of tiled assembly using multiple auto-fc-mers.
- Tiger- Velvet consumes the least amount of memory as low as 0.16 GB.
- Tiger-Soap still consumes 1.8 GB even though the read tile file size is around 10 MB only, whereas the 1-tile read file size is 4.7 GB.
- the runtime between the evaluations by Velvet/SOAPdenovo and Tiger- Velvet/Soap shows that Velvet and SOAPdenovo run much faster when the read tile size is small. For instance, the runtime for the 150-tile Tiger- Velvet-R assembly with 8 auto-fc-mers is less than twice of the 1-tile Velvet assembly.
- Table 8 The runtime and memory usage of the assemblies on the human chromosome 14 genome. All evaluations are done using 1 thread. "#k” stands for the applied k-m&r size. “#i” stands for the iteration number. Note: The runtime and memory usage for Tiger is on the read assembly (Step 5) only.
- Table 9 further lists the detailed computational resource usage using different numbers of threads across computers by Tiger and its counterparts.
- the runtime and memory usage include the whole Tiger assembly process from Step 3 to 5. Since the resource usage of a Tiger iteration can be very different especially for the type R tests, we used the first iteration of the type I because it is stabilized and consumes more resources than the type R iterations.
- the peak memory usage observed using Tiger using one thread was 1.87 GB and the runtime went to 4.69 hours.
- the 1.87 GB memory is from the contig clustering (Step 3) because the current implementation targets at 4 GB memory machine.
- the memory usage of 4-thread execution was 2.44 GB. This demonstrates Tiger's capability of running on commodity computers.
- Step 5* does not include SSPACE result since it does not execute across computers.
- Step 4 took up to 81.86% out of all three steps 3, 4, and 5. Further optimization may decrease the amount of time consume. Step 5 took 15.30%.
- Step 4 performs read-to-contig alignments, where the runtime of alignment tasks is similar to one another. This fits best the bulk- synchronous-parallel computation model so the speedup numbers were 3.41x, 10.74x, and 15.26x, showing close to linear results of the speedup, 4x, 12x and 20x, respectively.
- Step 5 the bulk-synchronous-parallel computation model is also used. The last contig scaffolding task was parallelizable within one computer so when the scaffolding task was in progress, the other computers were idle.
- the choice of the number of read tiles can affect the processing time for good results but also the quality of results the assembler can reach.
- each read tile assembly Low memory footprint of each read tile assembly: In some examples, the focus of Tiger is on decomposing the input reads into sub-assembly problems, instead of decomposing the de Bruijn graph data structure.
- the required memory for an assembly is inversely proportional to the number of read tiles. When the read coverage exceeds the number of read tiles, the assembly of each tile may need large amount of memory in the beginning iterations due to random reads partitioning. This is because each read tile may contain considerable proportion of all fc-mers for the whole target genome assembly. The read tile number can be increased to make the memory of each read tile assembly acceptable.
- Assembler embedding It is known that every assembler has its characteristics for specific types of genomes [31].
- the input to the disclosed framework can be an assembly result from assembler A. For example, Tiger can embed assembler A to further improve it, as was
- a system of the disclosure can also embed another assembler B to improve the results done by A.
- Scalable parallel assembly process In Tiger, the most time consuming steps are read clustering and read assembly. Both are highly parallel and do not need communication between threads. This makes the framework provided herein suitable for either distributed or shared memory computer clusters.
- More duplications were observed in Tiger assemblies. For a single-tile assembly, assemblers usually can detect duplications and resolve some of them. Duplications are more likely to take place in Tiger because assemblies for tiles are done independently. This is because when the duplications are in the contig ends, scaffolding tools usually can resolve them by merging contigs together. For example, the scaffolding tool SSPACE can help eliminate contig- end duplications. When the duplications are in the middle of two contigs, a post-processing step can be used to resolve them.
- Theorem 2 The N50 of the legitimately correct contig set C (built from the input reads as a whole) is as good as that of the contig set C iier that is constructed through iteratively adding the input reads R to the assembly process. This means the contigs in C iier can be elongated through iterative read clustering to provide the exact set of reads for the assembly until they repeat themselves or reach their ends.
- Lemma 3 A contig c can be elongated legitimately correctly until it repeats itself or reaches its end if all reads, r, that can contribute to it are added to the assembly process.
- Lemma 4 A contig c can be adjusted in the elongation process when more reads are collected.
- FIG. 12 illustrates a generalized example of a suitable computing system or environment
- the computing system 1200 is not intended to suggest any limitation as to scope of use or functionality, as the innovations may be implemented in diverse general-purpose or special-purpose computing systems.
- a communication device as described herein can take the form of the described computing system 1200.
- the computing system 1200 includes one or more processing units 1210, 1215 and memory 1220, 1225.
- the processing units 1210, 1215 execute computer-executable instructions.
- a processing unit can be a general-purpose central processing unit (CPU), processor in an application-specific integrated circuit (ASIC) or any other type of processor.
- ASIC application-specific integrated circuit
- FIG. 12 shows a central processing unit 1210 as well as a graphics processing unit or co-processing unit 1215.
- the tangible memory 1220, 1225 may be volatile memory (e.g., registers, cache, RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory, etc.), or some combination of the two, accessible by the processing unit(s).
- the memory 1220, 1225 stores software 1280 implementing one or more innovations described herein, in the form of computer- executable instructions suitable for execution by the processing unit(s).
- a computing system may have additional features.
- the computing system 1200 includes storage 1240, one or more input devices 1250, one or more output devices 1260, and one or more communication connections 1270.
- An interconnection mechanism (not shown) such as a bus, controller, or network interconnects the components of the computing system 1200.
- operating system software provides an operating environment for other software executing in the computing system 1200, and coordinates activities of the components of the computing system 1200.
- the tangible storage 1240 may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, CD-ROMs, DVDs, or any other medium which can be used to store information in a non-transitory way and which can be accessed within the computing system 1200.
- the storage 1240 stores instructions for the software 1280 implementing one or more innovations described herein.
- the input device(s) 1250 may be a touch input device such as a keyboard, mouse, pen, or trackball, a voice input device, a scanning device, or another device that provides input to the computing system 1200.
- the input device(s) 1250 may be a camera, video card, TV tuner card, or similar device that accepts video input in analog or digital form, or a CD- ROM or CD-RW that reads video samples into the computing system 1200.
- the output device(s) 1260 may be a display, printer, speaker, CD- writer, or another device that provides output from the computing system 1200.
- the communication connection(s) 1270 enable communication over a communication medium to another computing entity.
- the communication medium conveys information such as computer-executable instructions, audio or video input or output, or other data in a modulated data signal.
- a modulated data signal is a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
- communication media can use an electrical, optical, RF, or other carrier.
- Computer-readable media are any available tangible media that can be accessed within a computing environment.
- computer-readable media include memory 1220, 1225, storage 1240, and combinations of any of the above.
- program modules include routines, programs, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
- the functionality of the program modules may be combined or split between program modules as desired in various embodiments.
- Computer-executable instructions for program modules may be executed within a local or distributed computing system.
- system and “device” are used interchangeably herein. Unless the context clearly indicates otherwise, neither term implies any limitation on a type of computing system or computing device. In general, a computing system or computing device can be local or distributed, and can include any combination of special-purpose hardware and/or general-purpose hardware with software implementing the functionality described herein.
- the cloud 1310 provides services for connected devices
- Connected device 1330 represents a device with a computer screen 1335 (e.g., a mid-size screen).
- connected device 1330 could be a personal computer such as desktop computer, laptop, notebook, netbook, or the like.
- Connected device 1340 represents a device with a mobile device screen 1345 (e.g., a small size screen).
- connected device 1340 could be a mobile phone, smart phone, personal digital assistant, tablet computer, and the like.
- Connected device 1350 represents a device with a large screen 1355.
- connected device 1350 could be a television screen (e.g., a smart television) or another device connected to a television (e.g., a set-top box or gaming console) or the like.
- One or more of the connected devices 1330, 1340, 1350 can include touch screen capabilities. Touchscreens can accept input in different ways.
- capacitive touchscreens detect touch input when an object (e.g., a fingertip or stylus) distorts or interrupts an electrical current running across the surface.
- touchscreens can use optical sensors to detect touch input when beams from the optical sensors are interrupted. Physical contact with the surface of the screen is not necessary for input to be detected by some touchscreens.
- Devices without screen capabilities also can be used in example environment 1300.
- the cloud 1310 can provide services for one or more computers (e.g., server computers) without displays.
- Services can be provided by the cloud 1310 through service providers 1320, or through other providers of online services (not depicted).
- cloud services can be customized to the screen size, display capability, and/or touch screen capability of a particular connected device (e.g., connected devices 1330, 1340, 1350).
- the cloud 1310 provides the technologies and solutions described herein to the various connected devices 1330, 1340, 1350 using, at least in part, the service providers 1320.
- the service providers 1320 can provide a centralized solution for various cloud-based services.
- the service providers 1320 can manage service subscriptions for users and/or devices (e.g., for the connected devices 1330, 1340, 1350 and/or their respective users).
- Any of the disclosed methods can be implemented as computer-executable instructions stored on one or more computer-readable storage media (e.g., non-transitory computer-readable media, such as one or more optical media discs, volatile memory components (such as DRAM or SRAM), or nonvolatile memory components (such as hard drives)) and executed on a computer (e.g., any commercially available computer, including smart phones or other mobile devices that include computing hardware).
- a computer e.g., any commercially available computer, including smart phones or other mobile devices that include computing hardware.
- Any of the computer-executable instructions for implementing the disclosed techniques as well as any data created and used during implementation of the disclosed embodiments can be stored on one or more computer-readable media (e.g., non-transitory computer-readable media).
- the computer-executable instructions can be part of, for example, a dedicated software application or a software application that is accessed or downloaded via a web browser or other software application (such as a remote computing application).
- Such software can be executed, for example, on a single local computer (e.g., any suitable commercially available computer) or in a network environment (e.g., via the Internet, a wide-area network, a local-area network, a client-server network (such as a cloud computing network), or other such network) using one or more network computers.
- any of the software-based embodiments can be uploaded, downloaded, or remotely accessed through a suitable communication means.
- suitable communication means include, for example, the Internet, the World Wide Web, an intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave, and infrared communications), electronic communications, or other such communication means.
- Any of the computer-readable media herein can be non-transitory (e.g., memory, magnetic storage, optical storage, or the like).
- Any of the storing actions described herein can be implemented by storing in one or more computer-readable media (e.g., computer-readable storage media or other tangible media).
- computer-readable media e.g., computer-readable storage media or other tangible media.
- Any of the things described as stored can be stored in one or more computer-readable media (e.g., computer-readable storage media or other tangible media).
- computer-readable media e.g., computer-readable storage media or other tangible media.
- Any of the methods described herein can be implemented by computer-executable instructions in (e.g., encoded on) one or more computer-readable media (e.g., computer-readable storage media or other tangible media). Such instructions can cause a computer to perform the method.
- computer-executable instructions e.g., encoded on
- computer-readable media e.g., computer-readable storage media or other tangible media.
- Such instructions can cause a computer to perform the method.
- the technologies described herein can be implemented in a variety of programming languages.
- Any of the methods described herein can be implemented by computer-executable instructions stored in one or more computer-readable storage devices (e.g., memory, magnetic storage, optical storage, or the like). Such instructions can cause a computer to perform the method.
- computer-executable instructions stored in one or more computer-readable storage devices (e.g., memory, magnetic storage, optical storage, or the like). Such instructions can cause a computer to perform the method.
- ALLPATHS De novo assembly of whole-genome shotgun microreads. Genome Research 2008, 18(5):810-820.
- Narzisi G, Mishra B Comparing De Novo Genome Assembly: The Long and Short of It. PLoS ONE 2011, 6(4) :e 19175.
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Description
ITERATIVE GENOME ASSEMBLER
CROSS REFERENCE TO RELATED APPLICATION
This application claims the benefit of U.S. Provisional Application No. 61/683,358 filed August 15, 2012, herein incorporated by reference.
FIELD
The field relates to de novo genome assembly of short reads.
ACKNOWLEDGMENT OF GOVERNMENT SUPPORT
This invention was made in part with government support under contract number
NIH/NHGRI Grant 1R21HG006464 awarded by the National Institutes of Health, and under NSF Career Award #1054309 awarded by the National Science Foundation. The government has certain rights in the invention.
BACKGROUND
Among scientific disciplines, genomics has one of the fastest growing bodies of data today. This is largely due to the recent advances in next- generation sequencing (NGS) technologies, which have tremendously reduced DNA sequencing costs. This massive amount of sequencing data has provided the basis to better understand the tree of life and to identify molecular signatures of human variation and disease mechanisms. To make such analyses possible, the key computational task is to de novo assemble raw reads from NGS technologies into complete or near-complete genomes. However, the enormous amount of data creates an inevitable barrier to the assembly process in terms of memory usage. In addition, the lower quality and limited read length produced by NGS, as compared to the traditional Sanger sequencing, make it extremely difficult to assemble reads into long scaffolds, which are essential to facilitate the analyses of large-scale genome rearrangements.
Most of the modern NGS-based de novo genome assemblers adopt the de Bruijn Graph (DBG) data structure to handle extremely high coverage data [1-3]. Several assemblers have specifically been developed with some success to assemble large genomes. In SOAPdenovo [4] and ALLPATHS-LG [5], a DBG was constructed in a large shared memory and the assembly process was done in parallel within multiple threads. However, all of them required hundreds of gigabytes (GB) of memory to assemble large genomes, such as those from human and other mammalian species.
To tackle this problem, ABySS [6], YAGA [7] and PASHA [8] developed distributed DBG algorithms that split the DBG and parallelize the assembly process on a cluster on the basis of message passing interface (MPI). However, this imposed considerable communication among servers because many adjacent vertices in the DBG could be located on different servers. The amount of communication among servers increases nonlinearly when the number of servers increases, causing scalability issues.
Some assemblers made modifications to DBG in order to reduce memory usage.
Goassembler [9-10] used a compressed bitmap representation of the DBG, resulting in a memory usage that could be close to the theoretical minimum value of a DBG. Cortex [11] utilized colored DBG to detect the variations among 10 human genomes with less than 256 GB of memory.
SparseAssembler2 [12] reduced memory usage dramatically by storing only a small fraction of k- mers. SGA [13] used a compressed index of reads and it could assemble a human genome under 60 GB of memory.
Despite these developments, the memory usage of these tools is still too large for current commodity multi-core systems, limiting the scope of de novo assembly for large genomes to research groups that own large computer clusters. There are other deficiencies in current tools under various scenarios (such as the inability to generate long, high-quality contigs for large sequenced genomes containing repeats ). Therefore, a new computational framework for scalable de novo genome assembly is needed.
SUMMARY
Provided herein are methods implemented at least in part by a computing system, which can fro example be used for de novo genome assembly. In some examples the methods include receiving a plurality of input reads representing nucleotide sequences of DNA or RNA fragments; partitioning the input reads into respective read tiles; assembling sets of contigs for respective of the read tiles; clustering the contigs into contig clusters, wherein the clustering is based on overlap between the contigs; clustering the input reads into read clusters, wherein the clustering is based on similarity between the input reads and the contig clusters; and assembling sets of contigs for respective of the read clusters. In some examples, (d) - (f) are iteratively repeated until an end condition is met.
In some examples, the method includes (a) receiving a plurality of input reads representing nucleotide sequences of DNA or RNA fragments; (b) partitioning the input reads into respective read tiles; (c) coordinating assembly of the read tiles into respective sets of contigs; (d) clustering the contigs into contig clusters, wherein the clustering is based on overlap between the contigs; (e)
coordinating clustering the input reads into read clusters, wherein the clustering is based on similarity between the input reads and the contig clusters; and (f) assembling sets of contigs for respective of the read clusters.
In some examples, the method includes receiving a plurality of contigs assembled from a plurality of input reads representing nucleotide sequences of DNA or RNA fragments; clustering the contigs into contig clusters, wherein the clustering is based on overlap between the contigs; clustering the input reads into read clusters, wherein the clustering is based on similarity between the input reads and the contig clusters; and assembling sets of contigs for respective of the read clusters.
Also provided are one or more computer-readable storage media that include computer- executable instructions causing a computer to perform the methods provided herein.
Also provided are systems, which can include a plurality of stored input reads representing sequencing reads generated from a sequencing method applied to a sample; a partitioner configured to partition the stored input reads into a plurality of read tiles; an assembler configured to assemble input reads in a given read tile into a set of contigs; a contig cluster tool configured to organize the contigs into a graph and partition the graph into contig clusters; and an alignment tool configured to accept the contig clusters and the input reads and output read clusters partitioning the input reads according to alignment against the contig clusters; wherein the assembler is further configured to assemble input reads in a given read cluster into a new set of contigs.
The foregoing and other objects and features of the disclosure will become more apparent from the following detailed description, which proceeds with reference to the accompanying figures.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a block diagram of an exemplary system implementing iterative genome assembly.
Figure 2 is a flowchart of an exemplary method implementing iterative genome assembly. Figure 3 is a block diagram of an exemplary system clustering contigs based on overlap. Figure 4 is a flowchart of an exemplary method of clustering contigs based on overlap. Figure 5 is a block diagram of an exemplary graph representing contig overlap.
Figure 6 is a flowchart of an exemplary method of clustering contigs via a word map. Figure 7 is a block diagram of an exemplary system performing read tile assembly into contigs in parallel on different machines.
Figure 8 is a block diagram of an exemplary system performing read clustering based on contig clusters in parallel on different machines.
Figure 9 is a schematic view of an iterative framework for genome assembly.
Figure 10 is a schematic view of exemplary contig clustering.
Figures 11A-B are bar graphs showing a Tiger- Velvet- 1 li runtime comparison using data from human chromosome 14.
Figure 12 is a diagram of an exemplary computing system in which described embodiments can be implemented.
Figure 13 is an exemplary cloud- support environment that can be used in conjunction with the technologies described herein.
DETAILED DESCRIPTION
The following explanations of terms and methods are provided to better describe the present disclosure and to guide those of ordinary skill in the art in the practice of the present disclosure. The singular forms "a," "an," and "the" refer to one or more than one, unless the context clearly dictates otherwise. For example, the term "comprising a cell" includes single or plural cells and is considered equivalent to the phrase "comprising at least one cell." The term "or" refers to a single element of stated alternative elements or a combination of two or more elements, unless the context clearly indicates otherwise. As used herein, "comprises" means "includes." Thus, "comprising A or B," means "including A, B, or A and B," without excluding additional elements. All references cited herein are incorporated by reference.
Whenever a range is given in the specification, for example, a temperature range, a time range, a size range, a length range, a cardinality range, or a composition or concentration range, all intermediate ranges and subranges, as well as all individual values included in the ranges given are included in the disclosure. The specific embodiments provided herein are examples of useful embodiments of the present disclosure and it will be apparent to one skilled in the art that the methods and systems provided herein can be carried out using a large number of variations of the devices, device components, software components, methods, and steps set forth in the present description. Methods and devices useful for the present disclosure can include a large number of optional composition and processing elements and steps.
Unless explained otherwise, all technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which this disclosure belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present disclosure, suitable methods and materials are
described below. The materials, methods, and examples are illustrative only and not intended to be limiting.
In order to facilitate review of the various embodiments of the disclosure, the following explanations of specific terms are provided:
Target nucleic acid sequence or molecule: A defined region or particular portion of a nucleic acid molecule, for example, a portion of a genome (such as a chromosome of mammalian or plant genomic DNA of interest). In an example where the target nucleic acid sequence is a target genomic sequence, such a target can be defined by its position on a chromosome (e.g., in a normal cell), for example, according to cytogenetic nomenclature by reference to a particular location on a chromosome; by reference to its location on a genetic map; by reference to a hypothetical or assembled contig; by its specific sequence or function; by its gene or protein name; or by any other means that uniquely identifies it from among other genetic sequences of a genome. In some examples, the target nucleic acid sequence is a genomic sequence (such as a mammalian genomic sequence, for example human genomic sequence, a plant genomic sequence, a bacterial genomic sequence, a viral genomic sequence, or a fungal genomic sequence).
In some examples, alterations of a target nucleic acid sequence (e.g., genomic nucleic acid sequence) are "associated with" a disease or condition. That is, detection of the target nucleic acid sequence can be used to infer the status of a sample with respect to the disease or condition. For example, the target nucleic acid sequence can exist in two (or more) distinguishable forms, such that a first form correlates with absence of a disease or condition and a second (or different) form correlates with the presence of the disease or condition. The two different forms can be
qualitatively distinguishable, such as by polynucleotide polymorphisms, and/or the two different forms can be quantitatively distinguishable, such as by the number of copies of the target nucleic acid sequence that are present in a cell.
Overview
With the cost reduction of the next- generation sequencing (NGS) technologies, genomics provides an unprecedented opportunity to understand fundamental questions in biology and elucidate human diseases. One of the most important steps to reconstruct the sequenced genome is de novo genome assembly. For example, a typical chromosome can include millions of
nucleotides, and a genome can include billions of nucleotides. Current sequencing methods divide these very long sequences into shorter sequences, for example, by cutting the nucleic acid molecule into shorter pieces. However, it can be difficult to efficiently put the chromosome or genome back together after it has been split up (referred to as de novo genome assembly). In addition, most de
novo assemblers require enormous amount of computational resource, which is not accessible for most research groups and medical personnel. Provided herein is a novel de novo assembly framework, including a software program referred to herein as Tiger (Tiled Iterative GEnome assembleR), which adapts to available computing resources by iteratively decomposing the assembly problem into sub-problems. The disclosed methods are also flexible to embed different assemblers for various types of target genomes.
State-of-the-art assemblers that can achieve relatively high assembly quality need excessive amounts of computing resource {e.g., memory) that are not available to most researchers. The disclosed methods and systems disclosed herein provide the only known viable path to utilize NGS de novo assemblers that require more memory than what is present in available commodity computers. The data provided herein demonstrates the feasibility of obtaining better quality results with low memory footprint, along with the scalability and flexibility of using distributed
commodity computers.
It is shown herein that one of the causes of the NGS genome assembly problem, in terms of memory usage and scalability, can be solved if the large computational tasks are decomposed into modest-sized independent sub-problems, which then fit into smaller memories and can be solved in parallel. This can effectively move large-scale de novo assembly tasks into commodity PC networks. In addition, this new approach provides improved assembly quality compared to the current state-of-the-art assemblers. A highly effective framework for decomposing the problem of genome assembly from NGS reads is provided. The decomposed sub-problems can be solved in a sequential manner using significantly less memory or solved simultaneously if more computing nodes are available.
Besides the limitation on computing resources, others have compared NGS de novo assemblers [14-20] and acknowledged that no assembler is the best across all applications and datasets. To deal with this issue effectively, the disclosed framework is designed in such a way that it can seamlessly embed different assemblers into the framework to take advantage of unique strengths of each assembler. No existing assemblers are known to do this. These embedded assemblers work on decomposed sub-problems mentioned above efficiently. Through an iterative improvement approach facilitated by this framework, higher assembly quality was achieved as compared to the original assemblers themselves.
Further, current technologies lack an ability to generate long, high-quality contigs for large sequenced genomes with repeats.
Thus, provided herein is a novel method and system for sequence assembly, which provides comparable or better quality results while adapting to available memory resources. The method
partitions and clusters reads so that the memory usage for assembly is inversely proportional to the number of partitions. This fundamentally resolves the issue of high memory requirement in sequence assembly and can accelerate the research progress of this field. The disclosed methods and systems can leverage existing well-built assemblers and iteratively improve the results. The disclosed methods and systems can also start from an already assembled result and further improve it, using the original read set. In situations where traditional assemblers require more memory than that is present in available computers, the disclosed methods and systems provide the only known viable path. The results provided herein demonstrate the feasibility of getting better quality results with low memory footprint and the scalability of using distributed commodity computers.
As described herein, iterative construction of partitioned, more focused input tiles can enable assemblers to achieve better results for large sequenced genomes with repeats.
The disclosed methods and systems are useful in a wide variety of technology areas. For example, the disclosed de novo genome assembly methods and systems can be used in cancer research. In some examples, the target to be assembled is from a particular cancer, such as from a cancer from a specific patient. Such sequencing will permit identification of the specific and unique changes present in the tumor from the patient. Based on these changes a personalized therapeutic strategy can be undertaken. In some examples, the target to be assembled is from a particular cancer, for example to identify new mutations associated with that cancer.
Thus, such methods and systems are useful for personalized medicine. In addition to cancer, genome sequencing and assembly using the disclosed methods and systems permits personalized treatment for a variety of diseases, and permits identification of new classes of therapeutics, for example, to eliminate side effects of current therapeutics, identify correct patient dosage, and improve diagnostic tests.
The disclosed methods and systems are useful for understanding the diversity and importance in transfer of genetic material among different organisms, such as bacterial species. The disclosed methods and systems are useful for identifying biomarkers of plant species under critical conditions such as temperature, disease and drought. It can also be used to study factors affecting growth, for example, in identifying essential functional genomic mechanisms of synthesis of plant material grown under stress.
One skilled in the art will appreciate that the disclosed methods and systems are useful for other purposes as well.
Example 1 - Exemplary System Implementing Iterative Genome Assembly
FIG. 1 is a block diagram of an exemplary system 100 implementing iterative genome assembly as described herein.
In the example, a partitioner 115 accepts stored input reads 110 as input and outputs a plurality of read tiles 120. The stored input reads 110 can represent sequencing reads generating from a sequencing method applied to a sample as described herein.
As described herein, the partitioner 115 can be configured to partition the stored input reads into a plurality of different read tiles. Various techniques can be used, such as randomly partitioning the stored input reads. For example, a round-robin approach can be used where a first stored input read is placed into a first read tile, a second stored input read is placed into a second read tile, etc. The number of read tiles can be varied as described herein.
The assembler 125 can be configured to assemble input reads in a given read tile into a set of contigs (e.g., for the given read tile). Thus, given r read tiles 120, r contig sets 130 can be generated (e.g., one contig set per read tile). As described herein, assembly can be distributed among a plurality of different machines.
A contig cluster tool 140 can be configured to organize the contigs (e.g., the collective contigs from the different contig sets 130) from the contig sets 130 into a graph 145 and partition the graph into contig clusters 150 (e.g., based on overlap as described herein). As described herein, processing for the contig cluster tool 140 can be distributed among a plurality of different machines. As iterations progress, the contig clusters 150 can become well-clustered contigs.
An alignment tool 160 can accept the contig clusters 150 and the input reads 110 and output read clusters 165. As described herein, processing for the alignment tool 160 can be distributed among a plurality of different machines. As iterations progress, the read clusters 150 can become well-clustered read clusters.
The assembler 125 is further configured to assemble input reads in a given read cluster 165 into a new set of contigs (e.g., like those in 130, but improving with successive iterations). As described herein, iterations can stop when there ceases to be observed improvement.
In practice, the systems shown herein, such as system 100, can be more complicated, with additional functionality, tools, and the like.
The system 100 and any of the other systems described herein can be implemented in conjunction with any of the hardware components described herein, such as the computing systems or mobile devices described below (e.g., comprising one or more processors, memory, and the like). In any of the examples herein, the inputs, input reads, contigs, outputs, graphs, and tools can be stored in one or more computer-readable storage media or computer-readable storage devices.
The technologies described herein can be generic to the specifics of operating systems or hardware and can be applied in any variety of environments to take advantage of the described features.
Example 2 - Exemplary Method Implementing Iterative Genome Assembly
FIG. 2 is a flowchart of an exemplary method 200 of implementing iterative genome assembly and can be implemented, for example, in the system shown in FIG. 1.
At 210, a plurality of input reads are received.
At 220, the input reads are initially partitioned into respective read tiles. The result is a set of read tiles with each given read tile comprising a subset of the input reads. As described herein, random (e.g., round robin or the like) partitioning can be supported. The input reads can thus be substantially equally divided among the read tiles. As described herein, the number of read tiles can vary. The number of read tiles can also vary from iteration to iteration.
At 230, sets of contigs are assembled for respective of the read tiles. Thus, the input reads for a read tiles are assembled into respective sets of contigs. For example, a different contig set can be assembled for each read tile. Assembly thus can take place for input reads within each read tile. The inputs in a given read tile can be submitted as input to an assembler, and the output is a set of contigs assembled for the read tile. As described herein, different assemblers can be used. Also, as described herein, a read tile can be treated as a complete set of inputs that can be processed separately (e.g., in parallel, sequentially, on a different machine, or the like) from another read tile. Thus, assembly of the read tiles can be coordinated (e.g., scheduled for execution on another machine or system) rather than explicitly performed.
Guidance can be provided as input to an assembler in the form of a specified k-mer size. Responsive to the input, the assembler can attempt to assemble contigs of the k-mer size. An automatic k-mer can be chosen as described herein. Other guidance can be provided based on characteristics of the input reads (e.g., length, coverage, errors, and the like) if known.
As iterations progress, it is expected that the overall number of contigs will decrease.
At 240, the resulting contigs are (e.g., collectively) clustered into contig clusters. Such clustering can be based on overlap between the contigs as described herein. As described herein, a graph of the contigs can be built to model overlap within the contigs, and the graph can be partitioned into clusters of contigs. The sets of contigs can be combined (e.g., into a collective set for inclusion in the graph) before they are clustered. Scaffolding can also be performed.
At 250, the input reads can be clustered into read clusters, based on similarity between the input reads and the contig clusters. For example, the percentage of base-pairs (or nucleotides) that are in a common pattern between the two (e.g., an input read and a contig in a cluster) can be
determined via an alignment tool. As described herein, an input read can be assigned to more than one read cluster. As described herein, clustering can be performed by distributed processing. Thus, clustering can be coordinated (e.g., scheduled for execution on another machine or system) rather than explicitly performed.
At 260, sets of contigs for respective of the read clusters can be assembled. The read clusters can then be processed as read tiles. For example, similar to the assembly of 230, a different contig set can be assembled for each read cluster. The input reads for a given read cluster are thus assembled into a set of contigs.
240 - 260 can be iteratively repeated until an end condition is met.
In practice, 210-230 can be performed as preprocessing (e.g., by another technology), and the method 200 can comprise 240 - 260.
The method 200 and any of the other methods described herein can be performed by computer-executable instructions (e.g., causing a computing system to perform the method) stored in one or more computer-readable media (e.g., storage or other tangible media) or stored in one or more computer-readable storage devices.
Example 3 - Exemplary Input Reads
In any of the examples herein, input reads can represent nucleotide sequences of DNA or
RNA fragments as described herein. The input reads can be a result of sequencing reads generated from a sequencing method applied to a sample. Such input reads can correspond to subsequences of an entire genome and can range up to the billions of reads. As described herein, the lengths of the input reads can vary.
Although some examples use reads taken directly from a sequencing method, some preassembly can be performed. For example, input reads can take the form of sequences from publicly available databases as described herein, assembled contigs, assembled scaffolds, or the like. Thus, the input to the technologies described herein are not necessarily limited to raw reads from a sequencer.
In practice, an input read takes the form of a digital representation of a string of nucleotides (e.g., DNA or RNA fragments) as sequenced by the sequencing method or assembled by any of a variety of arbitrary assembly methods.
Example 4 - Exemplary Read Tiles
In any of the examples herein, a read tile can take the form of a subset of input reads. In practice, the input reads are partitioned into a plurality of read tiles as described herein. The term
"read cluster" is synonymous with "read tile" in that it is a subset of input reads. The term "read cluster" is typically used for a set of read tiles that have been partitioned based on contig clusters as described herein.
As iterations progress, the read tiles can become well-clustered read clusters in that they can be better clustered according to where in the genome they appear, even if their location need not be explicitly known or represented.
Example 5 - Exemplary Contigs
In any of the examples herein, a contig can take the form of a digital representation of a string of nucleotides (e.g., DNA or RNA) as assembled from the input reads. In practice, the contigs will be of larger size than the input reads and will eventually become of such a size that they are useful for a wide variety of purposes as described herein.
Example 6 - Exemplary Contig Clusters
In any of the examples herein, a contig cluster can take the form of a subset of contigs that have been determined to have a degree of overlap as described herein. As iterations progress, the contig clusters are said to be "well-clustered" because the contigs contribute to a continuous region of the target genome.
Example 7 - Exemplary Automatic A mer Choice
In any of the examples herein, assembly of some input reads can be attempted with at least one of the assemblers via a plurality of different k-mer sizes. Performance of assembly under the different k-mer sizes can be evaluated. A superior k-mer size can then be chosen for use across the reads based on the evaluation. Finer levels of granularity can be supported (e.g., different k-mer sizes for different read tiles).
Performance of assembly can be based at least on the number of used reads in the assembly, length of the contigs (e.g., N50 metrics, total length, or the like), errors (e.g., with reference to a reference genome, if any) or the like.
The number of used reads in the assembly need not be evaluated on a higher-the-better approach. As long as the number of used reads is over a threshold, the criterion can be considered to be fulfilled, and the assembly result may be rated as the best.
Example 8 - Exemplary End Condition
In any of the examples herein, iteration can continue until an end condition is met. A variety of end conditions can be used. For example, a lack of improvement (e.g., or threshold amount of improvement) in the sets of assembled contigs can indicate an end condition.
Improvement can be measured in terms of quality improvement. For example, improvement can be measured in terms of the length of the contigs, or the like. For example, the distribution of contig length can be measured and compared against a threshold.
An end condition can also be stated in terms of average, median, or mode contig length; minimum contig length; or the like. Metrics such as N50 can be used.
Example 9 - Exemplary System Clustering Contigs Based on Overlap
FIG. 3 is a block diagram of an exemplary system 300 clustering contigs based on overlap. In the example, combined contigs 330 are accepted by a graph builder 340, which outputs a contig connectivity graph 345 based on overlap in the combined contigs 330.
As described herein, the combined contigs 330 can be the collective contigs in the contig sets (e.g., 130 of FIG. 1) generated from assembly of read tiles or read clusters (e.g., the read tiles 120 or the read clusters 165 of FIG. 1).
The graph partitioner 347 accepts the contig connectivity graph 345 as input and outputs a plurality of contig clusters 350A-N, which are clustered based on overlap as represented in the graph 345. The clusters 350A-N can be used as contig clusters for the technologies described herein (e.g., the contig clusters 150 of FIG. 1).
In practice, the graph partitioner 347 can be provided with a target number of contig clusters 350A-N and divide the graph 345 accordingly. The number of clusters can change (e.g., increase, decrease) with subsequent iterations.
The target number of contig clusters can be chosen in a variety of ways. For example, the target genome length (if known in advance), available memory size on machines doing the processing, limited runtime (e.g., more clusters typically result in more runtime and better results), similarity to previous experience, or the like.
Example 10 - Exemplary Method Clustering Contigs Based on Overlap
FIG. 4 is a flowchart of an exemplary method 400 of clustering contigs based on overlap and can be implemented, for example, in the system shown in FIG. 3. The method 400 can also be
used to implement contig clustering 240 shown in FIG. 2. Although not explicitly shown, the method 400 can include initially receiving collective contigs.
At 420, a contig connectivity graph is built based on overlap between the combined contigs. Contigs can be represented as respective nodes in the graph. Edges are based on (e.g., placed based on, weighted based on, or both) detected overlap between the contigs. As described herein, clustering based on overlap can result in grouping the contigs based on how likely the contigs or portions thereof will be next to each other in the genome.
At 430, the graph is partitioned into contig clusters. Because nodes can represent respective contigs, partitioning the graph effectively partitions the corresponding contigs. Partitioning of the graph can be accomplished using a graph partitioning tool. For example, vertex weights, edges, or both can be considered when partitioning the graph into sub-graphs based on connectivity. A variety of partition techniques, such as circuit or graph partitioning can be used. A goal can be to partition the graph into parts with the least number of edges between groups while having substantially balanced group sizes.
In practice, vertex weights can be given less importance than edge weights to result in more edge-oriented subgraphs. However, vertex weights can still be considered to address situations where many short contigs have little connectivity. Such short contigs can thus be distributed evenly among the clusters to reduce their influence on the rest of the clustered contigs as the technique progresses. Because overlapping contigs are clustered together, they can subsequently be rebuilt as one long complete contig (e.g., when sets of contigs are subsequently assembled from read clusters based on the contig clusters).
Subsequently, at 440, the contig clusters can be used to cluster the input reads as described herein. Example 11 - Exemplary Graph Representing Contig Overlap
FIG. 5 is a block diagram of an exemplary graph 500 representing contig overlap. For ease of description, the example shows three nodes 510A-C. In practice, the graph 500 will have many more nodes and edges.
In the example, a node 510A, 510B, 5 IOC represents a respective contig. Nodes can be weighted based on the length of the contig represented (e.g., in mers or the like).
The edges 520A, 520B can placed based on detected overlap between nodes. The edges 520A, 520B can further be weighted based on the degree of overlap between a pair of contigs represented by respective nodes.
So, for example, the weight of the edge 520A can indicate the degree of overlap between a contig represented by the node 51 OA and a contig represented by the node 51 OB. Thus, edges in the graph represent overlap between contigs connected by the edges. For example, as described herein, a number of base pairs (e.g., nucleotides) in common between contigs connected by an edge can represent overlap and the number of base pairs indicates the edge weight between them.
Instead of base pairs, words can be used as described herein. Other factors that can be incorporated for consideration during partitioning include the number of other nodes each node connects to.
In practice, the graph 500 can be represented in a variety of ways. Data structures representing the contigs need not be stored in the nodes themselves. For example, a reference to the contig data can be stored.
Other information (e.g., a confidence score or the like) can be included to assist in the partition process.
Because the graph 500 represents contigs and connects the contigs according to overlap, it is sometimes called a "contig connectivity graph." In practice, connectivity of the graph can vary. For example, there may be nodes that have no edges or the like.
Example 12 - Exemplary Method Clustering Contigs Using Word Map
FIG. 6 is a flowchart of an exemplary method 600 of clustering contigs via a word map and can be implemented, for example, using the graph shown in FIG. 5. The method 600 can be used in place of or to supplement the method 400 of FIG. 4. Although not explicitly shown, the method 600 can include initially receiving collective contigs.
At 620, words are extracted from the contigs. The words can be consecutive base pair sequences (e.g., of equal length, substantially equal length, or the like) from the contigs.
At 630, a word map is built from the words and the contigs. The map can map the words to the contigs containing them. Building the word map can be accomplished by distributing processing among machines (e.g., with machines receiving respective sets of contigs and determining which of the contigs have which words in them).
At 640, a graph is built based on the word map. As described herein, edge weights can be placed and set to the degree of overlap (e.g., number of base pairs in common) between contigs represented by vertices.
The graph can then be portioned at 650. A graph partitioning tool suitable for processing millions of vertices (e.g., METIS or the like) can be used.
Example 13 - Exemplary Overlap
In any of the examples herein, contigs can be clustered based on overlap. As described herein, a graph can be built to model overlap among the contigs and then partitioned into subgraphs. Nodes in different subgraphs can be considered as being in different respective contig clusters. The graph can model contig connectivity intensity.
In practice, the more base pairs that are in common at the overlapping end of two contigs, the more they overlap, and thus the heavier the weight of the edge that connects them in the graph.
A threshold for determining overlap can be empirically determined. To conserve resources, a higher threshold can be set (e.g., to reduce edges in a large graph), Overlap threshold can be set in terms of base pairs, number of words, or some other metric to measure common regions between two contigs.
Although not necessarily explicitly determined, overlap also tends to indicate conceptual connectedness, such as how likely the contigs are in the same region (e.g., next to each other) of the genome.
Example 14 - Exemplary Read Clustering
In any of the examples herein, read clustering can be done to cluster the input reads into a plurality of read clusters. Such clustering can be based on the contig clusters that are determined as described herein. For example, an input read can be included in a read cluster associated with a particular contig cluster if sufficient alignment between the input read and a contig in the contig cluster is detected. Alignment between an input read and a contig can be partially (e.g., partial words extracted from both).
Alignment can be performed using read alignment software, such as the Bowtie ultrafast, memory-efficient short read aligner developed at the University of Maryland; the MUMer bioinformatics software system maintained by the Center for Computational Biology at Johns Hopkins University; SOAP Aligner available from BGI; custom aligners; and others.
In conceptual terms, the read clusters re-sort the input reads based on assembly of contigs that was done for the input reads (e.g., as partitioned into read tiles or clusters).
Example 15 - Exemplary System Performing Read Tile Assembly into Contigs in Parallel
FIG. 7 is a block diagram of an exemplary system 700 performing read tile assembly into contigs in parallel on different machines 790A-N.
In any of the examples herein, read cluster (or read tile) assembly can be performed in parallel by distributing different read clusters to different machines for assembly. The read clusters
can be distributed as sub-problems to different machines 790A-N as shown. In the example, the different read tiles (or clusters) 710A, 710B, 7 ION are accessible to respective of the machines 710A, 710B, 710N.
An assembler 725A, 725B, 725N can execute at each of the machines to assemble the read tiles (or clusters) into respective contigs sets 730A, 730B, 730N. As described herein, the same or different assemblers can be used.
As shown, assembly can be performed as a sub-problem requiring no additional
information regarding other read clusters.
In practice, assembly at one machine may finish before the assembly at another machine. The resulting contig sets 730A-N can be collectively partitioned into contig clusters as described herein.
As a result, in any of the examples herein, contig assembly can be performed by sending sets of read clusters (or tiles) to one or more other machines, and subsequently receiving contig sets assembled from respective of the read clusters (or tiles).
Example 16 - Exemplary System Performing Read Clustering in Parallel
FIG. 8 is a block diagram of an exemplary system 800 performing read clustering based on contig clusters in parallel on different machines 890A-N.
In any of the examples herein, read clustering can be performed in parallel by distributing different contig clusters to different machines for alignment against the input reads. The contig clusters can be distributed as sub-problems to different machines 890A-N as shown. In the example, the different contig clusters 850A, 850B, 850N are accessible to respective of the machines 810A, 810B, 8 ION. In addition, the input reads 810 (e.g., the same input reads) are also accessible to the machines.
An alignment tool 860A, 860B, 860N can execute at each of the machines to cluster the input reads 810 into read clusters 865A, 865B, 865N based on respective of the contig clusters 850A, 850B, 850N. The same or different alignment tools can be used.
As shown, contig cluster alignment against the input reads can be performed as a sub- problem requiring no additional information regarding other contig clusters.
In practice, clustering at one machine may finish before the clustering at another machine. The resulting read clusters 830A-N can be assembled into respective contig sets as described herein.
As a result, in any of the examples herein, read clustering can be performed by sending contig clusters to one or more other machines, and subsequently receiving read clusters clustered based on respective of the contig clusters and the input reads. Methods of Iterative Sequence Assembly
A. Exemplary Sources of Input Reads
The sequence used to generate input reads for the disclosed methods can be obtained from any organism of interest, such as a microbe (such as a virus, bacteria, fungi, or protozoa), plant, or animal (such as a mammal, for example, a human or mouse or an amphibian such as a frog or salamander).
In one example, input reads are obtained from a virus, such as a positive-strand RNA virus or a negative-strand RNA virus. Exemplary positive-strand RNA viruses include, but are not limited to: Picornaviruses (such as Aphthoviridae [for example, foot-and-mouth-disease virus (FMDV)]), Cardioviridae; Enteroviridae (such as Coxsackie viruses, Echoviruses, Enteroviruses, and Polioviruses); Rhinoviridae (Rhinoviruses)); Hepataviridae (Hepatitis A viruses); Togaviruses (examples of which include rubella; alphaviruses (such as Western equine encephalitis virus, Eastern equine encephalitis virus, and Venezuelan equine encephalitis virus)); Flaviviruses (examples of which include Dengue virus, West Nile virus, and Japanese encephalitis virus);
Calciviridae (which includes Norovirus and Sapovirus); and Coronaviruses (examples of which include SARS coronaviruses, such as the Urbani strain). Exemplary negative- strand RNA viruses include, but are not limited to: Orthomyxyo viruses (such as the influenza virus), Rhabdo viruses (such as Rabies virus), and Paramyxoviruses (examples of which include measles virus, respiratory syncytial virus, and parainfluenza viruses).
In one example, input reads are obtained from one or more DNA viruses. DNA viruses include, but are not limited to: Herpesviruses (such as Varicella-zoster virus, for example, the Oka strain; cytomegalovirus; and Herpes simplex virus (HSV) types 1 and 2); Adenoviruses (such as Adenovirus type 1 and Adenovirus type 41); Poxviruses (such as Vaccinia virus); and Parvoviruses (such as Parvovirus B 19).
In one example, input reads are obtained from a Retrovirus. Examples of retroviruses include, but are not limited to: human immunodeficiency virus type 1 (HIV- 1), such as subtype C; HrV-2; equine infectious anemia virus; feline immunodeficiency virus (FIV); feline leukemia viruses (FeLV); simian immunodeficiency virus (SIN); and avian sarcoma virus.
In one example, input reads are obtained from one or more of the following: HIV; Hepatitis A virus; Hepatitis B (HB) virus; Hepatitis C (HC) virus; Hepatitis D (HD) virus; Hepatitis E virus;
a respiratory virus (such as influenza A & B, respiratory syncytial virus, human parainfluenza virus, or human metapneumovirus), or West Nile Virus.
In one example, input reads are obtained from bacteria. Bacteria can be classified as gram- negative or gram-positive. Exemplary gram-negative bacteria include, but are not limited to:
Escherichia coli {e.g., K-12 and 0157:H7), Shigella dysenteriae, and Vibrio cholerae. Exemplary gram-positive bacteria include, but are not limited to: Bacillus anthracis, Staphylococcus aureus, Listeria, pneumococcus, gonococcus, and streptococcal meningitis. In one example, input reads are obtained from one or more of the following: Group A Streptococcus; Group B Streptococcus; Helicobacter pylori; Methicillin-resistant Staphylococcus aureus; Vancomycin-resistant
enterococci; Clostridium difficile; E. coli {e.g., Shiga toxin producing strains); Listeria; Salmonella; Campylobacter; B. anthracis; Chlamydia trachomatis; and Neisseria gonorrhoeae.
Protozoa, nemotodes, and fungi are also types of pathogens. In one example, input reads are obtained from Plasmodium {e.g., Plasmodium falciparum to diagnose malaria), Leishmania, Acanthamoeba, Giardia, Entamoeba, Cryptosporidium, Isospora, Balantidium, Trichomonas, Trypanosoma (e.g., Trypanosoma brucei), Naegleria, or Toxoplasma. In one example, input reads are obtained from Coccidiodes immitis or Blastomyces dermatitidis.
In one example, input reads are obtained from an organism having a large genome, such as a plant genome, for example, a flowering plant genome, an amphibian, a fish, or a mammal. Plant genomes that can be analyzed with the disclosed methods and systems include angiosperms, gymnosperms, monocotyledons and dicotyledons. Examples of monocotyledonous plants include asparagus, field and sweet corn, barley, wheat, rice, sorghum, onion, pearl millet, rye and oats. Examples of dicotyledonous plants include tomato, tobacco, cotton, potato, rapeseed, field beans, soybeans, peppers, lettuce, peas, alfalfa, clover, cole crops or Brassica oleracea {e.g., cabbage, broccoli, cauliflower, brussels sprouts), radish, carrot, beets, eggplant, spinach, cucumber, squash, melons, cantaloupe, sunflowers and various ornamentals. Woody species include poplar, pine, sequoia, cedar, oak, and the like. In one example the plant is a highly inbred crop, such as rice, soybean, corn, cotton, wheat, oat, barley, or sorghum. Other particular types of plants include fruit plants (such as strawberry), fruit trees (such as a citrus tree, {e.g., orange, lime, lemon or grapefruit tree), as well as other fruit trees {e.g., cherry, papaya or plum tree), flowering plants, and grasses. In one example, the plant is a crop plant, such as soybean, corn, canola, tobacco, cotton and the like. Other exemplary plants that can be used with the methods provided herein include rice, maize, wheat, barley, sorghum, millet, grass, oats, tomato, corn, potato, banana, kiwi fruit, avocado, melon, mango, cane, sugar beet, tobacco, papaya, peach, strawberry, raspberry, blackberry, blueberry, lettuce, cabbage, cauliflower, onion, broccoli, brussels sprouts, cotton, canola, grape,
soybean, oil seed rape, asparagus, beans, carrots, cucumbers, eggplant, melons, okra, parsnips, peanuts, peppers, pineapples, squash, sweet potatoes, rye, cantaloupes, peas, pumpkins, sunflowers, spinach, apples, cherries, cranberries, grapefruit, lemons, limes, nectarines, oranges, pears, tangelos, tangerines, lily, carnation, chrysanthemum, petunia, rose, geranium, violet, gladioli, orchid, lilac, crabapple, sweetgum, maple, poinsettia, locust, ash, linden tree, fern, Psilotum nudum, Picea abies, and Arabidopsis thaliana.
B. Generation of Input Reads
The disclosed methods use input reads containing sequence information for a target to be sequenced using the disclosed de novo genome assembly methods. Methods for generating input reads are routine in the art. In some examples, the disclosed methods or systems include a step of generating input reads, for example, by sequencing a target nucleic acid of interest.
In one example, input reads can be obtained from publicly available databases such as GenBank, EMBL (the European Molecular Biology Laboratory), the NCI DTP, CellMiner, and Ingenuity websites, as well as DDBJ (the DNA Databank of Japan).
In some examples, a target genome of interest (such as an entire genome or portion thereof, such as a particular gene or chromosome) can be sequenced using routine methods to produce input reads. The disclosure is not limited to particular sequencing methods, as numerous different methods are known in the art. In some examples, one or more target nucleic acids in a sample are sequenced to produce input reads. For example, at least 2, at least 3, at least 4, at least 5, at least 10, at least 100, at least 1000, or at least 10,000 different target nucleic acids (such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, 100, 500, 1000, 5000, 10,000, 30,000 or 50,000 different target nucleic acids) can be sequenced simultaneously or contemporaneously to produce input reads.
In some examples, an entire target genome or entire chromosome is sequenced. In some examples at least 2, at least 3, at least 4, at least 5, at least 10, at least 12, at least 15, at least 20, at least 23, at least 30, at least 40, or at least 50 different chromosomes (such as a 1-23, 2 to 100, 2 to 23, or 5 to 50 different chromosomes) are sequenced or assembled using the disclosed methods or systems, and thus can be used to generate input reads.
The disclosure is not limited to target nucleic acid molecules of any particular length. For example, genes can be to used to produce input reads and be assembled using the disclosed methods. In some examples, genes contain at least 500 nucleotides (nt), at least 1000 nt, at least 10,000 nt, at least 100,000 nt at least 500,000 nt, at least 1,000,000 nt, or at least 10,000,000 nt, (such as 1,000 to 10,000 nt; 1,000 to 1,000,000 nt or 1,000 to 25,000 nt). For example,
chromosomes can be used to produce input reads and can be assembled using the disclosed
methods. In some examples, chromosomes contain at least 100,000 nucleotides (nt), at least 500,000 nt, at least 1,000,000 nt, at least 10,000,000 nt, at least 100,000,000 nt, at least
1,000,000,000 nt, at least 2,000,000,000 nt or at least 3,750,000,000 nt (such as 10,000 to
10,000,000,000 nt; 1,000,000 to 5,000,000,000 nt or 50,000,000 to 250,000,000 nt). For example, genomes can be used to produce input reads and can be assembled using the disclosed methods. In some examples, genomes are at least 1,000 nt, at least 5,000 nt, at least 10,000 nt, at least 100,000 nt, at least 500,000 nt, at least 1,000,000 nt, at least 10,000,000 nt, at least 100,000,000 nt, at least 1,000,000,000 nt, at least 2,000,000,000 nt, at least 3,000,000,000 nt, at least 10,000,000,000 nt, at least 50,000,000,000 nt, at least 100,000,000,000 nt, at least 130,000,000,000 nt or at least
150,000,000,000 nt (such as 1,000 to 10,000,000 nt; 1,000,000 to 200,000,000,000 nt or 50,000,000 to 150,000,000,000 nt).
1. Isolation of nucleic acid molecules
Nucleic acid molecules, such as DNA or RNA or both, can be extracted from a sample using routine methods, prior to sequencing. General methods for nucleic acid extraction and isolation are well known in the art and are disclosed in standard textbooks of molecular biology, including Ausubel et al., Current Protocols of Molecular Biology, John Wiley and Sons (1997). Methods for nucleic acid extraction extraction from paraffin-embedded tissues are disclosed in for example, by Rupp and Locker (Lab. Invest., 56:A67, 1987) and DeAndres et al. (BioTechniques, 18:42044, 1995). In particular examples, nucleic acid extraction can be performed using a purification kit, buffer set and protease obtained from commercial manufacturers, such as Qiagen, according to the manufacturer's instructions. Other commercially available nucleic acid isolation kits include MASTERPURE™ Complete DNA and RNA Purification Kit (EPICENTRE™
Biotechnologies) and Paraffin Block RNA Isolation Kit (Ambion, Inc.). In some examples, the nucleic acids are isolated from an FFPE sample, which may be sectioned.
In some examples, one or more chromosomes from a sample are isolated using routine methods.
2. Exemplary samples
In some examples, the disclosed methods include the step of obtaining a sample, preparing the sample for analysis (for example fixing the sample, isolating nucleic acids, or combinations thereof), or both. Methods of obtaining a biological sample from a subject are known in the art. For example, methods of obtaining tissue, such as tissue from a human or other mammalian subject, as well as from other organisms such as a microbe or plant, are known. In some examples,
the sample is obtained from a microbe (e.g., fungus, bacteria, or parasite; specific examples listed above), plant, or other organism of interest.
The sample can be fresh, frozen, or fixed. In some examples, samples are processed post- collection by fixation and in some examples are wax- (e.g., paraffin-) embedded. Fixatives for mounted cell and tissue preparations are well known in the art and include, without limitation, formalin fixative, 95% alcoholic Bouin's fixative; 95% alcohol fixative; B5 fixative, Bouin's fixative, Karnovsky's fixative (glutaraldehyde), Hartman's fixative, Hollande's fixative, Orth's solution (dichromate fixative), and Zenker's fixative (see, e.g., Carson, Histotechology: A
Self-Instructional Text, Chicago:ASCP Press, 1997). In some examples, the sample is a fixed, wax-embedded tissue sample, such as a fixed, wax-embedded tissue sample.
In some examples the sample includes a plant or part thereof, such as a leaf, flower, pollen, or ovule.
In some examples, the sample is a tumor sample, which in some examples is obtained by surgical excision of all or part of the tumor, by collecting a fine needle aspirate from the tumor, as well as other methods known in the art. Examples of hematological tumors include leukemias, including acute leukemias (such as acute lymphocytic leukemia, acute myelocytic leukemia, acute myelogenous leukemia and myeloblastic, promyelocytic, myelomonocytic, monocytic and erythroleukemia), chronic leukemias (such as chronic myelocytic (granulocytic) leukemia, chronic myelogenous leukemia, and chronic lymphocytic leukemia), polycythemia vera, lymphoma, Hodgkin's disease, non-Hodgkin's lymphoma (indolent and high grade forms), multiple myeloma, Waldenstrom's macroglobulinemia, heavy chain disease, myelodysplasia syndrome, hairy cell leukemia and myelodysplasia. Examples of solid tumors, such as sarcomas and carcinomas, include fibrosarcoma, myxosarcoma, liposarcoma, chondrosarcoma, osteogenic sarcoma, and other sarcomas, synovioma, mesothelioma, Ewing's tumor, leiomyosarcoma, rhabdomyosarcoma, colon carcinoma, lymphoid malignancy, pancreatic cancer, breast cancer, lung cancers (such as non-small cell lung cancer), ovarian cancer, prostate cancer, hepatocellular carcinoma, squamous cell carcinoma, basal cell carcinoma, adenocarcinoma, sweat gland carcinoma, medullary thyroid carcinoma, papillary thyroid carcinoma, pheochromocytomas sebaceous gland carcinoma, papillary carcinoma, papillary adenocarcinomas, medullary carcinoma, bronchogenic carcinoma, renal cell carcinoma, hepatoma, bile duct carcinoma, choriocarcinoma, Wilms' tumor, cervical cancer, testicular tumor, seminoma, bladder carcinoma, and CNS tumors (such as a glioma, astrocytoma, medulloblastoma, craniopharyogioma, ependymoma, pinealoma, hemangioblastoma, acoustic neuroma, oligodendroglioma, menangioma, melanoma, neuroblastoma and retinoblastoma). Thus, a sample can be obtained from any such tumor.
3. Sequencing Methods
Exemplary sequencing methods that can be used to generate input reads for the methods provided herein include, but are not limited to: Maxam-Gilbert sequencing, chain termination methods (e.g., Sanger sequencing), shotgun sequencing, bridge PCR, single-molecule real-time sequencing (e.g., from Pacific Bio), ion semiconductor (e.g., Ion Torrent sequencing),
pyrosequencing (e.g., 454 from Life Sciences), sequencing by synthesis (e.g., from Illumina), sequencing by ligation (e.g., SOLiD sequencing), polony sequencing, heliscope single molecule sequencing, massively parallel signature sequencing (MPSS) (Brenner et al., Nature Biotechnology, 18:630-634, 2000), and exome sequencing (for example see Dolled-Filhart et al., Scientific World Journal. 2013:730210, 2013 and Do et al, Hum Mol Genet.21(Rl):Rl-9, 2012).
4. Length of Input Reads
The resulting sequencing reads generated from a sequencing method are referred to as input reads. In some examples, an individual input read is at least about 25 nucleotides (nt) in length, such as at least about 30 nt, at least 40 nt, at least 50 nt, at least 60 nt, at least 70 nt, at least 80 nt, at least 90 nt, at least 100 nt, at least 110 nt, at least 120 nt, at least 130 nt, or at least 150 nt, such as 25 to 1000 nt, 25 to 500 nt, 25 to 150 nt, 25 to 120 nt, or 50 to 120 nt. In some examples an individual input read is longer, such as at least about 1000 nt in length, such as at least about 2000 nt, at least 3000 nt, at least 4000 nt, at least 5000 nt, at least 6000 nt, at least 7000 nt, at least 8000 nt, at least 9000 nt, at least 10,000 nt, at least 50,000 nt, at least 100,000 nt, or at least 1,000,000 nt, such as 1000 to 10,000,000 nt, 1000 to 1,000,000 nt, 1000 to 100,000 nt, 1000 to 10,000 nt, or 1000 to 5000 nt.
In some examples, an average length of an input read in a population of input reads is at least about 25 nt in length, such as at least about 30 nt, at least 40 nt, at least 50 nt, at least 60 nt, at least 70 nt, at least 80 nt, at least 90 nt, at least 100 nt, at least 110 nt, at least 120 nt, at least 130 nt, or at least 150 nt, such as 25 to 1000 nt, 25 to 500 nt, 25 to 150 nt, 25 to 120 nt, or 50 to 120 nt. In some examples, an average length of an input read in a population of input reads longer, such as at least about 1000 nt in length, such as at least about 2000 nt, at least 3000 nt, at least 4000 nt, at least 5000 nt, at least 6000 nt, at least 7000 nt, at least 8000 nt, at least 9000 nt, at least 10,000 nt, at least 50,000 nt, at least 100,000 nt, or at least 1,000,000 nt, such as 1000 to 10,000,000 nt, 1000 to 1,000,000 nt, 1000 to 100,000 nt, 1000 to 10,000 nt, or 1000 to 5000 nt.
A population of input reads can have numerous individual input reads, such as at least 1000 different input reads, at least 5000, at least 10,000, at least 50,000, at least 100,000, at least
500,000, at least 1,000,000, at least 5,000,000, at least 10,000,000, at least 50,000,000, at least 100,000,000, at least 500,000,000, at least 1,000,000,000, at least 5,000,000,000, at least
10,000,000,000, at least 50,000,000,000, at least 100,000,000,000, at least 500,000,000,000, or more different input reads.
C. Partitioning Input Reads into Read Tiles
Each of the input reads is placed or partitioned into a read tile or "bucket". The number of read tiles used can be selected by the user, but there are at least two different read tiles. In some examples, there are at least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 500, at least 600, at least 750, or at least
1000 different read tiles, such as 10 to 1000, 10 to 600, 10 to 500, 10 to 100, 50 to 100, 50 to 1000, 50 to 500, or 50 to 100 different read tiles.
In one example, the assignment of a particular input read to a particular read tile is random. For example, the first input read can be assigned to read tile 1, the second input can be assigned to read tile 2, and so forth. For example, if there are 50 read tiles, input reads 1, 51, 101 etc. can be assigned to read tile 1, input reads 2, 52, 102 etc. can be assigned to read tile 2, and so forth. In another example, the input reads are assigned to the read tiles in groups. For example, if there are 50 read tiles, the first 1/50ώ of the input reads can be assigned to read tile 1, the second 1/50ώ of the input reads can be assigned to read tile 2, and so forth.
In some examples, the input reads are non-randomly assigned to a read tile. For example, if it is known that particular reads are near one another on the target, these related input reads can be assigned to the same read tile.
D. Assembly of Read Tiles into Contigs
Each of the individual read tiles containing a plurality of input reads are assembled, for example using a known assembler. Assemblers align the individual input reads in each read tile based on the ends of the sequence of the input reads. An empirically determined threshold (e.g., n nt) can be used to confirm that one input read is related to another. Examples of known assemblers include Velvet available from the European Bioinformatics Institute, SOAPdenovo available from BGI, ALLPATHS-LG, ABBySS, and IDBA, and the like.
The assembly of the read tiles can be done independently in serial or in parallel on a shared or distributed memory computer cluster. At this stage, there is no communication between the assemblies of the different read tiles.
The result of assembling of the input reads in each read tile is the production of contigs in each read tile. As a result of this assembly, the number of input reads decreases, but the number of read tiles (now referred to as contig sets) stays the same. All of the contigs generated from all of the read tiles are combined into a single contig set.
E. Contig Clustering
The contig set generated from assembling the input reads in each read tile is used to generate a graph, referred to as a contig connectivity graph, showing the relatedness of the individual contigs in the set (e.g., as shown in Figure 10 as described below).
In one example, the contig connectivity graph is generated based on generating graph vertices for the contigs, wherein the weight of the vertex is the contig length. The edge weights can be defined based on the degree of overlap between one contig and other contig.
The resulting contig connectivity graph is then partitioned into contig clusters, for example using a graph-partitioning tool, such as METIS, KaHIP, kMetric, Scotch of the Laboratoire Bordelais de Recherche en Informatique, or the like. In some examples, the contig lengths (vertex weights) are given less importance than the contig overlapping degrees (edge weights) in the graph partitioning process. However, if there are many short contigs with little connectivity in-between, the vertex weights can be given greater importance (for example this may be true during the first few iterations, until the contig length increases). The number of contig clusters can be determined by a user. In some examples, at least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 10,000, or more than 10,000 different read tiles, such as 10 to 1000, 10 to 500, 10 to 100, 50 to 1000, 50 to 500, 50 to 100, or over 10,000 different contig clusters are generated. The number of clusters can be chosen based on the given computation resources, the genome's characteristics, and the like. Dramatic variation within the same species is supported.
The resulting contig clusters are then used as part of the input in the next step.
F. Read Clustering
The entire input read set (i.e. , the initial sequence data partitioned into the read tiles) is then aligned to the contig clusters generated above. Each input read having similarity to a contig cluster is collected. In some examples, a single input read may have similarly to more than one contig cluster, and thus be assigned to more than one contig cluster (e.g., due to repeats).
The input read-to-contig cluster alignment can be performed using an alignment tool, such as Bowtie.
The result is the production of a plurality of read clusters, which can be longer and/or have more bridges than the contig clusters.
G. Read Assembly
The resulting read clusters are assembled as described above for "Assembly of Read Tiles into Contigs". This produces a plurality of contigs, which are longer but fewer in number than those generated above.
Merged contigs can be assembled in second phase assembly to promote elongation. In some examples, such as when the assembly of the merged contigs is not improving, the merged contig assembly can be skipped in later iterations to save time. In some examples, merged contig assembly is only performed in the first few iterations, if at all.
H. Post Processing
The resulting contigs can be analyzed to see if there was improvement (e.g., length of contigs increased, numbed or bridges between contigs increased, or combinations thereof). If there was improvement, the process can be repeated with one or more additional iterations of steps and contig clustering, read clustering, read assembly. If there was improvement, the process can be terminated, and the resulting contigs outputted. In some examples, the steps of contig clustering, read clustering, read assembly are iterated at least 2 times, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 15, at least 20, at least 25, at least 50, at least 75, at least 100, at least 150, at least 200, at least 500, at least 1000, at least 2500, at least 5000, or at least 10,000 times, such as 50 to 100, 50 to 500, or 100 to 1000 times.
I. Optional subsequent steps
The contig clusters can be provided to a user. The resulting contigs can be analyzed, for example to identify functional regions or domains (e.g., annotate the contigs), to compare the sequence of the contigs to other species, or to examine genetic variations. In some examples, the resulting contigs are assembled into a genome or chromosome.
In some examples, a treatment plan is administered to a patient based on the sequencing results, for example the presence or absence of particular mutations in the target assembled.
The present disclosure is illustrated by the disclosed non-limiting Examples.
Example Al
Tiled Genome Assembly (Tiger) Implementation
This example describes the Tiger implementation used to generate the results described in the examples below.
Most assemblers can deal with small genomes (such as E. coli) very well using a small amount of computation resource and time. For very large genomes (such as mammalian- size genomes), most assemblers either cannot produce good results or require tremendous amount of resources and/or time. Besides, assemblers usually have their own design characteristics targeting at specific types of genomes [31]. The disclosed approach substantially reduces the computational complexity and resources needed for large genome assembly. The method and system more effectively divides the genome assembly problem into smaller sub-problems and assembles them individually (for example, without inter-node communication) with the flexibility of using various off-the-shelf assemblers. In some examples, an iterative refinement approach is used to gradually improve the quality of problem partitioning and the overall solution.
Tiled genome assembly
The rationale of the disclosed methods and systems follows the logic that genome assembly could be done part-by-part instead of as a whole, namely the input reads can be divided into multiple tiles (or clusters) and the assembly results of all tiles can be merged as the final assembly. This approach is termed tiled genome assembly. It was observed that if all related information {e.g., reads) was had for only a short fragment of a target genome, most assemblers would get excellent results and require much less memory. Including more reads that correspond to larger regions increases the memory requirement and potentially makes assembly results worse. The main reason is that de novo assemblers cannot tell which part of the genome the reads belong to. However, if the reads are partitioned in an effective way, assemblers can produce better results while requiring much less memory.
Taking the DBG-based assemblers as an example, ideally a contig (from a specific region in the target genome) is built using only the fc-mers extracted by the reads contributing to that region. However, most assemblers extract the fc-mers from all the input reads and mix them together when constructing the DBG. More specifically, the fc-mers, whose source reads are not contributing to a specific region in the target genome, may still be used in the DBG construction process. Such k- mers are termed ambiguous k-mers. For genomes that are less repetitive, the ambiguous fc-mers
could be few. But for genomes that are highly repetitive, they can be significant enough to confuse the assembly process.
Therefore, a new approach was designed to partition the input reads into multiple tiles. The goal was to have each tile contain only those reads contributing to a specific region of the target genome. The reads in such read tiles are called well- clustered reads. Thus the effect from ambiguous fc-mers can be dramatically reduced. Since each read tile has all the necessary information, no communication would be needed among the assemblies of different read tiles. Since regions in the target genome can be assembled independently and as a result, each assembly needs less memory to complete.
Read clustering based on clustered contigs
A well-clustered read tile should contribute to a continuous region of the target genome. The region can be composed of one long contig or a set of contigs covering the whole region without gaps in-between. A set of such contigs is called well-clustered and can be obtained by sorting or clustering closely related contigs together. Therefore, by aligning the input reads against a well-clustered contig set, the reads having high similarity with sub-sections in the contigs can be collected as a well-clustered read tile. This process is called read clustering. The collected reads can then be assembled to produce a similar set of the contigs but with improved contig lengths and quality.
Intermediate reference genome
A target genome can be considered as a combination of multiple continuous regions (the minimum number is the number of chromosomes in the target genome), where each region can be contributed completely by one or many contigs. Therefore, ultimately there would be multiple well-clustered contig sets corresponding to multiple regions in the target genome. In the disclosed approach, the contigs from assembly were treated as the intermediate reference genome and the contigs were arranged in multiple clustered contig sets.
In some examples, for de novo assembly, random partitions of the reads in tiles are used, the reads in each tile assembled, and all contig sets merged into one as the intermediate reference. In this case, the reads in an initial, randomly partitioned tile will correspond to random regions in the target genome. As a result, the initial contig sets that serve as the intermediate reference will likely be fragmented and will have errors. The disclosed approach iteratively improves the clustering and
thus the quality of the intermediate reference genome. In the end, the intermediate reference genome converges to the final target genome.
Iterative assembly
The transformation from reads to contigs and from contigs to reads forms a cycle. Thus the whole assembly flow can be iterative. As more transformation iterations are performed, contigs become longer with higher quality since read clustering improves, and each tile contains less irrelevant information that may confuse the assembly process. Thus, in some embodiments, the improvement on contig quality of the current iteration can be carried over to the next iteration through more accurate read clustering.
The Tiger Program
A software program called "Tiled Iterative GEnome assembleR," or Tiger was developed implementing the technologies described herein. In the description of the program, "tile" is a synonym of "set" or "cluster", representing the tiled computation nature in the assembly process. The conceptual flow is illustrated in Figure 9.
Step 1. Reads partitioning: The input reads were first partitioned into multiple read tiles. In one example, the input reads were randomly partitioned evenly into N subsets, which can be determined by users based on the available resources and the total size of the input reads. In some examples, a non-random initial assignment of reads, for instance in cases where a priori knowledge about the reads can be used to provide clustering information.
Step 2. Read assembly: Read tiles were assembled individually, using an off-the-shelf assembler, such as Velvet, embedded into Tiger. Depending on the available system memory, the assembly of read tiles can be done independently in serial or in parallel on a shared or distributed memory computer cluster. There is no communication between the assemblies of different read tiles.
For any embedded assembler requiring specifying a k-mer size, k-mer sizes can be decided either manually by users or automatically through the auto-fc-mer scheme in Tiger. For the manual k-mer designation, a k-mer size is used in all read tile assemblies for all Tiger iterations.
Otherwise, the auto-fc-mer scheme randomly picks k-mer sizes within a given range and records the best k-mer size in the assembly results. The best k-mer size and the randomly picked ones will be considered in the subsequent assemblies. User-specified k-mer size can be introduced into this k- mer history database but may not be used again if the first attempt is not good. The number of used
reads in the assembly, the total length of the contigs, and the resultant N50s are used to evaluate whether a k-mer size can help produce the best result without knowing the target genome. This avoids the problem of picking a contig set with high N50 and low coverage and enables Tiger to find a good direction in the iterative process and to converge to high quality results. Alternative embodiments could optionally determine k-mer size to maximize N50 or other metrics, rather than coverage.
Since Step 2 is the first time to assemble the initial read tiles, the contigs can be short and may cause long running time in the later iterations. In some examples this is addressed by merging the contig sets and feeding the merged contig set to Velvet with the LONGSEQUENCE flag enabled. Velvet may further elongate the contigs by treating the input contigs as long reads. The new contig set is used when it is better than the merged contig set. The output contig set is scaffolded by SSPACE [24]. The scaffolded contig set is the input to Step 3. The purpose of this scaffolding process is to leverage paired-end information to bridge contigs which may be from different assemblies. This is beneficial for better clustered contigs at Step 3. The scaffolding process also helps resolve duplicated contigs from different assemblies.
Step 3. Contig clustering: An overview of the contig clustering algorithm used is depicted in Figure 10. In Figure 10, words are extracted from contigs. The number of common words (e.g., base pair sets, base pairs, or the like) between two contigs is used as the edge weight in the graph. Contig lengths are modeled as vertex weights. The contig connectivity graph is thus built, followed by the METIS partitioning process. The partitioned sub-graphs are clustered contig sets.
A graph that models the contig connectivity intensity is built from the merged contig set. This graph is called the "contig connectivity graph." Graph vertices are the contigs. Vertex weights are contig lengths. Edge weights are defined based on the contig overlapping degree with each other. A threshold (e.g., 10 or the like) can be set, below which overlap is not found.
Different thresholds can be picked depending on available resources or the like. Alternative weighting schemes can be used as understood by one skilled in the art. The contig connectivity graph is much smaller than the DBG so it uses a much smaller amount of memory. A graph- partitioning tool, METIS [25], was used to partition the graph into contig clusters. METIS is known to be fast and memory-efficient in processing millions of graph vertexes.
The contig lengths (vertex weights) were given less importance than the contig overlapping degrees (edge weights) in the graph partitioning process. This is because it was desired that the partitioned contig connectivity sub-graphs be more edge-oriented instead of being vertex-oriented. But the vertex weights need to be considered for the situations where there exist many short contigs with little connectivity in-between. This is very common for the assembly results in the first few
iterations on assembling randomly partitioned read tiles. These short contigs can be distributed to all clusters evenly. This not only preserves their existence in the following Tiger iterations but also reduces their influence on the rest of the clustered contigs.
By focusing graph partitioning on edge intensity, overlapping contigs will be grouped together and would be rebuilt as one long complete contig later at Step 5. These contigs are used to produce well-clustered reads in the read clustering process at Step 4. That is, this contig clustering step makes a crucial contribution to the quality of results of the later steps.
Building of a contig connectivity graph can be time-consuming if a traditional sequence alignment method is used, like the Smith- Waterman algorithm [26] . Since the degree of overlap between contigs need not be determined exactly for the methods and systems herein, a heuristic algorithm was used based on the image recognition algorithm using vocabulary trees in [27], with inverse document frequency scoring dropped. Consecutive sequences of equal length (called "words") are extracted from each of the contigs in the set. The extracted words are used to build a map (or inverted file) from the words to the contigs containing them. A contig connectivity graph is then built from the map with the edge weights being set to the number of base pairs in common between the vertices (contigs) connected by that edge. Since the connectivity graph stores only the contig connectivity information, the memory usage of this step is much lower than that at the read assembly step. Regarding runtime, building a connectivity graph can dominate the whole step. Building the word map can be done in parallel if desired. Overall the runtime is still much shorter than Step 4 and 5.
Step 4. Read clustering: The entire input read set is aligned to the contig sets from Step 3. The reads having high similarity to each contig set are collected as one read cluster. Each read is collected once for each cluster. This means a read can appear in multiple read clusters if similar contigs are not clustered together. This process guarantees any read potentially contributing to a contig set will be collected. The read-to-contig alignment was done using Bowtie [28], but one skilled in the art will appreciate that other methods can be used. For paired-end reads, if one of a read pair aligns to the given contig cluster, both reads are collected. This step provides the opportunity to extend and/or bridge the contigs. This clustering process can be done in parallel or in serial on a shared or distributed memory computer cluster. In some examples, no
communication is needed between read tiles. The required memory is also proportional to the size of a read tile.
Step 5. Read assembly: The read tiles were assembled the same as described for Step 2. But the assembly of the merged contigs from all read tile assemblies may optionally not be performed. If the assembly of the merged contigs is not improving, it is skipped in later iterations
to save time. It was observed that it is useful to have this additional assembly in the first few iterations.
Step 6. Post processing: In some examples, the methods will exit upon reaching the given number of iterations. In these examples, iterations which do not exit will return to Step 3. Step 3, 4, and 5 and thus form an iterative process. One skilled in the art will appreciate that systems and methods can be implemented which include additional pre- or post-processing steps or which modify certain iterations, for example optionally performing the read tile assembly sub-step according to Step 5.
In summary, the rationale behind the disclosed methods and systems is that the
improvement on contig quality of the current iteration can be carried over to the next iteration through more accurate read clustering. An optimal clustering solution will be achieved if only reads contributing to a contig are clustered for assembling the contig. This approach differentiates the disclosed algorithm from the previous work and provides a framework of improving an existing contig set further.
Example A2
Results Using Tiger
This example describes the results using Tiger to its significant benefits. It is understood that the particular choice of assemblers and parameters provide an illustrative embodiment of the disclosure, and do limit the scope of the disclosure to the described implementation.
Evaluation environment setup
Two well-known assemblers were embedded into Tiger for this evaluation, Velvet [3] (version 1.2.03, compiled with max k-mer 96 and 4 categories) and SOAPdenovo [4] (version 1.05), named as Tiger- Velvet and Tiger-Soap, respectively.
Two types of evaluation were carried out: type R and type I, labeled as Tiger- Velvet/Soap- R/I. Type R (Random) started from the randomly partitioned multiple read tiles followed by Tiger- Velvet/Soap, which demonstrated that Tiger can manage the randomly partitioned read tiles and gradually improve the result to achieve better NG50 than the corresponding single-tile assembly (i.e., the solution provided by the original assembler itself). Type I (Improved) started from the assembly result generated by Velvet/SOAPdenovo (instead of random partitioning), followed by Tiger- Velvet/Soap, respectively, to improve the result. This was to demonstrate that Tiger could also improve the single-tile assembly by its embedded assembler. Both types of evaluation used 150-tile assembly with auto-fc-mers.
The machine for these evaluations is installed with Intel Core i7 CPU 950 (4 physical cores in clock rate 3.07 GHz), 24 GB system memory, and 2 TB disk space. Five of such machines were used.
Data used
The human chromosome 14 data set in the GAGE assembly competition [20, 21] was used to assess Tiger. This chromosome is 88 Mbp excluding Ns {e.g., low quality values or reads, such as those that could not be positively identified as A, T, C, or G by the sequencer). The data set details are summarized in Table 1.
Table 1. Details of the human chromosome 14 read libraries.
Same as [20], the reads were corrected by Quake [22] before assembly. The other data set was the 4.6 Mbp long E. coli genome (Illumina paired-end reads, accession no. SRR001665) with 36 bp read length, generated from a 200 bp insert length, E. coli K- 12 MG1655 library (accession no. NC_000913). The assembly results were analyzed by the evaluation script from [21], using the MUMMer package [23], with 200 bp as the minimum contig length.
The same analysis metrics in [21] are reused in Table 2 and Table 5. The NG50 value is the smallest contig size such that 50% of the reference genome is contained in contigs of size NG50 or larger. The error-corrected NG50 is calculated by splitting contigs at every misjoin and at every indel that is longer than 5 bp. SNPs mean the single nucleotide differences. Inversions are the reversed sequences in strands. Relocations are the sequence rearrangements. "Unaligned ref." is the bases in the reference that was not aligned to any contig. "100% - Unaligned ref." is the genome coverage. "Duplicated ref." is the sequence occurrence frequencies in contigs.
Evaluation results
Tables 2-7 summarize the evaluation results using Tiger. For human chrl4 data (Tables 2- 4), the Tiger results had better NG50s and genome coverage as compared to the best Velvet and SOAPdenovo results using fc-mer sizes 61 and 55, respectively, indicating that Tiger can iteratively improve the assembly results.
Table 2. The human chromosome 14 assembly results in terms of continuity, accuracy, and statistics. The columns include the number of contigs, NG50 size and its error-corrected size, the number of single nucleotide polymorphisms (SNPs), the number of indels and misjoins in contigs, total assembly length, genome coverage (100 - Unaligned ref.), and duplications. K-mer 61 and 55 are the best k-mer sizes for Velvet and SOAPdenovo, respectively. "#k" stands for the applied k-mer size. "#i" stands for the iteration number.
Table 3. The human chromosome 14 assembly results in terms of contiguity and accuracy. The columns from the second are the number of contigs, NG50 size and its error-corrected size, the number of single nucleotide polymorphisms (SNPs), and the number of indels and misjoins in contigs. K-mer 61 and 55 are the best k-mer sizes for Velvet and SOAPdenovo (rows with bold border), respectively. "#i" stands for the iteration number. Since each experimental result takes the GAGE evaluation scheme about 4 to 8 hours to finish, we only picked the experimental result for evaluation based on the reported higher NG50 and lower contig number.
Continuity Accuracy (ea)
Experiments Num. NG50 NG50 corr. Indels Misjoins
SNP
(ea) (kbp) (kbp) > 5 i¾ 5 Inv. Reloc.
Velvet 61k 28,974 5.2 4.7 82,235 1,410 16,345 335 266
Tiger- Velvet-R 13i 20,703 8.4 7.1 74,923 1,475 16,490 391 311
Tiger- Velvet-R 20i 19,045 10.3 8.5 78,818 1,660 18,009 336 223
Tiger- Velvet-R 29i 19,159 10.8 8.7 81,905 1,863 18,960 296 247
Tiger- Velvet-R 39i 19,363 10.9 9.0 80,722 1,727 18,736 305 255
Tiger- Velvet-R 51i 20,397 11.2 9.2 82,874 1,775 19,251 282 195
Tiger- Velvet-R 64i 20,043 11.2 9.2 82,735 1,907 19,327 304 204
Tiger- Velvet-R 69i 19,929 11.4 9.3 82,563 1,816 19,498 297 222
Tiger- Velvet-R 125i 20,189 11.6 9.3 84,577 1,963 19,884 293 240
Tiger- Velvet-I 3i 22,871 10.4 8.5 85,017 1,802 19,671 491 330
Tiger- Velvet-I 5i 22,012 10.8 8.8 85,233 1,840 19,673 405 302
Tiger- Velvet-I 7i 21,623 10.9 8.9 84,811 1,816 19,654 357 297
Tiger- Velvet-I 9i 21,128 11.0 9.0 84,395 1,851 19,583 374 300
SOAPdenovo 55k 50,094 3.0 3.0 67,956 736 11,130 19 17
SOAPdenovo 61k 54,255 2.8 2.8 68,317 632 10,069 34 33
Tiger-Soap-R lOi 56,709 2.1 2.0 59,528 676 9,381 86 64
Tiger-Soap-R 15i 57,028 3.1 2.9 65,006 850 10,909 91 91
Tiger-Soap-R 26i 59,346 3.5 3.3 67,080 891 11,550 99 72
Tiger-Soap-R 36i 59,324 3.6 3.4 67,473 934 11,895 100 75
Tiger-Soap-R 45i 60,287 3.4 3.2 67,957 885 11,514 118 54
Tiger-Soap-R 53i 61,720 3.5 3.3 67,458 911 11,876 85 69
Tiger-Soap-R 120i 60,134 3.6 3.4 68,881 927 11,912 108 77
Tiger-Soap-I 3i 56,396 3.5 3.3 69,209 949 12,018 116 84
Tiger-Soap-I 5i 56,706 3.7 3.5 68,862 959 12,237 107 74
Tiger-Soap-I 7i 55,173 3.8 3.6 69,215 999 12,391 119 86
Table 4. The human chromosome 14 assembly statistics. The columns from the second are the total assembly lengths, genome coverage, etc. K-mer 61 and 55 are the best k-mer sizes for Velvet and SOAPdenovo (rows with bold border), respectively. "#i" stands for the iteration number. "100 - Unaligned ref." is the genome coverage.
Table 5. E. coli (SRR001665) 24-tile assembly results in terms of contiguity, accuracy, and statistics. The columns include the number of contigs, NG50 size and its error-corrected size, the number of single nucleotide polymorphisms (SNPs), the number of indels and misjoins in contigs, total assembly length, genome coverage (100 - Unaligned ref.), and duplications. K-mer 25 and 27 are the best k-mer sizes for
Velvet and SOAPdenovo, respectively. "#k" stands for the applied k-mer size. "#i" stands for the iteration number. Both Tiger-Velvet-I and Tiger-Soap-I evaluations use the best results from Velvet and SOAPdenovo as input, respectively.
Table 6. E. coli (SRR001665) 24-tile assembly results in terms of contiguity and accuracy. The columns from the second are the number of contigs, NG50 size and its error-corrected size, the number of single nucleotide polymorphisms (SNPs), and the number of indels and misjoins in contigs. K-mer 25 and 27 are the best k-mer sizes for Velvet and SOAPdenovo (rows with bold border), respectively. "#i" stands for the iteration number. Both Tiger-Velvet-I and Tiger-Soap-I experiments use the best results from Velvet and SOAPdenovo as input, respectively.
Table 7. E. coli (SRR001665) 24-tile assembly statistics. The columns from the second are the total assembly lengths, genome coverage (100 - Unaligned ref.), etc. K-mer 25 and 27 are the best k-mer sizes for Velvet and SOAPdenovo (rows with bold border), respectively. "#i" stands for the iteration number.
To demonstrate that Tiger can improve not only the best single-tile assembly result but also a common one by its embedded assemblers, Tiger- Velvet-I used the Velvet best result as input and Tiger-Soap-I used the SOAPdenovo result with k-mer size 61. It is noted that although Tiger applies an iterative assembly process, with the same parameter setting and inputs, Tiger can reproduce the same results.
The NG50s and coverages of the type R show continuous improvement from iteration to iteration. The best NG50s by Tiger- Velvet/Soap-R reach 11.6 kbp and 3.6 kbp (or 2.2x and 1.2x improvement), respectively, as compared to the best Velvet/SOAPdenovo results. The type I results also show continuous NG50 improvement. Regarding the coverages, although Tiger- Velvet-I results have an improving trend, such trend is not clear on Tiger-Soap-I results. The best NG50s by Tiger- Velvet/Soap-I reach 10.9 kbp and 3.8 kbp (or 2.1x and 1.3x improvement), respectively.
As for the accuracy, the best Tiger- Velvet/Soap results of both R and I flows had higher SNPs and indels errors. The misjoin errors by the best Tiger- Velvet-R result were less. But the best results of Tiger- Velvet-I and Tiger-Soap-I/R had higher misjoin errors. This may be because the read clustering step has collected some irrelevant reads due to unresolved duplications.
Additional steps or elements to address the resolution of duplications can be added to the method. Note that, in the E. coli results, both Tiger- Velvet/Soap produced similar misjoin errors against their counterparts. This indicates that the higher error rate in Tiger is also related to the reads characteristics.
Table 8 lists the runtime and memory usage results observed on the read assembly to demonstrate the low memory usage of tiled assembly using multiple auto-fc-mers. Tiger- Velvet consumes the least amount of memory as low as 0.16 GB. On the other hand, Tiger-Soap still consumes 1.8 GB even though the read tile file size is around 10 MB only, whereas the 1-tile read file size is 4.7 GB. The runtime between the evaluations by Velvet/SOAPdenovo and Tiger- Velvet/Soap shows that Velvet and SOAPdenovo run much faster when the read tile size is small. For instance, the runtime for the 150-tile Tiger- Velvet-R assembly with 8 auto-fc-mers is less than twice of the 1-tile Velvet assembly. The runtime between the Tiger- Velvet/Soap evaluations with 3 and 8 auto-fc-mers shows that embedded Velvet and SOAPdenovo take more time and memory for better-clustered read tiles because the contigs in a DBG can be assembled further by the clustered reads. However, for less-clustered read tiles, the contigs are shorter in a DBG with smaller memory and the assembly ends earlier.
Table 8. The runtime and memory usage of the assemblies on the human chromosome 14 genome. All evaluations are done using 1 thread. "#k" stands for the applied k-m&r size. "#i" stands for the iteration number. Note: The runtime and memory usage for Tiger is on the read assembly (Step 5) only.
Table 9 further lists the detailed computational resource usage using different numbers of threads across computers by Tiger and its counterparts. The runtime and memory usage include the whole Tiger assembly process from Step 3 to 5. Since the resource usage of a Tiger iteration can be very different especially for the type R tests, we used the first iteration of the type I because it is
stabilized and consumes more resources than the type R iterations. The peak memory usage observed using Tiger using one thread was 1.87 GB and the runtime went to 4.69 hours. The 1.87 GB memory is from the contig clustering (Step 3) because the current implementation targets at 4 GB memory machine. The memory usage of 4-thread execution was 2.44 GB. This demonstrates Tiger's capability of running on commodity computers.
Table 9. Comparison of the runtime and memory usage on the human chromosome 14 assembly. "#k" stands for the applied k-mer size. stands for the iteration number.
+ The memory usage across machines can not be measured in our environment.
When more threads across computers were given, the runtime speedup were 2.98x, 5.69x, and 7.16x, which are not proportional to the linear speedup, 4x, 12x, and 20x, respectively with the given thread numbers (4, 12, and 20). Since there were unparallelized parts, Tiger was dissected into steps with individual timing information, as shown in Figures 1 lA-1 IB. In Figure 11A, the steps are shown in the bars and the legend from top to bottom (e.g., the top portion of a bar is for Step 5; the middle portion of a bar is for Step 4; and the bottom portion of a bar is for Step 3). Different numbers of threads across machines are used.
In Figure 1 IB, the steps in the bars run from right to left, and the steps in the legend run from top to bottom (e.g., the left most portion of a set of bars if for Linear, and the right most portion of a set of bars is for Step3-5). The speedup base line is labeled as lx for other
corresponding columns. The k-mer size 61 is used in all tests to avoid varying runtime caused by
different k-mer sizes. Step 5* does not include SSPACE result since it does not execute across computers.
For the 1-thread evaluation, Step 4 took up to 81.86% out of all three steps 3, 4, and 5. Further optimization may decrease the amount of time consume. Step 5 took 15.30%. However, Step 4 performs read-to-contig alignments, where the runtime of alignment tasks is similar to one another. This fits best the bulk- synchronous-parallel computation model so the speedup numbers were 3.41x, 10.74x, and 15.26x, showing close to linear results of the speedup, 4x, 12x and 20x, respectively. At Step 5, the bulk-synchronous-parallel computation model is also used. The last contig scaffolding task was parallelizable within one computer so when the scaffolding task was in progress, the other computers were idle. However, although the rest of the tasks were mostly parallelizable, the runtime speedup was still not linear. This is because the assembly time of each read tile is very different from one another such that unbalanced load takes place often, meaning many threads were idle, waiting for the last one to finish. This suggests the bulk- synchronous- parallel model may not work well for Step 5 on parallel read assemblies.
For the Velvet and SOAPdenovo evaluations, the memory usage did not change much when more threads were added. When 4 threads were used, the runtime speedup for Velvet and
SOAPdenovo were 2.02x and 1.72x, respectively. No tests on multiple computers were perfromed since neither assembler could execute across computers with distributed memory. Although the assemblies using the best fc-mers consumed about 8.5 GB, locating the best fc-mers required enumerating all possible fc-mers which actually required more than the 24 GB memory on the machine used, e.g. k-mer 37 for Velvet. A computer cluster with 2 TB memory was used to overcome the memory explosion in assemblies. On the other hand, Tiger did not have this problem since in our evaluations each read tile size was about 1/150 of the input reads. This shows an exemplary advantage of Tiger when a 2 TB memory machine is not attainable.
Example A3
Discussion of Results
Choice and effect of the number of read tiles
The choice of the number of read tiles can affect the processing time for good results but also the quality of results the assembler can reach. The greater the number of read tiles, the longer the processing time and the better quality result the assembler can reach. Since in the beginning iterations, contigs are shorter and have less overlap with one another, the transformation between
reads and contigs needs more time to converge. When read tiles reach a well-clustered state, the assembler can focus on a smaller set of the read information and produce better quality results.
Choice of A mers for read tiles
The choice of k-mer sizes for DBG-based assemblers is an unresolved issue. This has a noticeable impact on the assembly results. Related works [2, 29, 30] either explicitly iterate all possible k-mer sizes or implicitly find suitable ones at different levels of assembly granularities for best results. In some examples, when using Tiger, the input reads are arranged in multiple relatively small clustered read tiles such that these k-mer size searching algorithms can provide better results. For the assemblers that require specifying k-mer sizes, the auto-fc-mer scheme in Tiger picks the best k-mer sizes for each read tile. It was observed that the best k-mer sizes selected by the auto-fc-mer scheme for each read tile are actually not the best ones used by the best single- tile assemblies. For example, in the Tiger- Velvet-R, the selected top three k-mer sizes were actually 55, 57, and 59 instead of the best k-mer size 61 by Velvet.
Novel features in Tiger Iterative improvement of assembly results: No existing sequencing method provides this functionality. The iterative transformation between contigs and reads gradually improves the read clustering quality and thus assemblers need to manage only a smaller portion of the original read information. In the iterative process disclosed herein, the scale and complexity of the assembly problems are reduced.
Low memory footprint of each read tile assembly: In some examples, the focus of Tiger is on decomposing the input reads into sub-assembly problems, instead of decomposing the de Bruijn graph data structure. The required memory for an assembly is inversely proportional to the number of read tiles. When the read coverage exceeds the number of read tiles, the assembly of each tile may need large amount of memory in the beginning iterations due to random reads partitioning. This is because each read tile may contain considerable proportion of all fc-mers for the whole target genome assembly. The read tile number can be increased to make the memory of each read tile assembly acceptable.
Assembler embedding: It is known that every assembler has its characteristics for specific types of genomes [31]. The input to the disclosed framework can be an assembly result from assembler A. For example, Tiger can embed assembler A to further improve it, as was
demonstrated in Example 2. In addition, a system of the disclosure can also embed another assembler B to improve the results done by A.
Scalable parallel assembly process: In Tiger, the most time consuming steps are read clustering and read assembly. Both are highly parallel and do not need communication between threads. This makes the framework provided herein suitable for either distributed or shared memory computer clusters.
Example A4
Observations
More duplications: More duplications were observed in Tiger assemblies. For a single-tile assembly, assemblers usually can detect duplications and resolve some of them. Duplications are more likely to take place in Tiger because assemblies for tiles are done independently. This is because when the duplications are in the contig ends, scaffolding tools usually can resolve them by merging contigs together. For example, the scaffolding tool SSPACE can help eliminate contig- end duplications. When the duplications are in the middle of two contigs, a post-processing step can be used to resolve them.
Shorter scaffolds: The scaffold results are not provided because in the example, at the contig clustering step, contigs are clustered only based on the degree of overlap with one another, but the contigs that can be scaffolded are not taken into consideration. A scaffolding phase can be added. Example A5
Theorems
Formally, a sequence assembly process can be represented in terms of A = (C, B, RJ, where C is a set of legitimately correct contigs, B is a set of legitimately correct base pairs, R is a set of input reads, where C is composed of B, and B is contributed by R, or C = B <-R. This is referred to as the assembly using the input reads as a whole. Then, the following theorems can be expressed.
Theorem 1: Assemblies of well-clustered multiple read sets can reach the same genome coverage as those by the assembly of the reads as a whole. This means the multiple subsets of contigs Co, Cc, can be contributed by multiple subsets of reads, Ro, RR, where C = (Co, Ccl and R = (R0, RR}. Or Q
Here we know one contig ; = (Bo, BBJ, and one base pair set 5, can be contributed by a read set Ri. Or Bi <-Ri, where Bi = (bo, bbj and Ri = (ro, rrJ. A well-clustered read set is an exact set of reads, Rj, that contributes to a legitimately correct contig, G, if and only if G can only be contributed by (Note that the well-clustered read set is difficult to achieve in practice due to duplications and repetitive elements in the genome. What is described here is the ideal case.)
To prove this theorem, the following lemmas are needed.
Lemma 1: A base pair b is said to be legitimately correct if all reads, rb = (rbo, rbnj, that can contribute to it are considered in the assembly process. Or this can be expressed as an all-to-one mapping, b rb = (rbo, rbnJ. Then, it is said rb is an exact set of reads to b.
This is obvious since if a base pair is legitimately correct without considering all reads, this base pair may not be correct.
Lemma 2: A contig c is said to be legitimately correct if all reads rc that can contribute to it are considered in the assembly process. Or this can be expressed as an all- to-one mapping, c <~ rc = (rco, rcnJ. Then, it is said rc is an exact set of reads to c.
Based on Lemma 1, since c = (bo, bn] and b (rbo, ·· ·, rbnJ, then it is concluded, c {{rboo, rbonl, ·· ·, ( rbn0, rbnn }} = rc. Based on Lemma 1 and 2, Bj <- Ri can be proved, and since Q = Bj, it can be concluded Q
Theorem 2: The N50 of the legitimately correct contig set C (built from the input reads as a whole) is as good as that of the contig set Ciier that is constructed through iteratively adding the input reads R to the assembly process. This means the contigs in Ciier can be elongated through iterative read clustering to provide the exact set of reads for the assembly until they repeat themselves or reach their ends.
This theorem can be proved by Lemma 3 and 4.
Lemma 3: A contig c can be elongated legitimately correctly until it repeats itself or reaches its end if all reads, r, that can contribute to it are added to the assembly process.
This is true by Lemma 2 since the elongation process has all exact reads considered.
Lemma 4: A contig c can be adjusted in the elongation process when more reads are collected.
When non-exact reads are considered in the assembly process, there can be tips generated. Yet on the other hand, the tips can be resolved when one tip is assembled using the exact reads added later. (This adjustment is used in existing assemblers, like Velvet.)
Lemma 3 and 4 enumerate the only two situations in a contig elongation process. Hence, the length of a contig in Citer is as good as that in C. And it is inferred that Theorem 2 is true.
Example - Exemplary Computing Systems
FIG. 12 illustrates a generalized example of a suitable computing system or environment
1200 in which several of the described innovations may be implemented. The computing system 1200 is not intended to suggest any limitation as to scope of use or functionality, as the innovations may be implemented in diverse general-purpose or special-purpose computing systems. A communication device as described herein can take the form of the described computing system 1200.
With reference to FIG. 12, the computing system 1200 includes one or more processing units 1210, 1215 and memory 1220, 1225. In FIG. 12, this basic configuration 1230 is included within a dashed line. The processing units 1210, 1215 execute computer-executable instructions. A processing unit can be a general-purpose central processing unit (CPU), processor in an application-specific integrated circuit (ASIC) or any other type of processor. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power. For example, FIG. 12 shows a central processing unit 1210 as well as a graphics processing unit or co-processing unit 1215. The tangible memory 1220, 1225 may be volatile memory (e.g., registers, cache, RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory, etc.), or some combination of the two, accessible by the processing unit(s). The memory 1220, 1225 stores software 1280 implementing one or more innovations described herein, in the form of computer- executable instructions suitable for execution by the processing unit(s).
A computing system may have additional features. For example, the computing system 1200 includes storage 1240, one or more input devices 1250, one or more output devices 1260, and one or more communication connections 1270. An interconnection mechanism (not shown) such as a bus, controller, or network interconnects the components of the computing system 1200.
Typically, operating system software (not shown) provides an operating environment for other software executing in the computing system 1200, and coordinates activities of the components of the computing system 1200.
The tangible storage 1240 may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, CD-ROMs, DVDs, or any other medium which can be used to store information in a non-transitory way and which can be accessed within the computing system 1200. The storage 1240 stores instructions for the software 1280 implementing one or more innovations described herein.
The input device(s) 1250 may be a touch input device such as a keyboard, mouse, pen, or trackball, a voice input device, a scanning device, or another device that provides input to the computing system 1200. For video encoding, the input device(s) 1250 may be a camera, video card, TV tuner card, or similar device that accepts video input in analog or digital form, or a CD- ROM or CD-RW that reads video samples into the computing system 1200. The output device(s) 1260 may be a display, printer, speaker, CD- writer, or another device that provides output from the computing system 1200.
The communication connection(s) 1270 enable communication over a communication medium to another computing entity. The communication medium conveys information such as computer-executable instructions, audio or video input or output, or other data in a modulated data signal. A modulated data signal is a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media can use an electrical, optical, RF, or other carrier.
The innovations can be described in the general context of computer-readable media.
Computer-readable media are any available tangible media that can be accessed within a computing environment. By way of example, and not limitation, with the computing system 1200, computer- readable media include memory 1220, 1225, storage 1240, and combinations of any of the above.
The innovations can be described in the general context of computer-executable
instructions, such as those included in program modules, being executed in a computing system on a target real or virtual processor (e.g., which is ultimately executed in hardware). Generally, program modules include routines, programs, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or split between program modules as desired in various embodiments. Computer-executable instructions for program modules may be executed within a local or distributed computing system.
The terms "system" and "device" are used interchangeably herein. Unless the context clearly indicates otherwise, neither term implies any limitation on a type of computing system or computing device. In general, a computing system or computing device can be local or distributed, and can include any combination of special-purpose hardware and/or general-purpose hardware with software implementing the functionality described herein.
For the sake of presentation, the detailed description uses terms like "determine" and "use" to describe computer operations in a computing system. These terms are high-level abstractions for operations performed by a computer, and should not be confused with acts performed by a human
being. The actual computer operations corresponding to these terms vary depending on implementation.
Example - Exemplary Cloud-Supported Environment
In example environment 1300, the cloud 1310 provides services for connected devices
1330, 1340, 1350 with a variety of screen capabilities. Connected device 1330 represents a device with a computer screen 1335 (e.g., a mid-size screen). For example, connected device 1330 could be a personal computer such as desktop computer, laptop, notebook, netbook, or the like.
Connected device 1340 represents a device with a mobile device screen 1345 (e.g., a small size screen). For example, connected device 1340 could be a mobile phone, smart phone, personal digital assistant, tablet computer, and the like. Connected device 1350 represents a device with a large screen 1355. For example, connected device 1350 could be a television screen (e.g., a smart television) or another device connected to a television (e.g., a set-top box or gaming console) or the like. One or more of the connected devices 1330, 1340, 1350 can include touch screen capabilities. Touchscreens can accept input in different ways. For example, capacitive touchscreens detect touch input when an object (e.g., a fingertip or stylus) distorts or interrupts an electrical current running across the surface. As another example, touchscreens can use optical sensors to detect touch input when beams from the optical sensors are interrupted. Physical contact with the surface of the screen is not necessary for input to be detected by some touchscreens. Devices without screen capabilities also can be used in example environment 1300. For example, the cloud 1310 can provide services for one or more computers (e.g., server computers) without displays.
Services can be provided by the cloud 1310 through service providers 1320, or through other providers of online services (not depicted). For example, cloud services can be customized to the screen size, display capability, and/or touch screen capability of a particular connected device (e.g., connected devices 1330, 1340, 1350).
In example environment 1300, the cloud 1310 provides the technologies and solutions described herein to the various connected devices 1330, 1340, 1350 using, at least in part, the service providers 1320. For example, the service providers 1320 can provide a centralized solution for various cloud-based services. The service providers 1320 can manage service subscriptions for users and/or devices (e.g., for the connected devices 1330, 1340, 1350 and/or their respective users).
Example - Exemplary Implementations
Although the operations of some of the disclosed methods are described in a particular, sequential order for convenient presentation, it should be understood that this manner of description encompasses rearrangement, unless a particular ordering is required by specific language set forth below. For example, operations described sequentially may in some cases be rearranged or performed concurrently. Moreover, for the sake of simplicity, the attached figures may not show the various ways in which the disclosed methods can be used in conjunction with other methods.
Any of the disclosed methods can be implemented as computer-executable instructions stored on one or more computer-readable storage media (e.g., non-transitory computer-readable media, such as one or more optical media discs, volatile memory components (such as DRAM or SRAM), or nonvolatile memory components (such as hard drives)) and executed on a computer (e.g., any commercially available computer, including smart phones or other mobile devices that include computing hardware). Any of the computer-executable instructions for implementing the disclosed techniques as well as any data created and used during implementation of the disclosed embodiments can be stored on one or more computer-readable media (e.g., non-transitory computer-readable media). The computer-executable instructions can be part of, for example, a dedicated software application or a software application that is accessed or downloaded via a web browser or other software application (such as a remote computing application). Such software can be executed, for example, on a single local computer (e.g., any suitable commercially available computer) or in a network environment (e.g., via the Internet, a wide-area network, a local-area network, a client-server network (such as a cloud computing network), or other such network) using one or more network computers.
For clarity, only certain selected aspects of the software-based implementations are described. Other details that are well known in the art are omitted. For example, it should be understood that the disclosed technology is not limited to any specific computer language or program. For instance, the disclosed technology can be implemented by software written in C++, Java, Perl, JavaScript, Adobe Flash, or any other suitable programming language. Likewise, the disclosed technology is not limited to any particular computer or type of hardware. Certain details of suitable computers and hardware are well known and need not be set forth in detail in this disclosure.
Furthermore, any of the software-based embodiments (comprising, for example, computer- executable instructions for causing a computer to perform any of the disclosed methods) can be uploaded, downloaded, or remotely accessed through a suitable communication means. Such
suitable communication means include, for example, the Internet, the World Wide Web, an intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave, and infrared communications), electronic communications, or other such communication means.
The disclosed methods, apparatus, and systems should not be construed as limiting in any way. Instead, the present disclosure is directed toward all novel and nonobvious features and aspects of the various disclosed embodiments, alone and in various combinations and
subcombinations with one another. The disclosed methods, apparatus, and systems are not limited to any specific aspect or feature or combination thereof, nor do the disclosed embodiments require that any one or more specific advantages be present or problems be solved.
Non- Transitory Computer-Readable Media
Any of the computer-readable media herein can be non-transitory (e.g., memory, magnetic storage, optical storage, or the like).
Storing in Computer- Readable Media
Any of the storing actions described herein can be implemented by storing in one or more computer-readable media (e.g., computer-readable storage media or other tangible media).
Any of the things described as stored can be stored in one or more computer-readable media (e.g., computer-readable storage media or other tangible media).
Methods in Computer- Readable Media
Any of the methods described herein can be implemented by computer-executable instructions in (e.g., encoded on) one or more computer-readable media (e.g., computer-readable storage media or other tangible media). Such instructions can cause a computer to perform the method. The technologies described herein can be implemented in a variety of programming languages.
Methods in Computer- Readable Storage Devices
Any of the methods described herein can be implemented by computer-executable instructions stored in one or more computer-readable storage devices (e.g., memory, magnetic
storage, optical storage, or the like). Such instructions can cause a computer to perform the method.
References
1. Chaisson M, Brinza D, Pevzner P: De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Research 2009, 19(2):336-346.
2. Butler J, MacCallum I, Kleber M, Shlyakhter I, Belmonte M, Lander E, Nusbaum C, Jaffe D:
ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research 2008, 18(5):810-820.
3. Zerbino D, Birney E: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 2008, 18(5):821-829.
4. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K et ah De novo assembly of human genomes with massively parallel short read sequencing. Genome
Research 2010, 20(2):265-272.
5. Gnerre S, MacCallum I, Przybylski D, Ribeiro F, Burton J, Walker B, Sharpe T, Hall G, Shea T,
Sykes S et ah. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences 2010, 108(4): 1513-
1518.
6. Simpson J, Wong K, Jackman S, Schein J, Jones S, Birol I: ABySS: A parallel assembler for short read sequence data. Genome Research 2009, 19(6): 1117- 1123.
7. Jackson BG, Regennitter M, Yang X, Schnable PS, Aluru S: Parallel de novo assembly of large genomes from high-throughput short reads. In: Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on. 1-10.
8. Liu Y, Schmidt B, Maskell D: Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinformatics 2011, 12(1):354.
9. Conway T, Bromage A: Succinct data structures for assembling large genomes.
Bioinformatics 2011, 27(4):479-486.
10. Conway T, Wazny J, Bromage A, Zobel J, Beresford-Smith B: Gossamer - A Resource Efficient de novo Assembler. Bioinformatics 2012.
11. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genetics 2012, 44(2):226-232.
12. Ye C, Cannon C, Ma Z, Yu D, Pop M: SparseAssembler2: Sparse k-mer Graph for Memory Efficient Genome Assembly. 2011.
13. Simpson J, Durbin R: Efficient de novo assembly of large genomes using compressed data structures. Genome Research 2011, 22(3):549-556.
14. Zhang W, Chen J, Yang Y, Tang Y, Shang J, Shen B: A Practical Comparison of De Novo Genome Assembly Software Tools for Next- Generation Sequencing Technologies. PLoS ONE 2011, 6(3):el7915.
15. Haiminen N, Kuhn D, Parida L, Rigoutsos I: Evaluation of Methods for De Novo Genome Assembly from High-Throughput Sequencing Reads Reveals Dependencies That Affect the Quality of the Results. PLoS ONE 2011, 6(9):e24182.
16. Earl D, Bradnam K, St. John J, Darling A, Lin D, Faas J, Hung On Ken Y, Vince B, Zerbino D, Diekhans M et ah. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research 2011, 21(12):2224-2241.
17. Lin Y, Li J, Shen H, Zhang L, Papasian C, Deng HW: Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 2011, 27(15):2031-2037.
18. Miller J, Koren S, Sutton G: Assembly algorithms for next- generation sequencing data.
Genomics 2010, 95(6) :315-327.
19. Narzisi G, Mishra B: Comparing De Novo Genome Assembly: The Long and Short of It. PLoS ONE 2011, 6(4) :e 19175.
20. Salzberg S, Phillippy A, Zimin A, Puiu D, Magoc T, Koren S, Treangen T, Schatz M, Delcher A, Roberts M et ah GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 2011, 22(3):557-567.
21. GAGE [http://gage.cbcb.umd.edu]
22. Kelley D, Schatz M, Salzberg S: Quake: quality-aware detection and correction of sequencing errors. Genome Biology 2010, 11(11):R116.
23. Kurtz S, Phillippy A, Delcher A, Smoot M, Shumway M, Antonescu C, Salzberg S:
Versatile and open software for comparing large genomes. Genome Biology 2004, 5(2):R12.
24. Boetzer M, Henkel C, Jansen H, Butler D, Pirovano W: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 2010, 27(4):578-579.
25. Karypis G, Kumar V: A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J Sci Comput 1998, 20(l):359-392.
26. Smith T and Waterman M: Identification of Common Molecular Subsequences. J Molecular Biology 1981, 147: 195-197.
27. Nister D, Stewenius H: Scalable Recognition with a Vocabulary Tree. In: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on: 2006. 2161-2168.
28. Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 2009, 10(3):R25.
29. Peng Y, Leung H, Yiu SM, Chin F: IDBA: a practical iterative de bruijn graph de novo assembler. In: Proceedings of the 14th Annual international conference on Research in Computational Molecular Biology: 2010; Lisbon, Portugal. Springer- Verlag: 426-440.
30. VelvetOptimizer [bioinformatics.net.au/software.velvetoptimiser.shtml]
31. Baker M: De novo genome assembly: what every biologist should know. Nature Methods 2012, 9(4):333-337. Alternatives
The technologies from any example can be combined with the technologies described in any one or more of the other examples. Where the word "exemplary" is used, it is intended to indicate an example and not necessarily an ideal embodiment. In view of the many possible embodiments to which the principles of the disclosure may be applied, it should be recognized that the illustrated embodiments are only examples of the disclosed technology and should not be taken as a limitation on the scope of the disclosed technology. Rather, the scope of the disclosed technology includes what is covered by the following claims. We therefore claim as our invention all that comes within the scope and spirit of the claims.
Claims
1. A method implemented at least in part by a computing system, the method comprising:
(a) receiving a plurality of input reads representing nucleotide sequences of DNA or RNA fragments;
(b) partitioning the input reads into respective read tiles;
(c) assembling sets of contigs for respective of the read tiles;
(d) clustering the contigs into contig clusters, wherein the clustering is based on overlap between the contigs;
(e) clustering the input reads into read clusters, wherein the clustering is based on similarity between the input reads and the contig clusters; and
(f) assembling sets of contigs for respective of the read clusters.
2. One or more computer-readable storage media comprising computer-executable instructions causing a computer to perform the method of claim 1.
3. The method of claim 1 further comprising:
iteratively repeating (d) - (f) until an end condition is met.
4. The method of claim 1 wherein:
the end condition comprises lack of improvement in the sets of contigs.
5. The method of claim 1 wherein:
the input reads are a result of sequencing reads generated from a sequencing method applied to a sample.
6. The method of claim 1 wherein:
the input reads are initially randomly partitioned into the read tiles.
7. The method of claim 1 wherein:
assembling the sets of contigs is performed by one or more assemblers.
8. The method of claim 7 further comprising:
attempting assembly with at least one of the assemblers via a plurality of different k-mer sizes;
evaluating performance of assembly under the different k-mer sizes; and
choosing a superior k-mer size based on the evaluating.
9. The method of claim 8 wherein:
performance of assembly is based at least on length of the contigs.
10. The method of claim 1 further comprising:
combining the sets of contigs before clustering the contigs.
11. The method of claim 1 wherein:
overlap is based on connectivity intensity between the contigs.
12. The method of claim 1 wherein clustering the contigs into contig clusters comprises: building a graph of the contigs modeling overlap; and
partitioning the graph into clusters of contigs.
13. The method of claim 12 wherein:
nodes in the graph represent contigs; and
edges in the graph represent overlap between contigs connected by the edgi
14. The method of claim 13 wherein:
overlap is represented by a number of base pairs in common between contigs connected by
15. The method of claim 13 wherein:
the nodes further represent respective lengths of represented contigs.
16. The method of claim 1 wherein:
clustering the input reads into read clusters places at least one input read into a plurality of read clusters.
17. The method of claim 1 wherein:
similarity between input reads and contig clusters is based at least on base-pairs in common between a given input read and a contig in a given contig cluster.
18. The method of claim 1 further comprising:
distributing different read clusters to different machines for assembly.
19. The method of claim 18 wherein:
read cluster assembly is performed as a sub-problem requiring no additional information regarding other read clusters.
20. The method of claim 1 further comprising:
distributing different contig clusters to different machines for alignment against the input reads.
21. The method of claim 20 wherein:
contig cluster alignment against the input reads is performed as a sub-problem requiring no additional information regarding other contig clusters.
22. A system comprising:
a plurality of stored input reads representing sequencing reads generated from a sequencing method applied to a sample;
a partitioner configured to partition the stored input reads into a plurality of read tiles; an assembler configured to assemble input reads in a given read tile into a set of contigs; a contig cluster tool configured to organize the contigs into a graph and partition the graph into contig clusters; and
an alignment tool configured to accept the contig clusters and the input reads and output read clusters partitioning the input reads according to alignment against the contig clusters;
wherein the assembler is further configured to assemble input reads in a given read cluster into a new set of contigs.
23. A method implemented at least in part by a computing system, the method comprising:
(a) receiving a plurality of input reads representing nucleotide sequences of DNA or RNA fragments;
(b) partitioning the input reads into respective read tiles;
(c) coordinating assembly of the read tiles into respective sets of contigs;
(d) clustering the contigs into contig clusters, wherein the clustering is based on overlap between the contigs;
(e) coordinating clustering the input reads into read clusters, wherein the clustering is based on similarity between the input reads and the contig clusters; and
(f) assembling sets of contigs for respective of the read clusters.
24. A method implemented at least in part by a computing system, the method comprising:
receiving a plurality of contigs assembled from a plurality of input reads representing nucleotide sequences of DNA or RNA fragments;
clustering the contigs into contig clusters, wherein the clustering is based on overlap between the contigs;
clustering the input reads into read clusters, wherein the clustering is based on similarity between the input reads and the contig clusters; and
assembling sets of contigs for respective of the read clusters.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201261683358P | 2012-08-15 | 2012-08-15 | |
| US61/683,358 | 2012-08-15 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2014028771A1 true WO2014028771A1 (en) | 2014-02-20 |
Family
ID=50101513
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2013/055197 Ceased WO2014028771A1 (en) | 2012-08-15 | 2013-08-15 | Iterative genome assembler |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2014028771A1 (en) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2020047553A1 (en) * | 2018-08-31 | 2020-03-05 | Guardant Health, Inc. | Genetic variant detection based on merged and unmerged reads |
| US20200286589A1 (en) * | 2019-03-07 | 2020-09-10 | International Business Machines Corporation | Methods of optimizing genome assembly parameters |
| US10783984B2 (en) | 2014-12-18 | 2020-09-22 | Pacific Biosciences Of California, Inc. | De novo diploid genome assembly and haplotype sequence reconstruction |
| CN117995283A (en) * | 2024-04-03 | 2024-05-07 | 吉林大学 | Single-sample metagenome clustering method, system, terminal and storage medium |
| US12183436B2 (en) | 2013-12-18 | 2024-12-31 | Pacific Biosciences Of California, Inc. | String graph assembly for polyploid genomes |
-
2013
- 2013-08-15 WO PCT/US2013/055197 patent/WO2014028771A1/en not_active Ceased
Non-Patent Citations (3)
| Title |
|---|
| CHARUVAKA, A ET AL.: "Evaluation Of Short Read Metagenomic Assembly.", BMC GENOMICS., vol. 12, no. 2:S8, 27 July 2011 (2011-07-27), pages 1 - 13 * |
| PERTEA, G ET AL.: "TIGR Gene Indices Clustering Tools (TGICL): A Software System For Fast Clustering Of Large EST Datasets.", BIOINFORMATICS., vol. 19, no. 5, 22 March 2003 (2003-03-22), pages 651 - 652 * |
| ZERBINO, R ET AL.: "Algorithms For De Novo Short Read Assembly Using De Bruijn Graphs. Genome Research.", VELVET, vol. 18, 18 March 2008 (2008-03-18), pages 821 - 829 * |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12183436B2 (en) | 2013-12-18 | 2024-12-31 | Pacific Biosciences Of California, Inc. | String graph assembly for polyploid genomes |
| US10783984B2 (en) | 2014-12-18 | 2020-09-22 | Pacific Biosciences Of California, Inc. | De novo diploid genome assembly and haplotype sequence reconstruction |
| WO2020047553A1 (en) * | 2018-08-31 | 2020-03-05 | Guardant Health, Inc. | Genetic variant detection based on merged and unmerged reads |
| JP2021536612A (en) * | 2018-08-31 | 2021-12-27 | ガーダント ヘルス, インコーポレイテッド | Detection of genetic variants based on merged and unmerged reads |
| JP7535998B2 (en) | 2018-08-31 | 2024-08-19 | ガーダント ヘルス, インコーポレイテッド | Detection of genetic variants based on merged and unmerged reads |
| US20200286589A1 (en) * | 2019-03-07 | 2020-09-10 | International Business Machines Corporation | Methods of optimizing genome assembly parameters |
| US11830581B2 (en) | 2019-03-07 | 2023-11-28 | International Business Machines Corporation | Methods of optimizing genome assembly parameters |
| CN117995283A (en) * | 2024-04-03 | 2024-05-07 | 吉林大学 | Single-sample metagenome clustering method, system, terminal and storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Raghavan et al. | A simple guide to de novo transcriptome assembly and annotation | |
| Stiller et al. | Complexity of avian evolution revealed by family-level genomes | |
| Zhang et al. | Comprehensive profiling of circular RNAs with nanopore sequencing and CIRI-long | |
| Kapli et al. | Phylogenetic tree building in the genomic age | |
| Chaisson et al. | Short read fragment assembly of bacterial genomes | |
| Straub et al. | Navigating the tip of the genomic iceberg: Next‐generation sequencing for plant systematics | |
| Li et al. | IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly | |
| Ávila-Arcos et al. | Application and comparison of large-scale solution-based DNA capture-enrichment methods on ancient DNA | |
| Williamson et al. | Detecting miRNAs in deep-sequencing data: a software performance comparison and evaluation | |
| Hozza et al. | How big is that genome? Estimating genome size and coverage from k-mer abundance spectra | |
| RU2015136780A (en) | METHODS, SYSTEMS AND SOFTWARE FOR IDENTIFICATION OF BIOMOLECULES USING MULTIPLICATIVE FORM MODELS | |
| Luan et al. | Incorporating indels as phylogenetic characters: impact for interfamilial relationships within Arctoidea (Mammalia: Carnivora) | |
| WO2014028771A1 (en) | Iterative genome assembler | |
| Zhao et al. | Bioinformatics for RNA‐Seq Data Analysis | |
| Lopez de Heredia et al. | RNA-seq analysis in forest tree species: bioinformatic problems and solutions | |
| Chin et al. | Sequence assembly using next generation sequencing data—challenges and solutions | |
| Singh et al. | Next-generation sequencing (NGS) tools and impact in plant breeding | |
| Basantani et al. | An update on bioinformatics resources for plant genomics research | |
| Ambrosino et al. | Bioinformatics resources for plant genomics: opportunities and bottlenecks in the-omics era | |
| Martinez | From plant genomes to protein families: computational tools | |
| Canovi et al. | A resource of identified and annotated lincRNAs expressed during somatic embryogenesis development in Norway spruce | |
| Lei et al. | Comparing De Novo Transcriptome Assemblers Using Illumina RNA Seq Reads | |
| Shrestha et al. | Accelerating ab initio phasing with de novo models | |
| Saeed et al. | A high performance multiple sequence alignment system for pyrosequencing reads from multiple reference genomes | |
| Rani et al. | Analysis of the metatranscriptome of microbial communities by comparison of different assembly tools reveals improved functional annotation |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13829550 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 13829550 Country of ref document: EP Kind code of ref document: A1 |







