WO2025184234A1 - A personalized haplotype database for improved mapping and alignment of nucleotide reads and improved genotype calling - Google Patents
A personalized haplotype database for improved mapping and alignment of nucleotide reads and improved genotype callingInfo
- Publication number
- WO2025184234A1 WO2025184234A1 PCT/US2025/017424 US2025017424W WO2025184234A1 WO 2025184234 A1 WO2025184234 A1 WO 2025184234A1 US 2025017424 W US2025017424 W US 2025017424W WO 2025184234 A1 WO2025184234 A1 WO 2025184234A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- haplotype
- personalized
- population
- haplotypes
- nucleotide
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
Definitions
- nucleobase sequencing platforms determine individual nucleobases within sequences from genomic samples’ cells by using conventional Sanger sequencing or by using sequencing-by-synthesis (SBS) methods.
- SBS sequencing-by-synthesis
- existing platforms can monitor millions to billions of nucleic acid polymers being synthesized in parallel to predict nucleobase calls from a larger base call dataset.
- a camera in many SBS platforms captures images of irradiated fluorescent tags incorporated into oligonucleotides for determining the nucleobase calls.
- existing sequencing platforms send base call data (or image-based data) to a computing device to apply sequencing data analysis software that determines a nucleobase sequence for a genomic sample or other nucleic acid polymer. For instance, such software (i) maps and aligns nucleotide reads determined by the sequencing platform for a genomic sample with (ii) a reference genome comprising at least a primary contiguous sequence. Based on differences between the aligned nucleotide reads and the reference genome, existing data analysis software can further utilize a variant caller to identify genotype and/or variants within a genomic sample, such as single nucleotide polymorphisms (SNPs), insertions or deletions (indels), or structural variants.
- SNPs single nucleotide polymorphisms
- indels insertions or deletions
- existing sequencing systems often utilize reference genomes that misrepresent certain populations and foment inaccurate read mapping and alignment and mistaken variant calling.
- some existing sequencing systems use a linear reference genome that purportedly represents a consensus or example of genes and other nucleotide sequences of an organism.
- GRCh38 from the Genome Reference Consortium
- GRCh38 from the Genome Reference Consortium
- some existing sequencing systems generate or use a graph reference genome.
- some graph reference genomes include both a linear reference genome and graph augmentations with multi- nucleobase codes representing SNPs and/or indels and alternate contiguous sequences representing various alternative population haplotypes at given genomic regions.
- graph reference genomes stack and index numerous alternate contiguous sequences that can respectively stretch relatively long nucleobase distances (e.g., hundreds to thousands of base pairs in length) and, consequently, include redundant reference nucleobases overlapping a same region.
- the one- size-fits-all approach to graph reference genomes can accordingly consume excessive amounts of memory encoding alternate contiguous sequences.
- the human reference genome assemblies utilized by many existing sequencing systems comprise a haploid representation of different reference genomic samples.
- a haploid reference genome for mapping and alignment of diploid genomic samples, existing sequencing systems frequently align nucleotide reads and determine variant calls that are negatively influenced by a reference bias in favor of alignment of such reads with alleles represented by the reference genome and to the detriment of alternative alleles — despite allele-variant differences between the sample diplotype and the haploid reference genome.
- FPs false positives
- FNs false negatives
- This disclosure describes embodiments of methods, non-transitory computer-readable media, and systems that (i) generate a personalized haplotype database for a genomic sample and (ii) utilize the personalized haplotype database to determine personalized alignments of nucleotide reads from the genomic sample.
- the disclosed systems can generate a haplotype database that is customized or personalized for a specific genomic sample based on a comparison of nucleotide reads from a genomic sample with candidate population haplotypes from a population haplotype database.
- the disclosed systems can generate a personalized diploid reference database with a customized set of haplotype pairs for diploid genomic regions of a reference genome.
- the disclosed systems can utilize the personalized haplotype database to determine personalized alignments of nucleotide reads from the genomic sample with respective genomic regions of a reference genome.
- the disclosed systems can identify, within a set of reference spans of a reference genome, a set of nucleotide reads from a genomic sample and candidate population haplotypes from a population haplotype database. For each reference span of the set of reference spans, the disclosed systems can generate haplotype set scores for set of the candidate population haplotypes based on comparing the nucleotide reads and the candidate population haplotypes. Based on the haplotype set scores, the disclosed systems can generate a personalized haplotype database comprising a subset of population haplotypes from the population haplotype database. By utilizing a personalized haplotype database generated according to the disclosed methods, the disclosed systems can determine personalized alignments of a set of nucleotide reads from a genomic sample with respective genomic regions of a reference genome.
- FIG. 1 illustrates an environment in which a personalized sequencing system can operate in accordance with one or more embodiments of the present disclosure.
- FIG. 2 illustrates the personalized sequencing system identifying nucleotide reads from a genomic sample within reference spans of a reference genome in accordance with one or more embodiments of the present disclosure.
- FIG. 3 A illustrates the personalized sequencing system determining initial alignments of a set of nucleotide reads from a genomic sample and generating an alignment data file in accordance with one or more embodiments of the present disclosure.
- FIG. 3B illustrates the personalized sequencing system generating a personalized haplotype database and a personalized alignment data file for a set of nucleotide reads from a genomic sample in accordance with one or more embodiments of the present disclosure.
- FIG. 4 further illustrates the personalized sequencing system generating a personalized haplotype database for a set of nucleotide reads from a genomic sample in accordance with one or more embodiments of the present disclosure.
- FIG. 5 illustrates the personalized sequencing system selecting haplotype sets for a set of reference spans of a reference genome and generating a personalized haplotype database in accordance with one or more embodiments of the present disclosure.
- FIG. 6A illustrates a set of base-level bins of a population haplotype database in accordance with one or more embodiments of the present disclosure.
- FIG. 6B illustrates a set of base-level bins of a personalized haplotype database in accordance with one or more embodiments of the present disclosure.
- FIG. 7 illustrates base-level bins and successive higher-level bins of a personalized haplotype database in accordance with one or more embodiments of the present disclosure.
- FIG. 8 illustrates comparative experimental results of determining variant calls from nucleotide reads that are (i) mapped and aligned with a reference genome using existing sequencing systems and (ii) mapped and aligned to a reference genome using a personalized haplotype database generated by the personalized sequencing system in accordance with various embodiments of the present disclosure.
- FIG. 9 illustrates comparative experimental results of mapping and aligning nucleotide reads from a genomic sample utilizing (i) an existing sequencing system and (ii) the personalized sequencing system in accordance with one or more embodiments of the present disclosure.
- FIG. 10 illustrates a flowchart of a series of acts for generating a personalized haplotype database and determining personalized alignments of a set of nucleotide reads from a genomic sample in accordance with one or more embodiments of the present disclosure.
- FIG. 11 illustrates a block diagram of an example computing device for implementing one or more embodiments of the present disclosure.
- This disclosure describes embodiments of a personalized sequencing system that can generate and utilize a personalized haplotype database to determine personalized alignments of nucleotide reads from a genomic sample.
- the personalized sequencing system can evaluate and select a customized set of population haplotypes based on a comparison of nucleotide reads from a genomic sample with candidate population haplotypes to generate a personalized haplotype database for the genomic sample.
- the personalized sequencing system initially maps and aligns a set of nucleotide reads from a genomic sample using a population haplotype database (e.g., a panel of 256 haplotypes) to generate an initial alignment data file (e.g., a rescored binary alignment map (BAM) file). Having determined an initial alignment of the set of nucleotide reads, the personalized sequencing system, in some embodiments, compares the set of nucleotide reads with haplotypes from the population haplotype database to select a subset of haplotypes to include within a personalized haplotype database (e.g., a personalize haplotype panel Tab Separated Values (TSV) file) for the genomic sample.
- a population haplotype database e.g., a panel of 256 haplotypes
- an initial alignment data file e.g., a rescored binary alignment map (BAM) file.
- BAM rescored binary alignment map
- the personalized sequencing system utilizes an imputation model to evaluate sets of candidate population haplotypes across a plurality of genomic regions of the reference genome to generate haplotype set scores and select the customized set of population haplotypes based on the generated haplotype set scores.
- the personalized sequencing system generates a personalized haplotype database comprising customized pairs of population haplotypes representing a predicted diplotype for a respective genomic sample within one or more diploid genomic regions of a corresponding reference genome. Having generated a personalized haplotype database, the personalized sequencing system can utilize the personalized haplotype database for a genomic sample to determine personalized alignments of nucleotide reads from a corresponding genomic sample with respective genomic regions of a reference genome.
- the personalized sequencing system identifies, within a set of reference spans of a reference genome, a set of nucleotide reads from a genomic sample and candidate population haplotypes from a population database. Based on comparing the set of nucleotide reads and the candidate population haplotypes within each reference span, the personalized sequencing system can generate haplotype set scores for sets of candidate population haplotypes. Based on the haplotype set scores, the personalized sequencing system can generate a personalized haplotype database comprising a subset of population haplotypes selected from the population haplotype database within the set of reference spans. As mentioned, the personalized sequencing system can utilize the personalized haplotype database to determine one or more personalized alignments of the set of nucleotide reads with respective genomic regions of the reference genome.
- the personalized sequencing system can utilize an imputation model to evaluate sets of candidate population haplotypes across a plurality of reference spans of the reference genome to generate haplotype set scores and select the customized set of population haplotypes based on the generated haplotype set scores.
- the personalized sequencing system generates haplotype set scores comprising either (i) haplotype pair scores for pairs of population haplotypes in reference spans covering diploid regions (e.g., regions corresponding to somatic chromosomes) of a reference genome or (ii) individual haplotype scores for population haplotypes in reference spans covering haploid regions (e.g., regions corresponding to sex chromosomes) of the reference genome.
- the personalized sequencing system can generate haplotype set scores for sets of more than two population haplotypes (e.g., to customize for polyploid genomic samples having more than two chromosomes per set or to account for uncertainty in selecting custom haplotype sets representing a genomic sample).
- the personalized sequencing system can utilize a hidden Markov model (HMM) algorithm as the imputation model.
- HMM hidden Markov model
- the personalized sequencing system utilizes an HMM algorithm to impute haplotype set posterior probabilities for sets of candidate population haplotypes across adjacent reference spans of a set of reference spans of a reference genome associated with a genomic sample.
- the personalized sequencing system can generate haplotype set scores for respective sets of candidate population haplotypes and, based on the haplotype set scores, generate a personalize haplotype database for the genomic sample.
- the personalized sequencing system can utilize a variety of methods of scoring sets of candidate population haplotypes.
- the personalized sequencing system utilizes a Variational Bayesian model implementing an iterative algorithm to generate haplotype set scores for the respective sets of candidate population haplotypes.
- the personalized sequencing system categorizes nucleotide reads as inherited from respective first and second parents, generates individual haplotype scores for the categorized reads, and generates haplotype set scores for pairs of candidate population haplotypes by combining individual haplotype scores from nucleotide reads inherited from the first parent with individual haplotype scores from nucleotide reads inherited from the second parent.
- the personalized sequencing system determines initial alignments of the set of nucleotide reads with respective genomic regions of the reference genome. By identifying subsets of nucleotide reads that, according to the initial alignments, align to respective reference spans of the set of reference spans, the personalized sequencing system can generate the haplotype set scores for each respective reference span within the set of reference spans.
- the personalized sequencing system determines the initial alignments by (1) identifying, as indicated within the population haplotype database, allelevariant differences between population haplotypes and a primary contiguous sequence at respective genomic regions of the reference genome, and (2) rescoring candidate reference alignments of the set of nucleotide reads with the primary contiguous sequence according to the identified allelevariant differences.
- the personalized sequencing system Having determined, via the aforementioned alignment method or by another method, initial alignments for a set of nucleotide read, in one or more embodiments, the personalized sequencing system generates an alignment data file including information for generating a personalized haplotype database.
- an alignment data file e.g., a personalized Binary Alignment Map (BAM) file
- BAM Binary Alignment Map
- the personalized sequencing system utilizes the alignment data file to evaluate and select population haplotypes for a personalized haplotype database.
- the personalized sequencing system can determine one or more personalized alignments of nucleotide reads from the set of nucleotide reads and output a personalized alignment data file with the personalized alignments and corresponding alignment scores.
- the personalized sequencing system provides several technical advantages, benefits, and/or improvements over existing sequencing systems and methods.
- the personalized sequencing system improves the accuracy of read alignments and subsequent genomic analysis by utilizing a personalized haplotype database for a genomic sample.
- the personalized sequencing system generates a personalized haplotype database including a subset of population haplotypes that is selected for a genomic sample and from a population haplotype database.
- the personalized sequencing system can more accurately align nucleotide reads with a corresponding reference genome — especially in more complex or “difficult” genomic regions (e.g., regions comprising lower confidence base calls in general) — than existing sequencing systems that utilize reference genomes augmented by an unfiltered or unnecessarily large set of population haplotypes (e.g., 15-20 haplotypes per region). Due to the improved alignment with the reference genome, the personalized sequencing system can also determine more accurate genotype calls and/or variant calls with a higher confidence that such calls match or differ from the reference base of a reference genome compared to existing sequencing systems. This disclosure describes and depicts examples of such improved genotype and/or variant calls below in relation to FIGS. 8-9.
- the personalized sequencing system further improves the accuracy of mapping and alignment and subsequent variant calling in part by avoiding reference bias, which is a type of bias in favor of aligning nucleotide reads with alleles of the corresponding reference genome.
- reference bias is a type of bias in favor of aligning nucleotide reads with alleles of the corresponding reference genome.
- existing sequencing systems often encounter reference bias when mapping nucleotide reads from a diploid sample to a haploid reference genome.
- the personalized sequencing system in at least some implementations, provides a personalized haplotype database that avoids reference bias by comprising a customized diploid reference genome for mapping nucleotide reads in diploid regions of a corresponding reference genome.
- the personalized sequencing system can further improve variant calling accuracy relative to existing sequencing systems by generating a personalized haplotype database for personalized mapping and alignment of nucleotide reads from a variety of polyploid genomic samples.
- the personalized sequencing system improves computational efficiency over existing sequencing systems.
- the personalized sequencing system can accurately determine personalized read alignments for nucleotide reads with improved computational speed and less memory relative to existing sequencing systems.
- existing sequencing systems often determine read alignments by attempting to align and score nucleotide reads with a robust graph genome augmented by numerous alternative contiguous sequences representing an unfiltered set of population haplotypes.
- the personalized sequencing system utilizes a personalized haplotype database comprising a discrete subset of population haplotypes particularly selected for a given genomic sample, resulting in improved alignment accuracy, reduced memory consumption, and increased processing speeds.
- the personalized sequencing system can accurately predict initial read alignments and/or personalized read alignments, according to the disclosed methods, while improving the computing speed and memory usage relative to existing sequencing systems.
- existing sequencing systems use graph reference genomes with generic graph augmentations including numerous and redundant alternate contiguous sequences that consume memory with the repeated sequences from overlapping portions of alternate contiguous sequences and slow down computer processing by scoring alignments between reads and such overlapping portions of alternate contiguous sequences.
- the personalized sequencing system can expedite mapping and alignment of nucleotide reads at least by: (i) determining candidate reference alignments of nucleotide reads with a primary contiguous sequence at respective genomic regions of a reference genome and (ii) determining initial alignments or personalized alignments of the nucleotide reads by rescoring the candidate reference alignments according to allele-variant differences between the primary contiguous sequence and population haplotypes at the respective genomic regions of the reference genome.
- the personalized sequencing system can generate a personalized haplotype database for determining personalized alignments with greater flexibility compared to existing sequencing systems.
- the personalized sequencing system generates a personalized haplotype database for a genomic sample based on comparing nucleotide reads from a genomic sample with candidate population haplotypes. Accordingly, the personalized sequencing system can generate alignments of nucleotide reads with improved accuracy and efficiency, without requiring additional information regarding the genomic sample, such as parental genomic data or demographic data associated with the sample specimen.
- the personalized sequencing system further increases flexibility by implementing a diploid personalized haplotype database that more flexibly facilitates accurate mapping and alignment and/or variant calling relative to existing sequencing system that utilize haploid reference genomes for mapping and alignment of diploid genomic samples.
- genomic sample refers to a target genome or portion of a genome undergoing an assay or sequencing.
- a genomic sample includes one or more sequences of nucleotides isolated or extracted from a sample organism (or a copy of such an isolated or extracted sequence).
- a genomic sample includes a full genome that is isolated or extracted (in whole or in part) from a sample organism and composed of nitrogenous heterocyclic bases.
- a genomic sample can include a segment of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or other polymeric forms of nucleic acids or chimeric or hybrid forms of nucleic acids noted below.
- the genomic sample is found in a sample prepared or isolated by a kit and received by a sequencing device.
- nucleotide read refers to an inferred or predicted sequence of one or more nucleotide bases (or nucleobase pairs) from all or part of a sample genomic sequence (e.g., a sample genomic sequence, complementary DNA).
- a nucleotide read includes a determined or predicted sequence of nucleobase calls for a nucleotide fragment (or group of monoclonal nucleotide fragments) from a sequencing library corresponding to a genomic sample.
- the personalized sequencing system determines a nucleotide read by generating nucleobase calls for nucleobases passed through a nanopore of a nucleotide-sample slide, determined via fluorescent tagging, or determined from a well in a flow cell.
- a nucleotide read can refer to a particular type of read, such as a nucleotide read synthesized from sample library fragments that are shorter than a threshold number of nucleobases (e.g., SBS reads).
- nucleotide read can refer to (i) assembled nucleotide reads that have been assembled from shorter nucleotide reads to form a contiguous sequence (e.g., assembled nucleotide reads) satisfying a threshold number of nucleobases, (ii) circular consensus sequencing (CCS) reads satisfying the threshold number of nucleobases, or (iii) nanopore long reads satisfying the threshold number of nucleobases.
- CCS circular consensus sequencing
- genomic coordinate refers to a particular location or position of a nucleobase within a genome (e.g., an organism’s genome or a reference genome).
- a genomic coordinate includes an identifier for a particular chromosome of a genome and an identifier for a position of a nucleobase within the particular chromosome.
- a genomic coordinate or coordinates may include a number, name, or other identifier for a somatic or sex chromosome (e.g., chrl or chrX) and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570 or chrl: 1234570-1234870).
- a genomic coordinate refers to a genomic coordinate on a sex chromosome (e.g., chrX or chrY). Consequently, the personalized sequencing system can determine genotype probabilities for a genotype call (e.g., a variant call) for a genomic coordinate on a sex chromosome.
- a genomic coordinate refers to a source of a reference genome (e.g., mt for a mitochondrial DNA reference genome or SARS-CoV-2 for a reference genome for the SARS-CoV-2 virus) and a position of a nucleobase within the source for the reference genome (e.g., mt: 16568 or SARS-CoV- 2:29001).
- a genomic coordinate refers to a position of a nucleobase within a reference genome without reference to a chromosome or source (e.g., 29727).
- genomic region refers to a range of genomic coordinates. Like genomic coordinates, in certain implementations, a genomic region may be identified by an identifier for a chromosome and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570-1234870). In various implementations, a genomic coordinate includes a position within a reference genome. In some cases, a genomic coordinate is specific to a particular reference genome.
- reference span refers to a span of nucleobase positions within a linear reference genome. In other words, a reference span includes a span of nucleobases between two respective genomic coordinates of the linear reference genome.
- a genomic coordinate includes a position within a reference genome. Such a position may be within a particular reference genome.
- the term “reference genome” refers to a digital nucleic acid sequence assembled as a representative example (or representative examples) of genes and other genetic sequences of an organism. Regardless of the sequence length, in some cases, a reference genome represents an example set of genes or a set of nucleic acid sequences in a digital nucleic acid sequenced determined by scientists as representative of an organism of a particular species.
- a linear human reference genome may be GRCh38 or other versions of reference genomes from the Genome Reference Consortium.
- a reference genome includes multi-base codes.
- a reference genome may include a graph reference genome that includes both a linear reference genome and paths representing nucleic acid sequences from ancestral haplotypes, such as Illumina DRAGEN Graph Reference Genome hgl9.
- primary contiguous sequence refers to a contiguous sequence representing a reference haplotype of the reference genome.
- a primary contiguous sequence digitally represents a reference haplotype of a reference genome but can include additional information from a primary assembly of the linear reference genome, such as indications of population variants in certain genomic regions to aid in identifying candidate alignments of nucleotide reads.
- alternative contiguous sequence refers to a contiguous sequence representing a population haplotype at particular genomic coordinates of a reference genome.
- a graph reference genome includes alternate contiguous sequences mapped to genomic coordinates of a primary contiguous sequence for a linear reference genome.
- a hash table for a graph reference genome includes identifiers that associate alternate contiguous sequences representing population haplotypes at genomic coordinates relative to a linear reference genome.
- a personalized haplotype database includes data corresponding to a limited sampling of population haplotypes selected for a particular genomic sample.
- allele-variant difference refers to differences between respective nucleobases of two or more given nucleotide sequences.
- allele-variant differences are differences between the primary contiguous sequence and at least one population haplotype (e.g., as represented by an alternative contiguous sequence).
- allele-variant differences within a given genomic region can include single nucleotide variants, multiple base differences, and/or insertions and deletions (indels) of population haplotypes relative to a primary contiguous sequence.
- allele-variant differences can refer to differences between a first population haplotype and a second population haplotype.
- the term “locally distinct population haplotype” or “locally distinct haplotype” refers to a haplotype comprising a set of at least one allele-variant difference, where the set is unique relative to other haplotypes within a respective genomic region of a reference genome.
- Each genomic region or reference span of a reference genome can include one or more locally distinct population haplotypes having a unique set of one or more allele-variant differences relative to other population haplotypes within the respective genomic region or reference span.
- a given set of one or more allele-variant differences within a genomic region corresponding to a candidate read alignment can represent multiple haplotypes due to a complete overlap of variants within the genomic region. Accordingly, in certain cases, multiple haplotypes comprising or consisting of identical nucleobases within a given genomic region can be represented by a single locally distinct population haplotype.
- an alignment score refers to a numeric score, metric, or other quantitative measurement evaluating an accuracy of an alignment between one or more nucleotide reads or a fragment of a nucleotide read and another nucleotide sequence from a reference genome.
- an alignment score includes a metric indicating a degree to which the nucleobases of one or more nucleotide reads (or a fragment thereof) match or are similar to a reference sequence or an alternate contiguous sequence from a reference genome.
- an alignment score takes the form of a Smith-Waterman score or a variation or version of a Smith- Waterman score for local alignment, such as various settings or configurations used by DRAGEN by Illumina, Inc. for Smith- Waterman scoring.
- mapping-quality score refers to a metric or other measurement quantifying a quality or certainty of an alignment of nucleotide reads (or other nucleotide sequences or subsequences) with a reference genome.
- a mapping-quality score includes mapping-quality (MAPQ) scores for nucleobase calls at genomic coordinates, where a MAPQ score represents -10 loglO Pr ⁇ mapping position is wrong ⁇ , rounded to the nearest integer.
- a mapping-quality score includes a full distribution of mapping qualities for all nucleotide reads aligning with a reference genome at a genomic coordinate.
- MAPQ scores are partitioned into sequential bins (e.g., Q-score bins denoted by “MAPQO,” “MAPQ 10,” “MAPQ20,” and so forth) representing different MAPQ scores for each bin.
- any MAPQ scores above a predetermined threshold score are associated with a maximum value bin (e.g., up to “MAPQ40”).
- the term “extended mapping-quality score” or “extended MAPQ score” refers to MAPQ scores that are partitioned into bins including a higher maximum value bin (e.g., up to “MAPQ60”) than conventionally implemented.
- the personalized sequencing system implements extended MAPQ scores when performing an initial mapping and alignment of a given set of nucleotide reads and utilizes the extended MAPQ scores to compare the nucleotide reads with candidate population haplotypes to generate haplotype set scores (e.g., as described below in relation to FIGS. 3A-3B and 4).
- genotype call refers to a determination or prediction of a particular genotype of a genomic sample or a sample nucleotide sequence at a genomic locus.
- a genotype call can include a prediction of a particular genotype of a genomic sample with respect to a reference genome or a reference sequence at a genomic coordinate or a genomic region.
- a genotype call includes a determination or prediction that a genomic sample comprises both a nucleobase and a complementary nucleobase at a genomic coordinate that is either homozygous or heterozygous for a reference base or a variant (e.g., homozygous reference bases represented as 0
- a genotype call can include a prediction of a variant or reference base for one or more alleles of a genomic sample and indicate zygosity with respect to a variant or reference base.
- a genotype call is often determined for a genomic coordinate or genomic region at which an SNP, insertion, deletion, or other variant has been identified for a population of organisms.
- nucleobase call refers to a determination or prediction of a particular nucleobase (or nucleobase pair) for an oligonucleotide (e.g., nucleotide read) during a sequencing cycle or for a genomic coordinate of a genomic sample.
- a nucleobase call can indicate a determination or prediction of the type of nucleobase that has been incorporated within an oligonucleotide on a nucleotide-sample slide (e.g., read-based nucleobase calls).
- a nucleobase call includes a determination or a prediction of a nucleobase based on intensity values resulting from fluorescent-tagged nucleotides added to an oligonucleotide of a nucleotide-sample slide (e.g., in a cluster of a flow cell).
- a single nucleobase call can be an adenine (A) call, a cytosine (C) call, a guanine (G) call, a thymine (T) call, or an uracil (U) call.
- variant refers to a nucleobase or multiple nucleobases that do not align with, differs from, or varies from a corresponding nucleobase (or nucleobases) in a reference sequence or a reference genome.
- a variant includes a SNP, an indel, or a structural variant that indicates nucleobases in a sample nucleotide sequence that differ from nucleobases in corresponding genomic coordinates of a reference sequence or a reference genome.
- a “variant call” refers to a nucleobase call comprising a mutation or a variant at a particular genomic coordinate or genomic region with respect to a reference.
- a variant call includes a determination or prediction that a genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that differs from a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome.
- a “reference call” refers to a nucleobase call comprising a non-variant or a reference nucleobase at a genomic coordinate or a genomic region with respect to a reference.
- a non-variant or reference call includes a determination or prediction that a genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that matches a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome.
- an alignment data file refers to a digital file that indicates mapping and alignment information for nucleotide reads of a sample nucleotide sequence.
- an alignment data file can include a binary alignment map (BAM) file, a compressed reference-oriented alignment map (CRAM) file, or another file indicating nucleotide reads of a sample nucleotide sequence.
- BAM binary alignment map
- CRAM compressed reference-oriented alignment map
- an alignment data file can include further information regarding nucleotide reads, mapping and alignment results, population haplotype data, and so forth.
- the term “population haplotype” refers to nucleotide sequences that are present in an organism (or present in organisms from a population) and inherited from one or more ancestors.
- a population haplotype can include alleles or other nucleotide sequences that are present in organisms of a population and inherited together by such organisms respectively from a single parent.
- population haplotypes include a set of SNPs or other variants on the same chromosome that tend to be inherited together.
- data representing a population haplotype, or a set of different population haplotypes are stored or otherwise accessible on a population haplotype database.
- the personalized sequencing system also generates a personalized haplotype database comprising a customized selection of the population haplotypes imported from a particular population haplotype database.
- a population haplotype database refers to a database encoding variant data for population haplotypes of a sample organism.
- a population haplotype database refers to an unfiltered compilation of population haplotypes or, in other words, a complete compilation of population haplotypes prior to personalization according to the methods disclosed herein.
- a population haplotype database includes complete or partially complete nucleotide sequences (e.g., alternate contiguous sequences) for population haplotypes of a sample organism.
- a population haplotype database encodes variant data for population haplotypes having allele-variant differences from locally distinct population haplotypes within respective genomic regions of a corresponding reference genome.
- the population haplotype database comprises a haplotype data structure comprising a hierarchical partitioning of different genomic regions of the reference genome into a collection of bins covering respective spans of a linear reference genome (e.g., as represented by a primary contiguous sequence).
- a personalized haplotype database refers to a haplotype database comprising a personalized (or customized) subset of population haplotypes for a given genomic sample.
- a personalized (or customized) haplotype database can include one or more population haplotypes selected for a sample genome from a population haplotype database as described above.
- a personalized (or customized) haplotype database can include two population haplotypes for each of one or more genomic regions of the corresponding reference genome identified as diploid.
- a personalized (or customized) haplotype database can include a single population haplotype. Indeed, any number of population haplotypes can be included within a personalized (or customized) haplotyped database according to the methods disclosed herein.
- haplotype set score refers to a metric or other measurement quantifying a likelihood or probability that nucleotide reads from a given genomic sample correspond to a respective set of one or more population haplotypes within a particular genomic region.
- haplotype set scores quantify similarities between haplotype sets comprising pairs of population haplotypes and nucleotide reads within a diploid region of a corresponding reference genome.
- haplotype set scores quantify similarities between nucleotide reads and individual haplotypes within a haploid genomic region (e.g., sex chromosomes) of a corresponding reference genome.
- haplotype set scores quantify similarities between nucleotide reads and sets of multiple haplotypes within a polyploid genomic region of a corresponding reference genome (e.g., such that each set of multiple haplotypes comprises a same quantity of haplotypes as a quantity of chromosomes sets in the polyploid genomic region of the genomic sample).
- a haplotype set score can include a haplotype set likelihood or a haplotype set posterior probability, such as described in relation to FIG. 5.
- the term “personalized alignment,” interchangeable with “customized alignment,” refers to a read alignment generated utilizing a personalized (or customized) haplotype database as described herein.
- a personalized (or customized) haplotype database generated for a particular genomic sample according to the methods disclosed herein, can be utilized in place of an unfiltered population haplotype database to generate one or more personalized (or customized) alignments of nucleotide reads from the particular genomic sample.
- FIG. 1 illustrates a schematic diagram of a computing system 100 in which a personalized sequencing system 106 operates in accordance with one or more embodiments.
- the computing system 100 includes a sequencing device 102 connected to a local device 108 (e.g., a local server device), one or more server device(s) 110, and a client device 114.
- a local device 108 e.g., a local server device
- server device(s) 110 e.g., a local server device
- client device 114 e.g., a client device 114
- the network 118 comprises any suitable network over which computing devices can communicate. Example networks are discussed in additional detail below with respect to FIG. 11. While FIG. 1 shows an embodiment of the personalized sequencing system 106, this disclosure describes alternative embodiments and configurations below.
- the sequencing device 102 comprises a computing device and a sequencing device system 104 for sequencing a genomic sample or other nucleic-acid polymer.
- the sequencing device 102 analyzes nucleotide fragments or oligonucleotides extracted from genomic samples to generate nucleotide reads or other data utilizing computer implemented methods and systems either directly or indirectly on the sequencing device 102. More particularly, the sequencing device 102 receives nucleotide-sample slides (e.g., flow cells) comprising nucleotide fragments extracted from samples and further copies and determines the nucleobase sequence of such extracted nucleotide fragments.
- nucleotide-sample slides e.g., flow cells
- the sequencing device 102 utilizes sequencing-by- synthesis (SBS) techniques to sequence nucleotide fragments into nucleotide reads and determine nucleobase calls for the nucleotide reads.
- SBS sequencing-by- synthesis
- the sequencing device 102 bypasses the network 118 and communicates directly with the local device 108 or the client device 114.
- the sequencing device 102 can further store the nucleobase calls as part of base-call data that is formatted as a binary base call (BCL) file and send the BCL file to the local device 108 and/or the server device(s) 110.
- BCL binary base call
- the local device 108 is located at or near a same physical location of the sequencing device 102. Indeed, in some embodiments, the local device 108 and the sequencing device 102 are integrated into a same computing device.
- the local device 108 may run the personalized sequencing system 106 to generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such basecall data.
- the sequencing device 102 may send (and the local device 108 may receive) base-call data generated during a sequencing run of the sequencing device 102.
- the local device 108 may align nucleotide reads with a reference genome utilizing a personalized haplotype database 112 and determine genetic variants based on the aligned nucleotide reads.
- the local device 108 may also communicate with the client device 114.
- the local device 108 can send data to the client device 114, including a binary alignment map (BAM) file, a variant call format (VCF) file, or other information indicating nucleobase calls, sequencing metrics, error data, or other metrics.
- BAM binary alignment map
- VCF variant call format
- the server device(s) 110 are located remotely from the local device 108 and the sequencing device 102. Similar to the local device 108, in some embodiments, the server device(s) 110 include a version of (or are otherwise able to access or implement) the personalized sequencing system 106. Accordingly, the server device(s) 110 may generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such base-call data. As indicated above, the sequencing device 102 may send (and the server device(s) 110 may receive) base-call data from the sequencing device 102. The server device(s) 110 may also communicate with the client device 114. In particular, the server device(s) 110 can send data to the client device 114, including BAM files, VCF files, or other sequencing related information.
- the server device(s) 110 comprise a distributed collection of servers where the server device(s) 110 include a number of server devices distributed across the network 118 and located in the same or different physical locations. Further, the server device(s) 110 can comprise a content server, an application server, a communication server, a web-hosting server, or another type of server. Moreover, as shown in FIG. 1, the server device(s) 110 are in communication, either directly or via the network 118, with a population haplotype database 120 storing population haplotypes to be evaluated by the personalized sequencing system 106 when generating the personalized haplotype database 112 for a genomic sample.
- the personalized sequencing system 106 can generate, encode, and/or implement the personalized haplotype database 112 to determine personalized alignments of nucleotide reads from a genomic sample with a reference genome.
- the personalized sequencing system 106 can generate the personalized haplotype database 112 for a genomic sample (e.g., based on a set of nucleotide reads from a genomic sample) and utilizes the generated personalized haplotype database 112 to determine one or more personalized alignments of nucleotide reads from a genomic sample corresponding to the genomic sample, as described in greater detail below in relation to the subsequent figures.
- the client device 114 can generate, store, receive, and send digital data.
- the client device 114 can receive sequencing data from the local device 108 or receive call files (e.g., BCL) and sequencing metrics from the sequencing device 102.
- the client device 114 may communicate with the local device 108 or the server device(s) 110 to receive a VCF comprising genotype or variant calls and/or other metrics, such as a base-call-quality metrics or pass-filter metrics.
- the client device 114 can accordingly present or display information pertaining to variant calls or other genotype calls within a graphical user interface of the sequencing application 116 to a user associated with the client device 114.
- the client device 114 can present genotype calls, variant calls, and/or sequencing metrics for a sequenced genomic sample within a graphical user interface of the sequencing application 116.
- FIG. 1 depicts the client device 114 as a desktop or laptop computer
- the client device 114 may comprise various types of client devices.
- the client device 114 includes non -mobile devices, such as desktop computers or servers, or other types of client devices.
- the client device 114 includes mobile devices, such as laptops, tablets, mobile telephones, or smartphones. Additional details regarding the client device 114 are discussed below with respect to FIG. 11.
- the client device 114 includes the sequencing application 116.
- the sequencing application 116 may be a web application or a native application stored and executed on the client device 114 (e.g., a mobile application, desktop application).
- the sequencing application 116 can include instructions that (when executed) cause the client device 114 to receive data from the personalized sequencing system 106 and present, for display at the client device 114, base-call data or data from an alignment data file or VCF.
- the sequencing application 116 can instruct the client device 114 to display summaries for multiple sequencing runs.
- a version of the personalized sequencing system 106 may be located and/or implemented (e.g., entirely or in part) on the client device 114 or the sequencing device 102.
- the personalized sequencing system 106 is implemented by one or more other components of the computing system 100, such as the local device 108.
- the personalized sequencing system 106 can be implemented in a variety of different ways across the sequencing device 102, the local device 108, the server device(s) 110, and the client device 114.
- the personalized sequencing system 106 can be downloaded from the server device(s) 110 to the personalized sequencing system 106 and/or the local device 108 where all or part of the functionality of the personalized sequencing system 106 is performed at each respective device within the computing system 100.
- the personalized sequencing system 106 evaluates, for a set of reference spans of a reference genome for a genomic sample, sets of candidate population haplotypes for inclusion within a personalized haplotype database for the genomic sample.
- FIG. 2 depicts an example of a set of nucleotide reads 202 and a set of reference spans 204a, 204b, through 204n of a reference genome.
- FIG. 2 illustrates the personalized sequencing system 106 identifying nucleotide reads of the set of nucleotide reads 202 within each reference span of the set of reference spans 204a-204n.
- the personalized sequencing system 106 compares nucleotide reads, such as the set of nucleotide reads 202, with sets of candidate population haplotypes within one or more genomic regions of a reference genome, such as the set of reference spans 204a-204n, to determine a likelihood that each set of candidate haplotypes represents the genomic sample within each respective genomic region.
- the personalized sequencing system 106 can partition a reference genome into a variety of genomic regions, such as but not limited to reference spans corresponding to chromosomes of the genomic sample, reference spans covering a predetermined number of nucleobases of a primary contiguous sequence (e.g., 1,000 nucleobases; 10,000 nucleobases; 1 million nucleobases), or a single reference span covering some or all of a genomic sample.
- reference spans corresponding to chromosomes of the genomic sample
- reference spans covering a predetermined number of nucleobases of a primary contiguous sequence e.g., 1,000 nucleobases; 10,000 nucleobases; 1 million nucleobases
- a single reference span covering some or all of a genomic sample.
- the personalized sequencing system 106 can partition a reference genome (or a portion thereof) into a set of reference spans, such as the reference spans 204a-204n, compare sets of population haplotypes with nucleotide reads aligned to each respective reference span, and select a set of population haplotypes for each respective reference span to include within a personalized haplotype database (e.g., as described below in relation to FIGS. 3A-3B and 4-5).
- a personalized haplotype database e.g., as described below in relation to FIGS. 3A-3B and 4-5.
- the personalized sequencing system 106 identifies nucleotide reads within each reference span of the set of reference spans 204a-204n. In some embodiments, for example, the personalized sequencing system 106 determines initial alignments of the set of nucleotide reads 202 with respective genomic regions of the reference genome (e.g., as further described below in relation to FIG. 3 A).
- the personalized sequencing system 106 can identify one or more nucleotide reads of the initially aligned set of nucleotide reads 202 that at least partially align with respective reference spans of the set of reference spans 204a-204n and compare the identified nucleotide reads (or portions thereof) with sets of candidate population haplotypes within each respective reference span of the set of reference spans 204a-204n.
- the personalized sequencing system 106 identifies one or more distinct k-mers within the set of nucleotide reads 202 for each candidate population haplotype within a given reference span of the set of reference spans 204a-204n.
- the personalized sequencing system 106 can identify distinct nucleotide k-mers (or partially distinct k- mers) of a length k within one or more candidate population haplotypes and compare the identified distinct k-mers of the candidate population haplotypes with the set of nucleotide reads 202 within the given reference span of the set of reference spans 204a-204n.
- the personalized sequencing system 106 can utilize various methods of identifying nucleotide reads within respective reference spans of a reference genome, not limited to the methods described herein.
- the personalized sequencing system 106 generates a personalized haplotype database for a genomic sample based on comparing nucleotide reads with population haplotypes from a population haplotype database and, utilizing the personalized haplotype database, determines personalized alignments of the nucleotide reads from a genomic sample. For example, FIGS.
- 3A-3B illustrate the personalized sequencing system 106 performing an initial read mapping and alignment 308a of nucleotide reads 302 with respective genomic regions of a reference genome 304, generating a personalized haplotype database 314 based on the initial alignments of the nucleotide reads 302, and utilizing the personalized haplotype database 314 to perform a personalized read mapping and alignment 308b of the nucleotide reads 302.
- the personalized sequencing system 106 implements the read mapping and alignment 308a to determine initial alignments of the nucleotide reads 302 with respective genomic regions of the reference genome 304.
- the personalized sequencing system 106 utilizes a population haplotype database 306 associated with the reference genome 304 to generate initial read alignments for the nucleotide reads 302.
- the population haplotype database 306 comprises a graph reference genome comprising a primary contiguous sequence augmented by a plurality of alternate contiguous sequences representing population haplotypes associated with the reference genome 304.
- the population haplotype database 306 comprises a data structure encoding allele-variant differences between the population haplotypes and a primary contiguous sequence for the reference genome 304.
- the personalized sequencing system 106 determines the initial alignments of the nucleotide reads 302 by (i) determining one or more candidate reference alignments of each nucleotide read with a primary contiguous sequence of the reference genome 304 and (ii) determining an initial alignment of each nucleotide read with a respective genomic region of the reference genome 304 by rescoring the one or more candidate reference alignments based on comparing each nucleotide read with the allele-variant differences indicated within the population haplotype database 306.
- the personalized sequencing system 106 having determined initial alignments of the nucleotide reads 302 by the initial read mapping and alignment 308a, the personalized sequencing system 106 generates an alignment data fde 310 (e.g., an initial alignment data file, such as a BAM file or other file type) indicating the initial alignments and, in some embodiments, additional information associated with the nucleotide reads 302, the initial alignments, and/or the population haplotypes to which one or more of the nucleotide reads 302 align according to the initial alignments.
- an alignment data fde 310 e.g., an initial alignment data file, such as a BAM file or other file type
- the alignment data file 310 generated via the initial read mapping and alignment 308a comprises the nucleotide reads 302 (e.g., nucleobases thereof), the initial alignments of the nucleotide reads 302 with respective genomic regions of the reference genome 304, alignment scores associated with the initial alignments (e.g., MAPQ or extended MAPQ scores), and/or haplotype data (e.g., imported from the population haplotype database 306) associated with the initial alignments of the nucleotide reads 302.
- alignment scores associated with the initial alignments e.g., MAPQ or extended MAPQ scores
- haplotype data e.g., imported from the population haplotype database 306 associated with the initial alignments of the nucleotide reads 302.
- haplotype data stored within the alignment data file 310 can include an indication of one or more candidate population haplotypes for respective reference spans of the reference genome 304.
- the personalized sequencing system 106 can identify one or more candidate population haplotypes, which the personalized sequencing system 106 analyzes for inclusion when generating the personalized haplotype database 314, within reference spans of the reference genome 304 based on similarities between the nucleotide reads 302 and population haplotypes within the respective reference spans.
- the personalized sequencing system 106 evaluates all population haplotypes within the population haplotype database 306 as candidate population haplotypes when selecting haplotypes for inclusion within the personalized haplotype database 314 (e.g., as further described below in relation to FIGS. 4-5).
- the personalized sequencing system 106 utilizes the alignment data file 310 (e.g., the initial alignments and other information associated therewith) in a haplotype selection process 312 to generate the personalized haplotype database 314.
- the personalized sequencing system 106 generates haplotype set scores for sets of candidate population haplotypes based on comparing the nucleotide reads 302 with the candidate population haplotypes within one or more reference spans of the reference genome 304.
- the haplotype selection process 312 comprises (i) comparing the nucleotide reads 302 with candidate population haplotypes from the population haplotype database 306 according to the initial read alignments indicated by the alignment data file 310 and, based on the comparison, (ii) selecting a subset of population haplotypes from the population haplotype database 306 for inclusion within the personalized haplotype database 314.
- the personalized haplotype database 314 comprises a subset of population haplotypes selected from the population haplotype database 306 for one or more respective reference spans (or other partitions) of the reference genome 304.
- the personalized sequencing system 106 implements the personalized read mapping and alignment 308b.
- the personalized sequencing system 106 determines personalized alignments of the nucleotide reads 302 with respective genomic regions of the reference genome 304 in similar fashion as the initial read mapping and alignment 308a, albeit by utilizing the personalized haplotype database 314 in place of the population haplotype database 306.
- the personalized haplotype database 314 comprises a data structure encoding allelevariant differences between the selected population haplotypes and a primary contiguous sequence for the reference genome 304 in each of the one or more reference spans of the reference genome 304.
- the personalized sequencing system 106 determines the personalized alignments of the nucleotide reads 302 by (i) determining one or more candidate reference alignments of each nucleotide read with a primary contiguous sequence of the reference genome 304 and (ii) determining a personalized alignment of each nucleotide read with a respective genomic region of the reference genome 304 by rescoring the one or more candidate reference alignments based on comparing each nucleotide read with the allele-variant differences indicated within the personalized haplotype database 314.
- the personalized haplotype database 314 comprises a graph reference genome.
- Such a graph reference genome may comprise a primary contiguous sequence augmented by a plurality of alternate contiguous sequences representing the population haplotypes selected from the population haplotype database 306 via the haplotype selection process 312.
- the personalized sequencing system 106 generates a personalized alignment data file 316 (e.g., an alignment data file, such as a BAM file or related file type) indicating the personalized alignments.
- a personalized alignment data file 316 e.g., an alignment data file, such as a BAM file or related file type
- the personalized alignment data file 316 includes additional information associated with the nucleotide reads 302, the personalized alignments, and/or the population haplotypes to which one or more of the nucleotide reads 302 align according to the personalized alignments.
- the personalized alignment data file 316 generated via the personalized read mapping and alignment 308b comprises the nucleotide reads 302 (e.g., nucleobases thereof), the personalized alignments of the nucleotide reads 302 with respective genomic regions of the reference genome 304, alignment scores associated with the personalized alignments (e.g., MAPQ or extended MAPQ scores), and/or haplotype data (e.g., imported from the personalized haplotype database 314) associated with the personalized alignments of the nucleotide reads 302.
- alignment scores associated with the personalized alignments e.g., MAPQ or extended MAPQ scores
- haplotype data e.g., imported from the personalized haplotype database 314.
- the personalized sequencing system 106 generates haplotype set scores for sets of candidate population haplotypes based on comparing nucleotide reads from a genomic sample with the sets of candidate population haplotypes within respective genomic regions of the genomic sample.
- FIG. 4 illustrates the personalized sequencing system 106 comparing nucleotide reads 404 from a genomic sample with sets of candidate population haplotypes (e.g., candidate population haplotypes 408) from a population haplotype database 406 to generate haplotype set scores 410. Based on the haplotype set scores 410, the personalized sequencing system 106 further determines the selected population haplotypes 414 for inclusion within a personalized haplotype database 412.
- the personalized sequencing system 106 can identify nucleotide reads for comparison with candidate population haplotypes within a given genomic region of a genomic sample (e.g., a reference span of a reference genome). As shown in FIG. 4, for example, the personalized sequencing system 106 receives identifies initial alignments 402 of a set of nucleotide reads (e.g., as determined by an initial mapping and alignment process, such as described above in relation to FIG. 3A) and identifies the nucleotide reads 404 from the set of nucleotide reads that, according to the initial alignments 402, at least partially align to a given genomic region. Alternatively, the personalized sequencing system 106 can identify the nucleotide reads 404 for comparison with the candidate population haplotypes 408 without determining or utilizing the initial alignment 402 (e.g., such as described above in relation to FIG. 2).
- the personalized sequencing system 106 compares sets of the candidate population haplotypes 408 with the nucleotide reads 404 within a given genomic region to determine the haplotype set scores 410.
- the candidate population haplotypes 408 include all population haplotypes within the population haplotype database 406 having variations from a respective reference genome (e.g., variants relative to a primary contiguous sequence) within the given genomic region.
- the personalized sequencing system 106 identifies a limited number of the candidate population haplotypes 408 from the population haplotype database 406 for comparison with the nucleotide reads 404 within the given genomic region.
- the personalized sequencing system 106 limits the candidate population haplotypes 408 to haplotypes having locally distinct variations from the reference genome within the given genomic region (e.g., population haplotypes comprising a unique set of one or more allele-variant differences relative to other population haplotypes within the given genomic region or reference span).
- the personalized sequencing system 106 determines haplotype likelihoods for the candidate population haplotypes 408 and/or limits the candidate population haplotypes 408 to haplotypes identified during initial mapping and alignment as having a relatively higher likelihood of matching the aligned nucleotide reads 404.
- Such individual haplotype likelihoods can be used as “soft” scores or multi -bit values that provide a basis or input for the haplotype sets scores 410 that the personalized sequencing system 106 subsequently determines for haplotype sets (e.g., pairs of population haplotypes).
- the personalized sequencing system 106 determines an individual haplotype likelihood as a Boolean value, multibit value, or other value (e.g., normalized Boolean value, normalized multi-bit value) quantifying the number of variants (e.g., SNPs) that differ between mapped and aligned nucleotide reads and population haplotypes from the population haplotype database 406.
- a Boolean value e.g., multibit value, or other value (e.g., normalized Boolean value, normalized multi-bit value) quantifying the number of variants (e.g., SNPs) that differ between mapped and aligned nucleotide reads and population haplotypes from the population haplotype database 406.
- the personalized sequencing system 106 determines (or receives) a set of alignment scores corresponding to the population haplotypes within the population haplotype database 406 (or a subset thereof) in relation to the nucleotide reads 404 aligned to a given genomic region according to the initial alignments 402.
- the personalized sequencing system 106 can identify an individual population haplotype exhibiting a highest alignment score of the set of alignment scores and assign (or set) the highest alignment score as a baseline (e.g., by setting the highest alignment score to a baseline value of zero) for purposes of determining the individual haplotype likelihoods for the remaining population haplotypes from the population haplotype database 406. By using the highest alignment score as a baseline, the personalized sequencing system 106 can determine a normalized haplotype likelihood for the remaining population haplotypes mapped and aligned to a set of nucleotide reads.
- Such an individual haplotype likelihood can take the form of a Boolean value, multi-bit value, or other quantification of the number of additional or existing variants (e.g., SNPs) that differ between the nucleotide reads and the remaining population haplotypes — relative to the population haplotype exhibiting the highest alignment score.
- the personalized sequencing system 106 selects the candidate population haplotypes 408 from the population haplotype database 406 based on their respective normalized haplotype likelihoods.
- the personalized sequencing system 106 selects population haplotypes having corresponding normalized haplotype likelihoods within a threshold value (e.g., population haplotypes matching by a threshold number of nucleotides within the nucleotide reads 404 within the given genomic region) as the candidate population haplotypes 408. Additionally or alternatively, in one or more embodiments, the personalized sequencing system 106 utilizes the normalized haplotype likelihoods (or a subset thereof if less than all population haplotypes are considered candidates) in determining the haplotype set scores 410 (e.g., to determine initial likelihoods as described below in relation to FIG. 5).
- a threshold value e.g., population haplotypes matching by a threshold number of nucleotides within the nucleotide reads 404 within the given genomic region
- the personalized sequencing system 106 utilizes the normalized haplotype likelihoods (or a subset thereof if less than all population haplotypes are considered candidates) in determining the haplotype set scores 410
- the personalized sequencing system 106 can assign or determine individual haplotype likelihoods to the candidate population haplotypes 408 for purposes of determining the haplotype set scores 410.
- the personalized sequencing system 106 Based on the haplotype likelihoods and/or other comparison of the nucleotide reads 404 with sets of the candidate population haplotypes 408 within the given genomic region, the personalized sequencing system 106 generates the haplotype set scores 410 for the respective sets of the candidate population haplotypes 408.
- the haplotype set scores 410 represent a likelihood or probability that a respective set of the candidate population haplotypes 408 represents the genomic sample of the nucleotide reads within the given genomic region.
- the personalized sequencing system 106 determines the selected population haplotypes 414 with the highest relative haplotype set score of the haplotype set scores 410 within the given genomic region and generates the personalized haplotype database 412 comprising the selected population haplotypes 414 for determining personalized alignments of the nucleotide reads 404 and/or additional nucleotide reads from the genomic sample. [0090] In various embodiments, the personalized sequencing system 106 generates the haplotype set scores 410 for sets of candidate population haplotypes comprising differing numbers of the candidate population haplotypes 408 per set.
- the personalized sequencing system 106 identifies a genomic region (e.g., a reference span) corresponding to a diploid region of a reference genome and, in response, generates the haplotype set scores 410 for sets comprising pairs of the candidate population haplotypes 408.
- the personalized sequencing system 106 identifies a genomic region (e.g., a reference span) corresponding to a haploid region of a reference genome, such as an X or Y chromosome.
- the personalized sequencing system 106 generates the haplotype set scores 410 for sets comprising individual haplotypes of the candidate population haplotypes 408.
- the personalized sequencing system 106 can determine the haplotype set scores 410 for sets comprising more than two of the candidate population haplotypes 408.
- the personalized sequencing system 106 generates haplotype set scores 410 for every combination of the candidate population haplotypes 408 (e.g., a haplotype set score for each possible pairing of the candidate population haplotypes 408). Also, in one or more embodiments, the personalized sequencing system 106 adjusts each haplotype set score of the haplotype set scores 410 for a given reference span in consideration of every nucleotide read of the nucleotide reads 404 identified within the given reference spans.
- the personalized sequencing system 106 adjusts each haplotype set score of the haplotype set scores 410 for a given reference span in consideration of each adjacently mapped nucleotide read of the nucleotide reads 404 within the given reference span.
- the personalized sequencing system 106 implements an iterative probabilistic algorithm, such as a Variational Bayesian model, to partition reads within each reference span and generate the haplotype set scores 410 based on the partitioned reads within each respective reference span.
- an iterative probabilistic algorithm such as a Variational Bayesian model
- the personalized sequencing system 106 utilizes a Variational Bayesian model to partition the respective reads based on a categorization of each of the nucleotide reads 404 as inherited from a respective first or second parent.
- the Variational Bayesian model comprises an iterative probabilistic algorithm that utilizes variational inference to predict the posterior distribution of latent variables to determine respective likelihoods of nucleotide reads being inherited from respective first or second parents.
- Such respective likelihoods for nucleotide reads could be any likelihood (e.g., 0.50 for a first parent and 0.50 for a second parent; 0.25 for a first parent and 0.75 for a second parent).
- the personalized sequencing system 106 can generate individual haplotype scores for the categorized reads in relation to each of the candidate population haplotypes 408, rather than generating the haplotype set scores 410 for each set of population haplotypes on a set-by-set basis. Accordingly, the personalized sequencing system 106 can generate the haplotype set scores 410 for sets of the candidate population haplotypes 408 by combining individual haplotype scores from the nucleotide reads inherited from the first parent with individual haplotype scores from the nucleotide reads inherited from the second parent. Thus, by partitioning the nucleotide reads 404 according to parental inheritance, the personalized sequencing system 106 can significantly reduce the number of unique sets of the candidate population haplotypes 408 to be scored, thereby avoiding scoring every pair of candidate population haplotypes.
- the personalized sequencing system 106 utilizes an imputation model to generate haplotype set scores for sets of candidate population haplotypes across adjacent reference spans of a reference genome.
- FIG. 5 illustrates the personalized sequencing system 106 utilizing a hidden Markov model (HMM) algorithm 508 to generate a set of haplotype set scores for sets of candidate population haplotypes across a set of reference spans 504a, 504b, through 504n of a reference genome.
- HMM hidden Markov model
- the personalized sequencing system 106 generating haplotype set posterior probabilities 510a, 510b, through 51 On for sets of haplotypes within the set of reference spans 504a, 504b, through 504n. Based on the generated haplotype set posterior probabilities 510a-510n, the personalized sequencing system 106 selects haplotype sets 512a, 512b, through 512n for the respective reference spans 504a, 504b, through 504n to include in a personalized haplotype database 514.
- the personalized sequencing system 106 identifies nucleotide reads 502a, 502b, through 502n from a genomic sample within the respective reference spans 504a-504n of a reference genome associated with the genomic sample.
- the set of reference spans 504a-504n comprises a partitioning of a reference genome (or a portion thereof) into bins spanning a selected number of base positions, such as, but not limited to, 1,000 base positions per reference span; 4,000 base positions per reference span; or 16,000 base positions per reference span. Any number of base positions, however, can be used for a reference span.
- the personalized sequencing system 106 identifies nucleotide reads 502a, 502n, through 502n for comparison with various sets of candidate population haplotypes to determine respective haplotype set likelihoods 506a, 506b, through 506n.
- the personalized sequencing system 106 can compare the nucleotide reads 502a - 502n with candidate population haplotypes by mapping and aligning the nucleotide reads 502a - 502n with candidate population candidates, determining alignment scores, and determining haplotype likelihoods based in part on the corresponding read alignments.
- the personalized sequencing system 106 further generates haplotype set likelihoods (e.g., haplotype set likelihoods 506a) of the haplotype set likelihoods 506a-506n for a given reference span (e.g., reference span 504a) of the reference spans 504a-504n.
- the personalized sequencing system 106 scores each set of candidate population haplotypes according to a comparison of nucleobases of the respective nucleotide reads (e.g., nucleotide reads 502a) within the given reference span and nucleobases of each set of candidate population haplotypes within the given reference span. Accordingly, as shown in FIG. 5, the personalized sequencing system 106 generates a plurality of haplotype set likelihoods for each respective reference span of the set of reference spans 504a- 504n.
- the personalized sequencing system 106 generates haplotype set scores for the respective sets of candidate population haplotypes in each of the reference spans 504a-504n based on the haplotype set likelihoods 506a-506n.
- the personalized sequencing system 106 utilizes an imputation model, such as the HMM algorithm 508, to generate haplotype set posterior probabilities 510a, 510b, through 51 On for sets of candidate population haplotypes within the respective reference spans 504a-504n based on respective haplotype set likelihoods of the haplotype set likelihoods 506a-506n for adjacent reference spans of the set of reference spans 504a-504n.
- the personalized sequencing system 106 utilizes the haplotype set likelihoods 506a-506n to generate a set of forward probabilities for each reference span of the set of reference spans 504a-504n.
- a forward pass of the HMM algorithm 508 starts with the haplotype set likelihoods 506a of the reference span 504a and ends with the haplotype set likelihoods 506n of the reference span 504n, thus generating forward probabilities for each of the reference spans 504a-504n with input from respectively adjacent reference spans (e.g., prior reference spans during a forward pass of the HMM algorithm 508).
- the personalized sequencing system 106 utilizes the forward probabilities for each set of candidate population haplotypes within the respective reference spans 504a-504n to generate updated probabilities for the respective sets of candidate population haplotypes. Based on the updated probabilities, the personalized sequencing system 106 generates the haplotype set posterior probabilities 510a-510n for the respective sets of candidate population haplotypes within the reference spans 504a-504n.
- the personalized sequencing system 106 generates a plurality of haplotype set posterior probabilities for each respective reference span of the set of reference spans 504a-504n and, in some embodiments, generates haplotype set scores for the respective sets of candidate population haplotypes from the haplotype set posterior probabilities 510a-51 On from each respective reference span of the set of reference spans 504a-504n.
- the personalized sequencing system 106 selects the haplotype sets 512a-512n for the respective reference spans 504a-504n (e.g., one haplotype set for each reference span) based on the respective haplotype set posterior probabilities 510a-510n, or haplotype set scores derived therefrom, to generate the personalized haplotype database 514. Accordingly, in some cases, the personalized sequencing system 106 can select different sets of candidate population haplotypes across the set of reference spans 504a-504n for inclusion within the personalized haplotype database 514, such that the haplotype sets 512a-512n vary across the reference spans 504a-504n.
- the personalized sequencing system 106 utilizes a recombination parameter of the HMM algorithm to reduce variation in the haplotype set posterior probabilities 510a-510n across the set of reference spans 504a-504n, thus reducing, in some cases, variation in the selected haplotype sets 512a-512n included within the personalized haplotype database 514 across the set of reference spans 504a-504n.
- the personalized sequencing system 106 determines haplotype set scores based on a comparison of nucleotide reads with allele-variant differences between candidate population haplotypes and a reference genome, as indicated within a population haplotype database. Also, in some embodiments, the personalized sequencing system 106 generates, for a given genomic sample, a personalized haplotype database comprising allelevariant differences between a limited selection of population haplotypes and a reference genome across a set of reference spans covering respective genomic regions of the reference genome.
- the personalized sequencing system 106 utilizes a population haplotype database and/or generates a personalized haplotype database encoded as a haplotype data structure as described by Michael Ruehle, Enhanced Mapping and Alignment of Nucleotide Reads Utilizing an Improved Haplotype Data Structure with Allele- Variant Differences, U.S. Provisional Application No. 63/613,574 (fded December 21, 2023) (hereinafter, Ruehle), which is hereby incorporated by reference in its entirety.
- FIGS. 6A-6B illustrate embodiments of haplotype data structures encoding allele-variant differences between population haplotypes and a reference genome.
- FIG. 6A illustrates a set of base-level bins 602a, 602b, through 602n of a population haplotype database 600 encoding allele-variant differences of population haplotypes within respective reference spans 604a, 604b, through 604n of a reference genome
- FIG. 6B illustrates a set of base-level bins 612a, 612b, through 612n of a personalized haplotype database 610 encoding allele-variant differences of selected population haplotypes within the respective reference spans 604a, 604b, through 604n of the reference genome. While the following paragraphs describe various bins of either the population haplotype database 600 or the personalized haplotype database 610, the bins of each of the population haplotype database 600 or the personalized haplotype database 610 can be encoded or otherwise be represented in a particular file type, such as a TSV file or Comma Separated Values (CSV) file.
- a particular file type such as a TSV file or Comma Separated Values (CSV) file.
- the population haplotype database 600 includes at least a base level comprising the set of base-level bins 602a-602n that partition genomic regions of a reference genome into the respective set of base-level reference spans 604a-604n.
- each base-level reference span of the set of base-level reference spans 604a-604n comprises a genomic region of a first length between respective genomic coordinates of the reference genome, thus partitioning genomic regions of the reference genome into multiple bins spanning an equal portion/length of the reference genome.
- the length of the base-level reference spans can approximate, for example, the average or maximum length of nucleotide reads provided to the personalized sequencing system 106 for mapping and alignment.
- the base-level reference spans can otherwise be selected to span a predetermined number of nucleobases from genomic coordinates or regions of a linear reference sequence, such as, but not limited to, 100 base pairs or 1,000 base pairs per base-level bin.
- the set of base-level bins 602a-602n of the population haplotype database 600 comprise encoded variant data for nucleotide variants from respective sets of locally distinct population haplotype(s) 606a-606n.
- each locally distinct population haplotype within a given base-level bin comprises a unique set of one or more allele-variant differences relative to other population haplotypes also having variations within the genomic region of the respective base-level reference span of the given base-level bin.
- each row of the locally distinct population haplotype(s) 606a comprises a unique set of allele-variant differences (denoted as single letters representing particular nucleotides) relative to other rows, such that no two rows are identical — although there can be limited overlap between allele-variant differences, as indicated by the top two rows of base-level bin 602a.
- population haplotypes having identical nucleotide variants within a given base-level bin are encoded as one locally distinct population haplotype within the given baselevel bin.
- each base-level bin of the population haplotype database 600 can include differing quantities of locally distinct population haplotypes. As shown in FIG.
- the set of locally distinct population haplotype(s) 606a included within the base-level bin 602a includes four locally distinct population haplotypes (as indicated by the four rows of the portrayed matrix), the set of locally distinct population haplotype(s) 606b included within the baselevel bin 602b includes five locally distinct population haplotypes, and the set of locally distinct population haplotype(s) 606n included within the base-level bin 602n includes three locally distinct population haplotypes.
- each base-level bin of the population haplotype database 600 can include any number of locally distinct population haplotypes, including as many as every population haplotype in a data set or no population haplotypes (e.g., in cases where there are no population haplotypes having allele-variant differences in a genomic region corresponding to a given bin).
- the base-level bins 602a, 602b, through 602n include respective allele-variant differences 608a, 608b, through 608n for each locally distinct population haplotype of the respective sets of locally distinct population haplotypes 606a, 606b, through 606n.
- variant data encoded within the base-level bin 602a includes a set of locally distinct population haplotype(s) 606a comprising one or more locally distinct population haplotypes for which allele-variant differences 608a are included for each respective locally distinct population haplotype.
- each base-level bin (e.g., of the set of base-level bins 602a-602n) comprises a matrix including corresponding variant data representing allelevariant differences from locally distinct population haplotypes (e.g., of the respective sets of locally distinct population haplotypes 606a-606n) and variant positions for the allele-variant differences.
- the variant data within each base-level bin includes data indications (e.g., the allele-variant differences 608a-608n) of single-nucleotide polymorphisms (SNPs) and/or insertions or deletions (indels) at respective genomic coordinates of the reference genome (e.g., of the primary contiguous sequence).
- each of the base-level bins 602a, 602b, and 602n further includes (or can reference data including) population frequencies 609a, 609b, and 609n (e.g., relative frequencies of haplotype alleles within a corresponding reference population) for the respective sets of locally distinct population haplotypes 606a-606n.
- the personalized sequencing system 106 adjusts alignments scores based on the population frequencies 609a-609n when determining initial read alignments for nucleotide reads using the population haplotype database 600 (e.g., as described above in relation to FIG. 3A).
- the personalized haplotype database 610 generated by the personalized sequencing system 106 includes variant data for a customized set of population haplotypes within each of the base-level bins 612a-612n.
- FIG. 6B depicts each baselevel bin with two population haplotypes, thus representing an inferred diplotype of a genomic sample corresponding to the personalized haplotype database 610 (e.g., a genomic sample for which the personalized sequencing system 106 generated the personalized haplotype database 610).
- the personalized sequencing system 106 generates a personalized haplotype database having differing numbers of selected haplotypes per reference span (e.g., one, two, or more haplotypes per selected set of haplotypes).
- the personalized haplotype database 610 includes at least a base level comprising the set of base-level bins 612a-612n that partition genomic regions of the reference genome into the respective set of base-level reference spans 604a-604n. Further, the set of base-level bins 612a, 612b, through 612n of the personalized haplotype database 610 comprise encoded variant data for nucleotide variants from the respective selected sets of population haplotypes 616a, 616b, through 616n.
- each base-level bin of the set of base-level bins 612a-612n comprises a matrix including corresponding variant data representing allele-variant differences from selected population haplotypes (e.g., of the respective selected sets of population haplotypes 616a-616n) and variant positions for the allele-variant differences.
- the variant data within each base-level bin includes data indications (e.g., the allele-variant differences 618a, 618b, through 618n) of single-nucleotide polymorphisms (SNPs) and/or insertions or deletions (indels) at respective genomic coordinates of the reference genome (e.g., of the primary contiguous sequence).
- population frequencies 619a, 619b, through 619n for the respective selected sets of population haplotypes 616a, 616b, through 616n are adjusted relative to the corresponding population frequencies 609a, 609b, through 609n (see FIG. 6A) to reflect the predicted diplotype within each of the base-level bins 612a, 612b, through 612n.
- the personalized sequencing system 106 utilizes a haplotype data structure (e.g., to implement a population haplotype database and/or a personalized haplotype database) with a hierarchical partitioning of genomic regions of a reference genome into multiple levels of bins corresponding to spans of nucleobases within the reference genome.
- a haplotype data structure 700 having a base level 702 comprising a set of base-level bins 704 and multiple successive levels 706a, 706b, 706c, through 706n of higher-level bins spanning successively larger spans of nucleobases of a reference genome.
- the haplotype data structure 700 comprises the base level 702 comprising the set of base-level bins 704 jointly spanning a primary contiguous sequence of the reference genome and the multiple successive levels 706a-706n of higher-level bins 708a, 708b, 708c, through 708n and offset higher-level bins 709a, 709b, 709c, and so forth also spanning the primary contiguous sequence of the reference genome.
- the successive level 706n comprises a higher-level bin 708n and a corresponding offset higher-level bin. But FIG. 7 does not depict the corresponding offset higher-level bin for the successive level 706n and the higher-level bin 708n due to constraints on figure space.
- the personalized sequencing system 106 can identify, determine, generate, or utilize more base-level bins, successive levels, higher-level bins, and/or offset higher-level bins than those depicted in FIG. 7. While the haplotype data structure 700 in FIG. 7 depicts a data structure for a personal haplotype database, a similar data structure can be used for a population haplotype database as described by Ruehle. As further indicated above, the haplotype data structure 700 can be encoded or otherwise be represented in a particular file type, such as a TSV file or CSV file.
- the base level 702 of the haplotype data structure 700 includes the set of base-level bins 704 corresponding to a respective set of base-level reference spans of the primary contiguous sequence for the reference genome.
- Each reference span of the set of base-level reference spans corresponds to a genomic region of a first length between respective genomic coordinates of the reference genome.
- each reference span of the set of base-level reference spans includes 1,000 base pairs (1 kbp) of the primary contiguous sequence for the reference genome.
- the first length of the base-level reference spans can be less than or greater than Ikbp, such as, but not limited to, 250 bp, 500 bp, 1500 bp, 5 kbp, lOkpb, and so forth. Accordingly, in various embodiments, the set of base-level bins 704 collectively span either the entire primary contiguous sequence or a genomic region of interest, such as but not limited to an entire chromosome.
- the set of base-level bins 704 of the base level 702 comprise variant data for nucleotide variants from respective sets of population haplotypes.
- each base-level bin of the base level 702 includes variant data for each locally distinct population haplotype.
- each locally distinct population haplotype comprises a unique set of one or more allele-variant differences relative to other population haplotypes within a respective base-level reference span of a given base-level bin of the set of base-level bins 704.
- each base-level bin of the base level 702 includes variant data for population haplotypes selected by the personalized sequencing system 106 (e.g., as described in relation to FIG. 5).
- the set of base-level bins 704 comprise respective pairs of population haplotypes selected by the personalized sequencing system 106, as indicated by the numbers (“2(0..!”) associated with each base-level reference span of the set of base-level bins 704.
- the haplotype data structure 700 comprises the multiple successive levels 706a-706n of higher-level bins 708a-708n.
- a first successive level 706a for instance, comprises a first set of higher-level bins 708a corresponding to a first set of higher-level reference spans of the primary contiguous sequence for the reference genome.
- Each reference span of the first set of higher-level reference spans corresponds to an expanded genomic region of a second length between respective genomic coordinates of the reference genome, wherein the expanded genomic regions are expanded relative to the genomic regions represented by the set of base-level reference spans such that the second length (of the respective first set of higher-level reference spans) is longer than the first length (of the set of base-level reference spans).
- each higher-level bin of the first set of higher-level bins 708a of the first successive level 706a corresponds to a consecutive pair of base-level bins from the set of base-level bins 704 of the base level 702 of the haplotype data structure 700.
- the multiple successive levels 706a-706c of the haplotype data structure 700 comprise respective sets of offset higher-level bins 709a-709c and the successive level 706n of the haplotype data structure 700 comprises the higher-level bin 708n and a corresponding offset higher-level bin.
- the first successive level 706a includes a set of offset higher-level bins 709a corresponding to a first set of offset higher-level reference spans of the primary contiguous sequence for the reference genome.
- Each reference span of the first set of offset higher-level reference spans corresponds to an offset expanded genomic region of the second length (i.e., the same length as the reference spans of the first set of successive reference spans) between respective genomic coordinates of the reference genome.
- the first set of offset higher-level bins 709a correspond to respective consecutive pairs of base-level bins from the set of base-level bins 704 of the base level 702 of the haplotype data structure 700.
- the respective reference spans of the first set of offset higher-level bins 709a are offset relative to the reference spans of the first set of higher- level bins 708a, such that each consecutive pair of base-level bins from the set of base-level bins 704 is represented by either a higher-level bin or an offset higher-level bin from the first successive level 706a.
- each additional successive level 706b-706n of the haplotype data structure 700 comprises additional higher-level bins 708b-708n corresponding to respective additional higher-level reference spans corresponding to further expanded genomic regions between genomic coordinates of the primary contiguous sequence for the reference genome.
- each higher-level bin (or offset higher-level bin) of a given successive level of the haplotype data structure 700 spans a combined genomic region of a pair of consecutive bins of a prior level of the haplotype data structure 700 (e.g., as indicated by the arrows linking various bins in FIG. 7).
- the first illustrated bin of the set of higher-level bins 708c spans the same genomic region represented by the first two illustrated bins of the set of higher-level bins 708b.
- the first illustrated bin of the set of higher-level bins 708b spans the same genomic region represented by the first two illustrated bins of the set of higher-level bins 708a.
- each successive level comprises higher-level bins corresponding to a pair of consecutive bins from the previous level of the haplotype data structure 700.
- the respective higher-level bins of each successive level of the haplotype data structure 700 comprise variant-data indices referencing combinations of the variant data from corresponding base-level bins of the base level 702.
- each higher-level bin and offset higher-level bin of the multiple sets of higher-level bins 708a-708c and offset higher-level bins 709a-709c, respectively — and each of the higher-level bin 708n and a corresponding offset higher-level bin of the successive level 706n — comprise variant-data indices referencing combinations of variant data from corresponding base-level bins of the set of baselevel bins 704.
- the variant-data indices include indications of locally distinct population haplotypes within each respective higher-level bin or offset higher-level bin.
- one of the offset higher-level bins 709b of the successive level 706b indicates two locally distinct population haplotypes (indicated by “2 Haplotypes (0..1)”).
- two bins of the higher-level bins 708a from the previous successive level indicate two locally distinct population haplotypes (indicated by “2(0..!”) and one locally distinct population haplotype (indicated by “1(0)”), respectively.
- the higher-level bins of each successive level comprise variant-data indices indicating locally distinct population haplotypes and linking the higher-level bins to variant data within the corresponding base-level bins without including the variant data from the respective base-level bins, thus avoiding redundant encoding of variant data within the haplotype data structure 700.
- the aforementioned bin of offset higher-level bins 709b indicating four locally distinct population haplotypes can include variant-data indices referencing how the locally distinct population haplotypes of the corresponding higher-level bins (of the higher-level bins 708a) from the previous successive level (the first successive level 706a) combine to form the four locally distinct population haplotypes of the aforementioned bin.
- each of the corresponding higher-level bins 708a can include variant data-indices referencing the population haplotypes (and the variant data thereof) indicated within the corresponding base-level bins (of the set of base-level bins 704) from the base level 702. [0118]
- the personalized sequencing system 106 can alternatively evaluate and select haplotype pairs across reference spans of a respective successive level of bins.
- the personalized sequencing system 106 evaluated candidate population haplotypes within reference spans corresponding to the higher-level bins 708b, then populates the haplotype data structure 700 with haplotype data from the selected candidate population haplotypes, such as described above.
- the personalized sequencing system 106 implements mapping and alignment of nucleotide reads from a genomic sample with genomic regions of a reference genome with increased accuracy.
- FIGS. 8-9 show experimental results of the personalized sequencing system 106 generating and utilizing a personalized haplotype database, in accordance with some of the disclosed embodiments, to determine personalized alignments of nucleotide reads.
- FIGS. 8-9 illustrates comparative results of identifying single nucleotide polymorphisms (SNPs) based on read alignments generated according to one or more embodiments.
- SNPs single nucleotide polymorphisms
- FIG. 8 includes a table of experimental results of identifying SNPs in nucleotide reads aligned by existing sequencing systems and the personalized sequencing system 106. As shown, the columns of the table respectively correspond to false positives (FP), false negatives (FN), incorrect heterozygous or homozygous genotype calls (Hethom), and combined false positives and false negatives (FP+FN). Further, as depicted in FIG.
- the first two rows (excluding the header row) of the table of experimental results correspond to results generated by an existing sequencing system using (i) a population haplotype database comprising 128 global population haplotypes (“Hap DB Global 128 Samples”) and (ii) a population haplotype database comprising 16 ancestry-specific population haplotypes selected based on the ancestry of the genomic sample tested (“HapDB Euro 16 Samples”), respectively.
- a population haplotype database comprising 128 global population haplotypes
- Hap DB Euro 16 Samples 16 ancestry-specific population haplotypes selected based on the ancestry of the genomic sample tested
- the final three rows of the table of experimental results correspond to results generated by the personalized sequencing system 106 using (i) a first personalized haplotype database generated, for the corresponding genomic sample, utilizing a first probabilistic model based on an exhaustive scoring of all possible pairs of candidate haplotypes for each reference span (“Personalized (exhaustive 1)”), (ii) a second personalized haplotype database generated, for the corresponding genomic sample, utilizing a second probabilistic model based on an exhaustive scoring of all possible pairs of candidate haplotypes for each reference span (“Personalized (exhaustive 2)”), and (iii) a third personalized haplotype database generated, for the corresponding genomic sample, utilizing a Variational Bayesian model to partition nucleotide reads by parental inheritance to enable individual scoring of candidate haplotypes (“Personalized (Variational Bayes)”), respectively.
- Personalized (exhaustive 1) a first probabilistic model based on an exhaustive scoring of all possible pairs of candidate haplotypes for each reference span
- Personalized (exhaustive 2) a
- each of the three portrayed example embodiments of the personalized sequencing system 106 exhibit improved overall accuracy relative to the portrayed existing sequencing systems in identifying SNPs within nucleotide reads (e.g., in terms of FPs, FNs, Hethom accuracy, and combined FP+FN metrics).
- personalization by matching ancestry of a genomic sample can further improve genotype calling accuracy (see, e.g., “HapDB Euro 16 Samples” results)
- the personalized sequencing system 106 can produce similar or improved results without contextual information for the given genomic sample, thus providing increased accuracy over many existing systems with increased flexibility and efficiency.
- FIG. 9 includes an additional illustration of experimental results of identifying SNPs in nucleotide reads aligned by existing sequencing systems and the personalized sequencing system 106.
- FIG. 9 depicts read pileups 906b and 906c (pileups of nucleotide reads mapped to and aligned with respective genomic regions a reference genome 902) generated by an existing sequencing system and by the personalized sequencing system 106, respectively, in comparison with a ground truth pileup 906a comprising at least one known single nucleotide variant, such as an identified known single nucleotide variant 904a.
- FIG. 9 depicts read pileups 906b and 906c (pileups of nucleotide reads mapped to and aligned with respective genomic regions a reference genome 902) generated by an existing sequencing system and by the personalized sequencing system 106, respectively, in comparison with a ground truth pileup 906a comprising at least one known single nucleotide variant, such as an identified known single nucleotide variant 904a.
- the read pileup 906c of the personalized sequencing system 106 includes a true positive call 904c for the identified known single nucleotide variant 904a due to multiple reads aligning to the respective genomic region with a relatively high mapping-quality (MAPQ) score (as indicated by the stack of filled rectangles).
- MAPQ mapping-quality
- the personalized sequencing system 106 can generated and utilize a personalized haplotype database to efficiently determine read alignments for nucleotide reads from a genomic sample with improved accuracy in identifying variants relative to existing sequencing systems, as indicated by the comparative number of false positives, false negatives, and incorrect heterozygous or homozygous genotype calls identified within the provided experimental results.
- FIG. 10 this figure illustrates an example flowchart of a series of acts for generating and utilizing a personalized haplotype database to determine personalized alignments of a set of nucleotide reads from a genomic sample in accordance with one or more embodiments. While FIG.
- FIG. 10 illustrates acts according to particular embodiments, alternative embodiments may omit, add to, reorder, and/or modify any of the acts shown in FIG. 10.
- the acts of FIG. 10 can be performed as part of a method.
- a non-transitory computer readable storage medium can comprise instructions that, when executed by one or more processors, cause a computing device to perform the acts depicted in FIG. 10.
- a system comprising at least one processor and a non-transitory computer readable medium comprising instructions that, when executed by one or more processors, cause the system to perform the acts of FIG. 10.
- the series of acts 1000 includes an act 1002 of identifying nucleotide reads and candidate population haplotypes within a set of reference spans of a reference genome, an act 1004 of generating haplotype set scores for sets of the candidate population haplotypes, an act 1006 of generating a personalized haplotype database based on the haplotype set scores, and an act 1008 of determining personalized alignments of the set of nucleotide reads utilizing the personalized haplotype database.
- the series of acts 1000 can include acts to perform any of the operations described in the following clauses:
- a computer-implemented method comprising: identifying, within a set of reference spans of a reference genome, a set of nucleotide reads from a genomic sample and candidate population haplotypes from a population haplotype database; generating, for the set of reference spans of the reference genome, haplotype set scores for sets of the candidate population haplotypes based on comparing the set of nucleotide reads and the candidate population haplotypes; generating, for the genomic sample and based on the haplotype set scores, a personalized haplotype database comprising a subset of population haplotypes from the population haplotype database within the set of reference spans; and determining, utilizing the personalized haplotype database, one or more personalized alignments of the set of nucleotide reads with respective genomic regions of the reference genome.
- CLAUSE 2 The computer-implemented method of clause 1, further comprising: identifying one or more distinct k-mers for each candidate population haplotype within a given reference span of the set of reference spans; and generating, for the given reference span, respective haplotype set scores of the haplotype set scores for respective candidate population haplotypes of the sets of the candidate population haplotypes within the given reference span based on comparing k-mers of the set of nucleotide reads with the one or more distinct k-mers of the respective candidate population haplotypes.
- CLAUSE 3 The computer-implemented method of clause 1, further comprising: determining initial alignments of the set of nucleotide reads with respective genomic regions of the reference genome; identifying subsets of nucleotide reads of the set of nucleotide reads that, according to the initial alignments, align to respective reference spans of the set of reference spans; and generating the haplotype set scores for the set of reference spans based on comparing the subsets of nucleotide reads and the candidate population haplotypes within the respective reference spans of the set of reference spans.
- CLAUSE 4 The computer-implemented method of clause 3, further comprising: identifying, as indicated within the population haplotype database, allele-variant differences between population haplotypes and a primary contiguous sequence at respective genomic regions of the reference genome; and determining one or more of the initial alignments of the set of nucleotide reads by rescoring candidate reference alignments of the set of nucleotide reads with the primary contiguous sequence according to the allele-variant differences.
- CLAUSE 5 The computer-implemented method of any of clauses 3-4, further comprising: identifying one or more population haplotypes for each nucleotide read of the set of nucleotide reads based on comparing the set of nucleotide reads with haplotype variants within the respective genomic regions of the initial alignments; and generating an alignment data file comprising: the initial alignments of the set of nucleotide reads; alignment scores corresponding to the initial alignments; and the identified one or more population haplotypes for each nucleotide read of the set of nucleotide reads.
- CLAUSE 6 The computer-implemented method of any of clauses 1-5, further comprising: determining haplotype set likelihoods for the set of reference spans by comparing the set of nucleotide reads and pairs of the candidate population haplotypes within respective reference spans of the set of reference spans; and generating the haplotype set scores for the set of reference spans based on the haplotype set likelihoods.
- CLAUSE 8 The computer-implemented method of any of clauses 1-7, further comprising including within the subset of population haplotypes of the personalized haplotype database, a respective subset of the candidate population haplotypes for each reference span of the set of reference spans based on the haplotype set scores.
- CLAUSE 9 The computer-implemented method of clause 8, further comprising: generating haplotype set likelihoods for the set of reference spans by comparing the set of nucleotide reads and the candidate population haplotypes; generating, utilizing a hidden Markov model (HMM) algorithm, a set of haplotype set posterior probabilities for each reference span of the set of reference spans based on respective haplotype set likelihoods for adjacent reference spans of the set of reference spans; and generating, based on the set of haplotype set posterior probabilities for each reference span of the set of reference spans, respective sets of the haplotype set scores for the set of reference spans.
- HMM hidden Markov model
- CLAUSE 10 The computer-implemented method of clause 9, further comprising: generating, utilizing the haplotype set likelihoods in a forward pass of the HMM algorithm across the set of reference spans, a set of forward probabilities for each reference span of the set of reference spans; generating, utilizing a backward pass of the HMM algorithm to update the set of forward probabilities for each reference span of the set of reference spans, a set of updated probabilities for each reference span of the set of reference spans; and generating, based on the set of updated probabilities for each reference span of the set of reference spans, the set of haplotype set posterior probabilities for each reference span of the set of reference spans.
- CLAUSE 11 The computer-implemented method of any of clauses 9-10, further comprising utilizing a recombination parameter of the HMM algorithm to reduce variation in haplotype set posterior probabilities across the set of reference spans.
- CLAUSE 12 The computer-implemented method of any of clauses 1-11, further comprising: categorizing, utilizing a Variational Bayesian model, each nucleotide read of the set of nucleotide reads within each respective reference span of the set of reference spans as inherited from a first parent or a second parent; generating, based on each categorized nucleotide read of the set of nucleotide reads, individual haplotype scores for population haplotypes selected from the population haplotype database; and generating the haplotype set scores for pairs of the population haplotypes selected from the population haplotype database by combining individual haplotype scores from nucleotide reads inherited from the first parent with individual haplotype scores from nucleotide reads inherited from the second parent.
- CLAUSE 13 The computer-implemented method of any of clauses 1-12, further comprising generating, for a given reference span of the set of reference spans, the haplotype set scores for respective pairs of locally distinct population haplotypes within the given reference span, each locally distinct population haplotype comprising a unique set of one or more allele-variant differences relative to other population haplotypes within the given reference span.
- CLAUSE 14 The computer-implemented method of any of clauses 1-13, further comprising: identifying, as indicated within the personalized haplotype database, allele-variant differences between population haplotypes of the subset of population haplotypes and a primary contiguous sequence at the respective genomic regions of the reference genome; and determining one or more of the one or more personalized alignments of the set of nucleotide reads by rescoring candidate reference alignments of the set of nucleotide reads with the primary contiguous sequence according to allele-variant differences.
- CLAUSE 15 The computer-implemented method of any of clauses 1-14, further comprising: generating alignment scores for initial alignments of the set of nucleotide reads with respective genomic regions of the reference genome based on comparing the set of nucleotide reads with population haplotypes from the population haplotype database; generating an initial alignment data file comprising the initial alignments and corresponding alignment scores; generating personalized alignment scores for the one or more personalized alignments based on comparing the set of nucleotide reads with the subset of population haplotypes within the personalized haplotype database; and generating a personalized alignment data file comprising the one or more personalized alignments and corresponding personalized alignment scores.
- CLAUSE 16 The computer-implemented method of any of clauses 1-15, further comprising determining genotype calls for the genomic sample based on the one or more personalized alignments.
- nucleic acid sequencing techniques can be used in conjunction with a variety of nucleic acid sequencing techniques. Particularly applicable techniques are those wherein nucleic acids are attached at fixed locations in an array such that their relative positions do not change and wherein the array is repeatedly imaged. Embodiments in which images are obtained in different color channels, for example, coinciding with different labels used to distinguish one nucleobase type from another are particularly applicable.
- the process to determine the nucleotide sequence of a target nucleic acid i.e., a nucleic acid polymer
- Preferred embodiments include sequencing-by-synthesis (SBS) techniques.
- SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand.
- a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery.
- more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
- SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties.
- Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using y-phosphate-labeled nucleotides, as set forth in further detail below.
- the number of nucleotides added in each cycle is generally variable and dependent upon the template sequence and the mode of nucleotide delivery.
- the terminator can be effectively irreversible under the sequencing conditions used as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.).
- SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as molecular weight or charge; a byproduct of incorporation of the nucleotide, such as release of pyrophosphate; or the like.
- a characteristic of the label such as fluorescence of the label
- a characteristic of the nucleotide monomer such as molecular weight or charge
- a byproduct of incorporation of the nucleotide such as release of pyrophosphate; or the like.
- the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used.
- the different nucleotides present in a sequencing reagent can have different labels and they can be distinguished using appropriate optics as exemplified by
- Preferred embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release.” Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) "Pyrosequencing sheds light on DNA sequencing.” Genome Res. 11(1), 3-11; Ronaghi, M., Uhlen, M. and Nyren, P.
- PPi inorganic pyrophosphate
- the nucleic acids to be sequenced can be attached to features in an array and the array can be imaged to capture the chemiluminescent signals that are produced due to incorporation of a nucleotides at the features of the array.
- An image can be obtained after the array is treated with a particular nucleotide type (e.g., A, T, C or G). Images obtained after addition of each nucleotide type will differ with regard to which features in the array are detected. These differences in the image reflect the different sequence content of the features on the array. However, the relative locations of each feature will remain unchanged in the images.
- the images can be stored, processed and analyzed using the methods set forth herein. For example, images obtained after treatment of the array with each different nucleotide type can be handled in the same way as exemplified herein for images obtained from different detection channels for reversible terminatorbased sequencing methods.
- cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in WO 04/018497 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference.
- This approach is being commercialized by Solexa (now Illumina Inc.), and is also described in WO 91/06678 and WO 07/123,744, each of which is incorporated herein by reference.
- the availability of fluorescently labeled terminators in which both the termination can be reversed, and the fluorescent label cleaved facilitates efficient cyclic reversible termination (CRT) sequencing.
- Polymerases can also be coengineered to efficiently incorporate and extend from these modified nucleotides.
- the labels do not substantially inhibit extension under SBS reaction conditions.
- the detection labels can be removable, for example, by cleavage or degradation. Images can be captured following incorporation of labels into arrayed nucleic acid features.
- each cycle involves simultaneous delivery of four different nucleotide types to the array and each nucleotide type has a spectrally distinct label. Four images can then be obtained, each using a detection channel that is selective for one of the four different labels. Alternatively, different nucleotide types can be added sequentially, and an image of the array can be obtained between each addition step.
- each image will show nucleic acid features that have incorporated nucleotides of a particular type. Different features are present or absent in the different images due the different sequence content of each feature. However, the relative position of the features will remain unchanged in the images. Images obtained from such reversible terminator- SB S methods can be stored, processed and analyzed as set forth herein. Following the image capture step, labels can be removed, and reversible terminator moieties can be removed for subsequent cycles of nucleotide addition and detection. Removal of the labels after they have been detected in a particular cycle and prior to a subsequent cycle can provide the advantage of reducing background signal and crosstalk between cycles. Examples of useful labels and removal methods are set forth below.
- nucleotide monomers can include reversible terminators.
- reversible terminators/cleavable fluors can include fluor linked to the ribose moiety via a 3' ester linkage (Metzker, Genome Res. 15:1767-1776 (2005), which is incorporated herein by reference).
- Other approaches have separated the terminator chemistry from the cleavage of the fluorescence label (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), which is incorporated herein by reference in its entirety).
- Ruparel et al described the development of reversible terminators that used a small 3' allyl group to block extension but could easily be deblocked by a short treatment with a palladium catalyst.
- the fluorophore was attached to the base via a photocleavable linker that could easily be cleaved by a 30 second exposure to long wavelength UV light.
- disulfide reduction or photocleavage can be used as a cleavable linker.
- Another approach to reversible termination is the use of natural termination that ensues after placement of a bulky dye on a dNTP.
- the presence of a charged bulky dye on the dNTP can act as an effective terminator through steric and/or electrostatic hindrance.
- Some embodiments can utilize detection of four different nucleotides using fewer than four different labels.
- SBS can be performed utilizing methods and systems described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232.
- a pair of nucleotide types can be detected at the same wavelength, but distinguished based on a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g. via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair.
- nucleotide types can be detected under particular conditions while a fourth nucleotide type lacks a label that is detectable under those conditions, or is minimally detected under those conditions (e.g., minimal detection due to background fluorescence, etc.). Incorporation of the first three nucleotide types into a nucleic acid can be determined based on presence of their respective signals and incorporation of the fourth nucleotide type into the nucleic acid can be determined based on absence or minimal detection of any signal.
- one nucleotide type can include label(s) that are detected in two different channels, whereas other nucleotide types are detected in no more than one of the channels.
- An exemplary embodiment that combines all three examples is a fluorescent-based SBS method that uses a first nucleotide type that is detected in a first channel (e.g. dATP having a label that is detected in the first channel when excited by a first excitation wavelength), a second nucleotide type that is detected in a second channel (e.g. dCTP having a label that is detected in the second channel when excited by a second excitation wavelength), a third nucleotide type that is detected in both the first and the second channel (e.g.
- dTTP having at least one label that is detected in both channels when excited by the first and/or second excitation wavelength
- a fourth nucleotide type that lacks a label that is not, or minimally, detected in either channel (e.g. dGTP having no label).
- sequencing data can be obtained using a single channel.
- the first nucleotide type is labeled but the label is removed after the first image is generated, and the second nucleotide type is labeled only after a first image is generated.
- the third nucleotide type retains its label in both the first and second images, and the fourth nucleotide type remains unlabeled in both images.
- Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides.
- the oligonucleotides typically have different labels that are correlated with the identity of a particular nucleotide in a sequence to which the oligonucleotides hybridize.
- images can be obtained following treatment of an array of nucleic acid features with the labeled sequencing reagents. Each image will show nucleic acid features that have incorporated labels of a particular type. Different features are present or absent in the different images due the different sequence content of each feature, but the relative position of the features will remain unchanged in the images. Images obtained from ligation-based sequencing methods can be stored, processed and analyzed as set forth herein. Exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Pat. No. 6,969,488, U.S. Pat. No. 6,172,218, and U.S. Pat. No. 6,306,597, the disclosures of which are incorporated herein by reference in their entireties.
- Some embodiments can utilize nanopore sequencing (Deamer, D. W. & Akeson, M. "Nanopores and nucleic acids: prospects for ultrarapid sequencing.” Trends Biotechnol. 18, 147- 151 (2000); Deamer, D. and D. Branton, “Characterization of nucleic acids by nanopore analysis”. Acc. Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin, and J. A. Golovchenko, "DNA molecules and configurations in a solid-state nanopore microscope” Nat. Mater. 2:611-615 (2003), the disclosures of which are incorporated herein by reference in their entireties).
- the target nucleic acid passes through a nanopore.
- the nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin.
- each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore.
- Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity.
- Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No. 7,211,414 (each of which is incorporated herein by reference) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No.
- FRET fluorescence resonance energy transfer
- the illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations.” Science 299, 682-686 (2003); Lundquist, P. M. et al.
- Some SBS embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product.
- sequencing based on detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009/0026082 Al; US 2009/0127589 Al; US 2010/0137143 Al; or US 2010/0282617 Al, each of which is incorporated herein by reference.
- Methods set forth herein for amplifying target nucleic acids using kinetic exclusion can be readily applied to substrates used for detecting protons. More specifically, methods set forth herein can be used to produce clonal populations of amplicons that are used to detect protons.
- the above SBS methods can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously.
- different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner.
- the target nucleic acids can be in an array format. In an array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner.
- the target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface.
- the array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature. Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR as described in further detail below.
- the methods set forth herein can use arrays having features at any of a variety of densities including, for example, at least about 10 features/cm2, 100 features/cm2, 500 features/cm2, 1,000 features/cm2, 5,000 features/cm2, 10,000 features/cm2, 50,000 features/cm2, 100,000 features/cm2, 1,000,000 features/cm2, 5,000,000 features/cm2, or higher.
- an advantage of the methods set forth herein is that they provide for rapid and efficient detection of a plurality of target nucleic acid in parallel. Accordingly, the present disclosure provides integrated systems capable of preparing and detecting nucleic acids using techniques known in the art such as those exemplified above.
- an integrated system of the present disclosure can include fluidic components capable of delivering amplification reagents and/or sequencing reagents to one or more immobilized DNA fragments, the system comprising components such as pumps, valves, reservoirs, fluidic lines and the like.
- a flow cell can be configured and/or used in an integrated system for detection of target nucleic acids. Exemplary flow cells are described, for example, in US 2010/0111768 Al and US Ser. No.
- one or more of the fluidic components of an integrated system can be used for an amplification method and for a detection method.
- one or more of the fluidic components of an integrated system can be used for an amplification method set forth herein and for the delivery of sequencing reagents in a sequencing method such as those exemplified above.
- an integrated system can include separate fluidic systems to carry out amplification methods and to carry out detection methods.
- Examples of integrated sequencing systems that are capable of creating amplified nucleic acids and also determining the sequence of the nucleic acids include, without limitation, the MiSeqTM platform (Illumina, Inc., San Diego, CA) and devices described in US Ser. No. 13/273,666, which is incorporated herein by reference.
- sample and its derivatives, is used in its broadest sense and includes any specimen, culture and the like that is suspected of including a target.
- the sample comprises DNA, RNA, PNA, LNA, chimeric or hybrid forms of nucleic acids.
- the sample can include any biological, clinical, surgical, agricultural, atmospheric or aquatic-based specimen containing one or more nucleic acids.
- the term also includes any isolated nucleic acid sample such a genomic DNA, fresh-frozen or formalin-fixed paraffin-embedded nucleic acid specimen.
- the sample can be from a single individual, a collection of nucleic acid samples from genetically related members, nucleic acid samples from genetically unrelated members, nucleic acid samples (matched) from a single individual such as a tumor sample and normal tissue sample, or sample from a single source that contains two distinct forms of genetic material such as maternal and fetal DNA obtained from a maternal subject, or the presence of contaminating bacterial DNA in a sample that contains plant or animal DNA.
- the source of nucleic acid material can include nucleic acids obtained from a newborn, for example as typically used for newborn screening.
- the nucleic acid sample can include high molecular weight material such as genomic DNA (gDNA).
- the sample can include low molecular weight material such as nucleic acid molecules obtained from FFPE or archived DNA samples.
- low molecular weight material includes enzymatically or mechanically fragmented DNA.
- the sample can include cell-free circulating DNA.
- the sample can include nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture micro-dissections, surgical resections, and other clinical or laboratory obtained samples.
- the sample can be an epidemiological, agricultural, forensic or pathogenic sample.
- the sample can include nucleic acid molecules obtained from an animal such as a human or mammalian source.
- the sample can include nucleic acid molecules obtained from a non-mammalian source such as a plant, bacteria, virus or fungus.
- the source of the nucleic acid molecules may be an archived or extinct sample or species.
- forensic samples can include nucleic acids obtained from a crime scene, nucleic acids obtained from a missing persons DNA database, nucleic acids obtained from a laboratory associated with a forensic investigation or include forensic samples obtained by law enforcement agencies, one or more military services or any such personnel.
- the nucleic acid sample may be a purified sample or a crude DNA containing lysate, for example derived from a buccal swab, paper, fabric or other substrate that may be impregnated with saliva, blood, or other bodily fluids.
- the nucleic acid sample may comprise low amounts of, or fragmented portions of DNA, such as genomic DNA.
- target sequences can be present in one or more bodily fluids including but not limited to, blood, sputum, plasma, semen, urine and serum.
- target sequences can be obtained from hair, skin, tissue samples, autopsy or remains of a victim.
- nucleic acids including one or more target sequences can be obtained from a deceased animal or human.
- target sequences can include nucleic acids obtained from non-human DNA such a microbial, plant or entomological DNA.
- target sequences or amplified target sequences are directed to purposes of human identification.
- the disclosure relates generally to methods for identifying characteristics of a forensic sample.
- the disclosure relates generally to human identification methods using one or more target specific primers disclosed herein or one or more target specific primers designed using the primer design criteria outlined herein.
- a forensic or human identification sample containing at least one target sequence can be amplified using any one or more of the target-specific primers disclosed herein or using the primer criteria outlined herein.
- the components of the personalized sequencing system 106 can include software, hardware, or both.
- the components of the personalized sequencing system 106 can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices (e.g., the client device 114, the local device 108, or the server device(s) 110).
- the computer-executable instructions of the personalized sequencing system 106 can cause the computing devices to perform the bubble detection methods described herein.
- the components of the personalized sequencing system 106 can comprise hardware, such as special purpose processing devices to perform a certain function or group of functions.
- the components of the personalized sequencing system 106 can include a combination of computer-executable instructions and hardware.
- the components of the personalized sequencing system 106 performing the functions described herein with respect to the personalized sequencing system 106 may, for example, be implemented as part of a stand-alone application, as a module of an application, as a plug-in for applications, as a library function or functions that may be called by other applications, and/or as a cloud-computing model.
- components of the personalized sequencing system 106 may be implemented as part of a stand-alone application on a personal computing device or a mobile device.
- the components of the personalized sequencing system 106 may be implemented in any application that provides sequencing services including, but not limited to Illumina BaseSpace, Illumina DRAGEN, or Illumina TruSight software. “Illumina,” “BaseSpace,” “DRAGEN,” and “TruSight,” are either registered trademarks or trademarks of Illumina, Inc. in the United States and/or other countries.
- Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below.
- Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures.
- one or more of the processes described herein may be implemented at least in part as instructions embodied in a non- transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein).
- a processor receives instructions, from a non-transitory computer-readable medium, (e.g., a memory, etc.), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.
- a non-transitory computer-readable medium e.g., a memory, etc.
- Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system.
- Computer-readable media that store computerexecutable instructions are non-transitory computer-readable storage media (devices).
- Computer- readable media that carry computer-executable instructions are transmission media.
- embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.
- Non-transitory computer-readable storage media includes RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) (e.g., based on RAM), Flash memory, phasechange memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
- SSDs solid state drives
- PCM phasechange memory
- a “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices.
- a network or another communications connection can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer- readable media.
- program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa).
- computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a NIC), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system.
- a network interface module e.g., a NIC
- non-transitory computer- readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.
- Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions.
- computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure.
- the computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code.
- the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like.
- the disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks.
- program modules may be located in both local and remote memory storage devices.
- Embodiments of the present disclosure can also be implemented in cloud computing environments.
- “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources.
- cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources.
- the shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
- FIG. 11 illustrates a block diagram of a computing device FIG. 1100 that may be configured to perform one or more of the processes described above.
- the computing device 1100 may implement the personalized sequencing system 106 and the sequencing device system 104.
- the computing device 1100 can comprise a processor 1102, a memory 1104, a storage device 1106, an I/O interface 1108, and a communication interface 1110, which may be communicatively coupled by way of a communication infrastructure 1112.
- the computing device 1100 can include fewer or more components than those shown in FIG. 11. The following paragraphs describe components of the computing device 1100 shown in FIG. 11 in additional detail.
- the processor 1102 includes hardware for executing instructions, such as those making up a computer program.
- the processor 1102 may retrieve (or fetch) the instructions from an internal register, an internal cache, the memory 1104, or the storage device 1106 and decode and execute them.
- the memory 1104 may be a volatile or nonvolatile memory used for storing data, metadata, and programs for execution by the processor(s).
- the storage device 1106 includes storage, such as a hard disk, flash disk drive, or other digital storage device, for storing data or instructions for performing the methods described herein.
- the I/O interface 1108 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 1100.
- the I/O interface 1108 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I/O devices or a combination of such I/O interfaces.
- the I/O interface 1108 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers.
- the I/O interface 1108 is configured to provide graphical data to a display for presentation to a user.
- the graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
- the communication interface 1110 can include hardware, software, or both. In any event, the communication interface 1110 can provide one or more interfaces for communication (such as, for example, packet-based communication) between the computing device 1100 and one or more other computing devices or networks. As an example, and not by way of limitation, the communication interface 1110 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.
- NIC network interface controller
- WNIC wireless NIC
- the communication interface 1110 may facilitate communications with various types of wired or wireless networks.
- the communication interface 1110 may also facilitate communications using various communication protocols.
- the communication infrastructure 1112 may also include hardware, software, or both that couples components of the computing device 1100 to each other.
- the communication interface 1110 may use one or more networks and/or protocols to enable a plurality of computing devices connected by a particular infrastructure to communicate with each other to perform one or more aspects of the processes described herein.
- the sequencing process can allow a plurality of devices (e.g., a client device, sequencing device, and server device(s)) to exchange information such as sequencing data and error notifications.
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
This disclosure describes methods, non-transitory computer readable media, and systems that implement improved mapping and alignment of nucleotide reads with genomic regions of a reference genome. For instance, the disclosed systems can identify nucleotide reads and candidate population haplotypes within a set of reference spans of a reference genome and generate haplotype set scores for sets of the candidate population haplotypes within the set of reference spans. Based on the haplotype set scores, the disclosed systems can generate a personalized haplotype database comprising a subset of population haplotypes from a population haplotype database and, utilizing the personalized haplotype database, determine one or more personalized alignments of the set of nucleotide reads from the genomic sample.
Description
A PERSONALIZED HAPLOTYPE DATABASE FOR IMPROVED MAPPING AND
ALIGNMENT OF NUCLEOTIDE READS AND IMPROVED GENOTYPE CALLING
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 63/558,754, entitled “A PERSONALIZED HAPLOTYPE DATABASE FOR IMPROVED MAPPING AND ALIGNMENT OF NUCLEOTIDE READS AND IMPROVED GENOTYPE CALLING,” filed on February 28, 2024 (IP-2677-PRV), the entirety of which is hereby incorporated by reference.
BACKGROUND
[0002] In recent years, biotechnology firms and research institutions have improved hardware and software for sequencing nucleotides and determining variant calls for genomic samples. For instance, some existing nucleobase sequencing platforms determine individual nucleobases within sequences from genomic samples’ cells by using conventional Sanger sequencing or by using sequencing-by-synthesis (SBS) methods. When using SBS, existing platforms can monitor millions to billions of nucleic acid polymers being synthesized in parallel to predict nucleobase calls from a larger base call dataset. For instance, a camera in many SBS platforms captures images of irradiated fluorescent tags incorporated into oligonucleotides for determining the nucleobase calls. After capturing such images, existing sequencing platforms send base call data (or image-based data) to a computing device to apply sequencing data analysis software that determines a nucleobase sequence for a genomic sample or other nucleic acid polymer. For instance, such software (i) maps and aligns nucleotide reads determined by the sequencing platform for a genomic sample with (ii) a reference genome comprising at least a primary contiguous sequence. Based on differences between the aligned nucleotide reads and the reference genome, existing data analysis software can further utilize a variant caller to identify genotype and/or variants within a genomic sample, such as single nucleotide polymorphisms (SNPs), insertions or deletions (indels), or structural variants.
[0003] Despite these recent advances, existing nucleobase sequencing platforms and sequencing data analysis software (together and hereinafter, “existing sequencing systems”) often utilize reference genomes that misrepresent certain populations and foment inaccurate read mapping and alignment and mistaken variant calling. For example, some existing sequencing systems use a linear reference genome that purportedly represents a consensus or example of genes and other nucleotide sequences of an organism. But about 93% of the primary assembly for the most common linear human reference genome, GRCh38 from the Genome Reference Consortium, is based on libraries from only 11 individuals, with 70% of the linear human reference genome
coming from 1 individual. Accordingly, many existing systems use a linear reference genome that does not represent certain populations, common variants, or common population haplotypes.
[0004] To address this lack of genetic representation in linear reference genomes, some existing sequencing systems generate or use a graph reference genome. For example, some graph reference genomes include both a linear reference genome and graph augmentations with multi- nucleobase codes representing SNPs and/or indels and alternate contiguous sequences representing various alternative population haplotypes at given genomic regions. In some cases, such graph reference genomes stack and index numerous alternate contiguous sequences that can respectively stretch relatively long nucleobase distances (e.g., hundreds to thousands of base pairs in length) and, consequently, include redundant reference nucleobases overlapping a same region. The one- size-fits-all approach to graph reference genomes can accordingly consume excessive amounts of memory encoding alternate contiguous sequences.
[0005] While such graph reference genomes better account for some populations’ genetics, the expanded representation of existing graph reference genomes often includes an exorbitant number of alternative paths for alleles that are similar to other genomic regions and paths in the graph reference genome. Consequently, existing sequencing systems can significantly increase the difficulty of predicting accurate degradations from alternative paths by undermining the distinctness and usefulness of a genomic region for mapping and alignment of nucleotide reads and by increasing confusion between multiple look-alike genomic regions. Indeed, the one-size-fits- all approach to graph reference genomes can decrease quality scores for mapping reads and undermine the expected utility of a more diverse set of alternate contiguous sequence to which reads can be mapped.
[0006] Indeed, these generic graph reference genomes — with an excessive number of alternative paths representing alternative contiguous sequences — frequently cause existing sequencing systems to misalign, incorrectly match, or miss call variants for a large number of samples as well as increase the chances of mismatched alignments with reads from a genomic sample. Due to having multiple look-alike population haplotypes that lift over a given genomic region of a primary contiguous sequence — and diminishing mapping quality (e.g., MAPQ 0) as such population haplotypes increase in number for the given genomic region — existing sequencing systems have often failed to scale up candidate population haplotypes in a graph reference genome without detrimentally effects to the accuracy of mapping and alignment and consequent reductions to variant-calling accuracy.
[0007] Although the human genome is largely diploid, the human reference genome assemblies utilized by many existing sequencing systems comprise a haploid representation of different reference genomic samples. By using a haploid reference genome for mapping and
alignment of diploid genomic samples, existing sequencing systems frequently align nucleotide reads and determine variant calls that are negatively influenced by a reference bias in favor of alignment of such reads with alleles represented by the reference genome and to the detriment of alternative alleles — despite allele-variant differences between the sample diplotype and the haploid reference genome. Such a reference bias often leads to false positives (FPs) and false negatives (FNs) in variant calls, thus reducing the accuracy of existing sequencing systems in generating variant calls for a diploid genomic sample.
[0008] These, along with additional problems and issues exist in existing sequencing systems.
SUMMARY
[0009] This disclosure describes embodiments of methods, non-transitory computer-readable media, and systems that (i) generate a personalized haplotype database for a genomic sample and (ii) utilize the personalized haplotype database to determine personalized alignments of nucleotide reads from the genomic sample. In particular, the disclosed systems can generate a haplotype database that is customized or personalized for a specific genomic sample based on a comparison of nucleotide reads from a genomic sample with candidate population haplotypes from a population haplotype database. In certain implementations, for example, the disclosed systems can generate a personalized diploid reference database with a customized set of haplotype pairs for diploid genomic regions of a reference genome. The disclosed systems can utilize the personalized haplotype database to determine personalized alignments of nucleotide reads from the genomic sample with respective genomic regions of a reference genome.
[0010] For example, the disclosed systems can identify, within a set of reference spans of a reference genome, a set of nucleotide reads from a genomic sample and candidate population haplotypes from a population haplotype database. For each reference span of the set of reference spans, the disclosed systems can generate haplotype set scores for set of the candidate population haplotypes based on comparing the nucleotide reads and the candidate population haplotypes. Based on the haplotype set scores, the disclosed systems can generate a personalized haplotype database comprising a subset of population haplotypes from the population haplotype database. By utilizing a personalized haplotype database generated according to the disclosed methods, the disclosed systems can determine personalized alignments of a set of nucleotide reads from a genomic sample with respective genomic regions of a reference genome.
[0011] Additional features and advantages of one or more embodiments of the present disclosure will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such example embodiments.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The detailed description refers to the drawings briefly described below.
[0013] FIG. 1 illustrates an environment in which a personalized sequencing system can operate in accordance with one or more embodiments of the present disclosure.
[0014] FIG. 2 illustrates the personalized sequencing system identifying nucleotide reads from a genomic sample within reference spans of a reference genome in accordance with one or more embodiments of the present disclosure.
[0015] FIG. 3 A illustrates the personalized sequencing system determining initial alignments of a set of nucleotide reads from a genomic sample and generating an alignment data file in accordance with one or more embodiments of the present disclosure.
[0016] FIG. 3B illustrates the personalized sequencing system generating a personalized haplotype database and a personalized alignment data file for a set of nucleotide reads from a genomic sample in accordance with one or more embodiments of the present disclosure.
[0017] FIG. 4 further illustrates the personalized sequencing system generating a personalized haplotype database for a set of nucleotide reads from a genomic sample in accordance with one or more embodiments of the present disclosure.
[0018] FIG. 5 illustrates the personalized sequencing system selecting haplotype sets for a set of reference spans of a reference genome and generating a personalized haplotype database in accordance with one or more embodiments of the present disclosure.
[0019] FIG. 6A illustrates a set of base-level bins of a population haplotype database in accordance with one or more embodiments of the present disclosure.
[0020] FIG. 6B illustrates a set of base-level bins of a personalized haplotype database in accordance with one or more embodiments of the present disclosure.
[0021] FIG. 7 illustrates base-level bins and successive higher-level bins of a personalized haplotype database in accordance with one or more embodiments of the present disclosure.
[0022] FIG. 8 illustrates comparative experimental results of determining variant calls from nucleotide reads that are (i) mapped and aligned with a reference genome using existing sequencing systems and (ii) mapped and aligned to a reference genome using a personalized haplotype database generated by the personalized sequencing system in accordance with various embodiments of the present disclosure.
[0023] FIG. 9 illustrates comparative experimental results of mapping and aligning nucleotide reads from a genomic sample utilizing (i) an existing sequencing system and (ii) the personalized sequencing system in accordance with one or more embodiments of the present disclosure.
[0024] FIG. 10 illustrates a flowchart of a series of acts for generating a personalized haplotype database and determining personalized alignments of a set of nucleotide reads from a genomic sample in accordance with one or more embodiments of the present disclosure.
[0025] FIG. 11 illustrates a block diagram of an example computing device for implementing one or more embodiments of the present disclosure.
DETAILED DESCRIPTION
[0026] This disclosure describes embodiments of a personalized sequencing system that can generate and utilize a personalized haplotype database to determine personalized alignments of nucleotide reads from a genomic sample. In particular, the personalized sequencing system can evaluate and select a customized set of population haplotypes based on a comparison of nucleotide reads from a genomic sample with candidate population haplotypes to generate a personalized haplotype database for the genomic sample. In some embodiments, for example, the personalized sequencing system initially maps and aligns a set of nucleotide reads from a genomic sample using a population haplotype database (e.g., a panel of 256 haplotypes) to generate an initial alignment data file (e.g., a rescored binary alignment map (BAM) file). Having determined an initial alignment of the set of nucleotide reads, the personalized sequencing system, in some embodiments, compares the set of nucleotide reads with haplotypes from the population haplotype database to select a subset of haplotypes to include within a personalized haplotype database (e.g., a personalize haplotype panel Tab Separated Values (TSV) file) for the genomic sample.
[0027] Moreover, in some embodiments, the personalized sequencing system utilizes an imputation model to evaluate sets of candidate population haplotypes across a plurality of genomic regions of the reference genome to generate haplotype set scores and select the customized set of population haplotypes based on the generated haplotype set scores. In certain implementations, the personalized sequencing system generates a personalized haplotype database comprising customized pairs of population haplotypes representing a predicted diplotype for a respective genomic sample within one or more diploid genomic regions of a corresponding reference genome. Having generated a personalized haplotype database, the personalized sequencing system can utilize the personalized haplotype database for a genomic sample to determine personalized alignments of nucleotide reads from a corresponding genomic sample with respective genomic regions of a reference genome.
[0028] Before determining such personalized alignments, in one or more embodiments, the personalized sequencing system identifies, within a set of reference spans of a reference genome, a set of nucleotide reads from a genomic sample and candidate population haplotypes from a population database. Based on comparing the set of nucleotide reads and the candidate population haplotypes within each reference span, the personalized sequencing system can generate haplotype
set scores for sets of candidate population haplotypes. Based on the haplotype set scores, the personalized sequencing system can generate a personalized haplotype database comprising a subset of population haplotypes selected from the population haplotype database within the set of reference spans. As mentioned, the personalized sequencing system can utilize the personalized haplotype database to determine one or more personalized alignments of the set of nucleotide reads with respective genomic regions of the reference genome.
[0029] As also mentioned, the personalized sequencing system can utilize an imputation model to evaluate sets of candidate population haplotypes across a plurality of reference spans of the reference genome to generate haplotype set scores and select the customized set of population haplotypes based on the generated haplotype set scores. In one or more embodiments, for example, the personalized sequencing system generates haplotype set scores comprising either (i) haplotype pair scores for pairs of population haplotypes in reference spans covering diploid regions (e.g., regions corresponding to somatic chromosomes) of a reference genome or (ii) individual haplotype scores for population haplotypes in reference spans covering haploid regions (e.g., regions corresponding to sex chromosomes) of the reference genome. Alternatively or additionally, the personalized sequencing system can generate haplotype set scores for sets of more than two population haplotypes (e.g., to customize for polyploid genomic samples having more than two chromosomes per set or to account for uncertainty in selecting custom haplotype sets representing a genomic sample).
[0030] When utilizing an imputation model to evaluate sets of candidate population haplotypes across reference spans, the personalized sequencing system can utilize a hidden Markov model (HMM) algorithm as the imputation model. For example, in some embodiments, the personalized sequencing system utilizes an HMM algorithm to impute haplotype set posterior probabilities for sets of candidate population haplotypes across adjacent reference spans of a set of reference spans of a reference genome associated with a genomic sample. Based on the haplotype set posterior probabilities, the personalized sequencing system can generate haplotype set scores for respective sets of candidate population haplotypes and, based on the haplotype set scores, generate a personalize haplotype database for the genomic sample.
[0031] Furthermore, in various embodiments, the personalized sequencing system can utilize a variety of methods of scoring sets of candidate population haplotypes. In some embodiments, for example, the personalized sequencing system utilizes a Variational Bayesian model implementing an iterative algorithm to generate haplotype set scores for the respective sets of candidate population haplotypes. Also, in one or more embodiments, the personalized sequencing system categorizes nucleotide reads as inherited from respective first and second parents, generates individual haplotype scores for the categorized reads, and generates haplotype set scores for pairs
of candidate population haplotypes by combining individual haplotype scores from nucleotide reads inherited from the first parent with individual haplotype scores from nucleotide reads inherited from the second parent.
[0032] To facilitate in determining haplotype set scores for sets of candidate population haplotypes, in some embodiments, the personalized sequencing system determines initial alignments of the set of nucleotide reads with respective genomic regions of the reference genome. By identifying subsets of nucleotide reads that, according to the initial alignments, align to respective reference spans of the set of reference spans, the personalized sequencing system can generate the haplotype set scores for each respective reference span within the set of reference spans. Moreover, in one or more embodiments, the personalized sequencing system determines the initial alignments by (1) identifying, as indicated within the population haplotype database, allelevariant differences between population haplotypes and a primary contiguous sequence at respective genomic regions of the reference genome, and (2) rescoring candidate reference alignments of the set of nucleotide reads with the primary contiguous sequence according to the identified allelevariant differences.
[0033] Having determined, via the aforementioned alignment method or by another method, initial alignments for a set of nucleotide read, in one or more embodiments, the personalized sequencing system generates an alignment data file including information for generating a personalized haplotype database. For example, such an alignment data file (e.g., a personalized Binary Alignment Map (BAM) file) can include the initial alignments of the set of nucleotide reads, alignment scores corresponding to the initial alignments, and one or more population haplotypes identified for each nucleotide read of the set of nucleotide reads. Subsequently, in some embodiments, the personalized sequencing system utilizes the alignment data file to evaluate and select population haplotypes for a personalized haplotype database. By utilizing the personalized haplotype database, the personalized sequencing system can determine one or more personalized alignments of nucleotide reads from the set of nucleotide reads and output a personalized alignment data file with the personalized alignments and corresponding alignment scores.
[0034] As suggested above, the personalized sequencing system provides several technical advantages, benefits, and/or improvements over existing sequencing systems and methods. For example, the personalized sequencing system improves the accuracy of read alignments and subsequent genomic analysis by utilizing a personalized haplotype database for a genomic sample. More specifically, in some embodiments, the personalized sequencing system generates a personalized haplotype database including a subset of population haplotypes that is selected for a genomic sample and from a population haplotype database. By utilizing the personalized haplotype database in determining personalized alignments of nucleotide reads, the personalized sequencing
system can more accurately align nucleotide reads with a corresponding reference genome — especially in more complex or “difficult” genomic regions (e.g., regions comprising lower confidence base calls in general) — than existing sequencing systems that utilize reference genomes augmented by an unfiltered or unnecessarily large set of population haplotypes (e.g., 15-20 haplotypes per region). Due to the improved alignment with the reference genome, the personalized sequencing system can also determine more accurate genotype calls and/or variant calls with a higher confidence that such calls match or differ from the reference base of a reference genome compared to existing sequencing systems. This disclosure describes and depicts examples of such improved genotype and/or variant calls below in relation to FIGS. 8-9.
[0035] Moreover, in some implementations, the personalized sequencing system further improves the accuracy of mapping and alignment and subsequent variant calling in part by avoiding reference bias, which is a type of bias in favor of aligning nucleotide reads with alleles of the corresponding reference genome. As mentioned above, for example, existing sequencing systems often encounter reference bias when mapping nucleotide reads from a diploid sample to a haploid reference genome. By contrast, the personalized sequencing system, in at least some implementations, provides a personalized haplotype database that avoids reference bias by comprising a customized diploid reference genome for mapping nucleotide reads in diploid regions of a corresponding reference genome. Indeed, the personalized sequencing system can further improve variant calling accuracy relative to existing sequencing systems by generating a personalized haplotype database for personalized mapping and alignment of nucleotide reads from a variety of polyploid genomic samples.
[0036] In addition to improving the accuracy of alignment and related sequencing analysis, the personalized sequencing system improves computational efficiency over existing sequencing systems. By utilizing a personalized haplotype database for a particular genomic sample, for example, the personalized sequencing system can accurately determine personalized read alignments for nucleotide reads with improved computational speed and less memory relative to existing sequencing systems. In particular, as mentioned above, existing sequencing systems often determine read alignments by attempting to align and score nucleotide reads with a robust graph genome augmented by numerous alternative contiguous sequences representing an unfiltered set of population haplotypes. In contrast, the personalized sequencing system utilizes a personalized haplotype database comprising a discrete subset of population haplotypes particularly selected for a given genomic sample, resulting in improved alignment accuracy, reduced memory consumption, and increased processing speeds.
[0037] Moreover, in at least some implementations, the personalized sequencing system can accurately predict initial read alignments and/or personalized read alignments, according to the
disclosed methods, while improving the computing speed and memory usage relative to existing sequencing systems. As noted above, existing sequencing systems use graph reference genomes with generic graph augmentations including numerous and redundant alternate contiguous sequences that consume memory with the repeated sequences from overlapping portions of alternate contiguous sequences and slow down computer processing by scoring alignments between reads and such overlapping portions of alternate contiguous sequences. In contrast to such existing systems, the personalized sequencing system can expedite mapping and alignment of nucleotide reads at least by: (i) determining candidate reference alignments of nucleotide reads with a primary contiguous sequence at respective genomic regions of a reference genome and (ii) determining initial alignments or personalized alignments of the nucleotide reads by rescoring the candidate reference alignments according to allele-variant differences between the primary contiguous sequence and population haplotypes at the respective genomic regions of the reference genome.
[0038] In addition or in the alternative to increased computational efficiency, the personalized sequencing system can generate a personalized haplotype database for determining personalized alignments with greater flexibility compared to existing sequencing systems. To illustrate, in some embodiments, the personalized sequencing system generates a personalized haplotype database for a genomic sample based on comparing nucleotide reads from a genomic sample with candidate population haplotypes. Accordingly, the personalized sequencing system can generate alignments of nucleotide reads with improved accuracy and efficiency, without requiring additional information regarding the genomic sample, such as parental genomic data or demographic data associated with the sample specimen. In certain implementations, for example, the personalized sequencing system further increases flexibility by implementing a diploid personalized haplotype database that more flexibly facilitates accurate mapping and alignment and/or variant calling relative to existing sequencing system that utilize haploid reference genomes for mapping and alignment of diploid genomic samples.
[0039] As suggested by the foregoing discussion, this disclosure utilizes a variety of terms to describe features and benefits of the personalized sequencing system. Additional detail is hereafter provided regarding the meaning of these terms as used in this disclosure. As used in this disclosure, for instance, the term “genomic sample” refers to a target genome or portion of a genome undergoing an assay or sequencing. For example, a genomic sample includes one or more sequences of nucleotides isolated or extracted from a sample organism (or a copy of such an isolated or extracted sequence). In particular, a genomic sample includes a full genome that is isolated or extracted (in whole or in part) from a sample organism and composed of nitrogenous heterocyclic bases. A genomic sample can include a segment of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or other polymeric forms of nucleic acids or chimeric or hybrid forms of
nucleic acids noted below. In some cases, the genomic sample is found in a sample prepared or isolated by a kit and received by a sequencing device.
[0040] Also, as used herein, the term “nucleotide read” (or simply “read”) refers to an inferred or predicted sequence of one or more nucleotide bases (or nucleobase pairs) from all or part of a sample genomic sequence (e.g., a sample genomic sequence, complementary DNA). In particular, a nucleotide read includes a determined or predicted sequence of nucleobase calls for a nucleotide fragment (or group of monoclonal nucleotide fragments) from a sequencing library corresponding to a genomic sample. For example, in some embodiments, the personalized sequencing system determines a nucleotide read by generating nucleobase calls for nucleobases passed through a nanopore of a nucleotide-sample slide, determined via fluorescent tagging, or determined from a well in a flow cell. In some cases, a nucleotide read can refer to a particular type of read, such as a nucleotide read synthesized from sample library fragments that are shorter than a threshold number of nucleobases (e.g., SBS reads). In these or other cases, another type of nucleotide read can refer to (i) assembled nucleotide reads that have been assembled from shorter nucleotide reads to form a contiguous sequence (e.g., assembled nucleotide reads) satisfying a threshold number of nucleobases, (ii) circular consensus sequencing (CCS) reads satisfying the threshold number of nucleobases, or (iii) nanopore long reads satisfying the threshold number of nucleobases.
[0041] As further used herein, the term “genomic coordinate” (or sometimes simply “coordinate”) refers to a particular location or position of a nucleobase within a genome (e.g., an organism’s genome or a reference genome). In some cases, a genomic coordinate includes an identifier for a particular chromosome of a genome and an identifier for a position of a nucleobase within the particular chromosome. For instance, a genomic coordinate or coordinates may include a number, name, or other identifier for a somatic or sex chromosome (e.g., chrl or chrX) and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570 or chrl: 1234570-1234870). In some cases, a genomic coordinate refers to a genomic coordinate on a sex chromosome (e.g., chrX or chrY). Consequently, the personalized sequencing system can determine genotype probabilities for a genotype call (e.g., a variant call) for a genomic coordinate on a sex chromosome. Further, in certain implementations, a genomic coordinate refers to a source of a reference genome (e.g., mt for a mitochondrial DNA reference genome or SARS-CoV-2 for a reference genome for the SARS-CoV-2 virus) and a position of a nucleobase within the source for the reference genome (e.g., mt: 16568 or SARS-CoV- 2:29001). By contrast, in certain cases, a genomic coordinate refers to a position of a nucleobase within a reference genome without reference to a chromosome or source (e.g., 29727).
[0042] As used herein, a “genomic region” refers to a range of genomic coordinates. Like genomic coordinates, in certain implementations, a genomic region may be identified by an
identifier for a chromosome and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570-1234870). In various implementations, a genomic coordinate includes a position within a reference genome. In some cases, a genomic coordinate is specific to a particular reference genome. Relatedly, as used herein, the term “reference span” refers to a span of nucleobase positions within a linear reference genome. In other words, a reference span includes a span of nucleobases between two respective genomic coordinates of the linear reference genome.
[0043] As noted above, a genomic coordinate includes a position within a reference genome. Such a position may be within a particular reference genome. As used herein, the term “reference genome” refers to a digital nucleic acid sequence assembled as a representative example (or representative examples) of genes and other genetic sequences of an organism. Regardless of the sequence length, in some cases, a reference genome represents an example set of genes or a set of nucleic acid sequences in a digital nucleic acid sequenced determined by scientists as representative of an organism of a particular species. For example, a linear human reference genome may be GRCh38 or other versions of reference genomes from the Genome Reference Consortium. As noted above, in some cases, a reference genome includes multi-base codes. As a further example, a reference genome may include a graph reference genome that includes both a linear reference genome and paths representing nucleic acid sequences from ancestral haplotypes, such as Illumina DRAGEN Graph Reference Genome hgl9.
[0044] As used herein, the term “primary contiguous sequence” (or simply “primary contig”) refers to a contiguous sequence representing a reference haplotype of the reference genome. In some embodiments, a primary contiguous sequence digitally represents a reference haplotype of a reference genome but can include additional information from a primary assembly of the linear reference genome, such as indications of population variants in certain genomic regions to aid in identifying candidate alignments of nucleotide reads.
[0045] By contrast, the term “alternate contiguous sequence” (or simply “alt contig”) refers to a contiguous sequence representing a population haplotype at particular genomic coordinates of a reference genome. For example, in some sequencing systems, a graph reference genome includes alternate contiguous sequences mapped to genomic coordinates of a primary contiguous sequence for a linear reference genome. In some cases, a hash table for a graph reference genome includes identifiers that associate alternate contiguous sequences representing population haplotypes at genomic coordinates relative to a linear reference genome. Critically, as explained and depicted in this disclosure, a personalized haplotype database includes data corresponding to a limited sampling of population haplotypes selected for a particular genomic sample.
[0046] Relatedly, as used herein, the term “allele-variant difference” refers to differences between respective nucleobases of two or more given nucleotide sequences. In some cases, for example, allele-variant differences are differences between the primary contiguous sequence and at least one population haplotype (e.g., as represented by an alternative contiguous sequence). In some embodiments, for example, allele-variant differences within a given genomic region can include single nucleotide variants, multiple base differences, and/or insertions and deletions (indels) of population haplotypes relative to a primary contiguous sequence. Also, allele-variant differences can refer to differences between a first population haplotype and a second population haplotype.
[0047] As used herein, the term “locally distinct population haplotype” or “locally distinct haplotype” refers to a haplotype comprising a set of at least one allele-variant difference, where the set is unique relative to other haplotypes within a respective genomic region of a reference genome. Each genomic region or reference span of a reference genome, for example, can include one or more locally distinct population haplotypes having a unique set of one or more allele-variant differences relative to other population haplotypes within the respective genomic region or reference span. Also, in some embodiments, a given set of one or more allele-variant differences within a genomic region corresponding to a candidate read alignment can represent multiple haplotypes due to a complete overlap of variants within the genomic region. Accordingly, in certain cases, multiple haplotypes comprising or consisting of identical nucleobases within a given genomic region can be represented by a single locally distinct population haplotype.
[0048] Moreover, as used herein, the term “alignment score” refers to a numeric score, metric, or other quantitative measurement evaluating an accuracy of an alignment between one or more nucleotide reads or a fragment of a nucleotide read and another nucleotide sequence from a reference genome. In particular, an alignment score includes a metric indicating a degree to which the nucleobases of one or more nucleotide reads (or a fragment thereof) match or are similar to a reference sequence or an alternate contiguous sequence from a reference genome. In certain implementations, an alignment score takes the form of a Smith-Waterman score or a variation or version of a Smith- Waterman score for local alignment, such as various settings or configurations used by DRAGEN by Illumina, Inc. for Smith- Waterman scoring.
[0049] Relatedly, as used herein, the term “mapping-quality score” refers to a metric or other measurement quantifying a quality or certainty of an alignment of nucleotide reads (or other nucleotide sequences or subsequences) with a reference genome. In some embodiments, for example, a mapping-quality score includes mapping-quality (MAPQ) scores for nucleobase calls at genomic coordinates, where a MAPQ score represents -10 loglO Pr{mapping position is wrong}, rounded to the nearest integer. In the alternative to a mean or median mapping quality, in
some implementations, a mapping-quality score includes a full distribution of mapping qualities for all nucleotide reads aligning with a reference genome at a genomic coordinate. In some embodiments, MAPQ scores are partitioned into sequential bins (e.g., Q-score bins denoted by “MAPQO,” “MAPQ 10,” “MAPQ20,” and so forth) representing different MAPQ scores for each bin. Also, in some embodiments, any MAPQ scores above a predetermined threshold score are associated with a maximum value bin (e.g., up to “MAPQ40”). Relatedly, as used herein, the term “extended mapping-quality score” or “extended MAPQ score” refers to MAPQ scores that are partitioned into bins including a higher maximum value bin (e.g., up to “MAPQ60”) than conventionally implemented. In some embodiments, for example, the personalized sequencing system implements extended MAPQ scores when performing an initial mapping and alignment of a given set of nucleotide reads and utilizes the extended MAPQ scores to compare the nucleotide reads with candidate population haplotypes to generate haplotype set scores (e.g., as described below in relation to FIGS. 3A-3B and 4).
[0050] As further used herein, the term “genotype call” refers to a determination or prediction of a particular genotype of a genomic sample or a sample nucleotide sequence at a genomic locus. In particular, a genotype call can include a prediction of a particular genotype of a genomic sample with respect to a reference genome or a reference sequence at a genomic coordinate or a genomic region. For instance, in some cases, a genotype call includes a determination or prediction that a genomic sample comprises both a nucleobase and a complementary nucleobase at a genomic coordinate that is either homozygous or heterozygous for a reference base or a variant (e.g., homozygous reference bases represented as 0| 0 or heterozygous for a variant on a particular strand represented as 0| 1). Accordingly, a genotype call can include a prediction of a variant or reference base for one or more alleles of a genomic sample and indicate zygosity with respect to a variant or reference base. A genotype call is often determined for a genomic coordinate or genomic region at which an SNP, insertion, deletion, or other variant has been identified for a population of organisms.
[0051] As further used herein, the term “nucleobase call” (or simply “base call”) refers to a determination or prediction of a particular nucleobase (or nucleobase pair) for an oligonucleotide (e.g., nucleotide read) during a sequencing cycle or for a genomic coordinate of a genomic sample. In particular, a nucleobase call can indicate a determination or prediction of the type of nucleobase that has been incorporated within an oligonucleotide on a nucleotide-sample slide (e.g., read-based nucleobase calls). In some cases, for a nucleotide read, a nucleobase call includes a determination or a prediction of a nucleobase based on intensity values resulting from fluorescent-tagged nucleotides added to an oligonucleotide of a nucleotide-sample slide (e.g., in a cluster of a flow
cell). As suggested above, a single nucleobase call can be an adenine (A) call, a cytosine (C) call, a guanine (G) call, a thymine (T) call, or an uracil (U) call.
[0052] As used herein, the term “variant” refers to a nucleobase or multiple nucleobases that do not align with, differs from, or varies from a corresponding nucleobase (or nucleobases) in a reference sequence or a reference genome. For example, a variant includes a SNP, an indel, or a structural variant that indicates nucleobases in a sample nucleotide sequence that differ from nucleobases in corresponding genomic coordinates of a reference sequence or a reference genome. [0053] Along these lines, a “variant call” (or “variant nucleobase call”) refers to a nucleobase call comprising a mutation or a variant at a particular genomic coordinate or genomic region with respect to a reference. In particular, a variant call includes a determination or prediction that a genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that differs from a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome. Conversely, a “reference call” (or “non-variant nucleobase call” or “non-variant call”) refers to a nucleobase call comprising a non-variant or a reference nucleobase at a genomic coordinate or a genomic region with respect to a reference. In particular, a non-variant or reference call includes a determination or prediction that a genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that matches a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome.
[0054] As used herein, the term “alignment data file” refers to a digital file that indicates mapping and alignment information for nucleotide reads of a sample nucleotide sequence. For example, an alignment data file can include a binary alignment map (BAM) file, a compressed reference-oriented alignment map (CRAM) file, or another file indicating nucleotide reads of a sample nucleotide sequence. Also, as discussed below (e.g., in relation to FIGS. 3A-3B), an alignment data file can include further information regarding nucleotide reads, mapping and alignment results, population haplotype data, and so forth.
[0055] As further used herein, the term “population haplotype” refers to nucleotide sequences that are present in an organism (or present in organisms from a population) and inherited from one or more ancestors. In particular, a population haplotype can include alleles or other nucleotide sequences that are present in organisms of a population and inherited together by such organisms respectively from a single parent. In one or more embodiments, population haplotypes include a set of SNPs or other variants on the same chromosome that tend to be inherited together. In some cases, data representing a population haplotype, or a set of different population haplotypes, are stored or otherwise accessible on a population haplotype database. As mentioned, in some embodiments, the personalized sequencing system also generates a personalized haplotype
database comprising a customized selection of the population haplotypes imported from a particular population haplotype database.
[0056] Relatedly, as used herein, the term “population haplotype database” refers to a database encoding variant data for population haplotypes of a sample organism. In particular, a population haplotype database refers to an unfiltered compilation of population haplotypes or, in other words, a complete compilation of population haplotypes prior to personalization according to the methods disclosed herein. In one or more embodiments, a population haplotype database includes complete or partially complete nucleotide sequences (e.g., alternate contiguous sequences) for population haplotypes of a sample organism. Alternatively, in some embodiments, a population haplotype database encodes variant data for population haplotypes having allele-variant differences from locally distinct population haplotypes within respective genomic regions of a corresponding reference genome. For example, in some embodiments, the population haplotype database comprises a haplotype data structure comprising a hierarchical partitioning of different genomic regions of the reference genome into a collection of bins covering respective spans of a linear reference genome (e.g., as represented by a primary contiguous sequence).
[0057] Relatedly, as used herein, the term “personalized haplotype database,” interchangeable with “customized haplotype database,” refers to a haplotype database comprising a personalized (or customized) subset of population haplotypes for a given genomic sample. For instance, a personalized (or customized) haplotype database can include one or more population haplotypes selected for a sample genome from a population haplotype database as described above. In some embodiments, for example, a personalized (or customized) haplotype database can include two population haplotypes for each of one or more genomic regions of the corresponding reference genome identified as diploid. Similarly, in some embodiments, for genomic regions corresponding to sex chromosomes of a sample genome, a personalized (or customized) haplotype database can include a single population haplotype. Indeed, any number of population haplotypes can be included within a personalized (or customized) haplotyped database according to the methods disclosed herein.
[0058] As also used herein, the term “haplotype set score” refers to a metric or other measurement quantifying a likelihood or probability that nucleotide reads from a given genomic sample correspond to a respective set of one or more population haplotypes within a particular genomic region. In some embodiments, for example, haplotype set scores quantify similarities between haplotype sets comprising pairs of population haplotypes and nucleotide reads within a diploid region of a corresponding reference genome. Similarly, in some embodiments, haplotype set scores quantify similarities between nucleotide reads and individual haplotypes within a haploid genomic region (e.g., sex chromosomes) of a corresponding reference genome. Furthermore, in
some embodiments, haplotype set scores quantify similarities between nucleotide reads and sets of multiple haplotypes within a polyploid genomic region of a corresponding reference genome (e.g., such that each set of multiple haplotypes comprises a same quantity of haplotypes as a quantity of chromosomes sets in the polyploid genomic region of the genomic sample). Also, in some embodiments, a haplotype set score can include a haplotype set likelihood or a haplotype set posterior probability, such as described in relation to FIG. 5.
[0059] Also, as used herein, the term “personalized alignment,” interchangeable with “customized alignment,” refers to a read alignment generated utilizing a personalized (or customized) haplotype database as described herein. For example, a personalized (or customized) haplotype database, generated for a particular genomic sample according to the methods disclosed herein, can be utilized in place of an unfiltered population haplotype database to generate one or more personalized (or customized) alignments of nucleotide reads from the particular genomic sample.
[0060] The following paragraphs describe the personalized sequencing system with respect to illustrative figures that portray example embodiments and implementations. For example, FIG. 1 illustrates a schematic diagram of a computing system 100 in which a personalized sequencing system 106 operates in accordance with one or more embodiments. As illustrated, the computing system 100 includes a sequencing device 102 connected to a local device 108 (e.g., a local server device), one or more server device(s) 110, and a client device 114. As shown in FIG. 1, the sequencing device 102, the local device 108, the server device(s) 110, and the client device 114 can communicate with each other via a network 118. The network 118 comprises any suitable network over which computing devices can communicate. Example networks are discussed in additional detail below with respect to FIG. 11. While FIG. 1 shows an embodiment of the personalized sequencing system 106, this disclosure describes alternative embodiments and configurations below.
[0061] As indicated by FIG. 1, the sequencing device 102 comprises a computing device and a sequencing device system 104 for sequencing a genomic sample or other nucleic-acid polymer. In some embodiments, by executing the sequencing device system 104 using a processor, the sequencing device 102 analyzes nucleotide fragments or oligonucleotides extracted from genomic samples to generate nucleotide reads or other data utilizing computer implemented methods and systems either directly or indirectly on the sequencing device 102. More particularly, the sequencing device 102 receives nucleotide-sample slides (e.g., flow cells) comprising nucleotide fragments extracted from samples and further copies and determines the nucleobase sequence of such extracted nucleotide fragments.
[0062] In one or more embodiments, the sequencing device 102 utilizes sequencing-by- synthesis (SBS) techniques to sequence nucleotide fragments into nucleotide reads and determine nucleobase calls for the nucleotide reads. In addition or in the alternative to communicating across the network 118, in some embodiments, the sequencing device 102 bypasses the network 118 and communicates directly with the local device 108 or the client device 114. By executing the sequencing device system 104, the sequencing device 102 can further store the nucleobase calls as part of base-call data that is formatted as a binary base call (BCL) file and send the BCL file to the local device 108 and/or the server device(s) 110.
[0063] As further indicated by FIG. 1, the local device 108 is located at or near a same physical location of the sequencing device 102. Indeed, in some embodiments, the local device 108 and the sequencing device 102 are integrated into a same computing device. The local device 108 may run the personalized sequencing system 106 to generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such basecall data. As shown in FIG. 1, the sequencing device 102 may send (and the local device 108 may receive) base-call data generated during a sequencing run of the sequencing device 102. By executing software in the form of the personalized sequencing system 106, the local device 108 may align nucleotide reads with a reference genome utilizing a personalized haplotype database 112 and determine genetic variants based on the aligned nucleotide reads. The local device 108 may also communicate with the client device 114. In particular, the local device 108 can send data to the client device 114, including a binary alignment map (BAM) file, a variant call format (VCF) file, or other information indicating nucleobase calls, sequencing metrics, error data, or other metrics.
[0064] As further indicated by FIG. 1, the server device(s) 110 are located remotely from the local device 108 and the sequencing device 102. Similar to the local device 108, in some embodiments, the server device(s) 110 include a version of (or are otherwise able to access or implement) the personalized sequencing system 106. Accordingly, the server device(s) 110 may generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such base-call data. As indicated above, the sequencing device 102 may send (and the server device(s) 110 may receive) base-call data from the sequencing device 102. The server device(s) 110 may also communicate with the client device 114. In particular, the server device(s) 110 can send data to the client device 114, including BAM files, VCF files, or other sequencing related information.
[0065] In some embodiments, the server device(s) 110 comprise a distributed collection of servers where the server device(s) 110 include a number of server devices distributed across the network 118 and located in the same or different physical locations. Further, the server device(s)
110 can comprise a content server, an application server, a communication server, a web-hosting server, or another type of server. Moreover, as shown in FIG. 1, the server device(s) 110 are in communication, either directly or via the network 118, with a population haplotype database 120 storing population haplotypes to be evaluated by the personalized sequencing system 106 when generating the personalized haplotype database 112 for a genomic sample.
[0066] As indicated above, as part of the server device(s) 110 or the local device 108, the personalized sequencing system 106 can generate, encode, and/or implement the personalized haplotype database 112 to determine personalized alignments of nucleotide reads from a genomic sample with a reference genome. In some embodiments, for example, the personalized sequencing system 106 can generate the personalized haplotype database 112 for a genomic sample (e.g., based on a set of nucleotide reads from a genomic sample) and utilizes the generated personalized haplotype database 112 to determine one or more personalized alignments of nucleotide reads from a genomic sample corresponding to the genomic sample, as described in greater detail below in relation to the subsequent figures.
[0067] As further illustrated and indicated in FIG. 1, by executing a sequencing application 116, the client device 114 can generate, store, receive, and send digital data. In particular, the client device 114 can receive sequencing data from the local device 108 or receive call files (e.g., BCL) and sequencing metrics from the sequencing device 102. Furthermore, the client device 114 may communicate with the local device 108 or the server device(s) 110 to receive a VCF comprising genotype or variant calls and/or other metrics, such as a base-call-quality metrics or pass-filter metrics. The client device 114 can accordingly present or display information pertaining to variant calls or other genotype calls within a graphical user interface of the sequencing application 116 to a user associated with the client device 114. For example, the client device 114 can present genotype calls, variant calls, and/or sequencing metrics for a sequenced genomic sample within a graphical user interface of the sequencing application 116.
[0068] Although FIG. 1 depicts the client device 114 as a desktop or laptop computer, the client device 114 may comprise various types of client devices. For example, in some embodiments, the client device 114 includes non -mobile devices, such as desktop computers or servers, or other types of client devices. In yet other embodiments, the client device 114 includes mobile devices, such as laptops, tablets, mobile telephones, or smartphones. Additional details regarding the client device 114 are discussed below with respect to FIG. 11.
[0069] As further illustrated in FIG. 1, the client device 114 includes the sequencing application 116. The sequencing application 116 may be a web application or a native application stored and executed on the client device 114 (e.g., a mobile application, desktop application). The sequencing application 116 can include instructions that (when executed) cause the client device
114 to receive data from the personalized sequencing system 106 and present, for display at the client device 114, base-call data or data from an alignment data file or VCF. Furthermore, the sequencing application 116 can instruct the client device 114 to display summaries for multiple sequencing runs.
[0070] As further illustrated in FIG. 1, a version of the personalized sequencing system 106 may be located and/or implemented (e.g., entirely or in part) on the client device 114 or the sequencing device 102. In yet other embodiments, the personalized sequencing system 106 is implemented by one or more other components of the computing system 100, such as the local device 108. In particular, the personalized sequencing system 106 can be implemented in a variety of different ways across the sequencing device 102, the local device 108, the server device(s) 110, and the client device 114. For example, the personalized sequencing system 106 can be downloaded from the server device(s) 110 to the personalized sequencing system 106 and/or the local device 108 where all or part of the functionality of the personalized sequencing system 106 is performed at each respective device within the computing system 100.
[0071] As previously mentioned, in some embodiments, the personalized sequencing system 106 evaluates, for a set of reference spans of a reference genome for a genomic sample, sets of candidate population haplotypes for inclusion within a personalized haplotype database for the genomic sample. To illustrate, FIG. 2 depicts an example of a set of nucleotide reads 202 and a set of reference spans 204a, 204b, through 204n of a reference genome. In particular, FIG. 2 illustrates the personalized sequencing system 106 identifying nucleotide reads of the set of nucleotide reads 202 within each reference span of the set of reference spans 204a-204n.
[0072] In one or more embodiments, for example, the personalized sequencing system 106 compares nucleotide reads, such as the set of nucleotide reads 202, with sets of candidate population haplotypes within one or more genomic regions of a reference genome, such as the set of reference spans 204a-204n, to determine a likelihood that each set of candidate haplotypes represents the genomic sample within each respective genomic region. In various embodiments, the personalized sequencing system 106 can partition a reference genome into a variety of genomic regions, such as but not limited to reference spans corresponding to chromosomes of the genomic sample, reference spans covering a predetermined number of nucleobases of a primary contiguous sequence (e.g., 1,000 nucleobases; 10,000 nucleobases; 1 million nucleobases), or a single reference span covering some or all of a genomic sample. Accordingly, the personalized sequencing system 106 can partition a reference genome (or a portion thereof) into a set of reference spans, such as the reference spans 204a-204n, compare sets of population haplotypes with nucleotide reads aligned to each respective reference span, and select a set of population haplotypes
for each respective reference span to include within a personalized haplotype database (e.g., as described below in relation to FIGS. 3A-3B and 4-5).
[0073] As shown in FIG. 2, the personalized sequencing system 106 identifies nucleotide reads within each reference span of the set of reference spans 204a-204n. In some embodiments, for example, the personalized sequencing system 106 determines initial alignments of the set of nucleotide reads 202 with respective genomic regions of the reference genome (e.g., as further described below in relation to FIG. 3 A). Subsequently, the personalized sequencing system 106 can identify one or more nucleotide reads of the initially aligned set of nucleotide reads 202 that at least partially align with respective reference spans of the set of reference spans 204a-204n and compare the identified nucleotide reads (or portions thereof) with sets of candidate population haplotypes within each respective reference span of the set of reference spans 204a-204n.
[0074] Alternatively, in some embodiments, the personalized sequencing system 106 identifies one or more distinct k-mers within the set of nucleotide reads 202 for each candidate population haplotype within a given reference span of the set of reference spans 204a-204n. For example, the personalized sequencing system 106 can identify distinct nucleotide k-mers (or partially distinct k- mers) of a length k within one or more candidate population haplotypes and compare the identified distinct k-mers of the candidate population haplotypes with the set of nucleotide reads 202 within the given reference span of the set of reference spans 204a-204n. Indeed, the personalized sequencing system 106 can utilize various methods of identifying nucleotide reads within respective reference spans of a reference genome, not limited to the methods described herein.
[0075] As previously mentioned, in some embodiments, the personalized sequencing system 106 generates a personalized haplotype database for a genomic sample based on comparing nucleotide reads with population haplotypes from a population haplotype database and, utilizing the personalized haplotype database, determines personalized alignments of the nucleotide reads from a genomic sample. For example, FIGS. 3A-3B illustrate the personalized sequencing system 106 performing an initial read mapping and alignment 308a of nucleotide reads 302 with respective genomic regions of a reference genome 304, generating a personalized haplotype database 314 based on the initial alignments of the nucleotide reads 302, and utilizing the personalized haplotype database 314 to perform a personalized read mapping and alignment 308b of the nucleotide reads 302.
[0076] As shown in FIG. 3 A, in some embodiments, the personalized sequencing system 106 implements the read mapping and alignment 308a to determine initial alignments of the nucleotide reads 302 with respective genomic regions of the reference genome 304. For instance, the personalized sequencing system 106 utilizes a population haplotype database 306 associated with the reference genome 304 to generate initial read alignments for the nucleotide reads 302. In some
embodiments, for example, the population haplotype database 306 comprises a graph reference genome comprising a primary contiguous sequence augmented by a plurality of alternate contiguous sequences representing population haplotypes associated with the reference genome 304.
[0077] Alternatively, in some embodiments, the population haplotype database 306 comprises a data structure encoding allele-variant differences between the population haplotypes and a primary contiguous sequence for the reference genome 304. In such embodiments, to implement the initial read mapping and alignment 308a, the personalized sequencing system 106 determines the initial alignments of the nucleotide reads 302 by (i) determining one or more candidate reference alignments of each nucleotide read with a primary contiguous sequence of the reference genome 304 and (ii) determining an initial alignment of each nucleotide read with a respective genomic region of the reference genome 304 by rescoring the one or more candidate reference alignments based on comparing each nucleotide read with the allele-variant differences indicated within the population haplotype database 306.
[0078] Moreover, as shown in FIG. 3 A, having determined initial alignments of the nucleotide reads 302 by the initial read mapping and alignment 308a, the personalized sequencing system 106 generates an alignment data fde 310 (e.g., an initial alignment data file, such as a BAM file or other file type) indicating the initial alignments and, in some embodiments, additional information associated with the nucleotide reads 302, the initial alignments, and/or the population haplotypes to which one or more of the nucleotide reads 302 align according to the initial alignments. In one or more embodiments, for example, the alignment data file 310 generated via the initial read mapping and alignment 308a comprises the nucleotide reads 302 (e.g., nucleobases thereof), the initial alignments of the nucleotide reads 302 with respective genomic regions of the reference genome 304, alignment scores associated with the initial alignments (e.g., MAPQ or extended MAPQ scores), and/or haplotype data (e.g., imported from the population haplotype database 306) associated with the initial alignments of the nucleotide reads 302.
[0079] In some embodiments, for instance, haplotype data stored within the alignment data file 310 can include an indication of one or more candidate population haplotypes for respective reference spans of the reference genome 304. In such embodiments, the personalized sequencing system 106 can identify one or more candidate population haplotypes, which the personalized sequencing system 106 analyzes for inclusion when generating the personalized haplotype database 314, within reference spans of the reference genome 304 based on similarities between the nucleotide reads 302 and population haplotypes within the respective reference spans. Alternatively, in some embodiments, the personalized sequencing system 106 evaluates all population haplotypes within the population haplotype database 306 as candidate population
haplotypes when selecting haplotypes for inclusion within the personalized haplotype database 314 (e.g., as further described below in relation to FIGS. 4-5).
[0080] As shown in FIG. 3B, in some embodiments, the personalized sequencing system 106 utilizes the alignment data file 310 (e.g., the initial alignments and other information associated therewith) in a haplotype selection process 312 to generate the personalized haplotype database 314. In some embodiments, for example, the personalized sequencing system 106 generates haplotype set scores for sets of candidate population haplotypes based on comparing the nucleotide reads 302 with the candidate population haplotypes within one or more reference spans of the reference genome 304. Thus, as shown in FIG. 3B, the haplotype selection process 312 comprises (i) comparing the nucleotide reads 302 with candidate population haplotypes from the population haplotype database 306 according to the initial read alignments indicated by the alignment data file 310 and, based on the comparison, (ii) selecting a subset of population haplotypes from the population haplotype database 306 for inclusion within the personalized haplotype database 314. Accordingly, in one or more implementations, the personalized haplotype database 314 comprises a subset of population haplotypes selected from the population haplotype database 306 for one or more respective reference spans (or other partitions) of the reference genome 304.
[0081] As also shown in FIG. 3B, having generated the personalized haplotype database 314 for the genomic sample of the nucleotide reads 302, the personalized sequencing system 106 implements the personalized read mapping and alignment 308b. In one or more embodiments, for instance, the personalized sequencing system 106 determines personalized alignments of the nucleotide reads 302 with respective genomic regions of the reference genome 304 in similar fashion as the initial read mapping and alignment 308a, albeit by utilizing the personalized haplotype database 314 in place of the population haplotype database 306. In some embodiments, for example, the personalized haplotype database 314 comprises a data structure encoding allelevariant differences between the selected population haplotypes and a primary contiguous sequence for the reference genome 304 in each of the one or more reference spans of the reference genome 304. In such embodiments, to implement the personalized read mapping and alignment 308b, the personalized sequencing system 106 determines the personalized alignments of the nucleotide reads 302 by (i) determining one or more candidate reference alignments of each nucleotide read with a primary contiguous sequence of the reference genome 304 and (ii) determining a personalized alignment of each nucleotide read with a respective genomic region of the reference genome 304 by rescoring the one or more candidate reference alignments based on comparing each nucleotide read with the allele-variant differences indicated within the personalized haplotype database 314. Alternatively, in some embodiments, the personalized haplotype database 314 comprises a graph reference genome. Such a graph reference genome may comprise a primary
contiguous sequence augmented by a plurality of alternate contiguous sequences representing the population haplotypes selected from the population haplotype database 306 via the haplotype selection process 312.
[0082] Furthermore, as shown in FIG. 3B, having determined personalized alignments of the nucleotide reads 302 by the personalized read mapping and alignment 308b, the personalized sequencing system 106 generates a personalized alignment data file 316 (e.g., an alignment data file, such as a BAM file or related file type) indicating the personalized alignments. In addition to data for personalize alignments, in some embodiments, the personalized alignment data file 316 includes additional information associated with the nucleotide reads 302, the personalized alignments, and/or the population haplotypes to which one or more of the nucleotide reads 302 align according to the personalized alignments. In one or more embodiments, for example, the personalized alignment data file 316 generated via the personalized read mapping and alignment 308b comprises the nucleotide reads 302 (e.g., nucleobases thereof), the personalized alignments of the nucleotide reads 302 with respective genomic regions of the reference genome 304, alignment scores associated with the personalized alignments (e.g., MAPQ or extended MAPQ scores), and/or haplotype data (e.g., imported from the personalized haplotype database 314) associated with the personalized alignments of the nucleotide reads 302.
[0083] As previously mentioned, in some embodiments, the personalized sequencing system 106 generates haplotype set scores for sets of candidate population haplotypes based on comparing nucleotide reads from a genomic sample with the sets of candidate population haplotypes within respective genomic regions of the genomic sample. For example, FIG. 4 illustrates the personalized sequencing system 106 comparing nucleotide reads 404 from a genomic sample with sets of candidate population haplotypes (e.g., candidate population haplotypes 408) from a population haplotype database 406 to generate haplotype set scores 410. Based on the haplotype set scores 410, the personalized sequencing system 106 further determines the selected population haplotypes 414 for inclusion within a personalized haplotype database 412.
[0084] As mentioned, the personalized sequencing system 106 can identify nucleotide reads for comparison with candidate population haplotypes within a given genomic region of a genomic sample (e.g., a reference span of a reference genome). As shown in FIG. 4, for example, the personalized sequencing system 106 receives identifies initial alignments 402 of a set of nucleotide reads (e.g., as determined by an initial mapping and alignment process, such as described above in relation to FIG. 3A) and identifies the nucleotide reads 404 from the set of nucleotide reads that, according to the initial alignments 402, at least partially align to a given genomic region. Alternatively, the personalized sequencing system 106 can identify the nucleotide reads 404 for
comparison with the candidate population haplotypes 408 without determining or utilizing the initial alignment 402 (e.g., such as described above in relation to FIG. 2).
[0085] As further depicted in FIG. 4, the personalized sequencing system 106 compares sets of the candidate population haplotypes 408 with the nucleotide reads 404 within a given genomic region to determine the haplotype set scores 410. In some embodiments, for example, the candidate population haplotypes 408 include all population haplotypes within the population haplotype database 406 having variations from a respective reference genome (e.g., variants relative to a primary contiguous sequence) within the given genomic region. Alternatively, in one or more embodiments, the personalized sequencing system 106 identifies a limited number of the candidate population haplotypes 408 from the population haplotype database 406 for comparison with the nucleotide reads 404 within the given genomic region. In some implementations, for example, the personalized sequencing system 106 limits the candidate population haplotypes 408 to haplotypes having locally distinct variations from the reference genome within the given genomic region (e.g., population haplotypes comprising a unique set of one or more allele-variant differences relative to other population haplotypes within the given genomic region or reference span).
[0086] Alternatively, in some embodiments, the personalized sequencing system 106 determines haplotype likelihoods for the candidate population haplotypes 408 and/or limits the candidate population haplotypes 408 to haplotypes identified during initial mapping and alignment as having a relatively higher likelihood of matching the aligned nucleotide reads 404. Such individual haplotype likelihoods can be used as “soft” scores or multi -bit values that provide a basis or input for the haplotype sets scores 410 that the personalized sequencing system 106 subsequently determines for haplotype sets (e.g., pairs of population haplotypes). In some cases, the personalized sequencing system 106 determines an individual haplotype likelihood as a Boolean value, multibit value, or other value (e.g., normalized Boolean value, normalized multi-bit value) quantifying the number of variants (e.g., SNPs) that differ between mapped and aligned nucleotide reads and population haplotypes from the population haplotype database 406.
[0087] To determine such individual haplotype likelihoods, in one or more embodiments, for example, the personalized sequencing system 106 determines (or receives) a set of alignment scores corresponding to the population haplotypes within the population haplotype database 406 (or a subset thereof) in relation to the nucleotide reads 404 aligned to a given genomic region according to the initial alignments 402. In some such embodiments, the personalized sequencing system 106 can identify an individual population haplotype exhibiting a highest alignment score of the set of alignment scores and assign (or set) the highest alignment score as a baseline (e.g., by setting the highest alignment score to a baseline value of zero) for purposes of determining the individual haplotype likelihoods for the remaining population haplotypes from the population haplotype
database 406. By using the highest alignment score as a baseline, the personalized sequencing system 106 can determine a normalized haplotype likelihood for the remaining population haplotypes mapped and aligned to a set of nucleotide reads. Such an individual haplotype likelihood can take the form of a Boolean value, multi-bit value, or other quantification of the number of additional or existing variants (e.g., SNPs) that differ between the nucleotide reads and the remaining population haplotypes — relative to the population haplotype exhibiting the highest alignment score. Having determined such normalized and individual haplotype likelihoods, in some cases, the personalized sequencing system 106 selects the candidate population haplotypes 408 from the population haplotype database 406 based on their respective normalized haplotype likelihoods.
[0088] To further illustrate, in some implementations, the personalized sequencing system 106 selects population haplotypes having corresponding normalized haplotype likelihoods within a threshold value (e.g., population haplotypes matching by a threshold number of nucleotides within the nucleotide reads 404 within the given genomic region) as the candidate population haplotypes 408. Additionally or alternatively, in one or more embodiments, the personalized sequencing system 106 utilizes the normalized haplotype likelihoods (or a subset thereof if less than all population haplotypes are considered candidates) in determining the haplotype set scores 410 (e.g., to determine initial likelihoods as described below in relation to FIG. 5). Accordingly, regardless of whether the personalized sequencing system 106 selects the candidate population haplotypes 408 from the population haplotype database 406 based on their respective normalized haplotype likelihoods, the personalized sequencing system 106 can assign or determine individual haplotype likelihoods to the candidate population haplotypes 408 for purposes of determining the haplotype set scores 410.
[0089] As just indicated, based on the haplotype likelihoods and/or other comparison of the nucleotide reads 404 with sets of the candidate population haplotypes 408 within the given genomic region, the personalized sequencing system 106 generates the haplotype set scores 410 for the respective sets of the candidate population haplotypes 408. In one or more embodiments, for example, the haplotype set scores 410 represent a likelihood or probability that a respective set of the candidate population haplotypes 408 represents the genomic sample of the nucleotide reads within the given genomic region. Accordingly, the personalized sequencing system 106 determines the selected population haplotypes 414 with the highest relative haplotype set score of the haplotype set scores 410 within the given genomic region and generates the personalized haplotype database 412 comprising the selected population haplotypes 414 for determining personalized alignments of the nucleotide reads 404 and/or additional nucleotide reads from the genomic sample.
[0090] In various embodiments, the personalized sequencing system 106 generates the haplotype set scores 410 for sets of candidate population haplotypes comprising differing numbers of the candidate population haplotypes 408 per set. In some embodiments, for example, the personalized sequencing system 106 identifies a genomic region (e.g., a reference span) corresponding to a diploid region of a reference genome and, in response, generates the haplotype set scores 410 for sets comprising pairs of the candidate population haplotypes 408. By contrast, in some embodiments, the personalized sequencing system 106 identifies a genomic region (e.g., a reference span) corresponding to a haploid region of a reference genome, such as an X or Y chromosome. In response to identifying a haploid genomic region, the personalized sequencing system 106 generates the haplotype set scores 410 for sets comprising individual haplotypes of the candidate population haplotypes 408. Furthermore, the personalized sequencing system 106 can determine the haplotype set scores 410 for sets comprising more than two of the candidate population haplotypes 408.
[0091] As mentioned, in some embodiments, the personalized sequencing system 106 generates haplotype set scores 410 for every combination of the candidate population haplotypes 408 (e.g., a haplotype set score for each possible pairing of the candidate population haplotypes 408). Also, in one or more embodiments, the personalized sequencing system 106 adjusts each haplotype set score of the haplotype set scores 410 for a given reference span in consideration of every nucleotide read of the nucleotide reads 404 identified within the given reference spans. Alternatively, in one or more embodiments, the personalized sequencing system 106 adjusts each haplotype set score of the haplotype set scores 410 for a given reference span in consideration of each adjacently mapped nucleotide read of the nucleotide reads 404 within the given reference span.
[0092] As a further alternative, in some embodiments, the personalized sequencing system 106 implements an iterative probabilistic algorithm, such as a Variational Bayesian model, to partition reads within each reference span and generate the haplotype set scores 410 based on the partitioned reads within each respective reference span. In some embodiments, for example, the personalized sequencing system 106 utilizes a Variational Bayesian model to partition the respective reads based on a categorization of each of the nucleotide reads 404 as inherited from a respective first or second parent. In some such embodiments, the Variational Bayesian model comprises an iterative probabilistic algorithm that utilizes variational inference to predict the posterior distribution of latent variables to determine respective likelihoods of nucleotide reads being inherited from respective first or second parents. Such respective likelihoods for nucleotide reads could be any likelihood (e.g., 0.50 for a first parent and 0.50 for a second parent; 0.25 for a first parent and 0.75 for a second parent).
[0093] Having partitioned the nucleotide reads 404 accordingly, the personalized sequencing system 106 can generate individual haplotype scores for the categorized reads in relation to each of the candidate population haplotypes 408, rather than generating the haplotype set scores 410 for each set of population haplotypes on a set-by-set basis. Accordingly, the personalized sequencing system 106 can generate the haplotype set scores 410 for sets of the candidate population haplotypes 408 by combining individual haplotype scores from the nucleotide reads inherited from the first parent with individual haplotype scores from the nucleotide reads inherited from the second parent. Thus, by partitioning the nucleotide reads 404 according to parental inheritance, the personalized sequencing system 106 can significantly reduce the number of unique sets of the candidate population haplotypes 408 to be scored, thereby avoiding scoring every pair of candidate population haplotypes.
[0094] As previously mentioned, in some embodiments, the personalized sequencing system 106 utilizes an imputation model to generate haplotype set scores for sets of candidate population haplotypes across adjacent reference spans of a reference genome. In accordance with one or more embodiments, for example, FIG. 5 illustrates the personalized sequencing system 106 utilizing a hidden Markov model (HMM) algorithm 508 to generate a set of haplotype set scores for sets of candidate population haplotypes across a set of reference spans 504a, 504b, through 504n of a reference genome. In particular, FIG. 5 illustrates the personalized sequencing system 106 generating haplotype set posterior probabilities 510a, 510b, through 51 On for sets of haplotypes within the set of reference spans 504a, 504b, through 504n. Based on the generated haplotype set posterior probabilities 510a-510n, the personalized sequencing system 106 selects haplotype sets 512a, 512b, through 512n for the respective reference spans 504a, 504b, through 504n to include in a personalized haplotype database 514.
[0095] As shown in FIG. 5, the personalized sequencing system 106 identifies nucleotide reads 502a, 502b, through 502n from a genomic sample within the respective reference spans 504a-504n of a reference genome associated with the genomic sample. In one or more embodiments, for example, the set of reference spans 504a-504n comprises a partitioning of a reference genome (or a portion thereof) into bins spanning a selected number of base positions, such as, but not limited to, 1,000 base positions per reference span; 4,000 base positions per reference span; or 16,000 base positions per reference span. Any number of base positions, however, can be used for a reference span.
[0096] Moreover, within the reference spans 504a-504n, the personalized sequencing system 106 identifies nucleotide reads 502a, 502n, through 502n for comparison with various sets of candidate population haplotypes to determine respective haplotype set likelihoods 506a, 506b, through 506n. As indicated above with respect to FIG. 4, for example, the personalized sequencing
system 106 can compare the nucleotide reads 502a - 502n with candidate population haplotypes by mapping and aligning the nucleotide reads 502a - 502n with candidate population candidates, determining alignment scores, and determining haplotype likelihoods based in part on the corresponding read alignments. In some embodiments, for example, the personalized sequencing system 106 further generates haplotype set likelihoods (e.g., haplotype set likelihoods 506a) of the haplotype set likelihoods 506a-506n for a given reference span (e.g., reference span 504a) of the reference spans 504a-504n. In some such embodiments, the personalized sequencing system 106 scores each set of candidate population haplotypes according to a comparison of nucleobases of the respective nucleotide reads (e.g., nucleotide reads 502a) within the given reference span and nucleobases of each set of candidate population haplotypes within the given reference span. Accordingly, as shown in FIG. 5, the personalized sequencing system 106 generates a plurality of haplotype set likelihoods for each respective reference span of the set of reference spans 504a- 504n.
[0097] Further, in one or more embodiments, the personalized sequencing system 106 generates haplotype set scores for the respective sets of candidate population haplotypes in each of the reference spans 504a-504n based on the haplotype set likelihoods 506a-506n. As also shown in FIG. 5, for example, the personalized sequencing system 106 utilizes an imputation model, such as the HMM algorithm 508, to generate haplotype set posterior probabilities 510a, 510b, through 51 On for sets of candidate population haplotypes within the respective reference spans 504a-504n based on respective haplotype set likelihoods of the haplotype set likelihoods 506a-506n for adjacent reference spans of the set of reference spans 504a-504n.
[0098] As indicated by FIG. 5, in a forward pass of the HMM algorithm 508 across the set of reference spans 504a-504n, for instance, the personalized sequencing system 106 utilizes the haplotype set likelihoods 506a-506n to generate a set of forward probabilities for each reference span of the set of reference spans 504a-504n. In some embodiments, for example, a forward pass of the HMM algorithm 508 starts with the haplotype set likelihoods 506a of the reference span 504a and ends with the haplotype set likelihoods 506n of the reference span 504n, thus generating forward probabilities for each of the reference spans 504a-504n with input from respectively adjacent reference spans (e.g., prior reference spans during a forward pass of the HMM algorithm 508).
[0099] In a backward pass of the HMM algorithm across the set of reference spans 504a-504n, the personalized sequencing system 106 utilizes the forward probabilities for each set of candidate population haplotypes within the respective reference spans 504a-504n to generate updated probabilities for the respective sets of candidate population haplotypes. Based on the updated probabilities, the personalized sequencing system 106 generates the haplotype set posterior
probabilities 510a-510n for the respective sets of candidate population haplotypes within the reference spans 504a-504n. Accordingly, as mentioned in relation to the haplotype set likelihoods 506a-506n, the personalized sequencing system 106 generates a plurality of haplotype set posterior probabilities for each respective reference span of the set of reference spans 504a-504n and, in some embodiments, generates haplotype set scores for the respective sets of candidate population haplotypes from the haplotype set posterior probabilities 510a-51 On from each respective reference span of the set of reference spans 504a-504n.
[0100] As further depicted in FIG. 5, the personalized sequencing system 106 selects the haplotype sets 512a-512n for the respective reference spans 504a-504n (e.g., one haplotype set for each reference span) based on the respective haplotype set posterior probabilities 510a-510n, or haplotype set scores derived therefrom, to generate the personalized haplotype database 514. Accordingly, in some cases, the personalized sequencing system 106 can select different sets of candidate population haplotypes across the set of reference spans 504a-504n for inclusion within the personalized haplotype database 514, such that the haplotype sets 512a-512n vary across the reference spans 504a-504n. Moreover, in some embodiments, the personalized sequencing system 106 utilizes a recombination parameter of the HMM algorithm to reduce variation in the haplotype set posterior probabilities 510a-510n across the set of reference spans 504a-504n, thus reducing, in some cases, variation in the selected haplotype sets 512a-512n included within the personalized haplotype database 514 across the set of reference spans 504a-504n.
[0101] As mentioned above, in some embodiments, the personalized sequencing system 106 determines haplotype set scores based on a comparison of nucleotide reads with allele-variant differences between candidate population haplotypes and a reference genome, as indicated within a population haplotype database. Also, in some embodiments, the personalized sequencing system 106 generates, for a given genomic sample, a personalized haplotype database comprising allelevariant differences between a limited selection of population haplotypes and a reference genome across a set of reference spans covering respective genomic regions of the reference genome. In some embodiments, for example, the personalized sequencing system 106 utilizes a population haplotype database and/or generates a personalized haplotype database encoded as a haplotype data structure as described by Michael Ruehle, Enhanced Mapping and Alignment of Nucleotide Reads Utilizing an Improved Haplotype Data Structure with Allele- Variant Differences, U.S. Provisional Application No. 63/613,574 (fded December 21, 2023) (hereinafter, Ruehle), which is hereby incorporated by reference in its entirety.
[0102] To further illustrate, FIGS. 6A-6B illustrate embodiments of haplotype data structures encoding allele-variant differences between population haplotypes and a reference genome. In particular, FIG. 6A illustrates a set of base-level bins 602a, 602b, through 602n of a population
haplotype database 600 encoding allele-variant differences of population haplotypes within respective reference spans 604a, 604b, through 604n of a reference genome, whereas FIG. 6B illustrates a set of base-level bins 612a, 612b, through 612n of a personalized haplotype database 610 encoding allele-variant differences of selected population haplotypes within the respective reference spans 604a, 604b, through 604n of the reference genome. While the following paragraphs describe various bins of either the population haplotype database 600 or the personalized haplotype database 610, the bins of each of the population haplotype database 600 or the personalized haplotype database 610 can be encoded or otherwise be represented in a particular file type, such as a TSV file or Comma Separated Values (CSV) file.
[0103] As shown in FIG. 6A, the population haplotype database 600 includes at least a base level comprising the set of base-level bins 602a-602n that partition genomic regions of a reference genome into the respective set of base-level reference spans 604a-604n. In one or more embodiments, each base-level reference span of the set of base-level reference spans 604a-604n comprises a genomic region of a first length between respective genomic coordinates of the reference genome, thus partitioning genomic regions of the reference genome into multiple bins spanning an equal portion/length of the reference genome. In various implementations, the length of the base-level reference spans can approximate, for example, the average or maximum length of nucleotide reads provided to the personalized sequencing system 106 for mapping and alignment. Alternatively, the base-level reference spans can otherwise be selected to span a predetermined number of nucleobases from genomic coordinates or regions of a linear reference sequence, such as, but not limited to, 100 base pairs or 1,000 base pairs per base-level bin.
[0104] As further illustrated in FIG. 6B, the set of base-level bins 602a-602n of the population haplotype database 600 comprise encoded variant data for nucleotide variants from respective sets of locally distinct population haplotype(s) 606a-606n. In particular, each locally distinct population haplotype within a given base-level bin comprises a unique set of one or more allele-variant differences relative to other population haplotypes also having variations within the genomic region of the respective base-level reference span of the given base-level bin. As shown in FIG. 6A, for example, each row of the locally distinct population haplotype(s) 606a comprises a unique set of allele-variant differences (denoted as single letters representing particular nucleotides) relative to other rows, such that no two rows are identical — although there can be limited overlap between allele-variant differences, as indicated by the top two rows of base-level bin 602a. Accordingly, in one or more embodiments, population haplotypes having identical nucleotide variants within a given base-level bin are encoded as one locally distinct population haplotype within the given baselevel bin.
[0105] In various embodiments, each base-level bin of the population haplotype database 600 can include differing quantities of locally distinct population haplotypes. As shown in FIG. 6A, for example, the set of locally distinct population haplotype(s) 606a included within the base-level bin 602a includes four locally distinct population haplotypes (as indicated by the four rows of the portrayed matrix), the set of locally distinct population haplotype(s) 606b included within the baselevel bin 602b includes five locally distinct population haplotypes, and the set of locally distinct population haplotype(s) 606n included within the base-level bin 602n includes three locally distinct population haplotypes. Indeed, each base-level bin of the population haplotype database 600 can include any number of locally distinct population haplotypes, including as many as every population haplotype in a data set or no population haplotypes (e.g., in cases where there are no population haplotypes having allele-variant differences in a genomic region corresponding to a given bin).
[0106] As further shown in FIG. 6A, the base-level bins 602a, 602b, through 602n include respective allele-variant differences 608a, 608b, through 608n for each locally distinct population haplotype of the respective sets of locally distinct population haplotypes 606a, 606b, through 606n. For example, variant data encoded within the base-level bin 602a includes a set of locally distinct population haplotype(s) 606a comprising one or more locally distinct population haplotypes for which allele-variant differences 608a are included for each respective locally distinct population haplotype. In some embodiments, for example, each base-level bin (e.g., of the set of base-level bins 602a-602n) comprises a matrix including corresponding variant data representing allelevariant differences from locally distinct population haplotypes (e.g., of the respective sets of locally distinct population haplotypes 606a-606n) and variant positions for the allele-variant differences. In various embodiments, the variant data within each base-level bin includes data indications (e.g., the allele-variant differences 608a-608n) of single-nucleotide polymorphisms (SNPs) and/or insertions or deletions (indels) at respective genomic coordinates of the reference genome (e.g., of the primary contiguous sequence).
[0107] As further shown in FIG. 6A, in some embodiments, each of the base-level bins 602a, 602b, and 602n further includes (or can reference data including) population frequencies 609a, 609b, and 609n (e.g., relative frequencies of haplotype alleles within a corresponding reference population) for the respective sets of locally distinct population haplotypes 606a-606n. In certain embodiments, for example, the personalized sequencing system 106 adjusts alignments scores based on the population frequencies 609a-609n when determining initial read alignments for nucleotide reads using the population haplotype database 600 (e.g., as described above in relation to FIG. 3A).
[0108] As shown in FIG. 6B, by contrast, the personalized haplotype database 610 generated by the personalized sequencing system 106 includes variant data for a customized set of population haplotypes within each of the base-level bins 612a-612n. In particular, FIG. 6B depicts each baselevel bin with two population haplotypes, thus representing an inferred diplotype of a genomic sample corresponding to the personalized haplotype database 610 (e.g., a genomic sample for which the personalized sequencing system 106 generated the personalized haplotype database 610). As mentioned previously, in various embodiments the personalized sequencing system 106 generates a personalized haplotype database having differing numbers of selected haplotypes per reference span (e.g., one, two, or more haplotypes per selected set of haplotypes).
[0109] As also shown in FIG. 6B, the personalized haplotype database 610 includes at least a base level comprising the set of base-level bins 612a-612n that partition genomic regions of the reference genome into the respective set of base-level reference spans 604a-604n. Further, the set of base-level bins 612a, 612b, through 612n of the personalized haplotype database 610 comprise encoded variant data for nucleotide variants from the respective selected sets of population haplotypes 616a, 616b, through 616n. In some embodiments, for example, each base-level bin of the set of base-level bins 612a-612n comprises a matrix including corresponding variant data representing allele-variant differences from selected population haplotypes (e.g., of the respective selected sets of population haplotypes 616a-616n) and variant positions for the allele-variant differences. In various embodiments, the variant data within each base-level bin includes data indications (e.g., the allele-variant differences 618a, 618b, through 618n) of single-nucleotide polymorphisms (SNPs) and/or insertions or deletions (indels) at respective genomic coordinates of the reference genome (e.g., of the primary contiguous sequence). Additionally, as shown in FIG. 6B, population frequencies 619a, 619b, through 619n for the respective selected sets of population haplotypes 616a, 616b, through 616n are adjusted relative to the corresponding population frequencies 609a, 609b, through 609n (see FIG. 6A) to reflect the predicted diplotype within each of the base-level bins 612a, 612b, through 612n.
[0110] As mentioned above, in some embodiments, the personalized sequencing system 106 utilizes a haplotype data structure (e.g., to implement a population haplotype database and/or a personalized haplotype database) with a hierarchical partitioning of genomic regions of a reference genome into multiple levels of bins corresponding to spans of nucleobases within the reference genome. For example, FIG. 7 illustrates a haplotype data structure 700 having a base level 702 comprising a set of base-level bins 704 and multiple successive levels 706a, 706b, 706c, through 706n of higher-level bins spanning successively larger spans of nucleobases of a reference genome. Specifically, the haplotype data structure 700 comprises the base level 702 comprising the set of base-level bins 704 jointly spanning a primary contiguous sequence of the reference genome and
the multiple successive levels 706a-706n of higher-level bins 708a, 708b, 708c, through 708n and offset higher-level bins 709a, 709b, 709c, and so forth also spanning the primary contiguous sequence of the reference genome. As indicated by FIG. 7, the successive level 706n comprises a higher-level bin 708n and a corresponding offset higher-level bin. But FIG. 7 does not depict the corresponding offset higher-level bin for the successive level 706n and the higher-level bin 708n due to constraints on figure space. As further indicated by the ellipsis (or dots) in FIG. 7, the personalized sequencing system 106 can identify, determine, generate, or utilize more base-level bins, successive levels, higher-level bins, and/or offset higher-level bins than those depicted in FIG. 7. While the haplotype data structure 700 in FIG. 7 depicts a data structure for a personal haplotype database, a similar data structure can be used for a population haplotype database as described by Ruehle. As further indicated above, the haplotype data structure 700 can be encoded or otherwise be represented in a particular file type, such as a TSV file or CSV file.
[OHl] As illustrated in FIG. 7, the base level 702 of the haplotype data structure 700 includes the set of base-level bins 704 corresponding to a respective set of base-level reference spans of the primary contiguous sequence for the reference genome. Each reference span of the set of base-level reference spans corresponds to a genomic region of a first length between respective genomic coordinates of the reference genome. In one or more embodiments, for example, each reference span of the set of base-level reference spans includes 1,000 base pairs (1 kbp) of the primary contiguous sequence for the reference genome. Alternatively, the first length of the base-level reference spans can be less than or greater than Ikbp, such as, but not limited to, 250 bp, 500 bp, 1500 bp, 5 kbp, lOkpb, and so forth. Accordingly, in various embodiments, the set of base-level bins 704 collectively span either the entire primary contiguous sequence or a genomic region of interest, such as but not limited to an entire chromosome.
[0112] As further indicated by FIG. 7, the set of base-level bins 704 of the base level 702 comprise variant data for nucleotide variants from respective sets of population haplotypes. In implementations of a population haplotype database, for example, each base-level bin of the base level 702 includes variant data for each locally distinct population haplotype. As mentioned, each locally distinct population haplotype comprises a unique set of one or more allele-variant differences relative to other population haplotypes within a respective base-level reference span of a given base-level bin of the set of base-level bins 704. In implementations of a personalized haplotype database, however, each base-level bin of the base level 702 includes variant data for population haplotypes selected by the personalized sequencing system 106 (e.g., as described in relation to FIG. 5). As shown in FIG. 7, for example, the set of base-level bins 704 comprise respective pairs of population haplotypes selected by the personalized sequencing system 106, as
indicated by the numbers (“2(0..!)”) associated with each base-level reference span of the set of base-level bins 704.
[0113] As also shown in FIG. 7, the haplotype data structure 700 comprises the multiple successive levels 706a-706n of higher-level bins 708a-708n. A first successive level 706a, for instance, comprises a first set of higher-level bins 708a corresponding to a first set of higher-level reference spans of the primary contiguous sequence for the reference genome. Each reference span of the first set of higher-level reference spans corresponds to an expanded genomic region of a second length between respective genomic coordinates of the reference genome, wherein the expanded genomic regions are expanded relative to the genomic regions represented by the set of base-level reference spans such that the second length (of the respective first set of higher-level reference spans) is longer than the first length (of the set of base-level reference spans). More specifically, as illustrated in FIG. 7, each higher-level bin of the first set of higher-level bins 708a of the first successive level 706a corresponds to a consecutive pair of base-level bins from the set of base-level bins 704 of the base level 702 of the haplotype data structure 700.
[0114] Furthermore, as shown in FIG. 7, the multiple successive levels 706a-706c of the haplotype data structure 700 comprise respective sets of offset higher-level bins 709a-709c and the successive level 706n of the haplotype data structure 700 comprises the higher-level bin 708n and a corresponding offset higher-level bin. For instance, the first successive level 706a includes a set of offset higher-level bins 709a corresponding to a first set of offset higher-level reference spans of the primary contiguous sequence for the reference genome. Each reference span of the first set of offset higher-level reference spans corresponds to an offset expanded genomic region of the second length (i.e., the same length as the reference spans of the first set of successive reference spans) between respective genomic coordinates of the reference genome. In like manner as the first set of higher-level bins 708a, the first set of offset higher-level bins 709a correspond to respective consecutive pairs of base-level bins from the set of base-level bins 704 of the base level 702 of the haplotype data structure 700. Further, as illustrated, the respective reference spans of the first set of offset higher-level bins 709a are offset relative to the reference spans of the first set of higher- level bins 708a, such that each consecutive pair of base-level bins from the set of base-level bins 704 is represented by either a higher-level bin or an offset higher-level bin from the first successive level 706a.
[0115] Moreover, each additional successive level 706b-706n of the haplotype data structure 700 comprises additional higher-level bins 708b-708n corresponding to respective additional higher-level reference spans corresponding to further expanded genomic regions between genomic coordinates of the primary contiguous sequence for the reference genome. In particular, as shown in FIG. 7, each higher-level bin (or offset higher-level bin) of a given successive level of the
haplotype data structure 700 spans a combined genomic region of a pair of consecutive bins of a prior level of the haplotype data structure 700 (e.g., as indicated by the arrows linking various bins in FIG. 7). For example, the first illustrated bin of the set of higher-level bins 708c spans the same genomic region represented by the first two illustrated bins of the set of higher-level bins 708b. Likewise, the first illustrated bin of the set of higher-level bins 708b spans the same genomic region represented by the first two illustrated bins of the set of higher-level bins 708a. Indeed, each successive level comprises higher-level bins corresponding to a pair of consecutive bins from the previous level of the haplotype data structure 700.
[0116] Moreover, in some embodiments, the respective higher-level bins of each successive level of the haplotype data structure 700 comprise variant-data indices referencing combinations of the variant data from corresponding base-level bins of the base level 702. In particular, each higher-level bin and offset higher-level bin of the multiple sets of higher-level bins 708a-708c and offset higher-level bins 709a-709c, respectively — and each of the higher-level bin 708n and a corresponding offset higher-level bin of the successive level 706n — comprise variant-data indices referencing combinations of variant data from corresponding base-level bins of the set of baselevel bins 704. Furthermore, the variant-data indices include indications of locally distinct population haplotypes within each respective higher-level bin or offset higher-level bin. As illustrated in FIG. 7, for example, one of the offset higher-level bins 709b of the successive level 706b indicates two locally distinct population haplotypes (indicated by “2 Haplotypes (0..1)”). As also illustrated, two bins of the higher-level bins 708a from the previous successive level (the first successive level 706a) indicate two locally distinct population haplotypes (indicated by “2(0..!)”) and one locally distinct population haplotype (indicated by “1(0)”), respectively.
[0117] In one or more embodiments, the higher-level bins of each successive level comprise variant-data indices indicating locally distinct population haplotypes and linking the higher-level bins to variant data within the corresponding base-level bins without including the variant data from the respective base-level bins, thus avoiding redundant encoding of variant data within the haplotype data structure 700. Referring to the successive level 706b, for example, the aforementioned bin of offset higher-level bins 709b indicating four locally distinct population haplotypes can include variant-data indices referencing how the locally distinct population haplotypes of the corresponding higher-level bins (of the higher-level bins 708a) from the previous successive level (the first successive level 706a) combine to form the four locally distinct population haplotypes of the aforementioned bin. Further, each of the corresponding higher-level bins 708a can include variant data-indices referencing the population haplotypes (and the variant data thereof) indicated within the corresponding base-level bins (of the set of base-level bins 704) from the base level 702.
[0118] Moreover, while FIG. 7 depicts the haplotype data structure 700 having pairs of haplotypes selected with respect to reference spans of respective base-level bins of the set of baselevel bins 704, the personalized sequencing system 106 can alternatively evaluate and select haplotype pairs across reference spans of a respective successive level of bins. In some embodiments, for example, the personalized sequencing system 106 evaluated candidate population haplotypes within reference spans corresponding to the higher-level bins 708b, then populates the haplotype data structure 700 with haplotype data from the selected candidate population haplotypes, such as described above.
[0119] As mentioned above, in certain described embodiments, the personalized sequencing system 106 implements mapping and alignment of nucleotide reads from a genomic sample with genomic regions of a reference genome with increased accuracy. To illustrate, FIGS. 8-9 show experimental results of the personalized sequencing system 106 generating and utilizing a personalized haplotype database, in accordance with some of the disclosed embodiments, to determine personalized alignments of nucleotide reads. In particular, FIGS. 8-9 illustrates comparative results of identifying single nucleotide polymorphisms (SNPs) based on read alignments generated according to one or more embodiments.
[0120] Specifically, FIG. 8 includes a table of experimental results of identifying SNPs in nucleotide reads aligned by existing sequencing systems and the personalized sequencing system 106. As shown, the columns of the table respectively correspond to false positives (FP), false negatives (FN), incorrect heterozygous or homozygous genotype calls (Hethom), and combined false positives and false negatives (FP+FN). Further, as depicted in FIG. 8, the first two rows (excluding the header row) of the table of experimental results correspond to results generated by an existing sequencing system using (i) a population haplotype database comprising 128 global population haplotypes (“Hap DB Global 128 Samples”) and (ii) a population haplotype database comprising 16 ancestry-specific population haplotypes selected based on the ancestry of the genomic sample tested (“HapDB Euro 16 Samples”), respectively.
[0121] By contrast, the final three rows of the table of experimental results correspond to results generated by the personalized sequencing system 106 using (i) a first personalized haplotype database generated, for the corresponding genomic sample, utilizing a first probabilistic model based on an exhaustive scoring of all possible pairs of candidate haplotypes for each reference span (“Personalized (exhaustive 1)”), (ii) a second personalized haplotype database generated, for the corresponding genomic sample, utilizing a second probabilistic model based on an exhaustive scoring of all possible pairs of candidate haplotypes for each reference span (“Personalized (exhaustive 2)”), and (iii) a third personalized haplotype database generated, for the corresponding genomic sample, utilizing a Variational Bayesian model to partition nucleotide reads by parental
inheritance to enable individual scoring of candidate haplotypes (“Personalized (Variational Bayes)”), respectively.
[0122] Indeed, as shown in FIG. 8, each of the three portrayed example embodiments of the personalized sequencing system 106 exhibit improved overall accuracy relative to the portrayed existing sequencing systems in identifying SNPs within nucleotide reads (e.g., in terms of FPs, FNs, Hethom accuracy, and combined FP+FN metrics). Moreover, while in certain cases personalization by matching ancestry of a genomic sample can further improve genotype calling accuracy (see, e.g., “HapDB Euro 16 Samples” results), the personalized sequencing system 106 can produce similar or improved results without contextual information for the given genomic sample, thus providing increased accuracy over many existing systems with increased flexibility and efficiency.
[0123] Moreover, FIG. 9 includes an additional illustration of experimental results of identifying SNPs in nucleotide reads aligned by existing sequencing systems and the personalized sequencing system 106. Specifically, FIG. 9 depicts read pileups 906b and 906c (pileups of nucleotide reads mapped to and aligned with respective genomic regions a reference genome 902) generated by an existing sequencing system and by the personalized sequencing system 106, respectively, in comparison with a ground truth pileup 906a comprising at least one known single nucleotide variant, such as an identified known single nucleotide variant 904a. FIG. 9 further depicts reads aligned with relatively low confidence (e.g., having a low mapping-quality (MAPQ) score) by using an unfilled rectangle to represent such relatively low-confidence-aligned reads and reads aligned with relatively high confidence by using a filled rectangle (e.g., all of the reads within the ground truth pileup 906a) to represent such relatively high-confidence-aligned reads. As shown, the read pileup 906b of the existing sequencing system results in a false negative call 904b for the identified known single nucleotide variant 904a due to multiple reads aligning to the respective genomic region with a zero or near-zero mapping-quality (MAPQ) score (as indicated by the stack of unfilled rectangles). By contrast, the read pileup 906c of the personalized sequencing system 106 includes a true positive call 904c for the identified known single nucleotide variant 904a due to multiple reads aligning to the respective genomic region with a relatively high mapping-quality (MAPQ) score (as indicated by the stack of filled rectangles).
[0124] Indeed, as illustrated in FIGS. 8-9, the personalized sequencing system 106 can generated and utilize a personalized haplotype database to efficiently determine read alignments for nucleotide reads from a genomic sample with improved accuracy in identifying variants relative to existing sequencing systems, as indicated by the comparative number of false positives, false negatives, and incorrect heterozygous or homozygous genotype calls identified within the provided experimental results.
[0125] Turning now to FIG. 10, this figure illustrates an example flowchart of a series of acts for generating and utilizing a personalized haplotype database to determine personalized alignments of a set of nucleotide reads from a genomic sample in accordance with one or more embodiments. While FIG. 10 illustrates acts according to particular embodiments, alternative embodiments may omit, add to, reorder, and/or modify any of the acts shown in FIG. 10. The acts of FIG. 10 can be performed as part of a method. Alternatively, a non-transitory computer readable storage medium can comprise instructions that, when executed by one or more processors, cause a computing device to perform the acts depicted in FIG. 10. In still further embodiments, a system comprising at least one processor and a non-transitory computer readable medium comprising instructions that, when executed by one or more processors, cause the system to perform the acts of FIG. 10.
[0126] As shown in FIG. 10, the series of acts 1000 includes an act 1002 of identifying nucleotide reads and candidate population haplotypes within a set of reference spans of a reference genome, an act 1004 of generating haplotype set scores for sets of the candidate population haplotypes, an act 1006 of generating a personalized haplotype database based on the haplotype set scores, and an act 1008 of determining personalized alignments of the set of nucleotide reads utilizing the personalized haplotype database.
[0127] For example, the series of acts 1000 can include acts to perform any of the operations described in the following clauses:
CLAUSE 1. A computer-implemented method comprising: identifying, within a set of reference spans of a reference genome, a set of nucleotide reads from a genomic sample and candidate population haplotypes from a population haplotype database; generating, for the set of reference spans of the reference genome, haplotype set scores for sets of the candidate population haplotypes based on comparing the set of nucleotide reads and the candidate population haplotypes; generating, for the genomic sample and based on the haplotype set scores, a personalized haplotype database comprising a subset of population haplotypes from the population haplotype database within the set of reference spans; and determining, utilizing the personalized haplotype database, one or more personalized alignments of the set of nucleotide reads with respective genomic regions of the reference genome.
CLAUSE 2. The computer-implemented method of clause 1, further comprising: identifying one or more distinct k-mers for each candidate population haplotype within a given reference span of the set of reference spans; and generating, for the given reference span, respective haplotype set scores of the haplotype set scores for respective candidate population haplotypes of the sets of the candidate population
haplotypes within the given reference span based on comparing k-mers of the set of nucleotide reads with the one or more distinct k-mers of the respective candidate population haplotypes.
CLAUSE 3. The computer-implemented method of clause 1, further comprising: determining initial alignments of the set of nucleotide reads with respective genomic regions of the reference genome; identifying subsets of nucleotide reads of the set of nucleotide reads that, according to the initial alignments, align to respective reference spans of the set of reference spans; and generating the haplotype set scores for the set of reference spans based on comparing the subsets of nucleotide reads and the candidate population haplotypes within the respective reference spans of the set of reference spans.
CLAUSE 4. The computer-implemented method of clause 3, further comprising: identifying, as indicated within the population haplotype database, allele-variant differences between population haplotypes and a primary contiguous sequence at respective genomic regions of the reference genome; and determining one or more of the initial alignments of the set of nucleotide reads by rescoring candidate reference alignments of the set of nucleotide reads with the primary contiguous sequence according to the allele-variant differences.
CLAUSE 5. The computer-implemented method of any of clauses 3-4, further comprising: identifying one or more population haplotypes for each nucleotide read of the set of nucleotide reads based on comparing the set of nucleotide reads with haplotype variants within the respective genomic regions of the initial alignments; and generating an alignment data file comprising: the initial alignments of the set of nucleotide reads; alignment scores corresponding to the initial alignments; and the identified one or more population haplotypes for each nucleotide read of the set of nucleotide reads.
CLAUSE 6. The computer-implemented method of any of clauses 1-5, further comprising: determining haplotype set likelihoods for the set of reference spans by comparing the set of nucleotide reads and pairs of the candidate population haplotypes within respective reference spans of the set of reference spans; and generating the haplotype set scores for the set of reference spans based on the haplotype set likelihoods.
CLAUSE 7. The computer-implemented method of any of clauses 1-6, further comprising: identify at least one reference span of the set of reference spans corresponding to a haploid region of the reference genome; and generate, for the at least one reference span, haplotype set scores for individual candidate population haplotypes based on comparing the set of nucleotide reads and the candidate population haplotypes within the at least one reference span.
CLAUSE 8. The computer-implemented method of any of clauses 1-7, further comprising including within the subset of population haplotypes of the personalized haplotype database, a respective subset of the candidate population haplotypes for each reference span of the set of reference spans based on the haplotype set scores.
CLAUSE 9. The computer-implemented method of clause 8, further comprising: generating haplotype set likelihoods for the set of reference spans by comparing the set of nucleotide reads and the candidate population haplotypes; generating, utilizing a hidden Markov model (HMM) algorithm, a set of haplotype set posterior probabilities for each reference span of the set of reference spans based on respective haplotype set likelihoods for adjacent reference spans of the set of reference spans; and generating, based on the set of haplotype set posterior probabilities for each reference span of the set of reference spans, respective sets of the haplotype set scores for the set of reference spans.
CLAUSE 10. The computer-implemented method of clause 9, further comprising: generating, utilizing the haplotype set likelihoods in a forward pass of the HMM algorithm across the set of reference spans, a set of forward probabilities for each reference span of the set of reference spans; generating, utilizing a backward pass of the HMM algorithm to update the set of forward probabilities for each reference span of the set of reference spans, a set of updated probabilities for each reference span of the set of reference spans; and generating, based on the set of updated probabilities for each reference span of the set of reference spans, the set of haplotype set posterior probabilities for each reference span of the set of reference spans.
CLAUSE 11. The computer-implemented method of any of clauses 9-10, further comprising utilizing a recombination parameter of the HMM algorithm to reduce variation in haplotype set posterior probabilities across the set of reference spans.
CLAUSE 12. The computer-implemented method of any of clauses 1-11, further comprising:
categorizing, utilizing a Variational Bayesian model, each nucleotide read of the set of nucleotide reads within each respective reference span of the set of reference spans as inherited from a first parent or a second parent; generating, based on each categorized nucleotide read of the set of nucleotide reads, individual haplotype scores for population haplotypes selected from the population haplotype database; and generating the haplotype set scores for pairs of the population haplotypes selected from the population haplotype database by combining individual haplotype scores from nucleotide reads inherited from the first parent with individual haplotype scores from nucleotide reads inherited from the second parent.
CLAUSE 13. The computer-implemented method of any of clauses 1-12, further comprising generating, for a given reference span of the set of reference spans, the haplotype set scores for respective pairs of locally distinct population haplotypes within the given reference span, each locally distinct population haplotype comprising a unique set of one or more allele-variant differences relative to other population haplotypes within the given reference span.
CLAUSE 14. The computer-implemented method of any of clauses 1-13, further comprising: identifying, as indicated within the personalized haplotype database, allele-variant differences between population haplotypes of the subset of population haplotypes and a primary contiguous sequence at the respective genomic regions of the reference genome; and determining one or more of the one or more personalized alignments of the set of nucleotide reads by rescoring candidate reference alignments of the set of nucleotide reads with the primary contiguous sequence according to allele-variant differences.
CLAUSE 15. The computer-implemented method of any of clauses 1-14, further comprising: generating alignment scores for initial alignments of the set of nucleotide reads with respective genomic regions of the reference genome based on comparing the set of nucleotide reads with population haplotypes from the population haplotype database; generating an initial alignment data file comprising the initial alignments and corresponding alignment scores; generating personalized alignment scores for the one or more personalized alignments based on comparing the set of nucleotide reads with the subset of population haplotypes within the personalized haplotype database; and generating a personalized alignment data file comprising the one or more personalized alignments and corresponding personalized alignment scores.
CLAUSE 16. The computer-implemented method of any of clauses 1-15, further comprising determining genotype calls for the genomic sample based on the one or more personalized alignments.
[0128] The methods described herein can be used in conjunction with a variety of nucleic acid sequencing techniques. Particularly applicable techniques are those wherein nucleic acids are attached at fixed locations in an array such that their relative positions do not change and wherein the array is repeatedly imaged. Embodiments in which images are obtained in different color channels, for example, coinciding with different labels used to distinguish one nucleobase type from another are particularly applicable. In some embodiments, the process to determine the nucleotide sequence of a target nucleic acid (i.e., a nucleic acid polymer) can be an automated process. Preferred embodiments include sequencing-by-synthesis (SBS) techniques.
[0129] SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand. In traditional methods of SBS, a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery. However, in the methods described herein, more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
[0130] SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties. Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using y-phosphate-labeled nucleotides, as set forth in further detail below. In methods using nucleotide monomers lacking terminators, the number of nucleotides added in each cycle is generally variable and dependent upon the template sequence and the mode of nucleotide delivery. For SBS techniques that utilize nucleotide monomers having a terminator moiety, the terminator can be effectively irreversible under the sequencing conditions used as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.).
[0131] SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as molecular weight or charge; a byproduct of incorporation of the nucleotide, such as release of pyrophosphate; or the like. In embodiments, where two or more different nucleotides are present in a sequencing reagent, the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used. For example, the different nucleotides present in a sequencing reagent can
have different labels and they can be distinguished using appropriate optics as exemplified by the sequencing methods developed by Solexa (now Illumina, Inc.).
[0132] Preferred embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release." Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) "Pyrosequencing sheds light on DNA sequencing." Genome Res. 11(1), 3-11; Ronaghi, M., Uhlen, M. and Nyren, P. (1998) “A sequencing method based on realtime pyrophosphate.” Science 281(5375), 363; U.S. Pat. No. 6,210,891; U.S. Pat. No. 6,258,568 and U.S. Pat. No. 6,274,320, the disclosures of which are incorporated herein by reference in their entireties). In pyrosequencing, released PPi can be detected by being immediately converted to adenosine triphosphate (ATP) by ATP sulfurylase, and the level of ATP generated is detected via luciferase-produced photons. The nucleic acids to be sequenced can be attached to features in an array and the array can be imaged to capture the chemiluminescent signals that are produced due to incorporation of a nucleotides at the features of the array. An image can be obtained after the array is treated with a particular nucleotide type (e.g., A, T, C or G). Images obtained after addition of each nucleotide type will differ with regard to which features in the array are detected. These differences in the image reflect the different sequence content of the features on the array. However, the relative locations of each feature will remain unchanged in the images. The images can be stored, processed and analyzed using the methods set forth herein. For example, images obtained after treatment of the array with each different nucleotide type can be handled in the same way as exemplified herein for images obtained from different detection channels for reversible terminatorbased sequencing methods.
[0133] In another exemplary type of SBS, cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in WO 04/018497 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference. This approach is being commercialized by Solexa (now Illumina Inc.), and is also described in WO 91/06678 and WO 07/123,744, each of which is incorporated herein by reference. The availability of fluorescently labeled terminators in which both the termination can be reversed, and the fluorescent label cleaved facilitates efficient cyclic reversible termination (CRT) sequencing. Polymerases can also be coengineered to efficiently incorporate and extend from these modified nucleotides.
[0134] Preferably in reversible terminator-based sequencing embodiments, the labels do not substantially inhibit extension under SBS reaction conditions. However, the detection labels can be removable, for example, by cleavage or degradation. Images can be captured following
incorporation of labels into arrayed nucleic acid features. In particular embodiments, each cycle involves simultaneous delivery of four different nucleotide types to the array and each nucleotide type has a spectrally distinct label. Four images can then be obtained, each using a detection channel that is selective for one of the four different labels. Alternatively, different nucleotide types can be added sequentially, and an image of the array can be obtained between each addition step. In such embodiments, each image will show nucleic acid features that have incorporated nucleotides of a particular type. Different features are present or absent in the different images due the different sequence content of each feature. However, the relative position of the features will remain unchanged in the images. Images obtained from such reversible terminator- SB S methods can be stored, processed and analyzed as set forth herein. Following the image capture step, labels can be removed, and reversible terminator moieties can be removed for subsequent cycles of nucleotide addition and detection. Removal of the labels after they have been detected in a particular cycle and prior to a subsequent cycle can provide the advantage of reducing background signal and crosstalk between cycles. Examples of useful labels and removal methods are set forth below.
[0135] In particular embodiments some or all of the nucleotide monomers can include reversible terminators. In such embodiments, reversible terminators/cleavable fluors can include fluor linked to the ribose moiety via a 3' ester linkage (Metzker, Genome Res. 15:1767-1776 (2005), which is incorporated herein by reference). Other approaches have separated the terminator chemistry from the cleavage of the fluorescence label (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), which is incorporated herein by reference in its entirety). Ruparel et al described the development of reversible terminators that used a small 3' allyl group to block extension but could easily be deblocked by a short treatment with a palladium catalyst. The fluorophore was attached to the base via a photocleavable linker that could easily be cleaved by a 30 second exposure to long wavelength UV light. Thus, either disulfide reduction or photocleavage can be used as a cleavable linker. Another approach to reversible termination is the use of natural termination that ensues after placement of a bulky dye on a dNTP. The presence of a charged bulky dye on the dNTP can act as an effective terminator through steric and/or electrostatic hindrance. The presence of one incorporation event prevents further incorporations unless the dye is removed. Cleavage of the dye removes the fluor and effectively reverses the termination. Examples of modified nucleotides are also described in U.S. Pat. No. 7,427,673, and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference in their entireties.
[0136] Additional exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Patent Application Publication No. 2007/0166705, U.S. Patent Application Publication No. 2006/0188901, U.S. Pat. No. 7,057,026, U.S. Patent Application Publication No. 2006/0240439, U.S. Patent Application Publication No.
2006/0281109, PCT Publication No. WO 05/065814, U.S. Patent Application Publication No. 2005/0100900, PCT Publication No. WO 06/064199, PCT Publication No. WO 07/010,251, U.S. Patent Application Publication No. 2012/0270305 and U.S. Patent Application Publication No. 2013/0260372, the disclosures of which are incorporated herein by reference in their entireties.
[0137] Some embodiments can utilize detection of four different nucleotides using fewer than four different labels. For example, SBS can be performed utilizing methods and systems described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232. As a first example, a pair of nucleotide types can be detected at the same wavelength, but distinguished based on a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g. via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair. As a second example, three of four different nucleotide types can be detected under particular conditions while a fourth nucleotide type lacks a label that is detectable under those conditions, or is minimally detected under those conditions (e.g., minimal detection due to background fluorescence, etc.). Incorporation of the first three nucleotide types into a nucleic acid can be determined based on presence of their respective signals and incorporation of the fourth nucleotide type into the nucleic acid can be determined based on absence or minimal detection of any signal. As a third example, one nucleotide type can include label(s) that are detected in two different channels, whereas other nucleotide types are detected in no more than one of the channels. The aforementioned three exemplary configurations are not considered mutually exclusive and can be used in various combinations. An exemplary embodiment that combines all three examples, is a fluorescent-based SBS method that uses a first nucleotide type that is detected in a first channel (e.g. dATP having a label that is detected in the first channel when excited by a first excitation wavelength), a second nucleotide type that is detected in a second channel (e.g. dCTP having a label that is detected in the second channel when excited by a second excitation wavelength), a third nucleotide type that is detected in both the first and the second channel (e.g. dTTP having at least one label that is detected in both channels when excited by the first and/or second excitation wavelength) and a fourth nucleotide type that lacks a label that is not, or minimally, detected in either channel (e.g. dGTP having no label).
[0138] Further, as described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232, sequencing data can be obtained using a single channel. In such so- called one-dye sequencing approaches, the first nucleotide type is labeled but the label is removed after the first image is generated, and the second nucleotide type is labeled only after a first image is generated. The third nucleotide type retains its label in both the first and second images, and the fourth nucleotide type remains unlabeled in both images.
[0139] Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides. The oligonucleotides typically have different labels that are correlated with the identity of a particular nucleotide in a sequence to which the oligonucleotides hybridize. As with other SBS methods, images can be obtained following treatment of an array of nucleic acid features with the labeled sequencing reagents. Each image will show nucleic acid features that have incorporated labels of a particular type. Different features are present or absent in the different images due the different sequence content of each feature, but the relative position of the features will remain unchanged in the images. Images obtained from ligation-based sequencing methods can be stored, processed and analyzed as set forth herein. Exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Pat. No. 6,969,488, U.S. Pat. No. 6,172,218, and U.S. Pat. No. 6,306,597, the disclosures of which are incorporated herein by reference in their entireties.
[0140] Some embodiments can utilize nanopore sequencing (Deamer, D. W. & Akeson, M. "Nanopores and nucleic acids: prospects for ultrarapid sequencing." Trends Biotechnol. 18, 147- 151 (2000); Deamer, D. and D. Branton, "Characterization of nucleic acids by nanopore analysis". Acc. Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin, and J. A. Golovchenko, "DNA molecules and configurations in a solid-state nanopore microscope" Nat. Mater. 2:611-615 (2003), the disclosures of which are incorporated herein by reference in their entireties). In such embodiments, the target nucleic acid passes through a nanopore. The nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin. As the target nucleic acid passes through the nanopore, each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore. (U.S. Pat. No. 7,001,792; Soni, G. V. & Meller, "A. Progress toward ultrafast DNA sequencing using solid-state nanopores." Clin. Chem. 53, 1996-2001 (2007); Healy, K. "Nanopore-based single-molecule DNA analysis." Nanomed. 2, 459-481 (2007); Cockroft, S. L., Chu, J., Amorin, M. & Ghadiri, M. R. "A single-molecule nanopore device detects DNA polymerase activity with single-nucleotide resolution." J. Am. Chem. Soc. 130, 818-820 (2008), the disclosures of which are incorporated herein by reference in their entireties). Data obtained from nanopore sequencing can be stored, processed and analyzed as set forth herein. In particular, the data can be treated as an image in accordance with the exemplary treatment of optical images and other images that is set forth herein.
[0141] Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity. Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No.
7,211,414 (each of which is incorporated herein by reference) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No. 7,315,019 (which is incorporated herein by reference) and using fluorescent nucleotide analogs and engineered polymerases as described, for example, in U.S. Pat. No. 7,405,281 and U.S. Patent Application Publication No. 2008/0108082 (each of which is incorporated herein by reference). The illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations." Science 299, 682-686 (2003); Lundquist, P. M. et al. "Parallel confocal detection of single molecules in real time." Opt. Lett. 33, 1026-1028 (2008); Korlach, J. et al. "Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nano structures." Proc. Natl. Acad. Sci. USA 105, 1176-1181 (2008), the disclosures of which are incorporated herein by reference in their entireties). Images obtained from such methods can be stored, processed and analyzed as set forth herein.
[0142] Some SBS embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product. For example, sequencing based on detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009/0026082 Al; US 2009/0127589 Al; US 2010/0137143 Al; or US 2010/0282617 Al, each of which is incorporated herein by reference. Methods set forth herein for amplifying target nucleic acids using kinetic exclusion can be readily applied to substrates used for detecting protons. More specifically, methods set forth herein can be used to produce clonal populations of amplicons that are used to detect protons.
[0143] The above SBS methods can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously. In particular embodiments, different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner. In embodiments using surface-bound target nucleic acids, the target nucleic acids can be in an array format. In an array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner. The target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface. The array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature.
Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR as described in further detail below.
[0144] The methods set forth herein can use arrays having features at any of a variety of densities including, for example, at least about 10 features/cm2, 100 features/cm2, 500 features/cm2, 1,000 features/cm2, 5,000 features/cm2, 10,000 features/cm2, 50,000 features/cm2, 100,000 features/cm2, 1,000,000 features/cm2, 5,000,000 features/cm2, or higher.
[0145] An advantage of the methods set forth herein is that they provide for rapid and efficient detection of a plurality of target nucleic acid in parallel. Accordingly, the present disclosure provides integrated systems capable of preparing and detecting nucleic acids using techniques known in the art such as those exemplified above. Thus, an integrated system of the present disclosure can include fluidic components capable of delivering amplification reagents and/or sequencing reagents to one or more immobilized DNA fragments, the system comprising components such as pumps, valves, reservoirs, fluidic lines and the like. A flow cell can be configured and/or used in an integrated system for detection of target nucleic acids. Exemplary flow cells are described, for example, in US 2010/0111768 Al and US Ser. No. 13/273,666, each of which is incorporated herein by reference. As exemplified for flow cells, one or more of the fluidic components of an integrated system can be used for an amplification method and for a detection method. Taking a nucleic acid sequencing embodiment as an example, one or more of the fluidic components of an integrated system can be used for an amplification method set forth herein and for the delivery of sequencing reagents in a sequencing method such as those exemplified above. Alternatively, an integrated system can include separate fluidic systems to carry out amplification methods and to carry out detection methods. Examples of integrated sequencing systems that are capable of creating amplified nucleic acids and also determining the sequence of the nucleic acids include, without limitation, the MiSeqTM platform (Illumina, Inc., San Diego, CA) and devices described in US Ser. No. 13/273,666, which is incorporated herein by reference.
[0146] The sequencing system described above sequences nucleic acid polymers present in samples received by a sequencing device. As defined herein, "sample" and its derivatives, is used in its broadest sense and includes any specimen, culture and the like that is suspected of including a target. In some embodiments, the sample comprises DNA, RNA, PNA, LNA, chimeric or hybrid forms of nucleic acids. The sample can include any biological, clinical, surgical, agricultural, atmospheric or aquatic-based specimen containing one or more nucleic acids. The term also includes any isolated nucleic acid sample such a genomic DNA, fresh-frozen or formalin-fixed paraffin-embedded nucleic acid specimen. It is also envisioned that the sample can be from a single individual, a collection of nucleic acid samples from genetically related members, nucleic acid samples from genetically unrelated members, nucleic acid samples (matched) from a single
individual such as a tumor sample and normal tissue sample, or sample from a single source that contains two distinct forms of genetic material such as maternal and fetal DNA obtained from a maternal subject, or the presence of contaminating bacterial DNA in a sample that contains plant or animal DNA. In some embodiments, the source of nucleic acid material can include nucleic acids obtained from a newborn, for example as typically used for newborn screening.
[0147] The nucleic acid sample can include high molecular weight material such as genomic DNA (gDNA). The sample can include low molecular weight material such as nucleic acid molecules obtained from FFPE or archived DNA samples. In another embodiment, low molecular weight material includes enzymatically or mechanically fragmented DNA. The sample can include cell-free circulating DNA. In some embodiments, the sample can include nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture micro-dissections, surgical resections, and other clinical or laboratory obtained samples. In some embodiments, the sample can be an epidemiological, agricultural, forensic or pathogenic sample. In some embodiments, the sample can include nucleic acid molecules obtained from an animal such as a human or mammalian source. In another embodiment, the sample can include nucleic acid molecules obtained from a non-mammalian source such as a plant, bacteria, virus or fungus. In some embodiments, the source of the nucleic acid molecules may be an archived or extinct sample or species.
[0148] Further, the methods and compositions disclosed herein may be useful to amplify a nucleic acid sample having low-quality nucleic acid molecules, such as degraded and/or fragmented genomic DNA from a forensic sample. In one embodiment, forensic samples can include nucleic acids obtained from a crime scene, nucleic acids obtained from a missing persons DNA database, nucleic acids obtained from a laboratory associated with a forensic investigation or include forensic samples obtained by law enforcement agencies, one or more military services or any such personnel. The nucleic acid sample may be a purified sample or a crude DNA containing lysate, for example derived from a buccal swab, paper, fabric or other substrate that may be impregnated with saliva, blood, or other bodily fluids. As such, in some embodiments, the nucleic acid sample may comprise low amounts of, or fragmented portions of DNA, such as genomic DNA. In some embodiments, target sequences can be present in one or more bodily fluids including but not limited to, blood, sputum, plasma, semen, urine and serum. In some embodiments, target sequences can be obtained from hair, skin, tissue samples, autopsy or remains of a victim. In some embodiments, nucleic acids including one or more target sequences can be obtained from a deceased animal or human. In some embodiments, target sequences can include nucleic acids obtained from non-human DNA such a microbial, plant or entomological DNA. In some embodiments, target sequences or amplified target sequences are directed to purposes of human
identification. In some embodiments, the disclosure relates generally to methods for identifying characteristics of a forensic sample. In some embodiments, the disclosure relates generally to human identification methods using one or more target specific primers disclosed herein or one or more target specific primers designed using the primer design criteria outlined herein. In one embodiment, a forensic or human identification sample containing at least one target sequence can be amplified using any one or more of the target-specific primers disclosed herein or using the primer criteria outlined herein.
[0149] The components of the personalized sequencing system 106 can include software, hardware, or both. For example, the components of the personalized sequencing system 106 can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices (e.g., the client device 114, the local device 108, or the server device(s) 110). When executed by the one or more processors, the computer-executable instructions of the personalized sequencing system 106 can cause the computing devices to perform the bubble detection methods described herein. Alternatively, the components of the personalized sequencing system 106 can comprise hardware, such as special purpose processing devices to perform a certain function or group of functions. Additionally, or alternatively, the components of the personalized sequencing system 106 can include a combination of computer-executable instructions and hardware.
[0150] Furthermore, the components of the personalized sequencing system 106 performing the functions described herein with respect to the personalized sequencing system 106 may, for example, be implemented as part of a stand-alone application, as a module of an application, as a plug-in for applications, as a library function or functions that may be called by other applications, and/or as a cloud-computing model. Thus, components of the personalized sequencing system 106 may be implemented as part of a stand-alone application on a personal computing device or a mobile device. Additionally, or alternatively, the components of the personalized sequencing system 106 may be implemented in any application that provides sequencing services including, but not limited to Illumina BaseSpace, Illumina DRAGEN, or Illumina TruSight software. “Illumina,” “BaseSpace,” “DRAGEN,” and “TruSight,” are either registered trademarks or trademarks of Illumina, Inc. in the United States and/or other countries.
[0151] Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non-
transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., a memory, etc.), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.
[0152] Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computerexecutable instructions are non-transitory computer-readable storage media (devices). Computer- readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.
[0153] Non-transitory computer-readable storage media (devices) includes RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) (e.g., based on RAM), Flash memory, phasechange memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
[0154] A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer- readable media.
[0155] Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a NIC), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system. Thus, it should be understood that non-transitory computer-
readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.
[0156] Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some embodiments, computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.
[0157] Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.
[0158] Embodiments of the present disclosure can also be implemented in cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
[0159] A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (laaS). A cloud-computing model can also be deployed using different deployment
models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this description and in the claims, a “cloud-computing environment” is an environment in which cloud computing is employed.
[0160] FIG. 11 illustrates a block diagram of a computing device FIG. 1100 that may be configured to perform one or more of the processes described above. One will appreciate that one or more computing devices such as the computing device 1100 may implement the personalized sequencing system 106 and the sequencing device system 104. As shown by FIG. 11 , the computing device 1100 can comprise a processor 1102, a memory 1104, a storage device 1106, an I/O interface 1108, and a communication interface 1110, which may be communicatively coupled by way of a communication infrastructure 1112. In certain embodiments, the computing device 1100 can include fewer or more components than those shown in FIG. 11. The following paragraphs describe components of the computing device 1100 shown in FIG. 11 in additional detail.
[0161] In one or more embodiments, the processor 1102 includes hardware for executing instructions, such as those making up a computer program. As an example, and not by way of limitation, to execute instructions for dynamically modifying workflows, the processor 1102 may retrieve (or fetch) the instructions from an internal register, an internal cache, the memory 1104, or the storage device 1106 and decode and execute them. The memory 1104 may be a volatile or nonvolatile memory used for storing data, metadata, and programs for execution by the processor(s). The storage device 1106 includes storage, such as a hard disk, flash disk drive, or other digital storage device, for storing data or instructions for performing the methods described herein.
[0162] The I/O interface 1108 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 1100. The I/O interface 1108 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I/O devices or a combination of such I/O interfaces. The I/O interface 1108 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers. In certain embodiments, the I/O interface 1108 is configured to provide graphical data to a display for presentation to a user. The graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
[0163] The communication interface 1110 can include hardware, software, or both. In any event, the communication interface 1110 can provide one or more interfaces for communication (such as, for example, packet-based communication) between the computing device 1100 and one or more other computing devices or networks. As an example, and not by way of limitation, the communication interface 1110 may include a network interface controller (NIC) or network adapter
for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.
[0164] Additionally, the communication interface 1110 may facilitate communications with various types of wired or wireless networks. The communication interface 1110 may also facilitate communications using various communication protocols. The communication infrastructure 1112 may also include hardware, software, or both that couples components of the computing device 1100 to each other. For example, the communication interface 1110 may use one or more networks and/or protocols to enable a plurality of computing devices connected by a particular infrastructure to communicate with each other to perform one or more aspects of the processes described herein. To illustrate, the sequencing process can allow a plurality of devices (e.g., a client device, sequencing device, and server device(s)) to exchange information such as sequencing data and error notifications.
[0165] In the foregoing specification, the present disclosure has been described with reference to specific exemplary embodiments thereof. Various embodiments and aspects of the present disclosure(s) are described with reference to details discussed herein, and the accompanying drawings illustrate the various embodiments. The description above and drawings are illustrative of the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various embodiments of the present disclosure.
[0166] The present disclosure may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be performed with less or more steps/acts or the steps/acts may be performed in differing orders. Additionally, the steps/acts described herein may be repeated or performed in parallel with one another or in parallel with different instances of the same or similar steps/acts. The scope of the present application is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Claims
1. A system comprising: at least one processor; and a non-transitory computer-readable medium storing instructions that, when executed by the at least one processor, cause the system to: identify, within a set of reference spans of a reference genome, a set of nucleotide reads from a genomic sample and candidate population haplotypes from a population haplotype database; generate, for the set of reference spans of the reference genome, haplotype set scores for sets of the candidate population haplotypes based on comparing the set of nucleotide reads and the candidate population haplotypes; generate, for the genomic sample and based on the haplotype set scores, a personalized haplotype database comprising a subset of population haplotypes from the population haplotype database within the set of reference spans; and determine, utilizing the personalized haplotype database, one or more personalized alignments of the set of nucleotide reads with respective genomic regions of the reference genome.
2. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: identify one or more distinct k-mers for each candidate population haplotype within a given reference span of the set of reference spans; and generate, for the given reference span, respective haplotype set scores of the haplotype set scores for respective candidate population haplotypes of the sets of the candidate population haplotypes within the given reference span based on comparing k-mers of the set of nucleotide reads with the one or more distinct k-mers of the respective candidate population haplotypes.
3. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: determine initial alignments of the set of nucleotide reads with respective genomic regions of the reference genome; identify subsets of nucleotide reads of the set of nucleotide reads that, according to the initial alignments, align to respective reference spans of the set of reference spans; and generate the haplotype set scores for the set of reference spans based on comparing the subsets of nucleotide reads and the candidate population haplotypes within the respective reference spans of the set of reference spans.
4. The system of claim 3, further comprising instructions that, when executed by the at least one processor, cause the system to: identify, as indicated within the population haplotype database, allele-variant differences between population haplotypes and a primary contiguous sequence at respective genomic regions of the reference genome; and determine one or more of the initial alignments of the set of nucleotide reads by rescoring candidate reference alignments of the set of nucleotide reads with the primary contiguous sequence according to the allele-variant differences.
5. The system of claim 3, further comprising instructions that, when executed by the at least one processor, cause the system to: identify one or more population haplotypes for each nucleotide read of the set of nucleotide reads based on comparing the set of nucleotide reads with haplotype variants within the respective genomic regions of the initial alignments; and generate an alignment data file comprising: the initial alignments of the set of nucleotide reads; alignment scores corresponding to the initial alignments; and the identified one or more population haplotypes for each nucleotide read of the set of nucleotide reads.
6. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: determine haplotype set likelihoods for the set of reference spans by comparing the set of nucleotide reads and pairs of the candidate population haplotypes within respective reference spans of the set of reference spans; and generate the haplotype set scores for the set of reference spans based on the haplotype set likelihoods.
7. The system of claim 1, further comprising instruction that, when executed by the at least one processor, cause the system to: identify at least one reference span of the set of reference spans corresponding to a haploid region of the reference genome; and generate, for the at least one reference span, haplotype set scores for individual candidate population haplotypes based on comparing the set of nucleotide reads and the candidate population haplotypes within the at least one reference span.
8. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to select, to include within the subset of population
haplotypes of the personalized haplotype database, a respective subset of the candidate population haplotypes for each reference span of the set of reference spans based on the haplotype set scores.
9. The system of claim 8, further comprising instructions that, when executed by the at least one processor, cause the system to: generate haplotype set likelihoods for the set of reference spans by comparing the set of nucleotide reads and the candidate population haplotypes; generate, utilizing a hidden Markov model (HMM) algorithm, a set of haplotype set posterior probabilities for each reference span of the set of reference spans based on respective haplotype set likelihoods for adjacent reference spans of the set of reference spans; and generate, based on the set of haplotype set posterior probabilities for each reference span of the set of reference spans, respective sets of the haplotype set scores for the set of reference spans.
10. The system of claim 9, further comprising instructions that, when executed by the at least one processor, cause the system to: generate, utilizing the haplotype set likelihoods in a forward pass of the HMM algorithm across the set of reference spans, a set of forward probabilities for each reference span of the set of reference spans; generate, utilizing a backward pass of the HMM algorithm to update the set of forward probabilities for each reference span of the set of reference spans, a set of updated probabilities for each reference span of the set of reference spans; and generate, based on the set of updated probabilities for each reference span of the set of reference spans, the set of haplotype set posterior probabilities for each reference span of the set of reference spans.
11. The system of claim 9, further comprising instructions that, when executed by the at least one processor, cause the system to utilize a recombination parameter of the HMM algorithm to reduce variation in haplotype set posterior probabilities across the set of reference spans.
12. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: categorize, utilizing a Variational Bayesian model, each nucleotide read of the set of nucleotide reads within each respective reference span of the set of reference spans as inherited from a first parent or a second parent; generate, based on each categorized nucleotide read of the set of nucleotide reads, individual haplotype scores for population haplotypes selected from the population haplotype database; and
generate the haplotype set scores for pairs of the population haplotypes selected from the population haplotype database by combining individual haplotype scores from nucleotide reads inherited from the first parent with individual haplotype scores from nucleotide reads inherited from the second parent.
13. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate, for a given reference span of the set of reference spans, the haplotype set scores for respective pairs of locally distinct population haplotypes within the given reference span, each locally distinct population haplotype comprising a unique set of one or more allele-variant differences relative to other population haplotypes within the given reference span.
14. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: identify, as indicated within the personalized haplotype database, allele-variant differences between population haplotypes of the subset of population haplotypes and a primary contiguous sequence at the respective genomic regions of the reference genome; and determine one or more of the one or more personalized alignments of the set of nucleotide reads by rescoring candidate reference alignments of the set of nucleotide reads with the primary contiguous sequence according to allele-variant differences.
15. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: generate alignment scores for initial alignments of the set of nucleotide reads with respective genomic regions of the reference genome based on comparing the set of nucleotide reads with population haplotypes from the population haplotype database; generate an initial alignment data fde comprising the initial alignments and corresponding alignment scores; generate personalized alignment scores for the one or more personalized alignments based on comparing the set of nucleotide reads with the subset of population haplotypes within the personalized haplotype database; and generate a personalized alignment data fde comprising the one or more personalized alignments and corresponding personalized alignment scores.
16. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to determine genotype calls for the genomic sample based on the one or more personalized alignments.
17. A computer-implemented method comprising: identifying, within a set of reference spans of a reference genome, a set of nucleotide reads from a genomic sample and candidate population haplotypes from a population haplotype database; generating, for the set of reference spans of the reference genome, haplotype set scores for sets of the candidate population haplotypes based on comparing the set of nucleotide reads and the candidate population haplotypes; generating, for the genomic sample and based on the haplotype set scores, a personalized haplotype database comprising a subset of population haplotypes from the population haplotype database within the set of reference spans; and determining, utilizing the personalized haplotype database, one or more personalized alignments of the set of nucleotide reads with respective genomic regions of the reference genome.
18. The computer-implemented method of claim 17, further comprising: identifying one or more distinct k-mers for each candidate population haplotype within a given reference span of the set of reference spans; and generating, for the given reference span, respective haplotype set scores of the haplotype set scores for respective candidate population haplotypes of the sets of the candidate population haplotypes within the given reference span based on comparing k-mers of the set of nucleotide reads with the one or more distinct k-mers of the respective candidate population haplotypes.
19. The computer-implemented method of claim 17, further comprising: determining initial alignments of the set of nucleotide reads with respective genomic regions of the reference genome; identifying subsets of nucleotide reads of the set of nucleotide reads that, according to the initial alignments, align to respective reference spans of the set of reference spans; and generating the haplotype set scores for the set of reference spans based on comparing the subsets of nucleotide reads and the candidate population haplotypes within the respective reference spans of the set of reference spans.
20. The computer-implemented method of claim 19, further comprising: identifying, as indicated within the population haplotype database, allele-variant differences between population haplotypes and a primary contiguous sequence at respective genomic regions of the reference genome; and determining one or more of the initial alignments of the set of nucleotide reads by rescoring candidate reference alignments of the set of nucleotide reads with the primary contiguous sequence according to the allele-variant differences.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202580003270.4A CN121359209A (en) | 2024-02-28 | 2025-02-26 | Personalized haplotype database for improved mapping and alignment of nucleotide reads and improved genotype detection |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202463558754P | 2024-02-28 | 2024-02-28 | |
| US63/558,754 | 2024-02-28 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025184234A1 true WO2025184234A1 (en) | 2025-09-04 |
Family
ID=95022910
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2025/017424 Pending WO2025184234A1 (en) | 2024-02-28 | 2025-02-26 | A personalized haplotype database for improved mapping and alignment of nucleotide reads and improved genotype calling |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN121359209A (en) |
| WO (1) | WO2025184234A1 (en) |
Citations (31)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO1991006678A1 (en) | 1989-10-26 | 1991-05-16 | Sri International | Dna sequencing |
| US6172218B1 (en) | 1994-10-13 | 2001-01-09 | Lynx Therapeutics, Inc. | Oligonucleotide tags for sorting and identification |
| US6210891B1 (en) | 1996-09-27 | 2001-04-03 | Pyrosequencing Ab | Method of sequencing DNA |
| US6258568B1 (en) | 1996-12-23 | 2001-07-10 | Pyrosequencing Ab | Method of sequencing DNA based on the detection of the release of pyrophosphate and enzymatic nucleotide degradation |
| US6274320B1 (en) | 1999-09-16 | 2001-08-14 | Curagen Corporation | Method of sequencing a nucleic acid |
| US6306597B1 (en) | 1995-04-17 | 2001-10-23 | Lynx Therapeutics, Inc. | DNA sequencing by parallel oligonucleotide extensions |
| WO2004018497A2 (en) | 2002-08-23 | 2004-03-04 | Solexa Limited | Modified nucleotides for polynucleotide sequencing |
| US20050100900A1 (en) | 1997-04-01 | 2005-05-12 | Manteia Sa | Method of nucleic acid amplification |
| WO2005065814A1 (en) | 2004-01-07 | 2005-07-21 | Solexa Limited | Modified molecular arrays |
| US6969488B2 (en) | 1998-05-22 | 2005-11-29 | Solexa, Inc. | System and apparatus for sequential processing of analytes |
| US7001792B2 (en) | 2000-04-24 | 2006-02-21 | Eagle Research & Development, Llc | Ultra-fast nucleic acid sequencing device and a method for making and using the same |
| US7057026B2 (en) | 2001-12-04 | 2006-06-06 | Solexa Limited | Labelled nucleotides |
| WO2006064199A1 (en) | 2004-12-13 | 2006-06-22 | Solexa Limited | Improved method of nucleotide detection |
| US20060240439A1 (en) | 2003-09-11 | 2006-10-26 | Smith Geoffrey P | Modified polymerases for improved incorporation of nucleotide analogues |
| US20060281109A1 (en) | 2005-05-10 | 2006-12-14 | Barr Ost Tobias W | Polymerases |
| WO2007010251A2 (en) | 2005-07-20 | 2007-01-25 | Solexa Limited | Preparation of templates for nucleic acid sequencing |
| US7211414B2 (en) | 2000-12-01 | 2007-05-01 | Visigen Biotechnologies, Inc. | Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity |
| WO2007123744A2 (en) | 2006-03-31 | 2007-11-01 | Solexa, Inc. | Systems and devices for sequence by synthesis analysis |
| US7315019B2 (en) | 2004-09-17 | 2008-01-01 | Pacific Biosciences Of California, Inc. | Arrays of optical confinements and uses thereof |
| US7329492B2 (en) | 2000-07-07 | 2008-02-12 | Visigen Biotechnologies, Inc. | Methods for real-time single molecule sequence determination |
| US20080108082A1 (en) | 2006-10-23 | 2008-05-08 | Pacific Biosciences Of California, Inc. | Polymerase enzymes and reagents for enhanced nucleic acid sequencing |
| US7405281B2 (en) | 2005-09-29 | 2008-07-29 | Pacific Biosciences Of California, Inc. | Fluorescent nucleotide analogs and uses therefor |
| US20090026082A1 (en) | 2006-12-14 | 2009-01-29 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes using large scale FET arrays |
| US20090127589A1 (en) | 2006-12-14 | 2009-05-21 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes using large scale FET arrays |
| US20100137143A1 (en) | 2008-10-22 | 2010-06-03 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes |
| US20100282617A1 (en) | 2006-12-14 | 2010-11-11 | Ion Torrent Systems Incorporated | Methods and apparatus for detecting molecular interactions using fet arrays |
| US20120270305A1 (en) | 2011-01-10 | 2012-10-25 | Illumina Inc. | Systems, methods, and apparatuses to image a sample for biological or chemical analysis |
| US20130079232A1 (en) | 2011-09-23 | 2013-03-28 | Illumina, Inc. | Methods and compositions for nucleic acid sequencing |
| US20130260372A1 (en) | 2012-04-03 | 2013-10-03 | Illumina, Inc. | Integrated optoelectronic read head and fluidic cartridge useful for nucleic acid sequencing |
| US20230095961A1 (en) * | 2021-09-21 | 2023-03-30 | Illumina, Inc. | Graph reference genome and base-calling approach using imputed haplotypes |
| US20230420082A1 (en) * | 2022-06-27 | 2023-12-28 | Illumina Software, Inc. | Generating and implementing a structural variation graph genome |
-
2025
- 2025-02-26 CN CN202580003270.4A patent/CN121359209A/en active Pending
- 2025-02-26 WO PCT/US2025/017424 patent/WO2025184234A1/en active Pending
Patent Citations (35)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO1991006678A1 (en) | 1989-10-26 | 1991-05-16 | Sri International | Dna sequencing |
| US6172218B1 (en) | 1994-10-13 | 2001-01-09 | Lynx Therapeutics, Inc. | Oligonucleotide tags for sorting and identification |
| US6306597B1 (en) | 1995-04-17 | 2001-10-23 | Lynx Therapeutics, Inc. | DNA sequencing by parallel oligonucleotide extensions |
| US6210891B1 (en) | 1996-09-27 | 2001-04-03 | Pyrosequencing Ab | Method of sequencing DNA |
| US6258568B1 (en) | 1996-12-23 | 2001-07-10 | Pyrosequencing Ab | Method of sequencing DNA based on the detection of the release of pyrophosphate and enzymatic nucleotide degradation |
| US20050100900A1 (en) | 1997-04-01 | 2005-05-12 | Manteia Sa | Method of nucleic acid amplification |
| US6969488B2 (en) | 1998-05-22 | 2005-11-29 | Solexa, Inc. | System and apparatus for sequential processing of analytes |
| US6274320B1 (en) | 1999-09-16 | 2001-08-14 | Curagen Corporation | Method of sequencing a nucleic acid |
| US7001792B2 (en) | 2000-04-24 | 2006-02-21 | Eagle Research & Development, Llc | Ultra-fast nucleic acid sequencing device and a method for making and using the same |
| US7329492B2 (en) | 2000-07-07 | 2008-02-12 | Visigen Biotechnologies, Inc. | Methods for real-time single molecule sequence determination |
| US7211414B2 (en) | 2000-12-01 | 2007-05-01 | Visigen Biotechnologies, Inc. | Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity |
| US7057026B2 (en) | 2001-12-04 | 2006-06-06 | Solexa Limited | Labelled nucleotides |
| US7427673B2 (en) | 2001-12-04 | 2008-09-23 | Illumina Cambridge Limited | Labelled nucleotides |
| US20060188901A1 (en) | 2001-12-04 | 2006-08-24 | Solexa Limited | Labelled nucleotides |
| WO2004018497A2 (en) | 2002-08-23 | 2004-03-04 | Solexa Limited | Modified nucleotides for polynucleotide sequencing |
| US20070166705A1 (en) | 2002-08-23 | 2007-07-19 | John Milton | Modified nucleotides |
| US20060240439A1 (en) | 2003-09-11 | 2006-10-26 | Smith Geoffrey P | Modified polymerases for improved incorporation of nucleotide analogues |
| WO2005065814A1 (en) | 2004-01-07 | 2005-07-21 | Solexa Limited | Modified molecular arrays |
| US7315019B2 (en) | 2004-09-17 | 2008-01-01 | Pacific Biosciences Of California, Inc. | Arrays of optical confinements and uses thereof |
| WO2006064199A1 (en) | 2004-12-13 | 2006-06-22 | Solexa Limited | Improved method of nucleotide detection |
| US20060281109A1 (en) | 2005-05-10 | 2006-12-14 | Barr Ost Tobias W | Polymerases |
| WO2007010251A2 (en) | 2005-07-20 | 2007-01-25 | Solexa Limited | Preparation of templates for nucleic acid sequencing |
| US7405281B2 (en) | 2005-09-29 | 2008-07-29 | Pacific Biosciences Of California, Inc. | Fluorescent nucleotide analogs and uses therefor |
| WO2007123744A2 (en) | 2006-03-31 | 2007-11-01 | Solexa, Inc. | Systems and devices for sequence by synthesis analysis |
| US20100111768A1 (en) | 2006-03-31 | 2010-05-06 | Solexa, Inc. | Systems and devices for sequence by synthesis analysis |
| US20080108082A1 (en) | 2006-10-23 | 2008-05-08 | Pacific Biosciences Of California, Inc. | Polymerase enzymes and reagents for enhanced nucleic acid sequencing |
| US20090127589A1 (en) | 2006-12-14 | 2009-05-21 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes using large scale FET arrays |
| US20090026082A1 (en) | 2006-12-14 | 2009-01-29 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes using large scale FET arrays |
| US20100282617A1 (en) | 2006-12-14 | 2010-11-11 | Ion Torrent Systems Incorporated | Methods and apparatus for detecting molecular interactions using fet arrays |
| US20100137143A1 (en) | 2008-10-22 | 2010-06-03 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes |
| US20120270305A1 (en) | 2011-01-10 | 2012-10-25 | Illumina Inc. | Systems, methods, and apparatuses to image a sample for biological or chemical analysis |
| US20130079232A1 (en) | 2011-09-23 | 2013-03-28 | Illumina, Inc. | Methods and compositions for nucleic acid sequencing |
| US20130260372A1 (en) | 2012-04-03 | 2013-10-03 | Illumina, Inc. | Integrated optoelectronic read head and fluidic cartridge useful for nucleic acid sequencing |
| US20230095961A1 (en) * | 2021-09-21 | 2023-03-30 | Illumina, Inc. | Graph reference genome and base-calling approach using imputed haplotypes |
| US20230420082A1 (en) * | 2022-06-27 | 2023-12-28 | Illumina Software, Inc. | Generating and implementing a structural variation graph genome |
Non-Patent Citations (16)
| Title |
|---|
| COCKROFT, S. L.CHU, J.AMORIN, MGHADIRI, M. R: "A single-molecule nanopore device detects DNA polymerase activity with single-nucleotide resolution.", J. AM. CHEM. SOC., vol. 130, 2008, pages 818 - 820, XP055097434, DOI: 10.1021/ja077082c |
| DEAMER, D. WAKESON, M: "Nanopores and nucleic acids: prospects for ultrarapid sequencing.", TRENDS BIOTECHNOL, vol. 18, 2000, pages 147 - 151, XP004194002, DOI: 10.1016/S0167-7799(00)01426-8 |
| DEAMER, DD. BRANTON: "Characterization of nucleic acids by nanopore analysis", ACC. CHEM. RES, vol. 35, 2002, pages 817 - 825, XP002226144, DOI: 10.1021/ar000138m |
| HEALY, K: "Nanopore-based single-molecule DNA analysis.", NANOMED, vol. 2, 2007, pages 459 - 481, XP009111262, DOI: 10.2217/17435889.2.4.459 |
| KORLACH, J ET AL.: "Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nano structures", PROC. NATL. ACAD. SCI. USA, vol. 105, 2008, pages 1176 - 1181 |
| LEVENE, M. J ET AL.: "Zero-mode waveguides for single-molecule analysis at high concentrations.", SCIENCE, vol. 299, 2003, pages 682 - 686, XP002341055, DOI: 10.1126/science.1079700 |
| LI, J.M. GERSHOWD. STEINE. BRANDINJ. A. GOLOVCHENKO: "DNA molecules and configurations in a solid-state nanopore microscope", NAT. MATER, vol. 2, 2003, pages 611 - 615, XP009039572, DOI: 10.1038/nmat965 |
| LUNDQUIST, P. M ET AL.: "Parallel confocal detection of single molecules in real time.", OPT. LETT, vol. 33, 2008, pages 1026 - 1028, XP001522593, DOI: 10.1364/OL.33.001026 |
| METZKER, GENOME RES, vol. 15, 2005, pages 1767 - 1776 |
| MICHAEL RUEHLE, ENHANCED MAPPING AND ALIGNMENT OF NUCLEOTIDE READS UTILIZING AN IMPROVED HAPLOTYPE DATA STRUCTURE WITH ALLELE-VARIANT DIFFERENCES |
| RONAGHI, M.KARAMOHAMED, S.PETTERSSON, B.UHLEN, MNYREN, P: "Real-time DNA sequencing using detection of pyrophosphate release.", ANALYTICAL BIOCHEMISTRY, vol. 242, no. 1, 1996, pages 84 - 9, XP002388725, DOI: 10.1006/abio.1996.0432 |
| RONAGHI, M.UHLEN, MNYREN, P: "A sequencing method based on real-time pyrophosphate.", SCIENCE, vol. 281, no. 5375, 1998, pages 363, XP002135869, DOI: 10.1126/science.281.5375.363 |
| RONAGHI, M: "Pyrosequencing sheds light on DNA sequencing.", GENOME RES, vol. 11, no. 1, 2001, pages 3 - 11, XP000980886, DOI: 10.1101/gr.11.1.3 |
| RUPAREL ET AL., PROC NATL ACAD SCI USA, vol. 102, 2005, pages 5932 - 7 |
| SIR�N JOUNI ET AL: "Personalized Pangenome References", BIORXIV, 15 December 2023 (2023-12-15), XP093276922, Retrieved from the Internet <URL:https://pmc.ncbi.nlm.nih.gov/articles/PMC10760139/pdf/nihpp-2023.12.13.571553v2.pdf> DOI: 10.1101/2023.12.13.571553 * |
| SONI, G. VMELLER: "A. Progress toward ultrafast DNA sequencing using solid-state nanopores.", CLIN. CHEM., vol. 53, 2007, pages 1996 - 2001, XP055076185, DOI: 10.1373/clinchem.2007.091231 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN121359209A (en) | 2026-01-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20240120027A1 (en) | Machine-learning model for refining structural variant calls | |
| CA3223739A1 (en) | Machine-learning model for recalibrating nucleotide-base calls | |
| US20240038327A1 (en) | Rapid single-cell multiomics processing using an executable file | |
| EP4457822B1 (en) | Machine learning model for recalibrating nucleotide base calls corresponding to target variants | |
| US20230095961A1 (en) | Graph reference genome and base-calling approach using imputed haplotypes | |
| US20230420082A1 (en) | Generating and implementing a structural variation graph genome | |
| US20260011405A1 (en) | Human leukocyte antigen (hla) genotyping | |
| WO2025184234A1 (en) | A personalized haplotype database for improved mapping and alignment of nucleotide reads and improved genotype calling | |
| US20250210141A1 (en) | Enhanced mapping and alignment of nucleotide reads utilizing an improved haplotype data structure with allele-variant differences | |
| US20230313271A1 (en) | Machine-learning models for detecting and adjusting values for nucleotide methylation levels | |
| US20240127906A1 (en) | Detecting and correcting methylation values from methylation sequencing assays | |
| US20240177802A1 (en) | Accurately predicting variants from methylation sequencing data | |
| US20230340571A1 (en) | Machine-learning models for selecting oligonucleotide probes for array technologies | |
| US20250384952A1 (en) | Tandem repeat genotyping | |
| US20230420080A1 (en) | Split-read alignment by intelligently identifying and scoring candidate split groups | |
| US20240371469A1 (en) | Machine learning model for recalibrating genotype calls from existing sequencing data files | |
| WO2025160089A1 (en) | Custom multigenome reference construction for improved sequencing analysis of genomic samples | |
| WO2024249973A2 (en) | Linking human genes to clinical phenotypes using graph neural networks | |
| WO2025250996A2 (en) | Call generation and recalibration models for implementing personalized diploid reference haplotypes in genotype calling | |
| WO2025090883A1 (en) | Detecting variants in nucleotide sequences based on haplotype diversity |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25712428 Country of ref document: EP Kind code of ref document: A1 |