[go: up one dir, main page]

WO2025221988A1 - Systems and methods for somatic small variant calling - Google Patents

Systems and methods for somatic small variant calling

Info

Publication number
WO2025221988A1
WO2025221988A1 PCT/US2025/025147 US2025025147W WO2025221988A1 WO 2025221988 A1 WO2025221988 A1 WO 2025221988A1 US 2025025147 W US2025025147 W US 2025025147W WO 2025221988 A1 WO2025221988 A1 WO 2025221988A1
Authority
WO
WIPO (PCT)
Prior art keywords
variant
sequencing
variants
threshold
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/025147
Other languages
French (fr)
Inventor
Mahdi Golkaram
Badri Kothandaraman PADHUKASAHASRAM
Chen Zhao
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Roche Sequencing Solutions Inc
Original Assignee
Roche Sequencing Solutions Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Roche Sequencing Solutions Inc filed Critical Roche Sequencing Solutions Inc
Publication of WO2025221988A1 publication Critical patent/WO2025221988A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Definitions

  • the embodiments described herein relate to systems and methods for performing variant calling on sequencing data. More particularly, the embodiments described herein related to calling single nucleotide variants, insertions, and deletions in the sequencing data. [0004] In accordance with a first aspect of the present disclosure, a method for somatic variant calling is provided.
  • the method includes: receiving a computer file comprising a plurality of consensus reads generated from a patient sample; aligning the plurality of consensus reads; determining a callable region from the aligned consensus reads; determining an active region from the callable region based on a comparison of the callable region to a reference sequence and identifying one or more differences in the active region compared to the reference sequence; generating an assembly graph from the reference sequence and the active region, wherein the assembly graph comprises a plurality of paths, wherein each path comprises a plurality of path segments, and wherein each path segment comprises a count of the consensus reads that support the path segment; determining a plurality of haplotype sequences from a plurality of paths of the assembly graph; determining a likelihood score for each haplotype sequence based on the count of consensus reads that support the path PATENT Client Reference No.: P39265-WO-1 segments included in each haplotype sequence; determining a plurality of candidate variants by comparing a weighted count for each variant to
  • the callable region is based on a minimum coverage requirement.
  • the assembly graph is pruned to remove at least one path in the plurality of paths. In other embodiments, the assembly graph is not pruned.
  • the first threshold is dynamically determined by fitting a sample specific regression model to a background distribution of weighted counts for the candidate variant.
  • the machine learning classification model includes one or more features selected from the group consisting of: nonduplex counts, duplex counts, weighted counts, distance of the candidate variant to a 5’ end of the consensus read, substitution type for single nucleotide variants, sequence context, strand bias, mapping quality, base quality, cluster size, and duplex fraction.
  • the machine learning classification model comprises a gradient boosted decision tree algorithm.
  • the gradient boosted decision tree algorithm is selected from the group consisting of: LightGBM and XGBoost.
  • a system is provided for generating sequencing data.
  • the system may include an assay device and/or a logic system.
  • the logic system may include a processor coupled to a memory storing instructions executable by the processor.
  • the processor upon execution of the instructions, is configured to perform the method of one or more embodiments of the first aspect.
  • PATENT Client Reference No.: P39265-WO-1 PATENT Client Reference No.: P39265-WO-1
  • the system further includes a treatment device for determining and/or administering a treatment to the patient based on at least one filtered variant in the plurality of filtered variants.
  • the system further includes a reporting device for displaying information relating to at least one filtered variant in the plurality of filtered variants.
  • a non-transitory computer-readable medium is provided.
  • FIG. 1 is a flow chart of a method for a variant caller workflow, in accordance with some embodiments.
  • FIG. 1 is a flow chart of a method for a variant caller workflow, in accordance with some embodiments.
  • FIG. 2 is a graph illustrating how a threshold is set by fitting a sample specific regression model to the background distribution of weighted variant counts for all the positions with evidence of alternate alleles (i.e., candidate variant sites) within the sample, in accordance with some embodiments.
  • FIG. 3 illustrates the relationship between false positives and sensitivity of the variant caller, in accordance with some embodiments.
  • FIG. 4 illustrates a flow chart of a three-stage filtering process that can be used to improve variant caller sensitivity and/or accuracy, in accordance with some embodiments.
  • FIG. 5 illustrates a sequencing system, in accordance with some embodiments. PATENT Client Reference No.: P39265-WO-1 [0021] FIG.
  • the present disclosure provides a number of techniques for performing variant calling, as part of a secondary analysis workflow of sequencing data produced by today’s next generation sequencing devices. More particularly, a variant calling algorithm is provided for identifying single nucleotide variants (SNV) and/or insertion/deletions (InDels) from sequencing data, such as nanopore-based sequencing data.
  • a variant calling algorithm is provided that generates an assembly graph using a reference sequence for one or more active regions of sequencing data (e.g., a plurality of reads generated by sequencing a sample).
  • Weights for each edge of the graph are then calculated based on the sequencing data.
  • Candidate variants are then identified by traversing the graph.
  • the candidate variants are then filtered according to a two stage process. In the first stage, candidate variants are removed based on a comparison of the candidate variant weight with a threshold.
  • a machine learning model is used to score the candidate variants that passed the first stage according to a number of features. The score is then compared against a threshold to determine a final list of called variants for the active region.
  • an InDel caller algorithm is provided.
  • the InDel caller algorithm can include a hotspot module and an adaptive module.
  • the InDel caller algorithm uses a number of heuristics to identify InDel variants in hotspot regions.
  • the adaptive module uses a two stage approach to identify variants with low AF that might not be strongly supported in a plasma sample.
  • a core linear regression model is used to filter out a number of variants based on a comparison of each candidate variant’s score with a threshold.
  • a machine learning classifier such as a gradient boosting machine (GBM) model is used to score each variant that passes the first stage and then the classifier score is compared to a threshold to determine the final list of called variants.
  • the called variants can be further filtered by a blocklist filter.
  • an SNV caller algorithm is provided that is similar in many aspects to the InDel caller algorithm described above.
  • the SNV caller algorithm can include a hotspot module and an adaptive module.
  • the hotspot module is similar to the hotspot module of the InDel caller algorithm, with the exception that one or more additional heuristics may be included in addition to or in lieu of the heuristics used in the InDel caller algorithm.
  • the adaptive module of the SNV caller algorithm also uses a two stage approach to identify variants.
  • a core linear regression model is used to filter out a number of variants based on a comparison of each candidate variant’s score with a threshold.
  • an extended linear model that incorporates a number of additional weighted features is used to filter out variants that passed the first stage based on a comparison of each candidate variant’s score with a threshold.
  • NGS Next Generation Sequencing
  • Patent Application Publication Nos.2013/0244340, 2013/0264207, 2014/0134616, 2015/0119259, and 2015/0337366) Sanger sequencing, capillary array sequencing, thermal cycle sequencing (see, e.g., Sears et al., Biotechniques, 13:626-633 (1992)), solid-phase sequencing (see, e.g., Zimmerman et al., Methods Mol.
  • sequencing with mass spectrometry such as matrix-assisted laser desorption/ionization time-of-flight mass PATENT Client Reference No.: P39265-WO-1 spectrometry (MALDI-TOF/MS) (see, e.g., Fu et al., Nature Biotech., 16:381-384 (1998)), sequencing by hybridization (see, e.g., Drmanac et al., Nature Biotech., 16:54-58 (1998)), and NGS methods, including but not limited to sequencing by synthesis (see, e.g., HiSeqTM, MiSeqTM, or Genome Analyzer, each available from Illumina, Inc.
  • sequencing by ligation see, e.g., SOLiDTM, available from Thermo Fisher Scientific, Inc. (Waltham, MA)
  • ion semiconductor sequencing see, e.g., Ion TorrentTM, available from Thermo Fisher Scientific, Inc. (Waltham, MA)
  • SMRT® sequencing available from Pacific Biosciences of California, Inc. (Menlo Park, CA).
  • Commercially available sequencing technologies include: sequencing-by- hybridization platforms from Affymetrix, Inc. (Sunnyvale, CA), sequencing-by-synthesis platforms from Illumina, Inc. (San Diego, Calif.), and sequencing-by-ligation platform from Thermo Fisher Scientific, Inc.
  • Bioinformatics Workflow Overview [0030] The output of an NGS sequencer is generally processed by a bioinformatics pipeline that processes the raw signal from the NGS sequencer and translates the raw signal into base calls, often referred to as raw reads, which are typically stored in a FASTQ file that combines the raw reads with associated quality data. This portion of the bioinformatics pipeline is often referred to as primary analysis.
  • the next section of the bioinformatics pipeline is called secondary analysis, and it takes the raw reads generated by the primary analysis, and performs several tasks, including alignment and variant calling.
  • Tertiary analysis is the final portion of the bioinformatics pipeline and uses the variant calling information to generate medical insights that health care practitioners can use to improve treatments for their patients.
  • Secondary Analysis PATENT Client Reference No.: P39265-WO-1
  • New sequencing technologies such as nanopore-based sequencers, generate sequencing data with different characteristics than sequencing data generated by the current market leading sequencers, such as Illumina sequencers. For example, these differences can include differences in raw read accuracy and differences in the error profiles.
  • FIG. 1 illustrates a secondary analysis workflow for a variant calling algorithm, in accordance with some embodiments.
  • the variant calling algorithm leverages a portion of the approach used by Mutect2, a haplotype variant caller with somatic-specific genotyping and filtering that is available as part of the Genome Analysis Toolkit (GATK) maintained by the Broad Institute, with several changes or additions to the said algorithm.
  • the variant calling algorithm starts by receiving consensus reads 100 from, for example, a nanopore sequencer or another type of sequencer. A typical format for receiving these reads is in a file using the FASTQ format. The consensus reads stored in the file can be accessed and processed by the pipeline to perform an alignment of the consensus reads against a reference sequence. A callable region 104 can be then identified as a region with a sufficient depth of read coverage to allow for variant calling.
  • the callable region 104 can have a minimum depth of read coverage of at least 5, 10, 15, 20, 25, 30, 35, 40, 45, or 50.
  • the required depth of coverage can vary depending on a variety of factors, such as the types of bases in the region (i.e., gc rich region or other type of motif which may have a higher error rate during sequencing), and the quality scores of the bases in the region.
  • PATENT Client Reference No.: P39265-WO-1 [0036]
  • mismatches or gaps between the base calls of the reads and the reference sequence are used to identify active regions 108 that contain potential SNVs (single nucleotide variants) and/or InDels (insertions and deletions) 106.
  • an assembly graph 110 such as a Debruijn graph for example, is generated by starting with the reference sequence, which is decomposed into a series of kmers (short sequences bases long), with each successive kmer overlapping the previous kmer by bases.
  • the kmers can be represented as nodes that can be joined by lines called edges.
  • the edges can be weighted to keep track of the number of kmers found in the sample, with the weights initially set to zero. This forms a reference graph.
  • each sequence read can similarly be decomposed into a series of kmers, and can be matched to the reference graph. Each time two successive kmers are matched to the graph, the weight of the edge joining the two kmer nodes is incremented by one.
  • the assembly graph can then be optionally pruned by removing sections of the graph that are supported by an edge weight that is fewer than a threshold value, such as 2.
  • a weight of 2, for example, means that 2 reads in the sample support that segment of the graph.
  • the threshold can be increased, for more aggressive pruning, which will result in faster processing and higher specificity, but with lower sensitivity.
  • haplotype sequences can be generated from the graph by traversing all paths in the graph, with a likelihood score calculated for each haplotype sequence as the product of the transition probabilities of the path edges.
  • the probability of an edge can be calculated as the weight of the edge divided by the sum of the weights of all the edges that share the same source node.
  • the haplotypes with the highest likelihood scores can be used for candidate variant detection 112.
  • PATENT Client Reference No.: P39265-WO-1 These haplotypes are then aligned to the reference sequence in order to identify candidate variants.
  • the alignment can be done using a Smith-Waterman alignment (SWA), for example, and can be stored in a SAM or BAM file.
  • SWA Smith-Waterman alignment
  • the candidate variants can be determined by comparing the aligned sequence to the reference sequence and can be stored in a .vcf file.
  • the steps described above can be performed, for example, using optimized settings for GATK, an open source variant caller.
  • the optimized settings include: initial- tumor-lod -5, tumor-lod-to-emit -5, pruning-lod-threshold -9, max-reads-per-alignment-start 0, active-probability-threshold 0.00005, and min-pruning 0. These parameters are changed from the defaults to more sensitively detect candidate variants at 0.1% AF from realignment.
  • the optimal parameters can be identified by maximizing sensitivity based on a grid search.
  • the candidate variants can be filtered using a two-stage process, which is designed to reduce false positives and improve precision. As shown in FIG.1, in the first stage 114, the weighted counts of variant molecules are compared against a threshold that is dynamically learned from the sample.
  • This threshold is computed by fitting a sample specific regression model to the background distribution of weighted variant counts for all the positions with evidence of alternate alleles (i.e., candidate variant sites) within the sample.
  • smooth splines can optionally be used instead of a regression model to fit more general curves. All the variants with weighted-counts above the sample threshold are retained for the second stage. The sample specific regression is illustrated in FIG.2.
  • the vertical line in FIG.2, which shows the threshold, can be set to achieve a particular desired or predetermined value of the cumulative counts for the y-axis, such as 1/100, 1/200, 1/300, 1/400, 1/500, 1/600, 1/700, 1/800, 1/900, or 1/1000 cumulative counts per base pair of the panel.
  • This can be learned from a training dataset, where the cumulative counts can be matched to or set based on a known number of variants, depending on a desired sensitivity and precision profile.
  • a machine learning classification model 116 e.g., LightGBM or XGBoost
  • XGBoost e.g., LightGBM or XGBoost
  • This model uses the counts of molecules supporting variants as well as additional features such as distance of the variant with respect to the 5’ end of the fragment, substitution type (for SNVs), sequence context, cluster size, strand bias, simplex/duplex support, ref/alt, PATENT Client Reference No.: P39265-WO-1 base quality score (e.g., base quality score used in GATK), MAPQ score, and strand information to produce a probability score for each variant.
  • Sequence context is the base upstream and downstream of the variant position. Therefore, if the variant occurs at position 1000 in chromosome 1, the sequence context is the combination of bases at position 999 and position 1001.
  • Strand bias is a measure of how unbalanced the distribution of plus and minus strands is in the data.
  • strand bias is 0. If either plus or minus strand is much higher than the other, then it takes higher value. The maximum number it can take is 0.25.
  • Table 1 below lists all the features extracted from consensus reads that can be used by the machine learning classifier model 116.
  • the classifier can use any combination of the features listed herein. For example, the classifier can use all the features in some embodiments or use a subset of the features in other embodiments.
  • Table 1 Feature Type Description Nonduplex Numeric Number of variant molecules with only support counts from + or - strands Duplex Counts Numeric Number of variant molecules with support from both strands (+/-) Weighted Numeric Linear weighted combination of duplex and Counts nonduplex counts Distance Numeric Median distance of variant from nearest fragment end Duplex fraction Numeric Fraction of variant molecules that are duplex Strand Bias Numeric Metric that measures the imbalance of support from + or - strands.
  • the machine learning classification threshold can be determined, for example, based on a training set with known somatic variant positive samples and negative/healthy samples which are not expected to contain somatic variants.
  • the threshold is set or selected to maximize the F1 score for variants in the data.
  • the F1 score is a metric as illustrated in FIG.4, an optional third filtering step can to remove systemic noise that can be introduced during the sequencing process, such as errors introduced during sample preparation, for example.
  • the first filtering stage 402 Starting with the candidate variants 400 and assembled BAM 401 that is the output of a variant caller (corresponding to step 114 in FIG.1), the first filtering stage 402 passes candidate variants with weighted counts greater than a sample specific threshold, as described above with respect to step 114 in FIG.1.
  • the candidate variant is below the threshold, it is filtered out as background noise 404.
  • the candidate variant has a weighted count that is greater than the first stage 402 threshold, it is passed to the second stage 406 (as described above with respect with step 116 in FIG.1) for the machine learning classifier filtering step 406.
  • the variant has a ML score that is less than the ML threshold, then it is filtered out 408 for one or more reasons, such as having low quality, low MAPQ, low base quality, strand bias, low depth, etc.
  • the variant PATENT Client Reference No.: P39265-WO-1 has a ML score that is greater than the ML threshold, then it is passed to an optional third stage 410 blocklist filter.
  • the blocklist filter can be created from normal samples staring with the VCF files from a variant caller such as GATK/mutect2. Any variant that appears in multiple normal samples, which are expected to not have any variants, with a count greater than or equal to 3 can be added to the blocklist. [0051] If the variant passed to the third stage 410 is on the blocklist, it is filtered out as systemic noise 412. If the variant passed to the third stage 410 is not on the blocklist, it is passed as a called variant 414. [0052] In some embodiments, the third stage 410 can include an additional rescue step for variants on the blocklist.
  • FIG. 5 illustrates a sequencing system 500 according to an embodiment of the present disclosure.
  • the system as shown includes a sample 505, such as Xpandomers within an assay device 510, where an assay 508 can be performed on sample 505.
  • sample 505 can be contacted with reagents of assay 508 to provide a signal (e.g., an intensity signal) of a physical characteristic 515 (e.g., sequence information of a cell-free nucleic acid molecule).
  • Assay 508 may include sequencing by expansion with an assay device 510, such as a nanopore sequencing device as discussed above.
  • Physical characteristic 515 e.g., a PATENT Client Reference No.: P39265-WO-1 fluorescence intensity, a voltage, or a current
  • Detector 520 can take a measurement at intervals (e.g., periodic intervals) to obtain data points that make up a data signal.
  • an analog-to-digital converter converts an analog signal from the detector into digital form at a plurality of times.
  • Assay device 510 and detector 520 can form an assay system, e.g., a a sequencing system 500 that performs sequencing according to embodiments described herein.
  • a data signal 525 is sent from detector 520 to logic system 530.
  • data signal 525 can be used to determine sequences and/or locations in a reference genome of nucleic acid molecules (e.g., DNA and/or RNA).
  • Data signal 525 can include various measurements made at a same time, e.g., different colors of fluorescent dyes or different electrical signals for different molecule of sample 505, and thus data signal 525 can correspond to multiple signals.
  • Data signal 525 may be stored in a local memory 535, an external memory 540, or a storage device 545.
  • the sequencing system 500 can be comprised of multiple assay devices 510 and detectors 520.
  • Logic system 530 may be, or may include, a computer system, ASIC, processor, graphics processing unit (GPU), etc. It may also include or be coupled with a display (e.g., monitor, LED display, etc.) and a user input device (e.g., mouse, keyboard, buttons, etc.).
  • Logic system 530 and the other components may be part of a stand-alone or network connected computer system, or they may be directly attached to or incorporated in a device (e.g., a sequencing device) that includes detector 520 and/or assay device 510.
  • Logic system 530 may also include software that executes in a processor 550.
  • Logic system 530 may include a computer readable medium storing instructions for controlling sequencing system 500 to perform any of the methods described herein.
  • logic system 530 can provide commands to a system that includes assay device 510 such that sequencing or other physical operations are performed. Such physical operations can be performed in a particular order, e.g., with reagents being added and removed in a particular order.
  • Sequencing system 500 may also include a treatment device 560, which can provide a treatment to the subject.
  • Treatment device 560 can determine a treatment and/or be used to perform a treatment. Examples of such treatment can include surgery, radiation therapy, chemotherapy, immunotherapy, targeted therapy, hormone therapy, and stem cell transplant.
  • Sequencing system 500 may also include a reporting device 555, which can present results of any of the methods describe herein, e.g., as determined using the sequencing system 500.
  • Reporting device 555 can be in communication with a reporting module within logic system 530 that can aggregate, format, and send a report to reporting device 555.
  • the reporting module can present information determined using any of the methods described herein.
  • the information can be presented by reporting device 555 in any format that can be recognized and interpreted by a user of the sequencing system 500.
  • the information can be presented by reporting device 555 in a displayed, printed, or transmitted format, or any combination thereof.
  • Any of the computer systems mentioned herein may utilize any suitable number of subsystems. Examples of such subsystems are shown in FIG.6 in computer system 600.
  • a computer system 600 includes a single computer apparatus, where the subsystems can be the components of the computer apparatus.
  • a computer system 600 can include multiple computer apparatuses, each being a subsystem, with internal components.
  • a computer system can include desktop and laptop computers, tablets, mobile phones, and other mobile devices.
  • FIG.6 The subsystems shown in FIG.6 are interconnected via a system bus 675. Additional subsystems such as a printer 674, keyboard 678, storage device(s) 679, 682, monitor 676 (e.g., a display screen, such as an LED), which is coupled to display adapter 682, and others are shown. Peripherals and input/output (I/O) devices, which couple to I/O controller 671, can be connected to the computer system by any number of means known in the art such as input/output (I/O) port 677 (e.g., USB, FireWire ® ).
  • I/O input/output
  • I/O port 677 or external interface 681 can be used to connect computer system PATENT Client Reference No.: P39265-WO-1 600 to a wide area network such as the Internet, a mouse input device, or a scanner.
  • the interconnection via system bus 675 allows the central processor 673 to communicate with each subsystem and to control the execution of a plurality of instructions from system memory 672 or the storage device(s) 679 (e.g., a fixed disk, such as a hard drive, or optical disk), as well as the exchange of information between subsystems.
  • the system memory 672 and/or the storage device(s) 679 may embody a computer readable medium.
  • a computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface 681, by an internal interface, or via removable storage devices that can be connected and removed from one component to another component.
  • computer systems, subsystem, or apparatuses can communicate over a network.
  • one computer can be considered a client and another computer a server, where each can be part of a same computer system.
  • a client and a server can each include multiple systems, subsystems, or components.
  • methods may involve various numbers of clients and/or servers, including at least 10, 20, 50, 100, 200, 500, 1,000, or 10,000 devices.
  • Methods can include various numbers of communication messages between devices, including at least 100, 200, 500, 1,000, 10,000, 50,000, 100,000, 500,000, or one million communication messages.
  • Such communications can involve at least 1 MB, 10 MB, 100 MB, 1 GB, 10 GB, or 100 GB of data.
  • aspects of embodiments can be implemented in the form of control logic using hardware circuitry (e.g., an application specific integrated circuit or field programmable gate array) and/or using computer software stored in a memory with a generally programmable processor in a modular or integrated manner, and thus a processor can include memory storing software instructions that configure hardware circuitry, as well as an FPGA with configuration instructions or an ASIC.
  • a processor can include a single-core processor, multi-core processor on a same integrated chip, or multiple processing units on a single circuit board or networked, as well as dedicated hardware.
  • P39265-WO-1 other ways and/or methods to implement embodiments of the present disclosure using hardware and a combination of hardware and software.
  • Any of the software components or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++, C#, Objective-C, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques.
  • the software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission.
  • a suitable non-transitory computer readable medium can include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk) or Blu-ray disk, flash memory, and the like.
  • the computer readable medium may be any combination of such devices.
  • the order of operations may be re-arranged.
  • a process can be terminated when its operations are completed but could have additional steps not included in a figure.
  • a process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.
  • Such programs may also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internet.
  • a computer readable medium may be created using a data signal encoded with such programs.
  • Computer readable media encoded with the program code may be packaged with a compatible device (e.g., as firmware) or provided separately from other devices (e.g., via Internet download). Any such computer readable medium may reside on or within a single computer product (e.g., a hard drive, a CD, or an entire computer system), and may be present on or within different computer products within a system or network.
  • a computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.
  • Any of the methods described herein may be totally or partially performed with a computer system including one or more processors, which can be configured to perform the steps. Any operations performed with a processor may be performed in real-time.
  • the term “real-time” may refer to computing operations or processes that are completed within a PATENT Client Reference No.: P39265-WO-1 certain time constraint. As examples, a time constraint may be 30 seconds, 1 minute, 10 minutes, 30 minutes, 1 hour, 4 hours, 1 day, or 7 days.
  • embodiments can be directed to computer systems configured to perform the steps of any of the methods described herein, potentially with different components performing a respective step or a respective group of steps.
  • steps of methods herein can be performed at a same time or at different times or in a different order. Additionally, portions of these steps may be used with portions of other steps from other methods. Also, all or portions of a step may be optional. Additionally, any of the steps of any of the methods can be performed with modules, units, circuits, or other means of a system for performing these steps.
  • the term “and/or” includes any and all combinations of one or more of the associated listed items and may be abbreviated as “/”.
  • Spatially relative terms such as “under”, “below”, “lower”, “over”, “upper” and the like, may be used herein for ease of description to describe one element or feature’s relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is inverted, elements described as “under” or “beneath” other elements or features would then be oriented “over” the other elements or features.
  • the exemplary term “under” can encompass both an orientation of over and under.
  • the device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.
  • the terms “upwardly”, “downwardly”, “vertical”, “horizontal” and the like are used herein for the purpose of explanation only unless specifically indicated otherwise.
  • first and second may be used herein to describe various features/elements (including steps), these features/elements should not be limited by these terms, unless the context indicates otherwise. These terms may be used to distinguish one feature/element from another feature/element.
  • a numeric value may have a value that is +/- 0.1% of the stated value (or range of values), +/- 1% of the stated value (or range of values), +/- 2% of the stated value (or range of values), +/- 5% of the stated value (or range of values), +/- 10% of the stated value (or range of values), etc.
  • Any numerical values given herein should also be understood to include about or approximately that value, unless the context indicates otherwise. For example, if the value “10” is disclosed, then “about 10” is also disclosed. Any numerical range recited herein is intended to include all sub-ranges subsumed therein.

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The variant caller algorithms described herein can be used to identify various mutations in sequencing data. The output of a variant caller, such as an open-source variant caller such as GATK/mutect2, can be further filtered to improve sensitivity and/or accuracy. A first filter stage can be used to remove background noise, and a second filter stage can use a machine learning classifier model to further filter the variants. An optional third filter stage can use a blocklist to filter systemic noise.

Description

INTERNATIONAL PATENT APPLICATION Title: SYSTEMS AND METHODS FOR SOMATIC SMALL VARIANT CALLING Inventors: Mahdi Golkaram, a citizen of Iran, resident of San Diego, CA Badri Padhukasahasram, a U.S. citizen, resident of Bangaluru, India Chen Zhao, a citizen of China, resident of San Diego, CA Assignee: Roche Sequencing Solutions, Inc. 4300 Hacienda Drive Pleasanton, CA 94588 United States of America Entity: Large SYSTEMS AND METHODS FOR SOMATIC SMALL VARIANT CALLING CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This Application claims the benefit of United States Provisional Patent Application No.63/635,568, filed on April 17, 2024, which is hereby incorporated by reference in its entirety. BACKGROUND [0002] The development of affordable and rapid DNA sequencing technologies has enabled the development of targeted therapeutics that rely on the use of DNA biomarkers to identify patients that are suitable for receiving the targeted therapy. For example, mutations in certain genes, such as genes involved in cell proliferation, are known to lead to certain types of cancers that can be treated very effectively with specific types of drugs. Other mutations are known to confer resistance to certain therapies. Therefore, there is a need for improved systems and methods to identify variants from sequencing data. SUMMARY [0003] The embodiments described herein relate to systems and methods for performing variant calling on sequencing data. More particularly, the embodiments described herein related to calling single nucleotide variants, insertions, and deletions in the sequencing data. [0004] In accordance with a first aspect of the present disclosure, a method for somatic variant calling is provided. The method includes: receiving a computer file comprising a plurality of consensus reads generated from a patient sample; aligning the plurality of consensus reads; determining a callable region from the aligned consensus reads; determining an active region from the callable region based on a comparison of the callable region to a reference sequence and identifying one or more differences in the active region compared to the reference sequence; generating an assembly graph from the reference sequence and the active region, wherein the assembly graph comprises a plurality of paths, wherein each path comprises a plurality of path segments, and wherein each path segment comprises a count of the consensus reads that support the path segment; determining a plurality of haplotype sequences from a plurality of paths of the assembly graph; determining a likelihood score for each haplotype sequence based on the count of consensus reads that support the path PATENT Client Reference No.: P39265-WO-1 segments included in each haplotype sequence; determining a plurality of candidate variants by comparing a weighted count for each variant to a first threshold that is dynamically determined from the patient sample, wherein the weighted count is the number of consensus reads with the candidate variant; filtering candidate variants with a weighted count greater than the first threshold with a machine learning classification model that generates a probability score for each candidate variant; and determining a plurality of filtered variants by comparing the probability score for each candidate variant to a second threshold. [0005] In some embodiments, the callable region is based on a minimum coverage requirement. [0006] In some embodiments, the assembly graph is pruned to remove at least one path in the plurality of paths. In other embodiments, the assembly graph is not pruned. [0007] In some embodiments, the first threshold is dynamically determined by fitting a sample specific regression model to a background distribution of weighted counts for the candidate variant. [0008] In some embodiments, the machine learning classification model includes one or more features selected from the group consisting of: nonduplex counts, duplex counts, weighted counts, distance of the candidate variant to a 5’ end of the consensus read, substitution type for single nucleotide variants, sequence context, strand bias, mapping quality, base quality, cluster size, and duplex fraction. [0009] In some embodiments, the machine learning classification model comprises a gradient boosted decision tree algorithm. [0010] In some embodiments, the gradient boosted decision tree algorithm is selected from the group consisting of: LightGBM and XGBoost. [0011] In accordance with a second aspect of the present disclosure, a system is provided for generating sequencing data. The system may include an assay device and/or a logic system. The logic system may include a processor coupled to a memory storing instructions executable by the processor. The processor, upon execution of the instructions, is configured to perform the method of one or more embodiments of the first aspect. PATENT Client Reference No.: P39265-WO-1 [0012] In some embodiments of the second aspect, the system further includes a treatment device for determining and/or administering a treatment to the patient based on at least one filtered variant in the plurality of filtered variants. [0013] In some embodiments of the third aspect, the system further includes a reporting device for displaying information relating to at least one filtered variant in the plurality of filtered variants. [0014] In accordance with a third aspect of the present disclosure, a non-transitory computer-readable medium is provided. The computer-readable medium stores a set of instructions that, upon execution by at least one processor, cause the processor to perform the method of one or more embodiments of the first aspect. BRIEF DESCRIPTION OF THE DRAWINGS [0015] The features of the invention are set forth with particularity in the claims that follow. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings thereof. [0016] FIG. 1 is a flow chart of a method for a variant caller workflow, in accordance with some embodiments. [0017] FIG. 2 is a graph illustrating how a threshold is set by fitting a sample specific regression model to the background distribution of weighted variant counts for all the positions with evidence of alternate alleles (i.e., candidate variant sites) within the sample, in accordance with some embodiments. [0018] FIG. 3 illustrates the relationship between false positives and sensitivity of the variant caller, in accordance with some embodiments. [0019] FIG. 4 illustrates a flow chart of a three-stage filtering process that can be used to improve variant caller sensitivity and/or accuracy, in accordance with some embodiments. [0020] FIG. 5 illustrates a sequencing system, in accordance with some embodiments. PATENT Client Reference No.: P39265-WO-1 [0021] FIG. 6 illustrates an exemplary computer apparatus, in accordance with some embodiments. DETAILED DESCRIPTION [0022] The present disclosure provides a number of techniques for performing variant calling, as part of a secondary analysis workflow of sequencing data produced by today’s next generation sequencing devices. More particularly, a variant calling algorithm is provided for identifying single nucleotide variants (SNV) and/or insertion/deletions (InDels) from sequencing data, such as nanopore-based sequencing data. [0023] In at least one aspect of a secondary analysis workflow, a variant calling algorithm is provided that generates an assembly graph using a reference sequence for one or more active regions of sequencing data (e.g., a plurality of reads generated by sequencing a sample). Weights for each edge of the graph are then calculated based on the sequencing data. Candidate variants are then identified by traversing the graph. The candidate variants are then filtered according to a two stage process. In the first stage, candidate variants are removed based on a comparison of the candidate variant weight with a threshold. In the second stage, a machine learning model is used to score the candidate variants that passed the first stage according to a number of features. The score is then compared against a threshold to determine a final list of called variants for the active region. [0024] In another aspect of the secondary analysis workflow, an InDel caller algorithm is provided. The InDel caller algorithm can include a hotspot module and an adaptive module. In the hotspot module, the InDel caller algorithm uses a number of heuristics to identify InDel variants in hotspot regions. In parallel, the adaptive module uses a two stage approach to identify variants with low AF that might not be strongly supported in a plasma sample. In the first stage, a core linear regression model is used to filter out a number of variants based on a comparison of each candidate variant’s score with a threshold. In the second stage, a machine learning classifier, such as a gradient boosting machine (GBM) model is used to score each variant that passes the first stage and then the classifier score is compared to a threshold to determine the final list of called variants. Optionally, the called variants can be further filtered by a blocklist filter. PATENT Client Reference No.: P39265-WO-1 [0025] In yet another aspect of the secondary analysis workflow, an SNV caller algorithm is provided that is similar in many aspects to the InDel caller algorithm described above. The SNV caller algorithm can include a hotspot module and an adaptive module. The hotspot module is similar to the hotspot module of the InDel caller algorithm, with the exception that one or more additional heuristics may be included in addition to or in lieu of the heuristics used in the InDel caller algorithm. The adaptive module of the SNV caller algorithm also uses a two stage approach to identify variants. In the first stage, a core linear regression model is used to filter out a number of variants based on a comparison of each candidate variant’s score with a threshold. In the second stage, an extended linear model that incorporates a number of additional weighted features is used to filter out variants that passed the first stage based on a comparison of each candidate variant’s score with a threshold. [0026] These secondary analysis workflows can be used to analyze sequencing data from a number of next generation sequencing devices, including nanopore-based sequencing systems. Correctly identifying variants from the raw sequencing data is an important step in the bioinformatics pipeline used to diagnose patients and/or create therapies tailored to particular disease. Sequencing [0027] Prepared nucleic acid molecules of interest (e.g., a sequencing library) can be sequenced using a sequencing assay as part of the procedure for determining sequencing reads for a plurality of microsatellite loci. Any of a number of sequencing technologies or sequencing assays can be utilized. The term "Next Generation Sequencing (NGS)" as used herein refers to sequencing methods that allow for massively parallel sequencing of clonally amplified molecules and of single nucleic acid molecules (or of nucleic acid analogues). [0028] Non-limiting examples of sequencing assays that are suitable for use with the methods disclosed herein include nanopore sequencing (see, e.g., U.S. Patent Application Publication Nos.2013/0244340, 2013/0264207, 2014/0134616, 2015/0119259, and 2015/0337366), Sanger sequencing, capillary array sequencing, thermal cycle sequencing (see, e.g., Sears et al., Biotechniques, 13:626-633 (1992)), solid-phase sequencing (see, e.g., Zimmerman et al., Methods Mol. Cell Biol., 3:39-42 (1992)), sequencing with mass spectrometry such as matrix-assisted laser desorption/ionization time-of-flight mass PATENT Client Reference No.: P39265-WO-1 spectrometry (MALDI-TOF/MS) (see, e.g., Fu et al., Nature Biotech., 16:381-384 (1998)), sequencing by hybridization (see, e.g., Drmanac et al., Nature Biotech., 16:54-58 (1998)), and NGS methods, including but not limited to sequencing by synthesis (see, e.g., HiSeq™, MiSeq™, or Genome Analyzer, each available from Illumina, Inc. (San Diego, CA)), sequencing by ligation (see, e.g., SOLiD™, available from Thermo Fisher Scientific, Inc. (Waltham, MA)), ion semiconductor sequencing (see, e.g., Ion Torrent™, available from Thermo Fisher Scientific, Inc. (Waltham, MA)), and SMRT® sequencing, available from Pacific Biosciences of California, Inc. (Menlo Park, CA). [0029] Commercially available sequencing technologies include: sequencing-by- hybridization platforms from Affymetrix, Inc. (Sunnyvale, CA), sequencing-by-synthesis platforms from Illumina, Inc. (San Diego, Calif.), and sequencing-by-ligation platform from Thermo Fisher Scientific, Inc. (Waltham, MA). Other sequencing technologies include, but are not limited to, the Ion Torrent technology by ThermoFisher Scientific, Inc. (Waltham, MA), and nanopore sequencing by Roche Sequencing Solutions, Inc. (Santa Clara, CA) and/or Oxford Nanopore Technologies, plc (Oxford, United Kingdom). Bioinformatics Workflow Overview [0030] The output of an NGS sequencer is generally processed by a bioinformatics pipeline that processes the raw signal from the NGS sequencer and translates the raw signal into base calls, often referred to as raw reads, which are typically stored in a FASTQ file that combines the raw reads with associated quality data. This portion of the bioinformatics pipeline is often referred to as primary analysis. [0031] The next section of the bioinformatics pipeline is called secondary analysis, and it takes the raw reads generated by the primary analysis, and performs several tasks, including alignment and variant calling. [0032] Tertiary analysis is the final portion of the bioinformatics pipeline and uses the variant calling information to generate medical insights that health care practitioners can use to improve treatments for their patients. Secondary Analysis PATENT Client Reference No.: P39265-WO-1 [0033] New sequencing technologies, such as nanopore-based sequencers, generate sequencing data with different characteristics than sequencing data generated by the current market leading sequencers, such as Illumina sequencers. For example, these differences can include differences in raw read accuracy and differences in the error profiles. Because Illumina sequencers currently dominate the market, the vast majority of the secondary analysis software tools that have been developed are custom tailored to process the type of data that is generated by the Illumina sequencers. These software tools, which typically work very well with data from Illumina sequencers, may not work well with data generated by new next generation sequencing technologies, such as nanopore sequencers. Consequently, there is a need to develop new secondary analysis tools that work well with the new sequencing technologies that are currently being developed. In addition, although the variant calling methods described herein may be particularly effective with nanopore sequencing data, the methods can also be used with other types of sequencing data, such as data from an Illumina sequencer. [0034] FIG. 1 illustrates a secondary analysis workflow for a variant calling algorithm, in accordance with some embodiments. The variant calling algorithm leverages a portion of the approach used by Mutect2, a haplotype variant caller with somatic-specific genotyping and filtering that is available as part of the Genome Analysis Toolkit (GATK) maintained by the Broad Institute, with several changes or additions to the said algorithm. [0035] The variant calling algorithm starts by receiving consensus reads 100 from, for example, a nanopore sequencer or another type of sequencer. A typical format for receiving these reads is in a file using the FASTQ format. The consensus reads stored in the file can be accessed and processed by the pipeline to perform an alignment of the consensus reads against a reference sequence. A callable region 104 can be then identified as a region with a sufficient depth of read coverage to allow for variant calling. For example, the callable region 104 can have a minimum depth of read coverage of at least 5, 10, 15, 20, 25, 30, 35, 40, 45, or 50. In some embodiments, the required depth of coverage can vary depending on a variety of factors, such as the types of bases in the region (i.e., gc rich region or other type of motif which may have a higher error rate during sequencing), and the quality scores of the bases in the region. PATENT Client Reference No.: P39265-WO-1 [0036] Next, within the callable regions 104, mismatches or gaps between the base calls of the reads and the reference sequence are used to identify active regions 108 that contain potential SNVs (single nucleotide variants) and/or InDels (insertions and deletions) 106. [0037] For each active region 108, an assembly graph 110, such as a Debruijn graph for example, is generated by starting with the reference sequence, which is decomposed into a series of kmers (short sequences bases long), with each successive kmer overlapping the previous kmer by bases. The kmers can be represented as nodes that can be joined by lines called edges. The edges can be weighted to keep track of the number of kmers found in the sample, with the weights initially set to zero. This forms a reference graph. [0038] Next, each sequence read can similarly be decomposed into a series of kmers, and can be matched to the reference graph. Each time two successive kmers are matched to the graph, the weight of the edge joining the two kmer nodes is incremented by one. If a kmer cannot be matched to the graph, a new node and edge is added. This is repeated for all the kmers in the sequence read. Under this method, the weights indicate the number of times a particular kmer was found in the sequence reads. In addition, the weights can also be used to determine the most likely path through the graph. [0039] In some embodiments, the assembly graph can then be optionally pruned by removing sections of the graph that are supported by an edge weight that is fewer than a threshold value, such as 2. A weight of 2, for example, means that 2 reads in the sample support that segment of the graph. The threshold can be increased, for more aggressive pruning, which will result in faster processing and higher specificity, but with lower sensitivity. Decreasing the threshold, means less pruning, which lowers the specificity, but increases the sensitivity. In some embodiments, the threshold can be set to zero, which means that the graph is not pruned, for maximum sensitivity. [0040] After the graph has been assembled and optionally pruned, haplotype sequences can be generated from the graph by traversing all paths in the graph, with a likelihood score calculated for each haplotype sequence as the product of the transition probabilities of the path edges. The probability of an edge can be calculated as the weight of the edge divided by the sum of the weights of all the edges that share the same source node. The haplotypes with the highest likelihood scores can be used for candidate variant detection 112. PATENT Client Reference No.: P39265-WO-1 [0041] These haplotypes are then aligned to the reference sequence in order to identify candidate variants. The alignment can be done using a Smith-Waterman alignment (SWA), for example, and can be stored in a SAM or BAM file. The candidate variants can be determined by comparing the aligned sequence to the reference sequence and can be stored in a .vcf file. [0042] The steps described above can be performed, for example, using optimized settings for GATK, an open source variant caller. The optimized settings include: initial- tumor-lod -5, tumor-lod-to-emit -5, pruning-lod-threshold -9, max-reads-per-alignment-start 0, active-probability-threshold 0.00005, and min-pruning 0. These parameters are changed from the defaults to more sensitively detect candidate variants at 0.1% AF from realignment. The optimal parameters can be identified by maximizing sensitivity based on a grid search. [0043] The candidate variants can be filtered using a two-stage process, which is designed to reduce false positives and improve precision. As shown in FIG.1, in the first stage 114, the weighted counts of variant molecules are compared against a threshold that is dynamically learned from the sample. This threshold is computed by fitting a sample specific regression model to the background distribution of weighted variant counts for all the positions with evidence of alternate alleles (i.e., candidate variant sites) within the sample. In some embodiments, smooth splines can optionally be used instead of a regression model to fit more general curves. All the variants with weighted-counts above the sample threshold are retained for the second stage. The sample specific regression is illustrated in FIG.2. The vertical line in FIG.2, which shows the threshold, can be set to achieve a particular desired or predetermined value of the cumulative counts for the y-axis, such as 1/100, 1/200, 1/300, 1/400, 1/500, 1/600, 1/700, 1/800, 1/900, or 1/1000 cumulative counts per base pair of the panel. This can be learned from a training dataset, where the cumulative counts can be matched to or set based on a known number of variants, depending on a desired sensitivity and precision profile. [0044] In the second stage, as shown in FIG.1, a machine learning classification model 116 (e.g., LightGBM or XGBoost) is used to further filter variants and generate the final calls. This model uses the counts of molecules supporting variants as well as additional features such as distance of the variant with respect to the 5’ end of the fragment, substitution type (for SNVs), sequence context, cluster size, strand bias, simplex/duplex support, ref/alt, PATENT Client Reference No.: P39265-WO-1 base quality score (e.g., base quality score used in GATK), MAPQ score, and strand information to produce a probability score for each variant. Sequence context is the base upstream and downstream of the variant position. Therefore, if the variant occurs at position 1000 in chromosome 1, the sequence context is the combination of bases at position 999 and position 1001. Strand bias is a measure of how unbalanced the distribution of plus and minus strands is in the data. If plus and minus strands are equal in number, then strand bias is 0. If either plus or minus strand is much higher than the other, then it takes higher value. The maximum number it can take is 0.25. Table 1 below lists all the features extracted from consensus reads that can be used by the machine learning classifier model 116. In some embodiments, the classifier can use any combination of the features listed herein. For example, the classifier can use all the features in some embodiments or use a subset of the features in other embodiments. Table 1 Feature Type Description Nonduplex Numeric Number of variant molecules with only support counts from + or - strands Duplex Counts Numeric Number of variant molecules with support from both strands (+/-) Weighted Numeric Linear weighted combination of duplex and Counts nonduplex counts Distance Numeric Median distance of variant from nearest fragment end Duplex fraction Numeric Fraction of variant molecules that are duplex Strand Bias Numeric Metric that measures the imbalance of support from + or - strands. Mapping quality Numeric Median mapping quality of variant molecules (mq) Base quality (bq) Numeric Median base quality of variant molecules at variant position PATENT Client Reference No.: P39265-WO-1 Cluster size Numeric Median number of raw reads for variant molecules/clusters Substitution type Categorical For SNVs, description of the base change e.g. A -> T Context Categorical 1 bp context upstream and downstream of variant’s position [0045] All the variants whose classifier probability score from the machine learning classifier model 116 is above the classifier threshold represent the final calls that pass this two-stage filtration. The sensitivity vs false positives trade off from the filtration step is illustrated in FIG.3 for varying values of the classifier threshold. [0046] The machine learning classification threshold can be determined, for example, based on a training set with known somatic variant positive samples and negative/healthy samples which are not expected to contain somatic variants. The threshold is set or selected to maximize the F1 score for variants in the data. The F1 score is a metric as illustrated in FIG.4, an optional third filtering step can to remove systemic noise that can be introduced during the sequencing process, such as errors introduced during sample preparation, for example. [0048] Starting with the candidate variants 400 and assembled BAM 401 that is the output of a variant caller (corresponding to step 114 in FIG.1), the first filtering stage 402 passes candidate variants with weighted counts greater than a sample specific threshold, as described above with respect to step 114 in FIG.1. If the candidate variant is below the threshold, it is filtered out as background noise 404. [0049] If the candidate variant has a weighted count that is greater than the first stage 402 threshold, it is passed to the second stage 406 (as described above with respect with step 116 in FIG.1) for the machine learning classifier filtering step 406. If the variant has a ML score that is less than the ML threshold, then it is filtered out 408 for one or more reasons, such as having low quality, low MAPQ, low base quality, strand bias, low depth, etc. If the variant PATENT Client Reference No.: P39265-WO-1 has a ML score that is greater than the ML threshold, then it is passed to an optional third stage 410 blocklist filter. [0050] The blocklist filter can be created from normal samples staring with the VCF files from a variant caller such as GATK/mutect2. Any variant that appears in multiple normal samples, which are expected to not have any variants, with a count greater than or equal to 3 can be added to the blocklist. [0051] If the variant passed to the third stage 410 is on the blocklist, it is filtered out as systemic noise 412. If the variant passed to the third stage 410 is not on the blocklist, it is passed as a called variant 414. [0052] In some embodiments, the third stage 410 can include an additional rescue step for variants on the blocklist. If the variant passed to the third stage 410 is on the blocklist and has a high score (i.e., a score above a threshold) based on the allele frequency (AF) in the target sample vs AF in the healthy samples, then the variant is rescued and passed as a called variant 414. The formula for calculating is: , (Eq.1) where is the mean of allele frequencies in the healthy samples, and is the standard deviation of allele frequencies in the healthy samples. The threshold can be determined using training data to rescue known variants that are on the blocklist. Example Systems [0053] FIG. 5 illustrates a sequencing system 500 according to an embodiment of the present disclosure. The system as shown includes a sample 505, such as Xpandomers within an assay device 510, where an assay 508 can be performed on sample 505. For example, sample 505 can be contacted with reagents of assay 508 to provide a signal (e.g., an intensity signal) of a physical characteristic 515 (e.g., sequence information of a cell-free nucleic acid molecule). Assay 508 may include sequencing by expansion with an assay device 510, such as a nanopore sequencing device as discussed above. Physical characteristic 515 (e.g., a PATENT Client Reference No.: P39265-WO-1 fluorescence intensity, a voltage, or a current), from the sample is detected by detector 520. Detector 520 can take a measurement at intervals (e.g., periodic intervals) to obtain data points that make up a data signal. In one embodiment, an analog-to-digital converter converts an analog signal from the detector into digital form at a plurality of times. [0054] Assay device 510 and detector 520 can form an assay system, e.g., a a sequencing system 500 that performs sequencing according to embodiments described herein. A data signal 525 is sent from detector 520 to logic system 530. As an example, data signal 525 can be used to determine sequences and/or locations in a reference genome of nucleic acid molecules (e.g., DNA and/or RNA). Data signal 525 can include various measurements made at a same time, e.g., different colors of fluorescent dyes or different electrical signals for different molecule of sample 505, and thus data signal 525 can correspond to multiple signals. Data signal 525 may be stored in a local memory 535, an external memory 540, or a storage device 545. The sequencing system 500 can be comprised of multiple assay devices 510 and detectors 520. [0055] Logic system 530 may be, or may include, a computer system, ASIC, processor, graphics processing unit (GPU), etc. It may also include or be coupled with a display (e.g., monitor, LED display, etc.) and a user input device (e.g., mouse, keyboard, buttons, etc.). Logic system 530 and the other components may be part of a stand-alone or network connected computer system, or they may be directly attached to or incorporated in a device (e.g., a sequencing device) that includes detector 520 and/or assay device 510. Logic system 530 may also include software that executes in a processor 550. Logic system 530 may include a computer readable medium storing instructions for controlling sequencing system 500 to perform any of the methods described herein. For example, logic system 530 can provide commands to a system that includes assay device 510 such that sequencing or other physical operations are performed. Such physical operations can be performed in a particular order, e.g., with reagents being added and removed in a particular order. Such physical operations may be performed by a robotics system, e.g., including a robotic arm, as may be used to obtain a sample and perform an assay 508. Logic system 530 can also perform any steps of methods described herein that perform computer processing, such as, but not limited to base calling, alignment, variant calling, and the variant caller algorithm shown in FIG.1. PATENT Client Reference No.: P39265-WO-1 [0056] Sequencing system 500 may also include a treatment device 560, which can provide a treatment to the subject. Treatment device 560 can determine a treatment and/or be used to perform a treatment. Examples of such treatment can include surgery, radiation therapy, chemotherapy, immunotherapy, targeted therapy, hormone therapy, and stem cell transplant. Logic system 530 may be connected to treatment device 560, e.g., to provide results of a method described herein. The treatment device may receive inputs from other devices, such as an imaging device and user inputs (e.g., to control the treatment, such as controls over a robotic system). [0057] Sequencing system 500 may also include a reporting device 555, which can present results of any of the methods describe herein, e.g., as determined using the sequencing system 500. Reporting device 555 can be in communication with a reporting module within logic system 530 that can aggregate, format, and send a report to reporting device 555. The reporting module can present information determined using any of the methods described herein. The information can be presented by reporting device 555 in any format that can be recognized and interpreted by a user of the sequencing system 500. For example, the information can be presented by reporting device 555 in a displayed, printed, or transmitted format, or any combination thereof. [0058] Any of the computer systems mentioned herein may utilize any suitable number of subsystems. Examples of such subsystems are shown in FIG.6 in computer system 600. In some embodiments, a computer system 600 includes a single computer apparatus, where the subsystems can be the components of the computer apparatus. In other embodiments, a computer system 600 can include multiple computer apparatuses, each being a subsystem, with internal components. A computer system can include desktop and laptop computers, tablets, mobile phones, and other mobile devices. [0059] The subsystems shown in FIG.6 are interconnected via a system bus 675. Additional subsystems such as a printer 674, keyboard 678, storage device(s) 679, 682, monitor 676 (e.g., a display screen, such as an LED), which is coupled to display adapter 682, and others are shown. Peripherals and input/output (I/O) devices, which couple to I/O controller 671, can be connected to the computer system by any number of means known in the art such as input/output (I/O) port 677 (e.g., USB, FireWire®). For example, I/O port 677 or external interface 681 (e.g., Ethernet, Wi-Fi, etc.) can be used to connect computer system PATENT Client Reference No.: P39265-WO-1 600 to a wide area network such as the Internet, a mouse input device, or a scanner. The interconnection via system bus 675 allows the central processor 673 to communicate with each subsystem and to control the execution of a plurality of instructions from system memory 672 or the storage device(s) 679 (e.g., a fixed disk, such as a hard drive, or optical disk), as well as the exchange of information between subsystems. The system memory 672 and/or the storage device(s) 679 may embody a computer readable medium. Another subsystem is a data collection device 685, such as a camera, microphone, accelerometer, and the like. Any of the data mentioned herein can be output from one component to another component and can be output to the user. [0060] A computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface 681, by an internal interface, or via removable storage devices that can be connected and removed from one component to another component. In some embodiments, computer systems, subsystem, or apparatuses can communicate over a network. In such instances, one computer can be considered a client and another computer a server, where each can be part of a same computer system. A client and a server can each include multiple systems, subsystems, or components. In various embodiments, methods may involve various numbers of clients and/or servers, including at least 10, 20, 50, 100, 200, 500, 1,000, or 10,000 devices. Methods can include various numbers of communication messages between devices, including at least 100, 200, 500, 1,000, 10,000, 50,000, 100,000, 500,000, or one million communication messages. Such communications can involve at least 1 MB, 10 MB, 100 MB, 1 GB, 10 GB, or 100 GB of data. [0061] Aspects of embodiments can be implemented in the form of control logic using hardware circuitry (e.g., an application specific integrated circuit or field programmable gate array) and/or using computer software stored in a memory with a generally programmable processor in a modular or integrated manner, and thus a processor can include memory storing software instructions that configure hardware circuitry, as well as an FPGA with configuration instructions or an ASIC. As used herein, a processor can include a single-core processor, multi-core processor on a same integrated chip, or multiple processing units on a single circuit board or networked, as well as dedicated hardware. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate PATENT Client Reference No.: P39265-WO-1 other ways and/or methods to implement embodiments of the present disclosure using hardware and a combination of hardware and software. [0062] Any of the software components or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++, C#, Objective-C, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques. The software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission. A suitable non-transitory computer readable medium can include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk) or Blu-ray disk, flash memory, and the like. The computer readable medium may be any combination of such devices. In addition, the order of operations may be re-arranged. A process can be terminated when its operations are completed but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function. [0063] Such programs may also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internet. As such, a computer readable medium may be created using a data signal encoded with such programs. Computer readable media encoded with the program code may be packaged with a compatible device (e.g., as firmware) or provided separately from other devices (e.g., via Internet download). Any such computer readable medium may reside on or within a single computer product (e.g., a hard drive, a CD, or an entire computer system), and may be present on or within different computer products within a system or network. A computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user. [0064] Any of the methods described herein may be totally or partially performed with a computer system including one or more processors, which can be configured to perform the steps. Any operations performed with a processor may be performed in real-time. The term “real-time” may refer to computing operations or processes that are completed within a PATENT Client Reference No.: P39265-WO-1 certain time constraint. As examples, a time constraint may be 30 seconds, 1 minute, 10 minutes, 30 minutes, 1 hour, 4 hours, 1 day, or 7 days. Thus, embodiments can be directed to computer systems configured to perform the steps of any of the methods described herein, potentially with different components performing a respective step or a respective group of steps. Although presented as numbered steps, steps of methods herein can be performed at a same time or at different times or in a different order. Additionally, portions of these steps may be used with portions of other steps from other methods. Also, all or portions of a step may be optional. Additionally, any of the steps of any of the methods can be performed with modules, units, circuits, or other means of a system for performing these steps. [0065] The specific details of particular embodiments may be combined in any suitable manner without departing from the spirit and scope of embodiments of the disclosure. However, other embodiments of the disclosure may be directed to specific embodiments relating to each individual aspect, or specific combinations of these individual aspects. [0066] The above description of example embodiments of the present disclosure has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise form described, and many modifications and variations are possible in light of the teaching above. [0067] When a feature or element is herein referred to as being “on” another feature or element, it can be directly on the other feature or element or intervening features and/or elements may also be present. In contrast, when a feature or element is referred to as being “directly on” another feature or element, there are no intervening features or elements present. It will also be understood that, when a feature or element is referred to as being “connected”, “attached” or “coupled” to another feature or element, it can be directly connected, attached or coupled to the other feature or element or intervening features or elements may be present. In contrast, when a feature or element is referred to as being “directly connected”, “directly attached” or “directly coupled” to another feature or element, there are no intervening features or elements present. Although described or shown with respect to one embodiment, the features and elements so described or shown can apply to other embodiments. It will also be appreciated by those of skill in the art that references to a structure or feature that is disposed “adjacent” another feature may have portions that overlap or underlie the adjacent feature. PATENT Client Reference No.: P39265-WO-1 [0068] Terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. For example, as used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items and may be abbreviated as “/”. [0069] Spatially relative terms, such as “under”, “below”, “lower”, “over”, “upper” and the like, may be used herein for ease of description to describe one element or feature’s relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is inverted, elements described as “under” or “beneath” other elements or features would then be oriented “over” the other elements or features. Thus, the exemplary term “under” can encompass both an orientation of over and under. The device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly. Similarly, the terms “upwardly”, “downwardly”, “vertical”, “horizontal” and the like are used herein for the purpose of explanation only unless specifically indicated otherwise. [0070] Although the terms “first” and “second” may be used herein to describe various features/elements (including steps), these features/elements should not be limited by these terms, unless the context indicates otherwise. These terms may be used to distinguish one feature/element from another feature/element. Thus, a first feature/element discussed below could be termed a second feature/element, and similarly, a second feature/element discussed below could be termed a first feature/element without departing from the teachings of the present invention. [0071] Throughout this specification and the claims which follow, unless the context requires otherwise, the word “comprise”, and variations such as “comprises” and “comprising” means various components can be co-jointly employed in the methods and PATENT Client Reference No.: P39265-WO-1 articles (e.g., compositions and apparatuses including device and methods). For example, the term “comprising” will be understood to imply the inclusion of any stated elements or steps but not the exclusion of any other elements or steps. [0072] As used herein in the specification and claims, including as used in the examples and unless otherwise expressly specified, all numbers may be read as if prefaced by the word “about” or “approximately,” even if the term does not expressly appear. The phrase “about” or “approximately” may be used when describing magnitude and/or position to indicate that the value and/or position described is within a reasonable expected range of values and/or positions. For example, a numeric value may have a value that is +/- 0.1% of the stated value (or range of values), +/- 1% of the stated value (or range of values), +/- 2% of the stated value (or range of values), +/- 5% of the stated value (or range of values), +/- 10% of the stated value (or range of values), etc. Any numerical values given herein should also be understood to include about or approximately that value, unless the context indicates otherwise. For example, if the value “10” is disclosed, then “about 10” is also disclosed. Any numerical range recited herein is intended to include all sub-ranges subsumed therein. It is also understood that when a value is disclosed that “less than or equal to” the value, “greater than or equal to the value” and possible ranges between values are also disclosed, as appropriately understood by the skilled artisan. For example, if the value “X” is disclosed the “less than or equal to X” as well as “greater than or equal to X” (e.g., where X is a numerical value) is also disclosed. It is also understood that the throughout the application, data is provided in a number of different formats, and that this data, represents endpoints and starting points, and ranges for any combination of the data points. For example, if a particular data point “10” and a particular data point “15” are disclosed, it is understood that greater than, greater than or equal to, less than, less than or equal to, and equal to 10 and 15 are considered disclosed as well as between 10 and 15. It is also understood that each unit between two particular units are also disclosed. For example, if 10 and 15 are disclosed, then 11, 12, 13, and 14 are also disclosed. [0073] Although various illustrative embodiments are described above, any of a number of changes may be made to various embodiments without departing from the scope of the invention as described by the claims. For example, the order in which various described method steps are performed may often be changed in alternative embodiments, and in other

Claims

PATENT Client Reference No.: P39265-WO-1 alternative embodiments one or more method steps may be skipped altogether. Optional features of various device and system embodiments may be included in some embodiments and not in others. Therefore, the foregoing description is provided primarily for exemplary purposes and should not be interpreted to limit the scope of the invention as it is set forth in the claims. [0074] The examples and illustrations included herein show, by way of illustration and not of limitation, specific embodiments in which the subject matter may be practiced. As mentioned, other embodiments may be utilized and derived there from, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. Such embodiments of the inventive subject matter may be referred to herein individually or collectively by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept, if more than one is, in fact, disclosed. Thus, although specific embodiments have been illustrated and described herein, any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description.
PATENT Client Reference No.: P39265-WO-1 CLAIMS 1. A method for somatic variant calling, the method comprising: receiving a computer file comprising a plurality of consensus reads generated from a patient sample; aligning the plurality of consensus reads; determining a callable region from the aligned consensus reads; determining an active region from the callable region based on a comparison of the callable region to a reference sequence and identifying one or more differences in the active region compared to the reference sequence; generating an assembly graph from the reference sequence and the active region, wherein the assembly graph comprises a plurality of paths, wherein each path comprises a plurality of path segments, and wherein each path segment comprises a count of the consensus reads that support the path segment; determining a plurality of haplotype sequences from a plurality of paths of the assembly graph; determining a likelihood score for each haplotype sequence based on the count of consensus reads that support the path segments included in each haplotype sequence; determining a plurality of candidate variants by comparing a weighted count for each variant to a first threshold that is dynamically determined from the patient sample, wherein the weighted count is the number of consensus reads with the candidate variant; filtering candidate variants with a weighted count greater than the first threshold with a machine learning classification model that generates a probability score for each candidate variant; and determining a plurality of filtered variants by comparing the probability score for each candidate variant to a second threshold. 2. The method of claim 1, wherein the callable region is based on a minimum coverage requirement. 3. The method of claim 1, wherein the assembly graph is pruned to remove at least one path in the plurality of paths.
PCT/US2025/025147 2024-04-17 2025-04-17 Systems and methods for somatic small variant calling Pending WO2025221988A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202463635568P 2024-04-17 2024-04-17
US63/635,568 2024-04-17

Publications (1)

Publication Number Publication Date
WO2025221988A1 true WO2025221988A1 (en) 2025-10-23

Family

ID=95714702

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2025/025147 Pending WO2025221988A1 (en) 2024-04-17 2025-04-17 Systems and methods for somatic small variant calling

Country Status (1)

Country Link
WO (1) WO2025221988A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130244340A1 (en) 2012-01-20 2013-09-19 Genia Technologies, Inc. Nanopore Based Molecular Detection and Sequencing
US20130264207A1 (en) 2010-12-17 2013-10-10 Jingyue Ju Dna sequencing by synthesis using modified nucleotides and nanopore detection
US20140134616A1 (en) 2012-11-09 2014-05-15 Genia Technologies, Inc. Nucleic acid sequencing using tags
US20150119259A1 (en) 2012-06-20 2015-04-30 Jingyue Ju Nucleic acid sequencing by nanopore detection of tag molecules
US20150337366A1 (en) 2012-02-16 2015-11-26 Genia Technologies, Inc. Methods for creating bilayers for use with nanopore sensors
WO2017004589A1 (en) * 2015-07-02 2017-01-05 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US20170270245A1 (en) * 2016-01-11 2017-09-21 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130264207A1 (en) 2010-12-17 2013-10-10 Jingyue Ju Dna sequencing by synthesis using modified nucleotides and nanopore detection
US20130244340A1 (en) 2012-01-20 2013-09-19 Genia Technologies, Inc. Nanopore Based Molecular Detection and Sequencing
US20150337366A1 (en) 2012-02-16 2015-11-26 Genia Technologies, Inc. Methods for creating bilayers for use with nanopore sensors
US20150119259A1 (en) 2012-06-20 2015-04-30 Jingyue Ju Nucleic acid sequencing by nanopore detection of tag molecules
US20140134616A1 (en) 2012-11-09 2014-05-15 Genia Technologies, Inc. Nucleic acid sequencing using tags
WO2017004589A1 (en) * 2015-07-02 2017-01-05 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US20170270245A1 (en) * 2016-01-11 2017-09-21 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
FU ET AL., NATURE BIOTECH., vol. 16, 1998, pages 381 - 384
SEARS ET AL., BIOTECHNIQUES, vol. 13, 1992, pages 626 - 633
ZIMMERMAN ET AL., , METHODS MOL. CELL BIOL., vol. 3, 1992, pages 39 - 42

Similar Documents

Publication Publication Date Title
JP7684708B2 (en) Non-invasive prenatal molecular karyotyping of maternal plasma
Wenger et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome
Formenti et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation
US20160125128A1 (en) Accurate typing of hla through exome sequencing
WO2024157051A1 (en) Method for detecting insertion-deletion mutations in genomic sequences
JPWO2019132010A1 (en) Methods, devices and programs for estimating base species in a base sequence
Heo Improving quality of high-throughput sequencing reads
WO2025221988A1 (en) Systems and methods for somatic small variant calling
WO2025221998A1 (en) Systems and methods for variant calling
Veeramachaneni Data Analysis in Rare Disease Diagnostics: CV Veeramachaneni
Tetikol et al. Population-specific genome graphs improve high-throughput sequencing data analysis: A case study on the Pan-African genome
US20240404627A1 (en) Systems and Methods for Correcting for Noise and Systemic Variations in Sequencing Data
Jayasekera et al. A Bioinformatics pipeline for variant discovery from Targeted Next Generation Sequencing of the human mitochondrial genome
Wang et al. A comparative study of methods for detecting small somatic variants in disease-normal paired next generation sequencing data
Valdez et al. scAllele: a versatile tool for the detection and analysis of variants in scRNA-seq
Arres et al. Assessing the readiness of Oxford Nanopore sequencing for clinical genomics applications
Prodanov Read Mapping, Variant Calling, and Copy Number Variation Detection in Segmental Duplications
HK40074981A (en) Noninvasive prenatal molecular karyotyping from maternal plasma
HK40100599A (en) Noninvasive prenatal molecular karyotyping from maternal plasma
KR20250092241A (en) Nucleic acid error suppression
Corbett Assessment of Alignment Algorithms, Variant Discovery and Genotype Calling Strategies in Exome Sequencing Data
Goldfeder Evaluating and Improving Clinical Genome Sequencing
Lorenzo Salazar Bioinformatics Pipeline for Next Generation Sequencing Analysis in Association Studies of Idiopathic Pulmonary Fibrosis
Inouye et al. Exploratory analysis and error modeling of a sequencing technology
Derryberry Benchmarking of single nucleotide somatic variant calling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25725374

Country of ref document: EP

Kind code of ref document: A1