HK40047016B

HK40047016B - Limit of detection based quality control metric

Info

Publication number: HK40047016B
Application number: HK62021036578.4A
Authority: HK
Inventors: 萨拉·L·金宁斯; 科斯明·德丘; 巴德里·帕杜卡萨哈斯拉姆; 迪米特里·斯库沃佐夫
Original assignee: Illumina公司
Priority date: 2019-06-03
Filing date: 2020-06-02
Publication date: 2024-12-06

Description

Quality control metrics based on detection limit

引用说明Citation Notes

PCT申请表格与本说明书同时提交，作为本申请的一部分。本申请要求同时提交的PCT申请表格中确定的权益或优先权的每项申请以引用的方式全文并入本文并用于所有目的。The PCT application form is filed concurrently with this specification and is an integral part of this application. Each claim for a benefit or priority identified in the concurrently filed PCT application form is incorporated herein by reference in its entirety for all purposes.

背景技术Background Technology

人类医学研究中的重要工作之一是发现会产生不良健康后果的遗传异常。在许多情况下，已经在基因组中拷贝数异常的部分中识别出特定基因和/或关键诊断标记。例如，在产前诊断中，全染色体的额外拷贝或缺失拷贝是经常发生的遗传病变。在癌症中，全染色体或染色体区段的拷贝缺失或倍增以及基因组特定区域的较高水平扩增经常发生。One of the important tasks in human medical research is the discovery of genetic abnormalities that lead to adverse health consequences. In many cases, specific genes and/or key diagnostic markers have been identified in portions of the genome with abnormal copy numbers. For example, in prenatal diagnosis, extra or missing copies of entire chromosomes are frequently found in genetic disorders. In cancer, copy deletions or duplications of entire chromosomes or chromosomal segments, as well as high-level amplification of specific regions of the genome, often occur.

关于拷贝数变异(CNV)的大多数信息已通过允许识别结构异常的细胞遗传学分辨率提供。用于基因筛检和生物剂量测定的常规手术已利用侵入性手术(例如羊膜穿刺术、脐带穿刺术或绒毛取样术(CVS))来获得用于染色体核型分析的细胞。由于认识到无需细胞培养的更快速测试方法的必要性，人们已经开发了荧光原位杂交(FISH)、荧光定量PCR(QF-PCR)和阵列-比较基因组杂交(array-CGH)作为用于分析拷贝数变异的分子细胞遗传学方法。Most information about copy number variations (CNVs) has been provided at cytogenetic resolution, allowing for the identification of structural abnormalities. Routine procedures for gene screening and biodosimetry have utilized invasive procedures such as amniocentesis, cordocentesis, or chorionic villus sampling (CVS) to obtain cells for chromosome karyotype analysis. Recognizing the need for faster assays that do not require cell culture, fluorescence in situ hybridization (FISH), quantitative real-time PCR (QF-PCR), and array-comparative genomic hybridization (array-CGH) have been developed as molecular cytogenetic methods for analyzing copy number variations.

可在相对较短的时间内对全基因组进行测序的技术的出现和循环游离DNA(cfDNA)的发现，提供了将源自一个待比较染色体的遗传物质与源自另一个待比较染色体的遗传物质进行比较而无需考虑与侵入性取样方法相关联的风险的机会，其为诊断所关注基因序列的多种拷贝数变异提供了工具。The advent of technologies that can sequence the entire genome in a relatively short time and the discovery of circulating cell-free DNA (cfDNA) have provided the opportunity to compare genetic material from one chromosome to another without considering the risks associated with invasive sampling methods, and have provided tools for diagnosing multiple copy number variations in the gene sequences of interest.

由于现有非侵入性产前诊断方法存在包括由于cfDNA的水平有限而导致的灵敏度不足和由于基因组信息的固有性质而导致的技术测序偏差的局限性，因此人们对会提供特异性、灵敏度和适用性中的任意一者或全部以在多种临床环境中可靠地诊断拷贝数变异的非侵入性方法存在持续需求。研究已表明，孕妇血浆中胎儿cfDNA片段的平均长度短于母体cfDNA片段。在本文的具体实施中利用母体和胎儿cfDNA之间的这种差异来确定CNV和/或胎儿分数。本文所公开的实施方案满足上述需求中的一些需求。一些实施方案提供了用于控制CNV检测中的样品质量的方法和系统，从而识别具有过低胎儿分数或读段覆盖度的样品以进行进一步处理，诸如重新测序。一些实施方案为非侵入性产前诊断和多种疾病的诊断提供了高分析灵敏度和特异性。Due to limitations in existing non-invasive prenatal diagnostic methods, including insufficient sensitivity due to limited levels of cfDNA and technical sequencing biases due to the inherent nature of genomic information, there is a persistent need for non-invasive methods that provide any or all of the specificity, sensitivity, and applicability required to reliably diagnose copy number variations in a variety of clinical settings. Studies have shown that the average length of fetal cfDNA fragments in maternal plasma is shorter than that of maternal cfDNA fragments. This difference between maternal and fetal cfDNA is utilized in the specific implementation described herein to determine CNVs and/or fetal fractions. The implementations disclosed herein address some of these needs. Some implementations provide methods and systems for controlling sample quality in CNV detection, thereby identifying samples with excessively low fetal fractions or read coverage for further processing, such as resequencing. Some implementations provide high analytical sensitivity and specificity for non-invasive prenatal diagnosis and the diagnosis of a variety of diseases.

发明内容Summary of the Invention

虽然本文的示例涉及人类并且语言主要针对人类，但本文所述的概念适用于任何植物或动物的基因组。本公开的这些和其他目的以及特征将根据以下描述和所附权利要求变得更加显而易见，或者可通过如下文所述的本公开的实践来了解。While the examples herein relate to humans and the language is primarily human-oriented, the concepts described herein apply to the genomes of any plant or animal. These and other objects and features of this disclosure will become more apparent from the following description and the appended claims, or may be learned through practice of this disclosure as described below.

本公开的一个方面涉及用于处理试验样品的方法，每个试验样品包含源自母体和胎儿的游离核酸片段。所述方法使用包括存储器和一个或多个处理器的计算机系统来实现。在一些具体实施中，方法包括：(a)确定试验样品的胎儿分数值，其中该试验样品的该胎儿分数指示该试验样品中胎源游离核酸片段的相对量；(b)由计算机系统接收通过对该试验样品中的所述游离核酸片段进行测序而获得的序列读段；(c)由计算机系统将所述游离核酸片段的所述序列读段与包含所关注序列的参考基因组进行比对，从而提供序列标签；(d)由计算机系统确定该参考基因组的至少一部分的所述序列标签的覆盖度；(e)基于在(d)中确定的所述序列标签覆盖度和在(a)中确定的胎儿分数来确定该试验样品在排除区域内，其中该排除区域由至少胎儿分数检测限(LOD)曲线限定，其中该胎儿分数LOD曲线随覆盖度值而变化，并且指示在给定不同覆盖度的情况下实现检测标准所需的胎儿分数的最小值；以及(f)排除该试验样品用于对该所关注序列的CNV作出判定，或对该试验样品进行重新测序以获得用于对该所关注序列的CNV作出判定的重新测序的序列读段。One aspect of this disclosure relates to a method for processing test samples, each test sample containing cell-free nucleic acid fragments derived from the mother and fetus. The method is implemented using a computer system including a memory and one or more processors. In some specific embodiments, the method includes: (a) determining a fetal score of the test sample, wherein the fetal score of the test sample indicates the relative amount of fetal cell-free nucleic acid fragments in the test sample; (b) receiving, by the computer system, sequence reads obtained by sequencing the cell-free nucleic acid fragments in the test sample; (c) aligning, by the computer system, the sequence reads of the cell-free nucleic acid fragments with a reference genome containing a sequence of interest, thereby providing a sequence tag; (d) determining, by the computer system, the coverage of the sequence tag for at least a portion of the reference genome; (e) based on... The sequence tag coverage determined in (d) and the fetal fraction determined in (a) determine whether the test sample is within the exclusion region, wherein the exclusion region is defined by at least a fetal fraction detection limit (LOD) curve, wherein the fetal fraction LOD curve varies with the coverage value and indicates the minimum fetal fraction required to achieve the detection standard given different coverage; and (f) excluding the test sample for determining the CNV of the sequence of interest, or re-sequencing the test sample to obtain a re-sequencing read for determining the CNV of the sequence of interest.

在一些具体实施中，方法还包括在(f)之前，确定试验样品对于所关注序列的CNV是阴性的。In some implementations, the method also includes, prior to (f), determining that the test sample is negative for the CNV of the sequence of interest.

在一些具体实施中，方法还包括：使用重新测序的序列读段来重复(a)-(d)；确定试验样品在排除区域之外；以及将试验样品判定为具有所关注序列的CNV或不具有所关注序列的CNV。In some specific implementations, the method also includes: repeating (a)-(d) using re-sequencing sequence reads; determining that the test sample is outside the exclusion region; and classifying the test sample as either a CNV with the sequence of interest or a CNV without the sequence of interest.

在一些具体实施中，基于受CNV影响的受影响训练样品的LOD来获得胎儿分数LOD曲线。在一些具体实施中，受影响样品包括计算机模拟样品。在一些具体实施中，受影响样品包括体外样品。在一些具体实施中，通过将具有两个或更多个胎儿分数的样品组合来获得受影响样品。In some implementations, the LOD curve for the fetal score is obtained based on the LOD of the affected training samples influenced by CNV. In some implementations, the affected samples include computer-simulated samples. In some implementations, the affected samples include in vitro samples. In some implementations, the affected samples are obtained by combining samples with two or more fetal scores.

在一些具体实施中，检测标准是对于观察胎儿分数而言基准真值胎儿分数大于指定LOD的期望置信水平。在一些具体实施中，检测标准是对于观察胎儿分数而言基准真值胎儿分数大于LOD Y％的X％置信度。在一些具体实施中，X为约50％、55％、60％、65％、70％、75％、80％、85％、90％、95％、99％或99.5％。在一些具体实施中，Y为约50％、55％、60％、65％、70％、75％、80％、85％、90％、95％或99％的检测置信度。在一些具体实施中，X为50％并且Y为95％。在一些具体实施中，指定LOD被确定为可检测到Y％的受影响样品的最小观察胎儿分数。在一些具体实施中，使用在观察覆盖度处观察胎儿分数的基准真值胎儿分数的分布来获得针对在该观察覆盖度处的该观察胎儿分数的检测标准。In some implementations, the detection criterion is the expected confidence level that the baseline true fetal score is greater than the specified LOD with respect to the observed fetal score. In some implementations, the detection criterion is an X% confidence level that the baseline true fetal score is greater than LOD Y% with respect to the observed fetal score. In some implementations, X is approximately 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99%, or 99.5%. In some implementations, Y is a detection confidence level of approximately 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or 99%. In some implementations, X is 50% and Y is 95%. In some implementations, the specified LOD is determined as the minimum observed fetal score for which Y% of the affected sample can be detected. In some specific implementations, the distribution of the baseline true fetal score at the observation coverage is used to obtain the detection standard for the observed fetal score at that observation coverage.

在一些具体实施中，排除区域在胎儿分数LOD曲线的下方。In some specific implementations, the exclusion region is located below the fetal fractional level of descent (LOD) curve.

在一些具体实施中，排除区域由胎儿分数LOD曲线和覆盖度阈值限定。In some specific implementations, the exclusion area is defined by the fetal fractional level of distance (LOD) curve and the coverage threshold.

在一些具体实施中，排除区域在胎儿分数LOD曲线和覆盖度阈值两者的下方。In some specific implementations, the exclusion region is located below both the fetal fractional level of distance (LOD) curve and the coverage threshold.

在一些具体实施中，确定参考基因组的部分的序列标签覆盖度包括：(i)将参考基因组分成多个分组；(ii)确定与每个分组进行比对的序列标签数目；以及(iii)使用该参考基因组的部分中的分组中的所述序列标签数目来确定所述序列标签覆盖度。在一些具体实施中，方法还包括在(iii)之前，通过考虑由于除拷贝数变异之外的因素引起的分组间变异，来调整与分组进行比对的序列标签数目。In some implementations, determining the sequence tag coverage of a portion of the reference genome includes: (i) dividing the reference genome into multiple groups; (ii) determining the number of sequence tags to be aligned with each group; and (iii) using the number of sequence tags in the groups within the portion of the reference genome to determine the sequence tag coverage. In some implementations, the method further includes, prior to (iii), adjusting the number of sequence tags to be aligned with each group by taking into account inter-group variation due to factors other than copy number variation.

在一些具体实施中，基于游离核酸片段的大小来确定试验样品的胎儿分数值。在一些具体实施中，通过以下方式来确定试验样品的胎儿分数值：获得游离核酸片段大小的频率分布；以及将该频率分布应用于将胎儿分数与片段大小的频率相关联的模型，以获得胎儿分数值。In some implementations, the fetal score of a test sample is determined based on the size of the cell-free nucleic acid fragments. In some implementations, the fetal score is determined by: obtaining a frequency distribution of cell-free nucleic acid fragment sizes; and applying this frequency distribution to a model that correlates the fetal score with the frequency of fragment sizes to obtain the fetal score.

在一些具体实施中，基于参考基因组的分组的覆盖度信息来确定试验样品的胎儿分数值。在一些具体实施中，通过以下方式来计算胎儿分数值：将参考基因组的多个分组的覆盖度值应用于将胎儿分数与分组的覆盖度相关联的模型，以获得胎儿分数值。在一些具体实施中，参考基因组的多个分组具有比其他分组更高的胎儿游离核酸片段分数。In some implementations, the fetal score of the test sample is determined based on the coverage information of the reference genome's groupings. In some implementations, the fetal score is calculated by applying the coverage values of multiple groups of the reference genome to a model that correlates the fetal score with the grouping coverage. In some implementations, multiple groups of the reference genome have higher scores of fetal cell-free nucleic acid fragments than other groups.

在一些具体实施中，基于性染色体的分组的覆盖度信息来确定试验样品的胎儿分数值。In some specific implementations, the fetal score of the test sample is determined based on the coverage information of sex chromosome grouping.

本公开的另一方面涉及包括一个或多个处理器和系统存储器的计算机系统。所述一个或多个处理器被配置为执行上述方法中的任意一种方法。Another aspect of this disclosure relates to a computer system comprising one or more processors and system memory. The one or more processors are configured to perform any of the methods described above.

本公开的附加方面涉及计算机程序产品，该计算机程序产品包括一个或多个计算机可读非暂态存储介质，所述一个或多个计算机可读非暂态存储介质具有存储在其上的计算机可执行指令，所述计算机可执行指令当由计算机系统的一个或多个处理器执行时使得该计算机系统实现上述方法中的任意一种方法。Additional aspects of this disclosure relate to a computer program product comprising one or more computer-readable non-transitory storage media having computer-executable instructions stored thereon, which, when executed by one or more processors of a computer system, cause the computer system to implement any of the methods described above.

引用并入References merged

本文提及的所有专利、专利申请和其他出版物，包括在这些参考文献中公开的所有序列，均明确地以引用方式并入本文，其程度如同具体且单独地指出每个单独的出版物、专利或专利申请以引用方式并入本文。所有引用文献的相关部分均全文以引用方式并入本文以用于本文引用的上下文所指示的目的。然而，不可将任何文献的引用理解为是对其作为本公开的现有技术的认可。All patents, patent applications, and other publications mentioned herein, including all sequences disclosed in these references, are expressly incorporated herein by reference to the extent that each individual publication, patent, or patent application is specifically and individually cited and incorporated herein by reference. The relevant portions of all cited references are incorporated herein by reference in their entirety for the purposes indicated in the context of the citation. However, no citation of any reference should be construed as an endorsement of its status as prior art to this disclosure.

附图说明Attached Figure Description

图1示意性地示出了根据覆盖度的确保检测所需的最小胎儿分数。Figure 1 schematically illustrates the minimum fetal score required to ensure detection based on coverage.

图2示出了用于排除样品的两步覆盖度阈值。Figure 2 shows the two-step coverage threshold used to exclude samples.

图3示出了三个群体或其样品的胎儿分数分布。Figure 3 shows the fetal fraction distribution of the three groups or their samples.

图4A示出了根据一些具体实施的使用检测限(LOD)QC方法来确定CNV的工作流程200。Figure 4A illustrates the workflow 200 for determining CNVs based on some specific implementations of the Limit of Detection (LOD) QC method.

图4B示出了用于CNV检测的另一个LOD QC过程。Figure 4B illustrates another LOD QC process for CNV detection.

图5示出了例示以统计概念为基础的LOD的经验和假设数据。Figure 5 illustrates empirical and hypothetical data for LOD based on statistical concepts.

图6示出了根据胎儿分数的检测概率。Figure 6 shows the detection probability based on fetal score.

图7示出了估计胎儿分数包括导致所述估计胎儿分数偏离真实胎儿分数的误差。Figure 7 illustrates that the estimated fetal score includes errors that cause the estimated fetal score to deviate from the true fetal score.

图8示出了由于覆盖度和由于胎儿分数值引起的胎儿分数误差。Figure 8 shows the error in fetal score due to coverage and due to fetal score values.

图9模拟了八个水平的观察胎儿分数的真实胎儿分数。Figure 9 simulates the actual fetal scores at eight levels of observed fetal scores.

图10示出了三个不同水平的误差或覆盖度的真实胎儿分数相对于观察胎儿分数的分布。Figure 10 shows the distribution of the true fetal score relative to the observed fetal score at three different levels of error or coverage.

图11示出了在给定不同误差或覆盖度的情况下2％的观察胎儿分数及其模拟真实胎儿分数分布。Figure 11 shows the observed fetal scores at 2% and their simulated real fetal score distributions under different errors or coverage conditions.

图12示出了根据观察胎儿分数的第20百分位(左图)和第25百分位(右图)的真实胎儿分数分布。Figure 12 shows the true fetal score distribution based on the 20th percentile (left) and 25th percentile (right) of the observed fetal score.

图13示出了LOD、覆盖度和观察胎儿分数。Figure 13 shows the LOD, coverage, and observed fetal score.

图14包括可从类似于图13所示的数据中获得的两条胎儿分数LOD曲线。Figure 14 includes two fetal fractional LOD curves that can be obtained from data similar to that shown in Figure 13.

图15示出了根据一些实施方案的用于确定存在拷贝数变异的方法。Figure 15 illustrates a method for determining the presence of copy number variations according to some implementation schemes.

图16A示出了用于评估拷贝数的三途径过程的流程图。Figure 16A shows a flowchart of the three-path process for evaluating copy number.

图16B示出了根据本公开的一些具体实施的用于根据覆盖度信息确定胎儿分数的示例性过程800。Figure 16B illustrates an exemplary process 800 for determining a fetal score based on coverage information according to some specific implementations of this disclosure.

图16C示出了根据一些具体实施的用于根据大小分布信息确定胎儿分数的过程。Figure 16C illustrates a specific implementation of the process for determining fetal fractions based on size distribution information.

图16D示出了根据本公开的一些具体实施的用于根据8聚体(8-mer)频率信息确定胎儿分数的示例性过程1000。Figure 16D illustrates an exemplary process 1000 for determining a fetal fraction based on octamer (8-mer) frequency information according to some specific implementations of this disclosure.

图17示出了用于从试验样品中产生判定或诊断的分散系统的一个具体实施。Figure 17 illustrates a specific implementation of a dispersion system for generating judgments or diagnoses from test samples.

图18示出了用于在不同位置处执行各种操作的选项。Figure 18 shows the options for performing various operations at different locations.

图19示出了根据稀释分数的合成生成的样品的染色体Y覆盖度(左图)和FF级分估计值(右图)。Figure 19 shows the chromosome Y coverage (left) and FF fraction estimates (right) of the synthesized samples based on the dilution fraction.

图20示出了LOD对比覆盖度的线性拟合的结果。Figure 20 shows the results of linear fitting of LOD contrast coverage.

图21示出了预测LOD与观察LOD重叠的结果。Figure 21 shows the results of the overlap between predicted LOD and observed LOD.

图22示出了NES的概率密度函数。Figure 22 shows the probability density function of NES.

图23示出了根据观察胎儿分数的NES覆盖度。Figure 23 shows NES coverage based on observed fetal scores.

图24示出了使用由LOD曲线和读段阈值限定的排除区域的数据排除。Figure 24 illustrates data exclusion using exclusion regions defined by LOD curves and read thresholds.

图25示出了现有方法和LOD QC方法第一次运行和第二次运行的通过率和失败率。Figure 25 shows the pass rate and failure rate of the first and second runs of the existing method and the LOD QC method.

图26示出了通过两步阈值法排除并通过胎儿分数LOD曲线方法挽救的数据。Figure 26 shows data that were excluded by the two-step threshold method and salvaged by the fetal fraction LOD curve method.

图27示出了通过LOD QC方法挽救的样品。Figure 27 shows a sample rescued using the LOD QC method.

图28示出了通过LOD QC方法排除的样品。Figure 28 shows the samples excluded by the LOD QC method.

图29示出了现有方法和LOD QC方法两次运行的通过率和失败率。Figure 29 shows the pass rate and failure rate of the existing method and the LOD QC method in two runs.

图30示出了75％置信度LOD曲线挽救的样品。Figure 30 shows the sample salvaged by the 75% confidence level LOD curve.

图31示出了使用一些具体实施的方法检测T21样品的灵敏度。Figure 31 illustrates the sensitivity of detecting T21 samples using some specific implementations of the method.

具体实施方式Detailed Implementation

定义definition

如本文所用，关于数值的术语“约”是指±10％。As used in this article, the term “about” for numerical values refers to ±10%.

术语“由……组成”意指“包括并限于”。The term "composed of" means "including and limited to".

术语“基本上由……组成”意指组合物、方法或结构可包括另外的成分、步骤和/或部分，但前提是所述另外的成分、步骤和/或部分不在实质上改变所要求保护的组合物、方法或结构的基本特性和新颖特性。The term "consistently made up of" means that a composition, method, or structure may include additional ingredients, steps, and/or portions, provided that such additional ingredients, steps, and/or portions do not substantially alter the essential and novel characteristics of the claimed composition, method, or structure.

除非另外指明，否则本文所公开的方法和系统的实践包括分子生物学、微生物学、蛋白质纯化、蛋白质工程、蛋白质和DNA测序以及重组DNA领域中常用的常规技术和装置，这些技术和装置在本领域的技术范围内。此类技术和装置是本领域技术人员已知的，并且在许多文本和参考文献中有所描述(参见例如Sambrook等人，“Molecular Cloning:ALaboratory Manual”，第三版(Cold Spring Harbor)，[2001年])；和Ausubel等人，“Current Protocols in Molecular Biology”[1987年])。Unless otherwise specified, the practices of the methods and systems disclosed herein include conventional techniques and apparatus commonly used in the fields of molecular biology, microbiology, protein purification, protein engineering, protein and DNA sequencing, and recombinant DNA, which are within the scope of the art. Such techniques and apparatus are known to those skilled in the art and are described in numerous texts and references (see, for example, Sambrook et al., “Molecular Cloning: A Laboratory Manual,” 3rd edition (Cold Spring Harbor), [2001]; and Ausubel et al., “Current Protocols in Molecular Biology” [1987]).

数值范围包括限定该范围的数字。在本说明书通篇中给出的每一最大数值限度旨在包括每一更低数值限度，如同此类更低数值限度在本文中明确地写出一样。在本说明书通篇中给出的每一最小数值限度将包括每一更高数值限度，如同此类更高数值限度在本文中明确地写出一样。在本说明书通篇中给出的每一数值范围将包括落入此类更宽数值范围内的每一更窄数值范围，如同此类更窄数值范围均在本文中明确写出一样。Numerical ranges include the numbers that define the range. Each maximum numerical limit given throughout this specification is intended to include every lower numerical limit, as such lower numerical limits are expressly stated herein. Each minimum numerical limit given throughout this specification will include every higher numerical limit, as such higher numerical limits are expressly stated herein. Each numerical range given throughout this specification will include every narrower numerical range falling within such wider numerical ranges, as such narrower numerical ranges are all expressly stated herein.

除非本文中另有定义，否则本文所用的所有技术和科学术语的含义与本发明所属领域的普通技术人员通常理解的含义相同。包括本文所包括的术语的各种科学词典是本领域技术人员熟知的并且是可用的。虽然与本文所述的方法和材料类似或等同的任何方法和材料也可用于本文所公开的实施方案的实践或测试，但本文描述了一些方法和材料。Unless otherwise defined herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. Various scientific dictionaries including those containing the terms included herein are well known and available to those skilled in the art. While any methods and materials similar to or equivalent to those described and used herein may also be used in the practice or testing of the embodiments disclosed herein, only a few methods and materials are described herein.

下文紧接着定义的术语通过整体参考本说明书来进行更全面的描述。应当理解，本公开不限于所述的特定方法、方案和试剂，因为这些方法、方案和试剂可根据本领域技术人员使用它们的上下文而变化。如本文所用，除非上下文另有明确指示，否则单数术语“一个”、“一种”和“该”包括复数指代。The terms defined below are described more fully with reference to this specification in their entirety. It should be understood that this disclosure is not limited to the specific methods, procedures, and reagents described, as these methods, procedures, and reagents may vary depending on the context in which they are used by those skilled in the art. As used herein, unless the context clearly indicates otherwise, the singular terms “an,” “a,” and “the” include plural references.

除非另外指明，否则分别地，以5'至3'的取向从左到右书写核酸，并且以氨基至羧基的取向从左到右书写氨基酸序列。Unless otherwise specified, nucleic acids are written from left to right with the 5' to 3' orientation, and the amino acid sequence is written from left to right with the amino to carboxyl orientation.

检测限(LOD)是可以限定置信度检测的最小水平的信号(例如，分析物、胎儿分数、指示条件的得分等)。在本申请中，LOD是以限定置信度检测非整倍体/CNV所需的胎儿分数(或其他分析物)的最低水平。The limit of detection (LOD) is the minimum level of signal (e.g., analyte, fetal score, score of indicative condition, etc.) that can be defined with confidence for detection. In this application, LOD is the minimum level of fetal score (or other analyte) required to define the confidence level for detecting aneuploidy/CNV.

本文所用的术语“参数”表示其值或其他特性影响相关条件(诸如拷贝数变异)的物理特征。在一些情况下，术语参数是指影响数学关系或模型输出的变量，该变量可以是自变量(即，模型的输入)或基于一个或多个自变量的中间变量。根据模型的范围，一个模型的输出可成为另一个模型的输入，从而成为另一个模型的参数。As used in this paper, the term "parameter" refers to a physical characteristic whose value or other property affects related conditions (such as copy number variation). In some cases, the term parameter refers to a variable that affects a mathematical relationship or model output, which can be an independent variable (i.e., the model's input) or an intermediate variable based on one or more independent variables. Depending on the scope of the model, the output of one model can become the input of another model, thus becoming a parameter of that other model.

术语“片段大小参数”是指与片段或片段集合(诸如核酸片段)的大小或长度相关的参数；例如，从体液中获得的cfDNA片段。如本文所用，在以下情况时，参数“偏向片段大小或大小范围”：1)对片段大小或大小范围的参数进行有利地加权，例如，当与该大小或大小范围的片段相关联时，计数的权重比其他大小或范围的计数的权重更大；或者2)从对片段大小或大小范围有利地进行加权的值中获得参数，例如，当与该大小或大小范围的片段相关联时，从权重更大的计数中获得的比率。当基因组产生相对于来自另一基因组或同一基因组的另一部分的核酸片段富含或具有更高浓度的大小或大小范围的核酸片段时，片段大小或大小范围可以是该基因组或其一部分的特性。The term "fragment size parameter" refers to a parameter relating to the size or length of a fragment or set of fragments (such as nucleic acid fragments); for example, cfDNA fragments obtained from bodily fluids. As used herein, a parameter is "biased toward fragment size or size range" when: 1) the fragment size or size range is advantageously weighted, for example, when a count associated with a fragment of that size or size range is given a greater weight than counts of other sizes or ranges; or 2) the parameter is obtained from values that are advantageously weighted toward fragment size or size range, for example, a ratio obtained from counts with greater weight when associated with a fragment of that size or size range. Fragment size or size range can be a characteristic of the genome or a portion thereof when the genome produces nucleic acid fragments of a size or size range that are richer or have a higher concentration than nucleic acid fragments from another genome or another part of the same genome.

术语“加权”是指使用被认为是“权重”的一个或多个值或函数来修改量(诸如参数或变量)。在某些实施方案中，将参数或变量乘以权重。在其他实施方案中，参数或变量以指数方式修改。在一些实施方案中，函数可以是线性或非线性函数。适用的非线性函数的示例包括但不限于赫维赛德(Heaviside)阶跃函数、箱车函数、阶梯函数或S形函数。对原始参数或变量进行加权可系统性地增大或减小加权变量的值。在各种实施方案中，加权可产生正值、非负值或负值。The term "weighting" refers to modifying a quantity (such as a parameter or variable) using one or more values or functions considered as "weights". In some implementations, the parameter or variable is multiplied by the weights. In other implementations, the parameter or variable is modified exponentially. In some implementations, the function can be linear or nonlinear. Examples of suitable nonlinear functions include, but are not limited to, the Heaviside step function, the boxcar function, the step function, or the sigmoid function. Weighting the original parameter or variable systematically increases or decreases the value of the weighted variable. In various implementations, weighting can produce positive, non-negative, or negative values.

本文中术语“拷贝数变异”是指与存在于参考样品中的核酸序列的拷贝数相比，存在于试验样品中的核酸序列的拷贝数变异。在某些实施方案中，核酸序列为1kb或更大。在一些情况下，核酸序列是全染色体或其重要部分。“拷贝数变异体”是指通过将试验样品中的所关注核酸序列与所关注核酸序列的预期水平进行比较来发现拷贝数差异的核酸序列。例如，将试验样品中所关注核酸序列的水平与合格样品中存在的核酸序列水平进行比较。拷贝数变异体/变异包括缺失(包括微缺失)、插入(包括微插入)、复制、倍增和易位。CNV涵盖染色体非整倍体和部分非整倍体。In this document, the term "copy number variation" refers to the copy number variation of a nucleic acid sequence present in a test sample compared to the copy number of the nucleic acid sequence present in a reference sample. In some implementations, the nucleic acid sequence is 1 kb or larger. In some cases, the nucleic acid sequence is the entire chromosome or a significant portion thereof. A "copy number variant" is a nucleic acid sequence whose copy number difference is detected by comparing the level of the nucleic acid sequence of interest in the test sample to the expected level of the nucleic acid sequence of interest. For example, comparing the level of the nucleic acid sequence of interest in the test sample to the level of the nucleic acid sequence present in a qualified sample. Copy number variants/variations include deletions (including microdeletions), insertions (including microinsertions), duplications, doubling, and translocations. CNVs cover chromosomal aneuploidy and partial aneuploidy.

本文中术语“非整倍体”是指由全染色体或染色体的一部分的丢失或获得引起的遗传物质的失衡。In this article, the term "aneuploidy" refers to an imbalance of genetic material caused by the loss or gain of a whole chromosome or a part of a chromosome.

本文中术语“染色体非整倍体”和“完整染色体非整倍体”是指由全染色体的丢失或获得引起的遗传物质的失衡，并且包括种系非整倍体和嵌合性非整倍体。In this article, the terms "chromosomal aneuploidy" and "complete chromosomal aneuploidy" refer to an imbalance of genetic material caused by the loss or gain of a complete chromosome, and include germline aneuploidy and mosaic aneuploidy.

本文中术语“部分非整倍体”和“部分染色体非整倍体”是指由染色体的一部分的丢失或获得引起的遗传物质的失衡(例如，部分单倍体性和部分三倍体性)，并且涵盖由易位、缺失和插入引起的失衡。In this article, the terms “partial aneuploidy” and “partial chromosomal aneuploidy” refer to an imbalance of genetic material caused by the loss or gain of a portion of a chromosome (e.g., partial haploidy and partial triploidy), and encompass imbalances caused by translocations, deletions, and insertions.

术语“多个”是指多于一个元件。例如，本文所用的术语是指使用本文所公开的方法足以识别试验样品和合格样品的拷贝数变异的显著差异的多个核酸分子或序列标签。在一些实施方案中，每个试验样品获得至少约3×10⁶个约20b至40bp之间的序列标签。在一些实施例方案中，每个试验样品提供至少约5×10⁶、8×10⁶、10×10⁶、15×10⁶、20×10⁶、30×10⁶、40×10⁶或50×10⁶个序列标签的数据，每个序列标签的长度在约20bp和40bp之间。The term "multiple" means more than one element. For example, as used herein, it refers to multiple nucleic acid molecules or sequence tags sufficient to identify significant differences in copy number variation between test samples and qualified samples using the methods disclosed herein. In some embodiments, each test sample yields at least about 3 × ^10⁶ sequence tags between about 20 bp and 40 bp. In some exemplary embodiments, each test sample provides data for at least about 5 × ^10⁶ , 8 × ^10⁶ , 10 × ^10⁶ , 15 × ^10⁶ , 20 × ^10⁶ , 30 × ^10⁶ , 40 × ^10⁶ , or 50 × ^10⁶ sequence tags, each with a length between about 20 bp and 40 bp.

术语“配对末端读段”是指来自从核酸片段的每个末端获得一个读段的配对末端测序的读段。配对末端测序可包括将多核苷酸链片段化成称为插入序列的短序列。片段化对于相对较短的多核苷酸(诸如无细胞DNA分子)是可选的或不必要的。The term "paired-end read" refers to a read obtained from each end of a nucleic acid fragment through paired-end sequencing. Paired-end sequencing may involve fragmenting a polynucleotide chain into short sequences called insert sequences. Fragmentation is optional or unnecessary for relatively short polynucleotides, such as cell-free DNA molecules.

术语“多核苷酸”、“核酸”和“核酸分子”可互换使用，并且是指共价连接的核苷酸(即RNA的核糖核苷酸和DNA的脱氧核糖核苷酸)的序列，其中一个核苷酸的戊糖的3'位置通过磷酸二酯基团连接至下一个核苷酸的戊糖的5'位置。核苷酸包括任何形式的核酸的序列，这些核酸包括但不限于RNA和DNA分子(诸如cfDNA分子)。术语“多核苷酸”包括但不限于单链多核苷酸和双链多核苷酸。The terms "polynucleotide," "nucleic acid," and "nucleic acid molecule" are used interchangeably and refer to a sequence of covalently linked nucleotides (i.e., ribonucleotides of RNA and deoxyribonucleotides of DNA), in which the 3' position of the pentose sugar of one nucleotide is linked to the 5' position of the pentose sugar of the next nucleotide via a phosphodiester group. Nucleotides include sequences of any form of nucleic acid, including but not limited to RNA and DNA molecules (such as cfDNA molecules). The term "polynucleotide" includes, but is not limited to, single-stranded and double-stranded polynucleotides.

本文中术语“试验样品”是指通常来源于生物流体、细胞、组织、器官或生物体且包含核酸或核酸混合物的样品，该核酸或核酸混合物包含待筛选拷贝数变异的至少一种核酸序列。在某些实施方案中，样品包含拷贝数疑似已发生变异的至少一条核酸序列。此类样品包括但不限于痰/口腔液、羊水、血液、血液级分或细针活检样品(例如，外科活检、细针活检等)、尿液、腹膜液、胸膜液等。虽然样品通常取自人类受试者(例如，患者)，但测定可用于任何哺乳动物的拷贝数变异(CNV)，这些哺乳动物包括但不限于狗、猫、马、山羊、绵羊、牛、猪等。样品可按从生物来源中获得的原样直接使用，或者经过预处理以改变样品的性质后使用。例如，此类预处理可包括由血液制备血浆、稀释粘性流体等。预处理的方法还包括但不限于过滤、沉淀、稀释、蒸馏、混合、离心、冷冻、冻干、浓缩、扩增、核酸片段化、干扰组分失活、添加试剂、裂解等。如果对于样品采用此类预处理方法，则此类预处理方法通常使得所关注核酸保留在试验样品中，有时其浓度与未处理的试验样品(例如，即未受到任何此类预处理方法的样品)中的浓度成比例。就本文所述的方法而言，此类“经处理的”或“处理后的”样品仍被视为生物“试验”样品。In this document, the term "test sample" refers to a sample typically derived from a biological fluid, cell, tissue, organ, or organism and containing nucleic acids or mixtures thereof, which contain at least one nucleic acid sequence for which a copy number variation is to be screened. In some embodiments, the sample contains at least one nucleic acid sequence for which a copy number variation is suspected to have occurred. Such samples include, but are not limited to, sputum/oral fluid, amniotic fluid, blood, blood fractions or fine-needle biopsy samples (e.g., surgical biopsy, fine-needle biopsy, etc.), urine, peritoneal fluid, pleural fluid, etc. Although samples are typically taken from human subjects (e.g., patients), the determination can be used for copy number variation (CNV) in any mammal, including but not limited to dogs, cats, horses, goats, sheep, cattle, pigs, etc. Samples may be used directly as is from the biological source or after pretreatment to alter the properties of the sample. For example, such pretreatment may include preparing plasma from blood, diluting viscous fluids, etc. Pretreatment methods also include, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, addition of reagents, lysis, etc. If such pretreatment methods are used on samples, these methods typically result in the retention of the nucleic acid of interest in the test sample, sometimes at a concentration proportional to that in the untreated test sample (e.g., a sample not subjected to any such pretreatment methods). For the purposes of the methods described herein, such “treated” or “processed” samples are still considered biological “test” samples.

本文中术语“合格样品”或“未受影响样品”是指包含以(试验样品中的核酸所比较的)已知拷贝数存在的核酸混合物的样品，并且合格样品或未受影响样品是对于所关注核酸序列而言正常，即不是非整倍体的样品。在一些实施方案中，将合格样品用作训练集的未受影响训练样品，以推导序列掩蔽或序列图谱。在某些实施方案中，将合格样品用于识别所考虑染色体的一条或多条归一化染色体或区段。例如，合格样品可用于识别21号染色体的归一化染色体。在这种情况下，合格样品是非21-三倍体样品的样品。另一个示例包括仅使用女性个体作为X染色体的合格样品。也可将合格样品用于其他目的，诸如确定用于判定受影响样品的阈值、识别用于限定参考序列上的掩蔽区域的阈值、确定基因组不同区域的预期覆盖度量等。In this document, the terms "qualified sample" or "unaffected sample" refer to a sample containing a mixture of nucleic acids present at a known copy number (compared to the nucleic acids in the test sample), and a qualified or unaffected sample is a sample that is normal with respect to the nucleic acid sequence of interest, i.e., not aneuploid. In some embodiments, qualified samples are used as unaffected training samples in a training set to derive sequence masking or sequence maps. In some embodiments, qualified samples are used to identify one or more normalized chromosomes or regions of the chromosome under consideration. For example, qualified samples can be used to identify the normalized chromosome of chromosome 21. In this case, the qualified sample is a sample that is not a 21-triploid sample. Another example includes qualified samples using only female individuals as the X chromosome. Qualified samples can also be used for other purposes, such as determining thresholds for identifying affected samples, identifying thresholds for defining masked regions on a reference sequence, determining expected coverage measures for different regions of the genome, etc.

本文中术语“训练集”是指可包括受影响和/或未受影响样品并且用来开发用于分析试验样品的模型的训练样品集。在一些实施方案中，训练集包括未受影响样品。在这些实施方案中，使用不受所关注拷贝数变异影响的样品训练集来建立用于确定CNV的阈值。训练集中的未受影响样品可用作合格样品以识别归一化序列(例如，归一化染色体)，并且未受影响样品的染色体剂量用于设定用于所关注序列(例如，所关注染色体)中的每一者的阈值。在一些实施方案中，训练集包括受影响样品。训练集中的受影响样品可用于确认受影响试验样品易于与未受影响样本区分开。In this document, the term "training set" refers to a set of training samples that may include affected and/or unaffected samples and are used to develop models for analyzing test samples. In some embodiments, the training set includes unaffected samples. In these embodiments, a training set of samples unaffected by the copy number variation of interest is used to establish a threshold for determining CNVs. Unaffected samples in the training set can be used as qualifying samples to identify normalized sequences (e.g., normalized chromosomes), and the chromosome dose of the unaffected samples is used to set a threshold for each of the sequences of interest (e.g., the chromosome of interest). In some embodiments, the training set includes affected samples. Affected samples in the training set can be used to confirm that affected test samples are easily distinguishable from unaffected samples.

训练集也是所关注群体中的统计样品，该统计样品不会与生物样品混淆。统计样品通常包括多个个体，这些个体的数据用于确定适用于群体的一个或多个所关注定量值。统计样品是所关注群体中的个体子集。这些个体可以是提供用于统计分析的数据点的人、动物、组织、细胞、其他生物样品(即，统计样品可包括多个生物样品)和其他个体实体。The training set is also a statistical sample of the population of interest, which will not be confused with biological samples. A statistical sample typically includes multiple individuals whose data are used to determine one or more quantitative values of interest applicable to the population. A statistical sample is a subset of individuals in the population of interest. These individuals can be people, animals, tissues, cells, other biological samples (i.e., a statistical sample may include multiple biological samples), and other individual entities that provide data points for statistical analysis.

通常，训练集与验证集结合使用。术语“验证集”用于指统计样品中的个体集，这些个体的数据用于验证或评估使用训练集确定的所关注定量值。在一些实施方案中，例如，训练集提供用于计算参考序列的掩蔽的数据，而验证集提供用于评估掩蔽的有效性或效果的数据。Typically, training and validation sets are used together. The term "validation set" refers to a set of individuals in a statistical sample whose data are used to validate or evaluate the quantitative values of interest determined using the training set. In some implementations, for example, the training set provides data for computing a mask for a reference sequence, while the validation set provides data for evaluating the effectiveness or impact of the mask.

本文所用的“拷贝数评估”是指与序列拷贝数相关的基因序列状态的统计评估。例如，在一些实施方案中，评估包括确定是否存在基因序列。在一些实施方案中，评估包括确定基因序列的部分或完整非整倍体。在其他实施方案中，评估包括基于基因序列的拷贝数来区分两个或更多个样品。在一些实施方案中，评估包括基于基因序列的拷贝数来进行统计分析，例如归一化和比较。As used herein, "copy number assessment" refers to a statistical assessment of the gene sequence status in relation to the copy number of the sequence. For example, in some embodiments, the assessment includes determining the presence of a gene sequence. In some embodiments, the assessment includes identifying partial or complete aneuploidy of a gene sequence. In other embodiments, the assessment includes distinguishing two or more samples based on the copy number of the gene sequence. In some embodiments, the assessment includes performing statistical analyses, such as normalization and comparisons, based on the copy number of the gene sequence.

术语“合格核酸”可与“合格序列”互换使用，“合格序列”是所关注序列或核酸的量所比较的序列。合格序列是优选按已知表示(即，合格序列的量是已知的)存在于生物学样品中的一种序列。一般来讲，合格序列是存在于“合格样品”中的序列。“所关注合格序列”是在合格样品中已知其量的合格序列，并且是与对照受试者和具有医学病症的个体之间的所关注序列的差异相关联的序列。The term "qualified nucleic acid" is used interchangeably with "qualified sequence," which is the sequence of interest or nucleic acid against which the quantity is compared. A qualified sequence is a sequence that is preferably present in a biological sample in a known manner (i.e., the quantity of the qualified sequence is known). Generally, a qualified sequence is a sequence present in a "qualified sample." A "qualified sequence of interest" is a qualified sequence whose quantity is known in a qualified sample and is associated with differences in the sequence of interest between control subjects and individuals with a medical condition.

本文中术语“所关注序列”或“所关注核酸序列”是指与健康个体和患病个体之间的序列表示的差异相关联的核酸序列。所关注序列可以是在疾病或遗传病症中错误表示(即过度表示或表示不足)的染色体上的序列。所关注序列可以是染色体的一部分(即染色体区段)或全染色体。例如，所关注序列可以是在非整倍体病症中过度表示的染色体或编码在癌症中表示不足的肿瘤抑制因子的基因。所关注序列包括在受试者的细胞总群或细胞亚群中过度表示或表示不足的序列。“所关注合格序列”是合格样品中的所关注序列。“所关注测试序列”是试验样品中的所关注序列。In this document, the terms "sequence of interest" or "nucleic acid sequence of interest" refer to nucleic acid sequences associated with differences in sequence representation between healthy and diseased individuals. A sequence of interest can be a sequence on a chromosome that is misrepresented (i.e., overrepresented or underrepresented) in a disease or genetic condition. A sequence of interest can be a portion of a chromosome (i.e., a chromosomal segment) or the entire chromosome. For example, a sequence of interest can be a chromosome overrepresented in aneuploidy or a gene encoding a tumor suppressor factor underrepresented in cancer. Sequences of interest include overrepresented or underrepresented sequences in the total or subpopulation of cells in a subject. A "qualified sequence of interest" is a sequence of interest in a qualified sample. A "test sequence of interest" is a sequence of interest in a test sample.

本文中术语“归一化序列”是指用于将映射到与归一化序列相关联的所关注序列的序列标签数目归一化的序列。在一些实施方案中，归一化序列包含稳健染色体(robustchromosome)。“稳健染色体”是不可能为非整倍体的染色体。在涉及人类染色体的一些情况下，稳健染色体是除X染色体、Y染色体、13号染色体、18号染色体和21号染色体之外的任何染色体。在一些实施方案中，归一化序列显示在样品和测序运行间映射到其的序列标签数目的变异性，该变异性近似于用作归一化参数的所关注序列的变异性。归一化序列可将受影响样品与一个或多个未受影响样品区分开。在一些具体实施中，当与其他潜在归一化序列(诸如其他染色体)相比时，归一化序列最佳地或有效地将受影响样品与一个或多个未受影响样品区分开。在一些实施方案中，以在样品和测序运行间所关注序列的染色体剂量变异性来计算归一化序列的变异性。在一些实施方案中，在未受影响样品集中识别归一化序列。The term "normalized sequence" as used herein refers to a sequence used to normalize the number of sequence tags mapped to the sequence of interest associated with the normalized sequence. In some embodiments, the normalized sequence comprises a robust chromosome. A "robust chromosome" is a chromosome that is unlikely to be aneuploid. In some cases involving human chromosomes, a robust chromosome is any chromosome other than the X chromosome, Y chromosome, chromosome 13, chromosome 18, and chromosome 21. In some embodiments, the normalized sequence shows the variability in the number of sequence tags mapped to it between samples and sequencing runs, which approximates the variability of the sequence of interest used as a normalization parameter. The normalized sequence can distinguish an affected sample from one or more unaffected samples. In some specific embodiments, the normalized sequence optimally or effectively distinguishes an affected sample from one or more unaffected samples when compared with other potential normalized sequences, such as other chromosomes. In some embodiments, the variability of the normalized sequence is calculated based on the chromosomal dose variability of the sequence of interest between samples and sequencing runs. In some embodiments, the normalized sequence is identified in a set of unaffected samples.

“归一化染色体”、“归一化分母染色体”或“归一化染色体序列”是“归一化序列”的示例。“归一化染色体序列”可由单条染色体或一组染色体构成。在一些实施方案中，归一化序列包含两条或更多条稳健染色体。在某些实施方案中，稳健染色体均为除X染色体、Y染色体、13号染色体、18号染色体和21号染色体之外的染色体。“归一化区段”是“归一化序列”的另一个示例。“归一化区段序列”可由染色体的单个片段构成，或可由相同或不同染色体的两个或更多个片段构成。在某些实施方案中，归一化序列旨在将变异性诸如过程相关变异性、染色体间(运行内)变异性和测序间(运行间)变异性归一化。"Normalized chromosome," "normalized denominator chromosome," or "normalized chromosome sequence" are examples of a "normalized sequence." A "normalized chromosome sequence" may consist of a single chromosome or a group of chromosomes. In some embodiments, a normalized sequence contains two or more robust chromosomes. In some embodiments, robust chromosomes are all chromosomes other than the X chromosome, Y chromosome, chromosome 13, chromosome 18, and chromosome 21. A "normalized segment" is another example of a "normalized sequence." A "normalized segment sequence" may consist of a single segment of a chromosome or may consist of two or more segments of the same or different chromosomes. In some embodiments, a normalized sequence is intended to normalize variability such as process-related variability, inter-chromosomal (intra-run) variability, and inter-sequencing (inter-run) variability.

本文中术语“可区分性”是指归一化染色体的特性，其使得人们能够将一个或多个未受影响(即正常)样品与一个或多个受影响(即非整倍体)样品区分开。显示最大“可区分性”的归一化染色体是这样的染色体或染色体组，其提供合格样品集中所关注染色体的染色体剂量与一个或多个受影响样品中的对应染色体中相同所关注染色体的染色体剂量的分布之间的最大统计差异。In this paper, the term "discriminability" refers to the property of a normalized chromosome that enables one or more unaffected (i.e., normal) samples to be distinguished from one or more affected (i.e., aneuploid) samples. The normalized chromosome exhibiting the greatest "discriminability" is a chromosome or set of chromosomes that provides the largest statistical difference between the chromosome dose of the chromosome of interest in a qualified sample set and the distribution of the chromosome dose of the same chromosome of interest in the corresponding chromosomes of one or more affected samples.

本文中术语“变异性”是指归一化染色体的特性，其使得人们能够将一个或多个未受影响(即正常)样品与一个或多个受影响(即非整倍体)样品区分开。在合格样品集中测量的归一化染色体的变异性是指映射到其的序列标签数目的变异性，该变异性近似于映射到用作归一化参数的所关注染色体的序列标签数目的变异性。In this paper, the term "variability" refers to the characteristics of a normalized chromosome that enable the differentiation of one or more unaffected (i.e., normal) samples from one or more affected (i.e., aneuploid) samples. The variability of a normalized chromosome measured in a qualified sample set refers to the variability of the number of sequence tags mapped to it, which approximates the variability of the number of sequence tags mapped to the chromosome of interest used as a normalization parameter.

本文中术语“序列标签密度”是指映射到参考基因组序列的序列读段的数目，例如21号染色体的序列标签密度是通过测序方法生成的映射到参考基因组的21号染色体的序列读段的数目。In this article, the term "sequence tag density" refers to the number of sequence reads mapped to a reference genome sequence. For example, the sequence tag density of chromosome 21 is the number of sequence reads of chromosome 21 mapped to the reference genome generated by sequencing methods.

本文中术语“序列标签密度比”是指映射到参考基因组的染色体(例如21号染色体)的序列标签数目与参考基因组染色体长度的比率。In this article, the term "sequence tag density ratio" refers to the ratio of the number of sequence tags mapped to a chromosome (e.g., chromosome 21) of the reference genome to the length of the reference genome chromosome.

本文中术语“序列剂量”是指将所识别的所关注序列的序列标签数目或另一个参数与所识别的归一化序列的序列标签数目或另一个参数相关联的参数。在一些情况下，序列剂量是所关注序列的序列标签覆盖度或其他参数与归一化序列的序列标签覆盖度或其他参数的比率。在一些情况下，序列剂量是指将所关注序列的序列标签密度与归一化序列的序列标签密度相关联的参数。“测试序列剂量”是将在试验样品中确定的所关注序列(例如21号染色体)的序列标签密度或其他参数与归一化序列(例如9号染色体)的序列标签密度或其他参数相关联的参数。类似地，“合格序列剂量”是将在合格样品中确定的所关注序列的序列标签密度或其他参数与归一化序列的序列标签密度或其他参数相关的参数。In this document, the term "sequence dose" refers to a parameter that correlates the number of sequence tags or another parameter of the identified sequence of interest with the number of sequence tags or another parameter of the identified normalized sequence. In some cases, sequence dose is the ratio of the sequence tag coverage or other parameter of the sequence of interest to the sequence tag coverage or other parameter of the normalized sequence. In some cases, sequence dose refers to a parameter that correlates the sequence tag density of the sequence of interest with the sequence tag density of the normalized sequence. "Test sequence dose" is a parameter that correlates the sequence tag density or other parameter of the sequence of interest (e.g., chromosome 21) identified in a test sample with the sequence tag density or other parameter of the normalized sequence (e.g., chromosome 9). Similarly, "qualified sequence dose" is a parameter that correlates the sequence tag density or other parameter of the sequence of interest identified in a qualified sample with the sequence tag density or other parameter of the normalized sequence.

术语“覆盖度”是指映射到限定序列的序列标签的丰度。可通过序列标签密度(或序列标签的计数)、序列标签密度比、归一化覆盖度量、调整的覆盖度值等定量地表示覆盖度。The term "coverage" refers to the abundance of sequence tags mapped to a defined sequence. Coverage can be quantitatively expressed by sequence tag density (or sequence tag count), sequence tag density ratio, normalized coverage metric, adjusted coverage value, etc.

术语“覆盖度量”是指原始覆盖度的修改，并且通常表示基因组区域诸如分组中序列标记的相对量(有时称为计数)。可通过归一化、调整和/或校正基因组区域的原始覆盖度或计数来获得覆盖度量。例如，可通过将映射到区域的序列标签计数除以映射到全基因组的序列标签总数来获得该区域的归一化覆盖度量。归一化覆盖度量允许在不同样品间比较分组覆盖度，这些样品可具有不同的测序深度。归一化覆盖度量与序列剂量的不同之处在于后者通常是通过除以映射到全基因组的子集的标签计数获得的。该子集是一个或多个归一化区段或染色体。无论是否归一化，覆盖度量都可针对基因组上区域与区域之间的全局图谱变化、G-C分数变化、稳健染色体异常值等加以校正。The term "coverage metric" refers to a modification of the original coverage and is typically expressed as the relative amount (sometimes called count) of sequence tags in a genomic region, such as in a group. Coverage metrics can be obtained by normalizing, adjusting, and/or correcting the original coverage or count of a genomic region. For example, a normalized coverage metric for a region can be obtained by dividing the count of sequence tags mapped to the region by the total number of sequence tags mapped to the entire genome. Normalized coverage metrics allow for comparison of group coverage across different samples, which may have different sequencing depths. Normalized coverage metrics differ from sequence doses in that the latter are typically obtained by dividing by the count of tags mapped to a subset of the entire genome. This subset is one or more normalized segments or chromosomes. Regardless of normalization, coverage metrics can be corrected for global map variations between regions on the genome, G-C score variations, robust chromosomal outliers, etc.

本文中术语“下一代测序(NGS)”是指允许对克隆扩增分子和单个核酸分子进行大规模平行测序的测序方法。NGS的非限制性示例包括边连接边测序和使用可逆染料终止子的边合成边测序。In this article, the term "next-generation sequencing (NGS)" refers to sequencing methods that allow for massively parallel sequencing of cloned amplified molecules and individual nucleic acid molecules. Non-limiting examples of NGS include sequencing-by-ligation and sequencing-by-synthesis using reversible dye terminators.

本文中术语“参数”是指表征系统特性的数值。通常，参数以数值方式表征定量数据集和/或定量数据集之间的数值关系。例如，映射到染色体的序列标签数目和标签所映射的染色体长度之间的比率(或比率的函数)是参数。In this paper, the term "parameter" refers to a numerical value that characterizes a system property. Typically, a parameter numerically represents the numerical relationship between quantitative datasets and/or between quantitative datasets. For example, the ratio (or a function of the ratio) between the number of sequence tags mapped to a chromosome and the length of the chromosome to which the tags are mapped is a parameter.

本文中术语“阈值”和“合格阈值”是指用作表征样品(诸如含有来自疑似具有医学病症的生物体的核酸的试验样品)的截止值的任何数字。可将阈值与参数值进行比较，以确定产生此参数值的样品是否表明生物体具有医学病症。在某些实施方案中，使用合格数据集来计算合格阈值，并将该合格阈值用作生物体中拷贝数变异(例如，非整倍体)的诊断限。如果从本文所公开的方法中获得的结果超过阈值，则受试者可被诊断为具有拷贝数变异，例如，21-三倍体症。可通过分析所计算的样品训练集的归一化值(例如染色体剂量、NCV或NSV)来确定用于本文所述方法的适当阈值。可使用包括合格(即，未受影响)样品和受影响样品两者的训练集中的合格(即，未受影响)样品来确定阈值。训练集中已知具有染色体非整倍体的样品(即，受影响样品)可用于确认所选阈值可用于将试验集中的受影响样品与未受影响样品区分开(参见本文的实施例)。阈值的选择取决于用户希望必须进行分类的置信水平。在一些实施方案中，用于确定适当阈值的训练集包括至少10个、至少20个、至少30个、至少40个、至少50个、至少60个、至少70个、至少80个、至少90个、至少100个、至少200个、至少300个、至少400、至少500个、至少600个、至少700个、至少800个、至少900、至少1000个、至少2000个、至少3000个、至少4000个或更多个合格样品。使用更大的合格样品集来改善阈值的诊断效用可能是有利的。In this document, the terms "threshold" and "qualification threshold" refer to any number used as a cutoff value to characterize a sample (such as a test sample containing nucleic acids from an organism suspected of having a medical condition). A threshold can be compared to a parameter value to determine whether a sample producing that parameter value indicates a medical condition in the organism. In some embodiments, a qualified dataset is used to calculate the qualification threshold, and this qualification threshold is used as a diagnostic limit for copy number variations (e.g., aneuploidy) in the organism. If the result obtained from the methods disclosed herein exceeds the threshold, the subject can be diagnosed with a copy number variation, such as triploidy 21. An appropriate threshold for the methods described herein can be determined by analyzing normalized values (e.g., chromosomal dose, NCV, or NSV) of the calculated sample training set. A threshold can be determined using qualified (i.e., unaffected) samples from a training set that includes both qualified (i.e., unaffected) samples and affected samples. Samples in the training set known to have chromosomal aneuploidy (i.e., affected samples) can be used to confirm that the selected threshold can be used to distinguish affected samples from unaffected samples in the test set (see embodiments herein). The choice of threshold depends on the level of confidence the user wishes to be able to classify. In some implementations, the training set used to determine the appropriate threshold includes at least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, at least 1000, at least 2000, at least 3000, at least 4000, or more qualified samples. Using a larger set of qualified samples can be advantageous to improve the diagnostic utility of the threshold.

术语“分组”是指序列的区段或基因组的区段。在一些实施方案中，分组在基因组或染色体内彼此邻接。每个分组可限定参考基因组中的核苷酸序列。根据特定应用和序列标签密度所需的分析，分组的大小可以是1kb、100kb、1Mb等。除了在参考序列内的位置之外，分组还可具有其他特性，诸如样品覆盖度和序列结构特性(诸如G-C分数)。The term "group" refers to a segment of sequence or a segment of the genome. In some implementations, groups are adjacent to each other within the genome or chromosome. Each group may define a nucleotide sequence within a reference genome. Depending on the specific application and the analysis required for sequence tag density, the size of the group can be 1 kb, 100 kb, 1 Mb, etc. In addition to its location within the reference sequence, the group may also have other characteristics, such as sample coverage and sequence structure characteristics (such as G-C score).

本文所用的术语“掩蔽阈值”是指基于序列分组中序列标签数目的值所比较的量，其中具有超过掩蔽阈值的值的分组被掩蔽。在一些实施方案中，掩蔽阈值可以是百分等级、绝对计数、映射质量得分或其他合适的值。在一些实施方案中，掩蔽阈值可被定义为在多个未受影响样品间的变异系数的百分等级。在其他实施方案中，掩蔽阈值可被定义为映射质量得分(例如MapQ得分)，其与将序列读段与参考基因组进行比对的可靠性相关联。需注意，掩蔽阈值不同于拷贝数变异(CNV)阈值，后者是表征包含来自疑似具有与CNV相关的医学病症的生物体的核酸的样品的截止值。在一些实施方案中，相对于本文他处所述的归一化染色体值(NCV)或归一化区段值(NSV)来定义CNV阈值。As used herein, the term "masking threshold" refers to a quantity compared based on the number of sequence tags in a sequence group, where groups with values exceeding the masking threshold are masked. In some embodiments, the masking threshold may be a percentile, absolute count, mapping quality score, or other suitable value. In some embodiments, the masking threshold may be defined as a percentile of the coefficient of variation among multiple unaffected samples. In other embodiments, the masking threshold may be defined as a mapping quality score (e.g., MapQ score), which is associated with the reliability of aligning a sequence read to a reference genome. It should be noted that the masking threshold differs from the copy number variation (CNV) threshold, which is a cutoff value characterizing samples containing nucleic acids from organisms suspected of having a CNV-related medical condition. In some embodiments, the CNV threshold is defined relative to the normalized chromosome value (NCV) or normalized segment value (NSV) described elsewhere herein.

本文中术语“归一化值”是指将所识别的所关注序列(例如染色体或染色体片段)的序列标签数目与所识别的归一化序列(例如归一化染色体或归一化染色体区段)的序列标签数目相关联的数值。例如，“归一化值”可以是如本文他处所述的染色体剂量，或者其可以是NCV，或者其可以是如本文他处所述的NSV。In this document, the term "normalized value" refers to a numerical value that correlates the number of sequence tags of the identified sequence of interest (e.g., chromosome or chromosome segment) with the number of sequence tags of the identified normalized sequence (e.g., normalized chromosome or normalized chromosome region). For example, a "normalized value" may be a chromosome dose as described elsewhere in this document, or it may be an NCV, or it may be an NSV as described elsewhere in this document.

术语“读段”是指从核酸样品的一部分中获得的序列。通常，尽管不是必须的，读段表示样品中邻接碱基对的短序列。读段可由样品部分的碱基对序列(以A、T、C或G)象征性地表示。读段可存储在存储器设备中并视情况进行处理，以确定该读段是匹配参考序列还是满足其他标准。读段可直接从测序装置获得或间接从所存储的关于样本的序列信息获得。在一些情况下，读段为足够长度(例如，至少约25bp)的DNA序列，其可用于鉴定更大的序列或区域，例如，其可被比对并特异性地分配给染色体或基因组区域或基因。The term "read" refers to a sequence obtained from a portion of a nucleic acid sample. Typically, though not strictly necessary, a read represents a short sequence of adjacent base pairs in the sample. Reads can be symbolically represented by the base pair sequence (A, T, C, or G) of the sample portion. Reads can be stored in a storage device and processed as needed to determine whether the read matches a reference sequence or meets other criteria. Reads can be obtained directly from a sequencing device or indirectly from stored sequence information about the sample. In some cases, reads are DNA sequences of sufficient length (e.g., at least about 25 bp) that can be used to identify larger sequences or regions, for example, that can be aligned and specifically assigned to chromosomal or genomic regions or genes.

术语“基因组读段”用于指个体的全基因组中任何区段的读段。The term "genome read" is used to refer to any segment of an individual's whole genome.

本文中术语“序列标签”可与术语“映射序列标签”互换使用，是指已通过比对特异性地分配(即，映射)到较大序列(例如，参考基因组)的序列读段。映射序列标签唯一地映射到参考基因组，即，它们被分配到参考基因组的单一位置。除非另外指明，否则映射到参考序列上的相同序列上的标记计数一次。可以数据结构或数据的其他集合体提供标签。在某些实施方案中，标签包含读段序列和与该读段相关的信息，诸如序列在基因组中的位置，例如染色体上的位置。在某些实施方案中，位置专门针对正链取向。与参考基因组的比对中，可限定标签允许有限量的错配。在一些实施方案中，分析中可能不包括可映射到参考基因组上的多于一个位置的标签，即，并不唯一映射的标签。In this document, the term "sequence tag" may be used interchangeably with the term "mapped sequence tag," referring to a sequence read that has been specifically assigned (i.e., mapped) to a larger sequence (e.g., a reference genome) through alignment. Mapped sequence tags are uniquely mapped to the reference genome; that is, they are assigned to a single location within the reference genome. Unless otherwise specified, tags on the same sequence mapped to the reference sequence are counted once. Tags may be provided by data structures or other collections of data. In some embodiments, the tag contains the read sequence and information associated with that read, such as the sequence's location in the genome, e.g., its location on a chromosome. In some embodiments, the location is specific to the positive strand orientation. In alignments with the reference genome, tags may be defined to allow a limited number of mismatches. In some embodiments, tags that can be mapped to more than one location on the reference genome, i.e., tags that are not uniquely mapped, may not be included in the analysis.

术语“非冗余序列标签”是指并不映射到相同位点的序列标签，在一些实施方案中，为了确定归一化染色体值(NCV)，对该相同位点进行计数。有时，将多个序列读段与参考基因组上的相同位置进行比对，从而产生冗余的或重复的序列标签。在一些实施方案中，为了确定NCV，将映射到相同位置的重复序列标签省略或计数为一个“非冗余序列标签”。在一些实施方案中，对与非排除位点进行比对的非冗余序列标签进行计数以产生“非排除位点计数”(NES计数)以用于确定NCV。The term "non-redundant sequence tag" refers to a sequence tag that does not map to the same locus, which is counted in some implementations to determine the normalized chromosome value (NCV). Sometimes, multiple sequence reads are aligned to the same location on a reference genome, resulting in redundant or repetitive sequence tags. In some implementations, repetitive sequence tags mapped to the same location are omitted or counted as a single "non-redundant sequence tag" to determine the NCV. In some implementations, non-redundant sequence tags aligned to non-excluded loci are counted to produce a "non-excluded locus count" (NES count) for determining the NCV.

术语“位点”是指参考基因组上的唯一位置(即，染色体ID、染色体位置和取向)。在一些实施方案中，位点可提供残基、序列标签或区段在序列上的位置。The term "site" refers to a unique location on a reference genome (i.e., chromosome ID, chromosome location, and orientation). In some implementations, a site can provide the location of a residue, sequence tag, or segment on the sequence.

“排除位点”是存在于已被排除用于对序列标签进行计数的参考基因组的区域中的位点。在一些实施方案中，排除位点存在于含有重复序列的染色体区域(例如着丝粒和端粒)和多于一种染色体所共有的染色体区域(例如存在于Y染色体上也存在于X染色体上的区域)中。"Exclusion sites" are sites located in regions of the reference genome that have been excluded from the counting of sequence tags. In some implementations, exclusion sites are located in chromosomal regions containing repetitive sequences (e.g., centromeres and telomeres) and chromosomal regions shared by more than one chromosome (e.g., regions on the Y chromosome that are also on the X chromosome).

“非排除位点”(NSE)是参考基因组中未被排除用于对序列标签进行计数的位点。"Non-exclusion sites" (NSEs) are sites in the reference genome that were not excluded for counting sequence tags.

“非排除位点计数”(NES计数)是映射到参考基因组上NSE的序列标签数目。在一些实施方案中，NES计数是映射到NES的非冗余序列标签数目。在一些实施方案中，覆盖度和相关参数(诸如归一化覆盖度量)、全局图谱去除的覆盖度量和染色体剂量基于NES计数。在一个示例中，以所关注染色体的NES计数与归一化染色体的计数的比率来计算染色体剂量。The “Non-Exclusion Site Count” (NES count) is the number of sequence tags mapped to NSEs on the reference genome. In some implementations, the NES count is the number of non-redundant sequence tags mapped to NESs. In some implementations, coverage and related parameters (such as normalized coverage metrics), global map removal coverage metrics, and chromosome dose are based on the NES count. In one example, chromosome dose is calculated as the ratio of the NES count of the chromosome of interest to the normalized chromosome count.

归一化染色体值(NCV)将试验样品的覆盖度与训练/合格样品集的覆盖度相关联。在一些实施方案中，NCV基于染色体剂量。在一些实施方案中，NCV将试验样品中所关注染色体的染色体剂量与合格样品集中对应染色体剂量的平均值之间的差异相关联为，并且可如下计算：The Normalized Chromosome Value (NCV) correlates the coverage of the test sample with the coverage of the training/qualified sample set. In some implementations, the NCV is based on chromosome dose. In some implementations, the NCV correlates the difference between the chromosome dose of the chromosome of interest in the test sample and the average of the corresponding chromosome doses in the qualified sample set, and can be calculated as follows:

其中和分别是合格样品集中第j个染色体剂量的估计平均值和标准偏差，并且x_ij是观察的试验样品i的第j个染色体比率(剂量)。Where and are the estimated mean and standard deviation of the dose of the j-th chromosome in the qualified sample set, respectively, and x _{ij} is the ratio (dose) of the j-th chromosome in the observed test sample i.

在一些实施方案中，可通过将试验样品中所关注染色体的染色体剂量与在相同流通池上测序的多重样品中对应染色体剂量的中值相关联来如下“在线(on the fly)”计算NCV：In some implementations, NCV can be calculated "on the fly" as follows: by correlating the chromosome dose of the chromosome of interest in the test sample with the median chromosome dose of the corresponding chromosome in multiplex samples sequenced on the same flow cell.

其中M_j是在相同流通池上进行测序的多重样品集中第j个染色体剂量的估计中值；是在一个或多个流通池上进行测序的一个或多个多重样集品中第j个染色体剂量的标准偏差，并且x_ij是观察的试验样品i的第j个染色体剂量。在该实施方案中，试验样品i是在确定M_j的相同流通池上进行测序的多重样品中的一个样品。Where M _j is the estimated median dose of the j-th chromosome in a multiplex sample set sequenced on the same flow cell; is the standard deviation of the dose of the j-th chromosome in one or more multiplex sample sets sequenced on one or more flow cells, and x _ij is the observed dose of the j-th chromosome in test sample i. In this embodiment, test sample i is one of the multiplex samples sequenced on the same flow cell where M_j was determined.

例如，对于在一个流通池上以64个多重样品中的一个样品进行测序的试验样品A中的所关注21号染色体，试验样品A中21号染色体的NCV如下计算：样品A中21号染色体的剂量减去在64个多重样品中确定的21号染色体的剂量的中值，除以确定的在流通池1上的或另外流通池的64个多重样品21号染色体的剂量的标准偏差。For example, for chromosome 21 of interest in test sample A, which is sequenced in one of 64 multiplex samples in a flow-through pool, the NCV of chromosome 21 in test sample A is calculated as follows: the dose of chromosome 21 in sample A minus the median dose of chromosome 21 determined in the 64 multiplex samples, divided by the standard deviation of the dose of chromosome 21 determined in the 64 multiplex samples in flow-through pool 1 or another flow-through pool.

如本文所用，术语“比对(aligned、alignment或aligning)”是指将读段或标签与参考序列进行比较，从而确定参考序列是否包含该读段序列的过程。如果参考序列包含该读段，则该读段可映射到参考序列，或者在某些实施方案中，映射到参考序列中的特定位置。在一些情况下，比对简单地告知读段是否为特定参考序列的成员(即，该读段是否存在于该参考序列中)。例如，读段与人类13号染色体的参考序列的比对将告知该读段是否存在于13号染色体的参考序列中。提供该信息的工具可被称为集合成员资格测试仪(setmembership tester)。在一些情况下，比对另外指示读段或标签映射到的参考序列中的位置。例如，如果参考序列是人类全基因组序列，则比对可指示读段存在于13号染色体上，并且还可指示该读段存在于13号染色体的特定链和/或位点上。As used herein, the term "aligned, alignment, or alignment" refers to the process of comparing a read or tag with a reference sequence to determine whether the reference sequence contains the read sequence. If the reference sequence contains the read, the read may be mapped to the reference sequence, or in some embodiments, to a specific location within the reference sequence. In some cases, alignment simply indicates whether a read is a member of a particular reference sequence (i.e., whether the read is present in the reference sequence). For example, aligning a read with a reference sequence of human chromosome 13 will indicate whether the read is present in the reference sequence of chromosome 13. The tool providing this information may be called a setmembership tester. In some cases, alignment additionally indicates the location in the reference sequence to which the read or tag is mapped. For example, if the reference sequence is a human whole genome sequence, alignment may indicate that the read is present on chromosome 13, and may also indicate that the read is present on a specific strand and/or site on chromosome 13.

比对的读段或标签是就其核酸分子顺序而言识别为与来自参考基因组的已知序列匹配的一条或多条序列。比对可手动进行，尽管其通常通过计算机算法来实现，因为不可能在实现本文所公开的方法的合理时间段内比对读段。来自比对序列的算法的一个示例是作为Illumina Genomics分析管线的一部分分布的高效核苷酸数据局部比对(ELAND)计算机程序。另选地，布隆(Bloom)过滤器或类似的集成员资格测试仪可用于将读段与参考基因组进行比对。参见2011年10月27日提交的美国专利申请61/552,374，其全文以引用方式并入本文。比对中序列读段的匹配可以是100％序列匹配或小于100％(非完全匹配)。An aligned read or tag is one or more sequences identified, in terms of their nucleic acid molecular sequence, as matching a known sequence from a reference genome. Alignment can be performed manually, although it is typically implemented using computer algorithms because it is not possible to align reads within a reasonable timeframe for implementing the methods disclosed herein. An example of an algorithm derived from alignment sequences is the High Efficiency Local Alignment of Nucleotide Data (ELAND) computer program, distributed as part of the Illumina Genomics analysis pipeline. Alternatively, Bloom filters or similar set membership testers can be used to align reads to a reference genome. See U.S. Patent Application 61/552,374, filed October 27, 2011, the entire contents of which are incorporated herein by reference. The matching of sequence reads in an alignment can be 100% sequence matching or less than 100% (incomplete matching).

本文所用的术语“映射”指通过比对将序列读段特异性地分配给较大的序列，例如参考基因组。As used in this article, the term "mapping" refers to the specific assignment of sequence reads to larger sequences, such as a reference genome, through alignment.

如本文所用，术语“参考基因组”或“参考序列”是指可用于参考来自受试者的识别序列的任何生物体或病毒的任何特定已知基因组序列，无论是部分的还是完整的。例如，可在ncbi.nlm.nih.gov的美国国家生物技术信息中心(National Center forBiotechnology Information)找到用于人类受试者以及许多其他生物体的参考基因组。“基因组”是指以核酸序列表达的生物体或病毒的完整遗传信息。As used herein, the terms “reference genome” or “reference sequence” refer to any specific known genome sequence of any organism or virus, whether partial or complete, that can be used as a reference for an identification sequence from a subject. For example, reference genomes for human subjects and many other organisms can be found at the National Center for Biotechnology Information (ncbi.nlm.nih.gov). A “genome” refers to the complete genetic information of an organism or virus expressed as a nucleic acid sequence.

在各种实施方案中，参考序列显著大于与其比对的读段。例如，参考序列可为比对读段的至少约100倍大、或至少约1000倍大、或至少约10,000倍大、或至少约10⁵倍大、或至少约10⁶倍大、或至少约10⁷倍大。In various implementations, the reference sequence is significantly larger than the read it is aligned with. For example, the reference sequence may be at least about 100 times larger than the aligned read, or at least about 1,000 times larger, or at least about 10,000 times larger, or at least about ^10⁵ times larger, or at least about ^10⁶ times larger, or at least about ^10⁷ times larger.

在一个示例中，参考序列是全长人类基因组的序列。此类序列可称为基因组参考序列。在另一个示例中，参考序列限于特定的人类染色体，诸如13号染色体。在一些实施方案中，参考染色体是来自人类基因组版本hg19的Y染色体序列。此类序列可称为染色体参考序列。参考序列的其他示例包括其他物种的基因组，以及任何物种的染色体、亚染色体区域(诸如链)等。In one example, the reference sequence is the sequence of the full-length human genome. Such a sequence may be called a genomic reference sequence. In another example, the reference sequence is limited to a specific human chromosome, such as chromosome 13. In some implementations, the reference chromosome is a Y chromosome sequence from human genome version hg19. Such a sequence may be called a chromosomal reference sequence. Other examples of reference sequences include genomes of other species, as well as chromosomes, subchromosomal regions (such as chains), etc., of any species.

在各种实施方案中，参考序列是来源于多个个体的共有序列或其他组合。然而，在某些应用中，参考序列可取自特定个体。In various implementations, the reference sequence is a shared sequence or other combination derived from multiple individuals. However, in some applications, the reference sequence may be taken from a specific individual.

本文中术语“临床相关序列”是指已知或疑似与遗传或疾病病症相关联或相牵连的核酸序列。确定是否存在临床相关序列可用于确定或确认医学病症的诊断，或提供用于疾病发展的预后。In this article, the term "clinically relevant sequence" refers to a nucleic acid sequence that is known or suspected to be associated with or related to a genetic or disease condition. Determining the presence of a clinically relevant sequence can be used to confirm or validate a medical diagnosis, or to provide prognostic information for disease progression.

当用于核酸或核酸混合物的上下文时，本文中术语“衍生的”是指从其来源中获得核酸的方式。例如，在一个实施方案中，衍生自两种不同基因组的核酸混合物意指核酸(例如cfDNA)由细胞通过自然发生的过程(诸如坏死或细胞凋亡)自然释放。在另一个实施方案中，衍生自两种不同基因组的核酸混合物意指从来自受试者的两种不同类型的细胞中提取核酸。When used in the context of nucleic acids or mixtures of nucleic acids, the term "derived" herein refers to the manner in which the nucleic acids are obtained from their source. For example, in one embodiment, a mixture of nucleic acids derived from two different genomes means that the nucleic acids (e.g., cfDNA) are naturally released by the cells through a naturally occurring process, such as necrosis or apoptosis. In another embodiment, a mixture of nucleic acids derived from two different genomes means that the nucleic acids were extracted from two different types of cells from a subject.

当用于获得特定定量值的上下文时，本文中术语“基于”是指使用另一数量作为输入来计算特定定量值作为输出。When used in the context of obtaining a specific quantitative value, the term "based on" in this paper means using another quantity as input to calculate a specific quantitative value as output.

本文中术语“患者样品”是指从患者，即医学关注、护理或治疗的接受者中获得的生物样品。患者样品可以是本文所述的任何样品。在某些实施方案中，患者样品通过非侵入性手术获得，例如外周血样品或粪便样品。本文所述的方法不需要限于人类。因此，考虑了各种兽医应用，在这种情况下，患者样品可以是来自非人类哺乳动物(例如，猫科动物、猪、马、牛等)的样品。The term "patient sample" as used herein refers to a biological sample obtained from a patient, i.e., a recipient of medical attention, care, or treatment. A patient sample can be any sample described herein. In some embodiments, the patient sample is obtained non-invasively, such as through peripheral blood or fecal samples. The methods described herein are not necessarily limited to humans. Therefore, various veterinary applications are considered, in which case the patient sample can be a sample from a non-human mammal (e.g., a feline, pig, horse, cattle, etc.).

本文中术语“混合样品”是指含有衍生自不同基因组的核酸混合物的样品。In this article, the term "mixed sample" refers to a sample containing a mixture of nucleic acids derived from different genomes.

本文中术语“母体样品”是指从妊娠受试者(例如，女性)中获得的生物样品。In this article, the term "maternal sample" refers to a biological sample obtained from a pregnant subject (e.g., a woman).

本文中术语“生物流体”是指取自生物来源的液体，并且包括例如血液、血清、血浆、痰、灌洗液、脑脊液、尿液、精液、汗液、泪液、唾液等。如本文所用，术语“血液”、“血浆”和“血清”明确地涵盖其级分或加工部分。类似地，在样品取自活检、拭子、涂片等的情况中，“样品”明确地涵盖来源于活检、拭子、涂片等的处理级分或部分。In this document, the term "biofluid" refers to a liquid derived from a biological source and includes, for example, blood, serum, plasma, sputum, lavage fluid, cerebrospinal fluid, urine, semen, sweat, tears, saliva, etc. As used herein, the terms "blood," "plasma," and "serum" explicitly encompass their fraction or processing portion. Similarly, in cases where the sample is derived from a biopsy, swab, smear, etc., "sample" explicitly encompasses the processing fraction or portion derived from the biopsy, swab, smear, etc.

本文中术语“母体核酸”和“胎儿核酸”分别指妊娠女性受试者的核酸和该妊娠女性受试者所怀的胎儿的核酸。In this article, the terms "maternal nucleic acid" and "fetal nucleic acid" refer to the nucleic acid of the pregnant female subject and the nucleic acid of the fetus she is carrying, respectively.

如本文所用，术语“对应于”有时是指存在于不同受试者的基因组中的核酸序列(例如基因或染色体)不必在所有基因组中具有相同的序列，而是用于提供所关注序列(例如基因或染色体)的同一性而非遗传信息。As used in this article, the term “corresponding to” sometimes refers to the fact that nucleic acid sequences (e.g., genes or chromosomes) present in the genomes of different subjects do not necessarily have the same sequence in all genomes, but are used to provide the identity of the sequence of interest (e.g., genes or chromosomes) rather than genetic information.

如本文所用，术语“胎儿分数”是指存在于包含胎儿和母体核酸的样品中的胎儿核酸的分数。胎儿分数通常用于表征母体血液中的cfDNA。As used herein, the term "fetal fraction" refers to the fraction of fetal nucleic acids present in a sample containing both fetal and maternal nucleic acids. Fetal fraction is commonly used to characterize cfDNA in maternal blood.

如本文所用，术语“染色体”是指活细胞的携带遗传性的基因载体，其来源于包含DNA和蛋白质组分(尤其是组蛋白)的染色质链。本文采用了常规的国际公认的个体人类基因组染色体编号系统。As used herein, the term "chromosome" refers to the genetic carrier of a living cell, derived from a chain of chromatin containing DNA and protein components (especially histones). This article employs the standard, internationally recognized chromosome numbering system for individual human genomes.

如本文所用，术语“多核苷酸长度”是指序列中或参考基因组的区域中核苷酸的绝对数目。术语“染色体长度”是指以碱基对给出的染色体的已知长度，例如，在以存在于万维网上|genome|.|ucsc|.|edu/cgi-bin/hgTracks？hgsid＝167155613&chromInfoPage＝处的人类染色体的NCBI36/hg18组件提供的长度。As used herein, the term “polynucleotide length” refers to the absolute number of nucleotides in a sequence or region of a reference genome. The term “chromosome length” refers to the known length of a chromosome given in base pairs, for example, the length provided by the NCBI36/hg18 component of the human chromosome available at |genome|.|ucsc|.|edu/cgi-bin/hgTracks?hgsid=167155613&chromInfoPage= on the World Wide Web.

本文中术语“受试者”是指人类受试者以及非人受试者，该非人受试者诸如哺乳动物、无脊椎动物、脊椎动物、真菌、酵母、细菌和病毒。虽然本文的示例涉及人类并且语言主要针对人类，但本文所公开的概念适用于来自任何植物或动物的基因组，并且可用于兽医学、动物科学、研究实验室等领域。In this paper, the term "subject" refers to both human subjects and non-human subjects, such as mammals, invertebrates, vertebrates, fungi, yeasts, bacteria, and viruses. While the examples in this paper involve humans and the language is primarily human-oriented, the concepts disclosed herein apply to genomes from any plant or animal and can be used in fields such as veterinary medicine, animal science, and research laboratories.

本文中术语“病症”是指作为广义术语的“医学病症”，其包括所有疾病和紊乱，也可包括可能影响人的健康、受益于医学辅助或对医学治疗有影响的损伤和正常健康情况，诸如妊娠。In this article, the term "symptom" refers to "medical symptoms" in a broad sense, which includes all diseases and disorders, as well as injuries and normal health conditions that may affect a person's health, benefit from medical assistance, or influence medical treatment, such as pregnancy.

当用于指染色体非整倍体时，本文中术语“完整”是指全染色体的获得或丢失。When used to refer to chromosomal aneuploidy, the term "complete" in this article means the acquisition or loss of the entire chromosome.

当用于指染色体非整倍体时，本文中术语“部分”是指染色体的一部分(即区段)的获得或丢失。When used to refer to chromosomal aneuploidy, the term "part" in this article refers to the acquisition or loss of a portion (i.e., a segment) of a chromosome.

本文中术语“嵌合性”是指表示在已从单个受精卵发育成的一个个体中存在具有不同染色体核型的两个细胞群。嵌合可由发育期间的突变引起，这种突变仅增殖成成体细胞的亚群。In this article, the term "chimerism" refers to the presence of two cell populations with different chromosomal karyotypes within an individual that has developed from a single fertilized egg. Chimerism can be caused by mutations during development that result in only a subset of adult cells proliferating.

本文术语“非嵌合性”是指生物体(例如人类胎儿)由一种染色体核型的细胞构成。The term "non-chimeric" in this article refers to an organism (such as a human fetus) whose cells are composed of a single chromosomal karyotype.

如本文所用，术语“灵敏度”是指当存在所关注条件时测试结果将为阳性的概率。可用真阳性的数目除以真阳性和假阴性之和来计算灵敏度。As used in this article, the term "sensitivity" refers to the probability that a test result will be positive when the conditions of interest are present. Sensitivity can be calculated by dividing the number of true positives by the sum of true positives and false negatives.

如本文所用，术语“特异性”是指当不存在所关注条件时测试结果将为阴性的概率。可用真阴性的数目除以真阴性和假阳性之和来计算特异性。As used in this article, the term "specificity" refers to the probability that a test result will be negative when the condition of interest is not present. Specificity can be calculated by dividing the number of true negatives by the sum of true negatives and false positives.

本文中术语“富集”是指扩增母体样品的一部分中包含的多态性靶核酸，并将所扩增的产物与去除该部分的该母体样品的其余部分组合的过程。例如，母体样品的其余部分可为原始母体样品。In this article, the term "enrichment" refers to the process of amplifying a polymorphic target nucleic acid contained in a portion of a parent sample and combining the amplified product with the remaining portion of the parent sample after that portion has been removed. For example, the remaining portion of the parent sample may be the original parent sample.

本文中术语“原始母体样品”是指从作为从中去除一部分以扩增多态性靶核酸的来源的妊娠受试者(例如女性)中获得的未富集生物样品。“原始样品”可以是从妊娠受试者中获得的任何样品及其处理级分，例如从母体血浆样品中提取的经纯化cfDNA样品。In this article, the term "original maternal sample" refers to a non-enriched biological sample obtained from a pregnant subject (e.g., a woman) from which a portion has been removed to amplify polymorphic target nucleic acids. "Original sample" can be any sample and its processing fraction obtained from a pregnant subject, such as a purified cfDNA sample extracted from a maternal plasma sample.

如本文所用，术语“引物”是指当置于诱导延伸产物合成的条件下(例如，该条件包括核苷酸、诱导剂(诸如DNA聚合酶)以及合适的温度和pH)时能够充当合成起始点的分离寡核苷酸。引物优选为在扩增中效率最高的单链，但也可以是双链。如果是双链，则在用于制备延伸产物之前首先对引物进行处理以分离其链。优选地，引物是寡脱氧核糖核苷酸。引物必须足够长以在存在诱导剂的情况下引发延伸产物的合成。引物的确切长度将取决于许多因素，包括温度、引物来源、方法的用途和用于引物设计的参数。As used herein, the term "primer" refers to a segregated oligonucleotide that, when placed under conditions that induce the synthesis of the extension product (e.g., conditions including nucleotides, an inducer such as DNA polymerase, and suitable temperature and pH), can act as a starting point for synthesis. Primers are preferably single-stranded molecules that are most efficient in amplification, but double-stranded primers are also acceptable. If double-stranded, the primers are first treated to separate their strands before use in preparing the extension product. Preferably, the primers are oligodeoxyribonucleotides. Primers must be long enough to initiate the synthesis of the extension product in the presence of an inducer. The exact length of the primer will depend on many factors, including temperature, primer source, the purpose of the method, and the parameters used for primer design.

引言及背景Introduction and Background

人类基因组中的CNV显著影响人类对疾病的多样性和易感性(Redon等人，Nature第23章：第444-454页[2006年]，Shaihio等人，Genome Res第19章：第1682-1690页[2009年])。此类疾病包括但不限于癌症、感染性和自身免疫疾病、神经系统疾病、代谢和/或心血管疾病等。CNVs in the human genome significantly influence human susceptibility to disease (Redon et al., Nature, Chapter 23: pp. 444-454 [2006], Shaihio et al., Genome Res, Chapter 19: pp. 1682-1690 [2009]). These diseases include, but are not limited to, cancer, infectious and autoimmune diseases, neurological diseases, metabolic and/or cardiovascular diseases.

已知CNV通过不同的机制导致遗传疾病，在大多数情况下导致基因剂量失衡或基因破坏。已知CNV除了与遗传紊乱直接相关之外，还介导可能有害的表型改变。最近，几项研究报道了与正常对照相比，在复杂疾病诸如自闭症、注意力缺陷多动障碍(ADHD)和精神分裂症中罕见或从头CNV的负担增加，表明了罕见或独特CNV的潜在致病性(Sebat等人，第316章：第445-449页[2007年]；Walsh等人，第320章：第539–543页[2008年])。CNV产生于主要由于缺失、复制、插入和不平衡易位事件的基因组重排。CNVs are known to cause genetic disorders through various mechanisms, in most cases leading to gene dosage imbalances or gene damage. In addition to their direct association with genetic disorders, CNVs are known to mediate potentially harmful phenotypic changes. Recently, several studies have reported an increased burden of rare or de novo CNVs in complex diseases such as autism, attention deficit hyperactivity disorder (ADHD), and schizophrenia compared to normal controls, suggesting the potential pathogenicity of rare or unique CNVs (Sebat et al., Chapter 316: pp. 445–449 [2007]; Walsh et al., Chapter 320: pp. 539–543 [2008]). CNVs arise from genomic rearrangements primarily due to deletions, duplications, insertions, and unbalanced translocation events.

研究已表明，胎儿来源的cfDNA片段平均长度短于母体来源的cfDNA片段。基于NGS数据的非侵入性产前检测(Non-invasive prenatal testing，NIPT)已成功实现。目前的方法包括使用短读段(25bp-36bp)对母体样品进行测序，与基因组比对，计算并归一化亚染色体覆盖度，最后评估与和正常二倍体基因组相关联的预期归一化覆盖度相比靶染色体的过度表示(13/18/21/X/Y)。因此，传统的NIPT测定和分析依赖于计数或覆盖度来评估胎儿非整倍体的可能性。Studies have shown that fetal-derived cfDNA fragments are on average shorter than maternal-derived cfDNA fragments. Non-invasive prenatal testing (NIPT) based on NGS data has been successfully implemented. Current methods involve sequencing maternal samples using short reads (25bp-36bp), aligning them to the genome, calculating and normalizing subchromosome coverage, and finally assessing overrepresentation of target chromosomes (13/18/21/X/Y) compared to the expected normalized coverage associated with a normal diploid genome. Therefore, traditional NIPT assays and analyses rely on counting or coverage to assess the likelihood of fetal aneuploidy.

由于母体血浆样品表示母体和胎儿cfDNA的混合物，因此任何给定NIPT方法的成功取决于其对检测低胎儿分数样品中拷贝数变异的灵敏度。对于基于计数的方法，它们的灵敏度取决于(a)测序深度和(b)数据归一化减小技术差异的能力。本公开提供了用于NIPT和其他应用的分析方法，该方法通过从例如配对端读段中推导片段大小信息，并且在分析管线中使用该信息。提高的分析灵敏度提供了在覆盖度降低(例如，测序深度降低)时应用NIPT方法的能力，这使得该技术能够用于对平均风险妊娠的低成本检测。Since maternal plasma samples represent a mixture of maternal and fetal cfDNA, the success of any given NIPT method depends on its sensitivity to detecting copy number variations in samples with low fetal fractions. For counting-based methods, their sensitivity depends on (a) sequencing depth and (b) the ability of data normalization to reduce technical variability. This disclosure provides analytical methods for NIPT and other applications that derive fragment size information from, for example, paired-end reads and use that information in an analytical pipeline. The improved analytical sensitivity provides the ability to apply NIPT methods even with reduced coverage (e.g., reduced sequencing depth), enabling the technique to be used for low-cost detection of average-risk pregnancies.

在一些实施方案中，提供了使用含有母体和胎儿无细胞DNA的母体样品来确定胎儿的拷贝数变异(CNV)的方法。In some implementations, methods are provided for determining fetal copy number variations (CNVs) using maternal samples containing cell-free maternal and fetal DNA.

本公开的方面涉及可识别能够可靠地作出拷贝数变异判定的样品测量值以及不能可靠地作出此类判定的样品测量值的方法和系统。换句话讲，本发明所公开的实施方案能够区分具有足够信息以进行可信判定的样品测量值与不具有足够信息的样品测量值。This disclosure relates to methods and systems for identifying sample measurements that can reliably make a determination of copy number variation and sample measurements that cannot reliably make such a determination. In other words, the embodiments disclosed in this invention can distinguish between sample measurements that have sufficient information for a credible determination and sample measurements that do not have sufficient information.

在用于区分可靠和不可靠样品测量值的一些现有方法中，将胎儿分数用作确定是否排除样品被判定的度量。此类方法可针对NIPT样品以特定方式运行测序例程。所得样品具有基因组或染色体的非排除区域的覆盖度或读段数目值。此类方法还可确定胎儿分数值。如果所确定的胎儿分数值低于某个阈值，例如3％，则排除样品测量值。在此类图的底部存在样品排除区域，该样品排除区域覆盖有效读段计数相当低和/或所确定的胎儿分数值低的样品测量值。在一些具体实施中，排除区域包括对于特别低的胎儿分数，将即使数目相对较高的有效读段排除在外的步骤。In some existing methods for distinguishing between reliable and unreliable sample measurements, the fetal score is used as a metric to determine whether a sample should be excluded. Such methods may run sequencing routines in a specific manner for NIPT samples. The resulting samples have coverage or read count values of non-excluded regions of the genome or chromosome. These methods may also determine a fetal score value. If the determined fetal score value is below a certain threshold, such as 3%, the sample measurement is excluded. At the bottom of such graphs, there is a sample exclusion region that covers sample measurements with a relatively low effective read count and/or a low determined fetal score. In some implementations, the exclusion region includes a step of excluding even relatively high numbers of effective reads for particularly low fetal scores.

如下文某些附图所示，数据示出了所确定的胎儿分数值可能由于各种原因而不准确的情况的各种示例。该数据示出了胎儿分数误差的某些趋势，诸如胎儿分数误差随判定误差的增加(其可由覆盖度降低引起)而增加，以及胎儿分数误差随真实胎儿分数值的量级减小而增加。由于以胎儿分数作为用于确定要排除哪些样品的模式存在这些问题，因此应使用用于确定要排除哪些样品的另一种技术。As shown in some of the accompanying figures below, the data illustrates various examples of situations where determined fetal score values may be inaccurate for various reasons. The data shows certain trends in fetal score error, such as an increase in error with increasing judgment error (which can be caused by reduced coverage), and an increase in error with decreasing magnitude of the true fetal score value. Because of these problems with using fetal scores as a model for determining which samples to exclude, an alternative technique should be used to determine which samples to exclude.

胎儿分数影响NIPT中胎儿游离DNA丰度和拷贝数变异的确定。当样品中的胎儿分数降低时，由于随着胎儿分数的降低，信号降低并且相对噪声增加，因此准确确定胎儿游离DNA的相对覆盖度变得更加困难。覆盖度或测序深度也对CNV检测具有类似的影响。由于这两个因素共同影响CNV检测，因此确保检测所需的最低胎儿分数根据覆盖度而改变，具有如图1所示的双曲线形状。Fetal fraction influences the determination of fetal cell-free DNA abundance and copy number variation in NIPT. As the fetal fraction decreases in a sample, accurately determining the relative coverage of fetal cell-free DNA becomes more difficult due to signal degradation and relative noise increases with decreasing fetal fraction. Coverage or sequencing depth also has a similar impact on CNV detection. Because both factors jointly affect CNV detection, the minimum fetal fraction required to ensure detection varies with coverage, exhibiting a hyperbolic shape as shown in Figure 1.

现有方法通过排除具有低覆盖度的样品来执行质量控制。例如，一些现有方法根据胎儿分数值来设定两个覆盖度阈值水平。在图2所示的示例中，第一覆盖度阈值设定为小于约5％的胎儿分数的约4.5M非排除位点NES或有效读段(线2002)，并且第二覆盖度阈值设定为大于约5％的胎儿分数的约2M(线2006)。线或曲线(2004)连接两个阈值水平。Existing methods perform quality control by excluding samples with low coverage. For example, some existing methods set two coverage threshold levels based on fetal score values. In the example shown in Figure 2, the first coverage threshold is set to less than approximately 4.5M non-excluded sites (NES) or valid reads for approximately 5% of fetal scores (line 2002), and the second coverage threshold is set to greater than approximately 2M for approximately 5% of fetal scores (line 2006). A line or curve (2004) connects the two threshold levels.

然而，此类方法具有各种局限性。就此类方法对CNV检测的影响而言，此类方法不能充分捕获覆盖度和胎儿分数之间的关系。此外，在这些现有方法中使用的阈值是静态的，相同的阈值应用于不同的样品、群体和平台。如图3所示，不同的样品群体具有不同的胎儿分数分布。图3示出了三个群体或其样品的胎儿分数分布。左图示出了三个群体的概率密度函数。右图示出了三个群体的累积分布。累积分布清楚地示出了0.04的固定胎儿分数阈值排除了来自三个群体的样品的不同部分。However, such methods have various limitations. Regarding their impact on CNV detection, they fail to adequately capture the relationship between coverage and fetal score. Furthermore, the thresholds used in these existing methods are static, with the same threshold applied to different samples, populations, and platforms. As shown in Figure 3, different sample populations exhibit different fetal score distributions. Figure 3 illustrates the fetal score distributions for three populations or their samples. The left figure shows the probability density functions for the three populations. The right figure shows the cumulative distribution for the three populations. The cumulative distribution clearly demonstrates that a fixed fetal score threshold of 0.04 excludes different portions of the samples from the three populations.

本公开提供了CNV检测过程中的质量控制(QC)的方法和系统，诸如图4A、图4B、图15和图16A所示。QC可在进行CNV判定之前或之后执行，以识别胎儿分数水平太低而不能产生可靠结果的样品。可重新运行所识别的样品以获得新的读段。如果重新运行中的测序深度增加，则覆盖度增加，这提高了样品的信号或降低了样品的噪声。当应用于不同样品群体时，本发明所公开的具体实施可考虑样品和群体之间的变异。并且这些具体实施可更有效地识别具有低胎儿分数和/或低读段覆盖度的样品。This disclosure provides methods and systems for quality control (QC) during CNV detection, as shown in Figures 4A, 4B, 15, and 16A. QC can be performed before or after CNV determination to identify samples with fetal fraction levels too low to produce reliable results. Identified samples can be rerun to obtain new reads. If the sequencing depth is increased during rerun, coverage increases, which improves the sample signal or reduces sample noise. The embodiments disclosed herein can account for variations between samples and populations when applied to different sample populations. Furthermore, these embodiments can more effectively identify samples with low fetal fractions and/or low read coverage.

示例性工作流程Exemplary Workflow

除非另外指明，否则本公开中的工作流程和标记的步骤可以与图和示例中所述的顺序不同的顺序来执行。例如，框204所示的确定胎儿分数值的操作可在202、206、208和210中所示的操作之前或之后执行。图4A示出了根据一些具体实施的使用检测限(LOD)QC方法来确定CNV的工作流程200。在该工作流程中，在框212处进行质量检查以检查试验样品是否在由至少胎儿分数LOD曲线限定的排除区域内。在许多具体实施中，排除区域由LOD曲线和覆盖度阈值或胎儿分数阈值限定。该步骤及其下游步骤可应用于下文所述的其他CNV检测过程。该示例性工作流程包括对包含母体游离核酸片段和胎儿游离核酸片段的试验样品进行测序，以获得序列读段。参见框202。可使用各种技术来获得序列读段，包括但不限于下文所述的测序技术。Unless otherwise specified, the workflows and marked steps in this disclosure may be performed in a different order than those described in the figures and examples. For example, the operation of determining the fetal score shown in box 204 may be performed before or after the operations shown in 202, 206, 208, and 210. Figure 4A illustrates a workflow 200 for determining CNVs using a limit of detection (LOD) QC method according to some specific implementations. In this workflow, a quality check is performed at box 212 to check whether the test sample is within an exclusion region defined by at least a fetal score LOD curve. In many specific implementations, the exclusion region is defined by an LOD curve and a coverage threshold or a fetal score threshold. This step and its downstream steps may be applied to other CNV detection processes described below. This exemplary workflow includes sequencing a test sample containing both maternal and fetal cell-free nucleic acid fragments to obtain sequence reads. See box 202. Various techniques may be used to obtain sequence reads, including but not limited to the sequencing techniques described below.

然后过程200继续确定试验样品的胎儿分数值。在一些具体实施中，将从202获得的序列读段用于确定胎儿分数。然而，也可使用来自相同个体的其他核酸。在一些具体实施中，可使用各种技术来确定试验样品的胎儿分数，包括但不限于下文所述的技术。简而言之，在一些具体实施中，基于游离核酸片段的大小来确定试验样品的胎儿分数值。在一些具体实施中，通过获得游离核酸片段大小的频率分布，以及将该频率分布应用于将胎儿分数与片段大小的频率相关联的模型，以获得胎儿分数值。在一些具体实施中，基于参考基因组的分组的覆盖度信息来确定胎儿分数值，其中将参考基因组分成区段或分组。在一些具体实施中，通过以下方式来计算胎儿分数值：将参考基因组的多个分组的覆盖度值应用于将胎儿分数与分组的覆盖度相关联的模型，以获得胎儿分数值。在一些具体实施中，从训练样品中选择过度表示的胎儿游离核酸片段所比对的分组。与分组进行比对的测试读段用于确定胎儿分数。在一些具体实施中，基于性染色体(诸如从怀有男性胎儿的母体中获得的母体样品中的Y染色体)的分组的覆盖度信息来确定试验样品的胎儿分数值。Then, process 200 proceeds to determine the fetal score of the test sample. In some embodiments, the sequence reads obtained from 202 are used to determine the fetal score. However, other nucleic acids from the same individual may also be used. In some embodiments, various techniques may be used to determine the fetal score of the test sample, including but not limited to those described below. Briefly, in some embodiments, the fetal score of the test sample is determined based on the size of the cell-free nucleic acid fragments. In some embodiments, the fetal score is obtained by obtaining a frequency distribution of cell-free nucleic acid fragment sizes and applying that frequency distribution to a model that correlates the fetal score with the frequency of fragment sizes. In some embodiments, the fetal score is determined based on coverage information of groupings of a reference genome, wherein the reference genome is divided into segments or groups. In some embodiments, the fetal score is calculated by applying coverage values of multiple groups of the reference genome to a model that correlates the fetal score with the coverage of the groups. In some embodiments, groups of overrepresented cell-free fetal nucleic acid fragments are selected from the training samples for alignment. Test reads aligned to the groups are used to determine the fetal score. In some implementations, the fetal score of the test sample is determined based on the coverage information of the sex chromosome grouping (such as the Y chromosome in a maternal sample obtained from a mother carrying a male fetus).

过程200还包括接收通过对试验样品中的游离核酸片段进行测序而获得的序列读段。参见框206。然后将游离核酸片段的序列读段与包括所关注序列的参考基因组进行比对，从而提供序列标签。参见框208。Process 200 also includes receiving sequence reads obtained by sequencing cell-free nucleic acid fragments in the test sample. See box 206. The sequence reads of the cell-free nucleic acid fragments are then aligned with a reference genome including the sequence of interest to provide sequence tags. See box 208.

过程200继续确定所关注序列的序列标签的覆盖度。参见210。确定每个样品的覆盖度。在各种具体实施中，在所有染色体(全参考基因组)上、在染色体的子集上或在亚染色体水平上确定覆盖度。可使用各种技术来确定覆盖度，包括但不限于下文所示的那些技术。在一些具体实施中，通过以下方式确定覆盖度：将参考基因组分成多个分组，确定与每个分组进行比对的序列标签数目，以及使用所关注序列的分组中的所述序列标签数目来确定所述序列标签的覆盖度。在一些具体实施中，方法还包括通过考虑由于拷贝数变异之外的因素引起的分组间变异，来调整与分组进行比对的序列标签数目。下文提供了用于确定覆盖度的方法的更详细描述。Procedure 200 continues to determine the coverage of sequence tags for the sequence of interest. See 210. The coverage for each sample is determined. In various embodiments, coverage is determined on all chromosomes (the whole reference genome), on a subset of chromosomes, or at the subchromosome level. Various techniques can be used to determine coverage, including but not limited to those described below. In some embodiments, coverage is determined by dividing the reference genome into multiple groups, determining the number of sequence tags aligned to each group, and using the number of sequence tags in the group of the sequence of interest to determine the coverage of the sequence tags. In some embodiments, the method further includes adjusting the number of sequence tags aligned to groups by taking into account inter-group variation due to factors other than copy number variation. A more detailed description of the methods for determining coverage is provided below.

过程200中的下一个步骤是QC步骤。QC步骤包括基于208中确定的序列标签的覆盖度和204中确定的胎儿分数来确定试验样品是否在排除区域内。该排除区域由至少胎儿分数LOD曲线限定。该胎儿分数LOD曲线随覆盖度值而变化，并且指示在给定不同覆盖度的情况下实现检测标准所需的胎儿分数的最小值。LOD曲线的示例参见图14。The next step in process 200 is the QC step. The QC step involves determining whether the test sample is within the exclusion region based on the coverage of the sequence tag determined in 208 and the fetal fraction determined in 204. This exclusion region is defined by a fetal fraction LOD curve. This fetal fraction LOD curve varies with coverage values and indicates the minimum fetal fraction required to achieve the testing criteria for a given coverage level. An example of an LOD curve is shown in Figure 14.

在一些具体实施中，使用受CNV影响的训练样品来获得胎儿分数LOD曲线。受影响样品可包括通过将来自不同个体的序列读段组合而获得的计算机模拟样品。在一些具体实施中，通过将具有两个或更多个胎儿分数的物理样品组合来获得受影响样品，从而提供具有中间胎儿分数值的合成样品。在一些具体实施中，使用关于CNV的受影响样品和未受影响样品来获得胎儿分数LOD曲线。在一些具体实施中，可使用物理或模拟样品来推导LOD曲线。In some implementations, training samples affected by CNVs are used to obtain fetal fraction LOD curves. Affected samples may include computer-simulated samples obtained by combining sequence reads from different individuals. In some implementations, affected samples are obtained by combining physical samples with two or more fetal fractions, thus providing synthetic samples with intermediate fetal fraction values. In some implementations, fetal fraction LOD curves are obtained using both affected and unaffected samples with respect to CNVs. In some implementations, either physical or simulated samples may be used to derive the LOD curve.

在一些具体实施中，检测标准是在给定观察胎儿分数的情况下，基准真值胎儿分数大于指定LOD的期望置信水平。LOD是可以限定置信度检测的最小水平的信号(分析物、胎儿分数、指示条件的得分等)。在本申请的上下文中，LOD是以限定置信度检测非整倍体/CNV所需的胎儿分数(或其他分析物)的最低水平。需注意，包括两个置信度值：第一置信度是大于指定LOD胎儿分数的基准真值胎儿分数的置信度，并且第二置信度是LOD本身的置信度。在一些具体实施中，检测标准是对于观察胎儿分数而言基准真值胎儿分数大于LOD Y％的X％置信度。基准真值胎儿分数是以实际胎儿分数为基础的推断胎儿分数。在一些具体实施中，X为约50％、55％、60％、65％、70％、75％、80％、85％、90％、95％、99％或99.5％。在一些具体实施中，Y为约50％、55％、60％、65％、70％、75％、80％、85％、90％、95％或99％的检测置信度。在一些具体实施中，X为80％并且Y为95％。换句话讲，检测标准是对于观察胎儿分数而言基准真值胎儿分数大于LOD95％(或简称为LOD95)的80％置信度，LOD95是可在95％的时间检测到CNV或非整倍体的最小胎儿分数值。在其他具体实施中，X为50％并且Y为95％。In some implementations, the detection criterion is the expected confidence level that the baseline true fetal score is greater than a specified LOD, given an observed fetal score. LOD is a signal (analyte, fetal score, score of indicative condition, etc.) that can define the minimum level of confidence required for detection. In the context of this application, LOD is the lowest level of fetal score (or other analyte) required to define the confidence level for detecting aneuploidy/CNV. It should be noted that two confidence levels are included: the first confidence level is the confidence level of the baseline true fetal score being greater than the specified LOD fetal score, and the second confidence level is the confidence level of the LOD itself. In some implementations, the detection criterion is an X% confidence level that the baseline true fetal score is greater than LOD Y% for the observed fetal score. The baseline true fetal score is an inferred fetal score based on the actual fetal score. In some implementations, X is approximately 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99%, or 99.5%. In some implementations, Y represents a detection confidence level of approximately 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or 99%. In some implementations, X represents 80% and Y represents 95%. In other words, the detection standard is an 80% confidence level that the baseline true fetal score for observing fetal scores is greater than LOD95% (or simply LOD95), which is the smallest fetal score that can detect CNV or aneuploidy 95% of the time. In other implementations, X represents 50% and Y represents 95%.

在一些具体实施中，指定LOD被确定为可检测到Y％的受影响样品的最小观察胎儿分数。在一些具体实施中，使用在观察覆盖度处观察胎儿分数的真实胎儿分数(或模拟胎儿分数)的分布来获得针对该观察胎儿分数和该观察覆盖度的检测标准。图5示出了例示以统计概念为基础的LOD的经验和假设数据。CNV检测基于指示样品包含CNV的可能性的对数似然比。上图示出了Y轴上的LLR和X轴上的胎儿分数。在每个胎儿分数水平处测量多个样品，并且可获得平均值和标准偏差。给定样品数据，人们可推断每个水平的胎儿分数的LLR的群体分布。上图示出了应用截止值(标记为502A)来判定CNV。随着胎儿分数值增加，LLR得分也增加，进一步远离截止值，从而允许检测到更多样品。In some implementations, the designated LOD is determined as the minimum observed fetal score at which Y% of affected samples can be detected. In some implementations, the distribution of the true fetal score (or simulated fetal score) at the observed fetal score coverage is used to obtain the detection criteria for that observed fetal score and that observed coverage. Figure 5 illustrates empirical and hypothetical data illustrating LOD based on statistical concepts. CNV detection is based on the log-likelihood ratio indicating the probability that a sample contains a CNV. The figure above shows the LLR on the Y-axis and the fetal score on the X-axis. Multiple samples are measured at each fetal score level, and the mean and standard deviation are obtained. Given sample data, one can infer the population distribution of the LLR for each level of fetal score. The figure above shows the application of a cutoff value (labeled 502A) to determine CNV. As the fetal score value increases, the LLR score also increases, moving further away from the cutoff value, thus allowing the detection of more samples.

在该示例中，2.3％是可以95％置信度(单边)检测到非整倍体/CNV的最低胎儿分数。在胎儿分数2.3％处受影响样品的观察LLR在上图中示出为504A，并且在下图中示出为504B。观察数据的基础群体分布在下图中示出为506。截止LLR在上图中示出为502A，并且在下图中示出为502B。下图示出了在胎儿分数2.3％处，5％的基础群体低于LLR截止值，并且95％的群体高于LLR阈值502B。因此，群体中95％的样品被检测为具有CNV。In this example, 2.3% is the lowest fetal fraction at which aneuploidy/CNV can be detected with 95% confidence (one-sided). The observed LLR for affected samples at a fetal fraction of 2.3% is shown as 504A in the upper figure and 504B in the lower figure. The baseline population distribution of the observed data is shown as 506 in the lower figure. The cutoff LLR is shown as 502A in the upper figure and 502B in the lower figure. The lower figure shows that at a fetal fraction of 2.3%, 5% of the baseline population is below the LLR cutoff value, and 95% of the population is above the LLR threshold 502B. Therefore, 95% of the samples in the population were detected as having CNV.

如在图5的上图中可见，随着胎儿分数的增加，LLR和检测概率也增加。图6示出了根据胎儿分数的检测概率。在该示例中，2.3％胎儿分数的检测概率为95％。因此，此处胎儿分数LOD95为2.3％。As can be seen in the upper part of Figure 5, the LLR and detection probability increase with the increase of the fetal score. Figure 6 shows the detection probability based on the fetal score. In this example, the detection probability is 95% for a fetal score of 2.3%. Therefore, the LOD95 for the fetal score here is 2.3%.

返回图4A，如果确定试验样品不在由胎儿分数LOD曲线限定的排除区域内，则该方法继续评估所关注序列的覆盖度以确定CNV。参见从框212到框214的“否”分支。并且在该路径中结束该过程。如果确定试验样品在LOD排除区域内，并且确定应对样品进行重新测序，则参见框212的“是”分支和框216的“是”分支。然后对样品进行重新测序，并且重复从202到212的过程。如果在框216处确定序列不需要进行重新测序，诸如当其已经进行重新测序时，则过程结束。Returning to Figure 4A, if the test sample is determined not to be within the exclusion region defined by the fetal fractional LOD curve, the method continues to assess the coverage of the sequence of interest to determine the CNV. See the "No" branch from boxes 212 to 214. The procedure ends at this point. If the test sample is determined to be within the LOD exclusion region and it is determined that the sample should be re-sequencing, see the "Yes" branch in boxes 212 and 216. The sample is then re-sequencing, and the procedure from 202 to 212 is repeated. If, at box 216, it is determined that the sequence does not require re-sequencing, such as when it has already been re-sequencing, the procedure ends.

如下文中实验数据所示，所公开的QC方法可有助于提高灵敏度和特异性两者。如果人们希望仅改善灵敏度，则可首先确定试验样品为阴性还是阳性。只有判定为阴性的样品才能通过如图4B所示的QC过程。As shown in the experimental data below, the disclosed QC method can help improve both sensitivity and specificity. If one wishes to improve only sensitivity, one can first determine whether the test sample is negative or positive. Only samples determined to be negative can pass the QC process shown in Figure 4B.

图4B示出了用于CNV检测的LOD QC过程，其类似于图4A中的过程，不同的是在确定样品是否落入排除区域之前首先执行判定。参见框211。如果判定为阴性，则该过程之后确定样品是否在由LOD曲线限定的排除区域内。参见框212。如果不是，则该过程结束。参见框216的“否”分支。如果试验样品在排除区域内(框212的“是”分支)并且确定需要进行重新测序，则将对样品进行重新测序并重复该过程。如果确定样品不需要进行重新测序，诸如在其已经进行一次重新测序的情况下，则过程结束。参见框216的“否”分支。Figure 4B illustrates the LOD QC process for CNV detection, which is similar to the process in Figure 4A, except that a decision is performed first before determining whether the sample falls into the exclusion region. See Box 211. If the decision is negative, the process then determines whether the sample is within the exclusion region defined by the LOD curve. See Box 212. If not, the process ends. See the "No" branch of Box 216. If the test sample is within the exclusion region ("Yes" branch of Box 212) and it is determined that resequencing is required, the sample will be resequentialed and the process repeated. If it is determined that the sample does not require resequencing, such as if it has already been resequentialed once, the process ends. See the "No" branch of Box 216.

LOD和LOD曲线LOD and LOD curves

图7示出了估计胎儿分数包括导致所述估计胎儿分数偏离真实胎儿分数的误差。实线示出了观察胎儿分数分布。虚线示出真实胎儿分数分布。观察胎儿分数是根据其他变量推断的，而不是直接根据胎儿DNA片段和母体DNA片段测量的。灰线是直接根据胎儿Y染色体测量的胎儿分数的分布。由于灰线是直接测量的，因此它比观察胎儿分数更接近真实胎儿分数。Figure 7 illustrates the estimated fetal score, including the error that causes it to deviate from the true fetal score. The solid line shows the distribution of the observed fetal score. The dashed line shows the distribution of the true fetal score. The observed fetal score is inferred from other variables, rather than being measured directly from fetal and maternal DNA fragments. The gray line is the distribution of the fetal score measured directly from the fetal Y chromosome. Because the gray line is directly measured, it is closer to the true fetal score than the observed fetal score.

图8示出了由于覆盖度和由于胎儿分数值引起的胎儿分数误差。左图示出了胎儿分数估计值的标准偏差(与误差相关的量度)随覆盖度的增加而减小。右图示出了胎儿分数估计值的标准偏差随真实胎儿分数的增加而减小。Figure 8 illustrates the errors in fetal score estimation due to coverage and due to fetal score values. The left figure shows that the standard deviation of the fetal score estimate (a measure of error) decreases as coverage increases. The right figure shows that the standard deviation of the fetal score estimate decreases as the true fetal score increases.

图9模拟了八个水平的观察胎儿分数的真实胎儿分数。图9示出了在0％和8％之间以1％为间隔的观察胎儿分数。图9还示出了这些观察胎儿分数的基准真值分布。竖直虚线9002示出了观察胎儿分数0％。实线9004是0％处观察胎儿分数的真实胎儿分数的分布。虚线9006指示8％的观察胎儿分数。实线9008示出了8％的观察胎儿分数的真实胎儿分数的分布。如图所示，当观察胎儿分数较低时，诸如在线9002和0％处，基准真值胎儿分数分布9004偏离观察胎儿分数的距离比在8％处的真实胎儿分数(9008)偏离观察8％胎儿分数(9006)的距离更远。另外，较低胎儿分数分布的尖峰值往往较大。换句话讲，随着观察胎儿分数的增加，真实胎儿分数的分布变得更平坦，并且与观察胎儿分数的偏差更小。Figure 9 simulates the actual fetal scores at eight levels of observed fetal scores. Figure 9 shows the observed fetal scores at 1% intervals between 0% and 8%. Figure 9 also shows the baseline true distribution of these observed fetal scores. The vertical dashed line 9002 represents the observed fetal score of 0%. The solid line 9004 is the distribution of the actual fetal scores at 0%. The dashed line 9006 indicates the observed fetal score at 8%. The solid line 9008 shows the distribution of the actual fetal scores at 8%. As shown, when the observed fetal scores are low, such as at lines 9002 and 0%, the baseline true fetal score distribution 9004 deviates from the observed fetal scores by a greater distance than the actual fetal score at 8% (9008) deviates from the observed 8% fetal score (9006). Additionally, the peak values in the distribution of lower fetal scores tend to be larger. In other words, as the observed fetal scores increase, the distribution of the actual fetal scores becomes flatter and deviates less from the observed fetal scores.

图10示出了三个不同水平的误差或覆盖度(注意，覆盖度与误差成反比)的真实胎儿分数相对于观察胎儿分数的分布。图10的左图与图9相同。图10的左图以1％胎儿分数误差来模拟。中间图示出了以1.5％误差模拟的胎儿分数的分布。右图示出了以2％误差模拟的胎儿分数的分布。从图中可以看出，观察胎儿分数和真实胎儿分数之间的差异受覆盖度影响，误差越大，观察胎儿分数和真实胎儿分数之间的差异越大。Figure 10 shows the distribution of the true fetal score relative to the observed fetal score at three different levels of error or coverage (note that coverage is inversely proportional to error). The left plot of Figure 10 is the same as Figure 9. The left plot of Figure 10 simulates a fetal score error of 1%. The middle plot shows the distribution of the fetal score simulated with an error of 1.5%. The right plot shows the distribution of the fetal score simulated with an error of 2%. As can be seen from the figures, the difference between the observed fetal score and the true fetal score is affected by coverage; the larger the error, the greater the difference between the observed fetal score and the true fetal score.

图11示出了在给定不同误差或覆盖度的情况下2％的观察胎儿分数及其模拟真实胎儿分数分布。观察胎儿分数由线1102指示。以最小误差1％(或最高覆盖度)模拟的真实胎儿分数具有分布1112。分布1114是1.5％误差或中间覆盖度模拟的真实胎儿分数。分布1116是以最高2％误差或最低覆盖度模拟的真实胎儿分数分布。分布1112、1114、1116的第5百分位或95％置信度分别由线1122、1124和1126标记。这些分布表明，随着误差增加或覆盖度减小，真实胎儿分数进一步偏离观察胎儿分数，并且95％置信水平也增加。图11中还示出了分布1116的第20百分位(1128)和第25百分位(1130)。它们分别对应于80％置信水平和75％置信水平。类似地，可确定其他置信水平，诸如50％置信水平。Figure 11 shows the distribution of the observed fetal scores at 2% and their simulated real fetal scores for different errors or coverage levels. The observed fetal scores are indicated by line 1102. The simulated real fetal scores with the minimum error of 1% (or the highest coverage) have distribution 1112. Distribution 1114 is the simulated real fetal scores with 1.5% error or intermediate coverage. Distribution 1116 is the distribution of the simulated real fetal scores with the highest error of 2% or the lowest coverage. The 5th percentile or 95% confidence level of distributions 1112, 1114, and 1116 are marked by lines 1122, 1124, and 1126, respectively. These distributions show that as the error or coverage decreases, the real fetal scores deviate further from the observed fetal scores, and the 95% confidence level also increases. The 20th percentile (1128) and 25th percentile (1130) of distribution 1116 are also shown in Figure 11. They correspond to the 80% and 75% confidence levels, respectively. Similarly, other confidence levels can be determined, such as a 50% confidence level.

图12示出了根据观察胎儿分数的第20百分位(左图)和第25百分位(右图)的真实胎儿分数分布。随着观察胎儿分数的增加，真实胎儿分数的第20百分位也相应地增加。左图示出了80％置信度数据，即真实胎儿分数分布的第20百分位。右图示出了75％置信度数据。在左图中，第20百分位真实胎儿分数与观察胎儿分数之间的关系示出为不同的线，每条线指示不同的覆盖度水平(和对应的FF误差水平)。左图示出了十个不同的覆盖度水平(1M-10M)。该图示出了三个变量之间的关系：观察胎儿分数、第20百分位(或80％置信度)处的真实胎儿分数和覆盖度。观察例如左图，可以看到2％观察FF(对应于图11中观察到的FF)处的两种模式，(a)真实FF的第20百分位(或80％置信度)高于2％，以及(b)随着覆盖度减小(或误差增加)，真实FF的第20百分位(或80％置信度)增加。这两种模式也可见于图11中。Figure 12 shows the distribution of true fetal scores based on the 20th percentile (left) and 25th percentile (right) of the observed fetal score. As the observed fetal score increases, the 20th percentile of the true fetal score also increases accordingly. The left figure shows the 80% confidence data, i.e., the 20th percentile of the true fetal score distribution. The right figure shows the 75% confidence data. In the left figure, the relationship between the 20th percentile true fetal score and the observed fetal score is shown as distinct lines, each indicating a different coverage level (and corresponding FF error level). The left figure shows ten different coverage levels (1M–10M). This figure illustrates the relationship between three variables: the observed fetal score, the true fetal score at the 20th percentile (or 80% confidence level), and coverage. Observing, for example, the left figure, we can see two patterns at the 2% observed FF (corresponding to the FF observed in Figure 11): (a) the 20th percentile (or 80% confidence level) of the true FF is above 2%, and (b) the 20th percentile (or 80% confidence level) of the true FF increases as coverage decreases (or error increases). These two patterns are also visible in Figure 11.

在观察胎儿分数的低端，随着覆盖度减小，真实胎儿分数的第20百分位增加。在观察胎儿分数的高端，这种关系发生了逆转。为了计算LOD曲线，首先根据经验确定特定覆盖度的LOD95为某值(为胎儿分数)。该LOD95值也是期望的(例如，20百分位)真实胎儿分数的值。利用该值，人们可根据真实函数对比观察函数(图12的左图中的线)确定观察胎儿分数。At the lower end of the observed fetal score, the 20th percentile of the true fetal score increases as coverage decreases. This relationship reverses at the higher end of the observed fetal score. To calculate the LOD curve, the LOD95 for a specific coverage is first determined empirically as a value (for the fetal score). This LOD95 value is also the expected (e.g., 20th percentile) value of the true fetal score. Using this value, the observed fetal score can be determined by comparing the true function with the observed function (the line in the left plot of Figure 12).

例如，根据经验确定覆盖度1百万的LOD95为6.50％ FF，这也是1202处的真实胎儿分数的第20百分位。人们可根据图确定观察胎儿分数为8％(如1204所示)。根据经验确定覆盖度2百万的LOD95为4.59％ FF，这也是1206处的真实胎儿分数的第20百分位。人们可根据图确定观察胎儿分数为4.2％(如1208所示)。人们可类似地从图中的其他线获得其他覆盖度的观察胎儿分数。这些点是具有真实胎儿分数超过不同覆盖度的LOD95的80％置信度所需的观察胎儿分数。For example, based on experience, the LOD95 for coverage of 1 million is determined to be 6.50% FF, which is also the 20th percentile of the true fetal score at point 1202. One can determine the observed fetal score as 8% based on the graph (as shown in point 1204). Based on experience, the LOD95 for coverage of 2 million is determined to be 4.59% FF, which is also the 20th percentile of the true fetal score at point 1206. One can determine the observed fetal score as 4.2% based on the graph (as shown in point 1208). One can similarly obtain the observed fetal scores for other coverages from other lines in the graph. These points represent the observed fetal scores required to have a true fetal score exceeding the 80% confidence level of the LOD95 for different coverages.

图13示出了LOD、覆盖度和观察胎儿分数。该表在从左侧起的第一列中示出了有效读段计数，在第三列中示出了LOD95，并且在第四列中示出了达到真实胎儿分数高于LOD95的80％置信度所需的观察胎儿分数。具有75％置信度所需的观察胎儿分数在最右列中示出，其可从图12的右图所示的数据中获得。第二列示出了不同有效读段计数的FF误差，表明随着覆盖度增加，误差减小。Figure 13 shows LOD, coverage, and observed fetal score. The table shows the effective read count in the first column from the left, LOD95 in the third column, and the observed fetal score required to achieve an 80% confidence level of a true fetal score higher than LOD95 in the fourth column. The observed fetal score required for a 75% confidence level is shown in the rightmost column, which can be obtained from the data shown in the right panel of Figure 12. The second column shows the FF error for different effective read counts, indicating that the error decreases as coverage increases.

图14包括可从类似于图13所示的数据中获得的两条胎儿分数LOD曲线。在一些具体实施中，LOD曲线与覆盖度阈值(诸如图14所示的1M覆盖度阈值)结合使用。Figure 14 includes two fetal fractional LOD curves that can be obtained from data similar to that shown in Figure 13. In some implementations, the LOD curves are used in conjunction with coverage thresholds, such as the 1M coverage threshold shown in Figure 14.

在一些具体实施中，如图14所示，排除区域在胎儿分数LOD曲线的下方。从数值上看，排除区域是每个点具有比胎儿分数LOD曲线上的对应点更低的胎儿分数或覆盖度值的区域。在一些具体实施中，排除区域由胎儿分数LOD曲线和覆盖度阈值限定。排除区域在胎儿分数LOD曲线和覆盖度阈值两者的下方。In some implementations, as shown in Figure 14, the exclusion region lies below the fetal score LOD curve. Numerically, the exclusion region is the area where each point has a lower fetal score or coverage value than the corresponding point on the fetal score LOD curve. In some implementations, the exclusion region is defined by both the fetal score LOD curve and a coverage threshold. The exclusion region lies below both the fetal score LOD curve and the coverage threshold.

在一些具体实施中，生成LOD曲线的方法可总结如下。使用模拟“样品”获得LOD曲线，其中每个“样品”包括以下项。In some specific implementations, the methods for generating LOD curves can be summarized as follows: LOD curves are obtained using simulated "samples," where each "sample" includes the following items.

·从临床样品中已知的覆盖度值的正态分布中采样的覆盖度值Coverage values sampled from a normal distribution of known coverage values in clinical samples.

·根据临床样品中的观察FF分布确定的从已知基础真实FF分布采样的基础真实FF值• Baseline true FF values determined from the known baseline true FF distribution based on the observed FF distribution in clinical samples.

·通过将误差(对应于覆盖度值)添加到基础真实FF值来确定的观察FF值• The observed FF value is determined by adding the error (corresponding to the coverage value) to the base true FF value.

模拟了大量此类“样品”。对于每个观察胎儿分数值(以小数增量计)，以及在每个覆盖度水平(以增量计1-10M)处，收集该数据集中的所有对应真实FF值，并且选择这些真实FF值的100减去X百分位数。对于每个可能的观察FF值和覆盖度组合，现在存在单个真实FF值，我们可以说在给定该观察FF值的情况下，我们对真实FF至少与所选择的真实FF一样高的置信度为X％。A large number of such "samples" were simulated. For each observed fetal score (in decimal increments) and at each coverage level (in increments of 1–10M), all corresponding true fetal fractional values in the dataset were collected, and the 100th percentile of these true fetal fractional values was selected minus the X percentile. For each possible combination of observed fetal fractional values and coverage, there is now a single true fetal fractional value to which we can say, given that observed fetal fractional value, that we have at least X% confidence in the true fetal fractional ...

对于每个覆盖度水平，将观察FF相对于所选择的真实FF作图，以得到每个覆盖度水平具有一条曲线的这些图。For each coverage level, the FF will be plotted relative to the selected true FF to obtain these plots with a curve for each coverage level.

除此之外，LOD研究诸如实施例1的LOD研究描述了覆盖度和LOD95之间的关系(该关系的推导包括受影响样品)。然后将来自该图的对应于覆盖度水平1-10M(以增量计)的LOD95(Y)值叠加在上述观察FF对比真实FF的图上。这些水平LOD95(Y)线中的每一条线与观察FF对比真实FF图上的对应(按覆盖度)曲线相交的点从x轴上读段，以得到确保真实FF超过LOD95(Y)的X％置信度所需的观察FF。In addition, LOD studies such as Example 1 describe the relationship between coverage and LOD95 (the derivation of this relationship includes the affected samples). The LOD95(Y) values corresponding to coverage levels 1–10M (in increments) from this plot are then superimposed on the aforementioned plot comparing observed FF to true FF. Each of these horizontal LOD95(Y) lines intersects the corresponding (by coverage) curve on the observed FF vs. true FF plot at a point read from the x-axis to obtain the observed FF required to ensure that the true FF exceeds the X% confidence level of LOD95(Y).

然后相对于获得LOD QC曲线所需的观察FF绘制覆盖度值。Then, draw the coverage value relative to the observation FF required to obtain the LOD QC curve.

评估CNVAssess CNV

用于确定CNV的方法Methods for determining CNV

使用由本文所公开的方法提供的序列覆盖度值、片段大小参数和/或甲基化水平，人们可以相对于通过常规方法获得的序列覆盖度具有提高的灵敏度、选择性和/或效率来确定与序列、染色体或染色体区段的拷贝数和CNV相关的各种遗传条件。例如，在一些实施方案中，掩蔽参考序列用于确定在包含胎儿和母体核酸分子的母体试验样品中是否存在任意两种或更多种不同的完整胎儿染色体非整倍体。下文提供的示例性方法将读段与参考序列(包括参考基因组)进行比对。可在未掩蔽或掩蔽参考序列上执行比对，从而产生映射到该参考序列的序列标签。在一些实施方案中，仅将落入参考序列的未掩蔽区段上的序列标签考虑在内以确定拷贝数变异。Using the sequence coverage values, fragment size parameters, and/or methylation levels provided by the methods disclosed herein, one can determine various genetic conditions related to copy number and CNVs of sequences, chromosomes, or chromosomal segments with improved sensitivity, selectivity, and/or efficiency relative to sequence coverage obtained by conventional methods. For example, in some embodiments, a masked reference sequence is used to determine the presence of any two or more distinct intact fetal chromosomal aneuploidies in a maternal test sample containing fetal and maternal nucleic acid molecules. Exemplary methods provided below align reads to a reference sequence (including a reference genome). Alignment can be performed on unmasked or masked reference sequences, resulting in sequence tags mapped to that reference sequence. In some embodiments, only sequence tags falling on unmasked segments of the reference sequence are considered to determine copy number variations.

在一些实施方案中，评估核酸样品的CNV包括通过以下三种类型的判定之一来表征染色体或区段非整倍体的状态：“正常”或“未受影响”、“受影响”和“无判定”。通常设定用于判定正常和受影响的阈值。在样品中测量与非整倍体或其他拷贝数变异相关的参数，并将测量值与阈值进行比较。对于复制型非整倍体，如果染色体或区段剂量(或其他测量值序列含量)高于用于受影响样品的限定阈值集，则作出受影响的判定。对于此类非整倍体，如果染色体或区段剂量低于用于正常样品的阈值设定，则作出正常的判定。相比之下，对于缺失型非整倍体，如果染色体或区段剂量低于用于受影响样品的限定阈值，则判定受影响，并且如果染色体或区段剂量高于针对正常样品的阈值设定，则作出正常的判定。例如，在存在三倍体的情况下，通过低于用户定义的可靠性阈值的参数的值(例如，试验染色体剂量)来确定“正常”判定，并且通过高于用户定义的可靠性阈值的参数(例如，试验染色体剂量)来确定“受影响”判定。通过位于进行“正常”或“受影响”判定的阈值之间的参数(例如，试验染色体剂量)来确定“无判定”结果。术语“无判定”可与“未分类”互换使用。In some implementations, assessing the CNV of a nucleic acid sample involves characterizing the state of chromosomal or segmental aneuploidy through one of three types of determinations: “normal” or “unaffected,” “affected,” and “no determination.” Thresholds are typically set for determining normal and affected status. Parameters associated with aneuploidy or other copy number variations are measured in the sample, and the measurements are compared to the thresholds. For replication aneuploidy, an affected determination is made if the chromosomal or segment dose (or other measured sequence content) is higher than a defined threshold set for affected samples. For such aneuploidy, a normal determination is made if the chromosomal or segment dose is lower than a threshold set for normal samples. In contrast, for deletion aneuploidy, an affected determination is made if the chromosomal or segment dose is lower than a defined threshold set for affected samples, and a normal determination is made if the chromosomal or segment dose is higher than a threshold set for normal samples. For example, in the presence of triploidy, a “normal” determination is determined by a parameter value (e.g., test chromosomal dose) below a user-defined reliability threshold, and an “affected” determination is determined by a parameter value (e.g., test chromosomal dose) above a user-defined reliability threshold. A "no-decision" result is determined by a parameter (e.g., the dose of the tested chromosome) that falls between the thresholds for determining "normal" or "affected". The term "no-decision" is used interchangeably with "unclassified".

可用于确定CNV的参数包括但不限于覆盖度、片段大小偏倚/加权覆盖度、限定大小范围内片段的分数或比率和片段的甲基化水平。如本文所述，从与参考基因组的区域进行比对的读段的计数中获得覆盖度，并且任选地将该覆盖度归一化以产生序列标签计数。在一些实施方案中，可通过片段大小来对序列标签计数进行加权。Parameters that can be used to determine a CNV include, but are not limited to, coverage, fragment size bias/weighted coverage, the fraction or ratio of fragments within a defined size range, and the methylation level of the fragment. As described herein, coverage is obtained from the count of reads aligned to a region of a reference genome, and optionally, this coverage is normalized to produce a sequence tag count. In some embodiments, the sequence tag count may be weighted by fragment size.

在一些实施方案中，片段大小参数偏向基因组中的一个基因组的片段大小特性。片段大小参数是与片段大小相关的参数。在以下情况时，参数偏向片段大小：1)对片段大小参数进行有利地加权，例如，该大小的计数的权重比其他大小的计数的权重更大；或者2)从对片段大小有利地进行加权的值中获得参数，例如，从该大小的权重更大的计数中获得的比率。当基因组具有相对于另一基因组或同一基因组的另一部分富集或更高浓度的核酸时，该大小是该基因组的特性。In some implementations, the fragment size parameter is biased towards the fragment size characteristic of a genome. The fragment size parameter is a parameter associated with fragment size. The parameter is biased towards fragment size when: 1) the fragment size parameter is advantageously weighted, for example, the count of that size has a greater weight than the count of other sizes; or 2) the parameter is obtained from a value that is advantageously weighted towards fragment size, for example, a ratio obtained from a count that has a greater weight for that size. This size is a characteristic of the genome when it has an enriched or higher concentration of nucleic acids relative to another genome or another part of the same genome.

在一些实施方案中，用于确定在母体试验样品中是否存在任何完整胎儿染色体非整倍体的方法包括(a)获得母体试验样品中胎儿和母体核酸的序列信息；(b)使用上述序列信息和方法来识别选自1-22号染色体、X染色体和Y染色体的所关注染色体中每一条染色体的序列标签数目、序列覆盖度量、片段大小参数或另一个参数，以及识别一条或多条归一化染色体序列的序列标签数目或另一个参数；(c)使用所识别的所述所关注染色体中每一条染色体的所述序列标签数目或所述其他参数，以及所识别的所述归一化染色体中每一条染色体的所述序列标签数目或所述其他参数，来计算所述所关注染色体中每一条序列的单个染色体剂量；以及(d)将每个染色体剂量与阈值进行比较，并由此确定在母体试验样品中是否存在任何完整胎儿染色体非整倍体。In some embodiments, a method for determining the presence of any intact fetal chromosomal aneuploidy in a maternal test sample includes (a) obtaining sequence information of fetal and maternal nucleic acids in the maternal test sample; (b) using the sequence information and methods described above to identify the number of sequence tags, sequence coverage metric, fragment size parameter, or another parameter for each chromosome of interest selected from chromosomes 1-22, the X chromosome, and the Y chromosome, and to identify the number of sequence tags or another parameter for one or more normalized chromosome sequences; (c) using the number of sequence tags or the other parameter for each chromosome of interest identified, and the number of sequence tags or the other parameter for each normalized chromosome identified, to calculate a single chromosome dose for each sequence of the chromosome of interest; and (d) comparing each chromosome dose to a threshold, thereby determining whether any intact fetal chromosomal aneuploidy is present in the maternal test sample.

在一些实施方案中，上述步骤(a)可包括对试验样品的核酸分子的至少一部分进行测序，以获得该试验样品的胎儿和母体核酸分子的所述序列信息。在一些实施方案中，步骤(c)包括以所识别的所关注染色体中每一条染色体的序列标签数目或其他参数与所识别的归一化染色体序列的序列标签数目或其他参数的比率，来计算所述所关注染色体中每一条染色体的单个染色体剂量。在一些其他实施方案中，染色体剂量基于来源于序列标签数目的处理序列覆盖度量或另一个参数。在一些实施方案中，仅使用唯一的、非冗余序列标签来计算处理序列覆盖度量或另一个参数。在一些实施方案中，处理序列覆盖度量是序列标签密度比，其是通过序列长度进行标准化的序列标签数目。在一些实施方案中，处理序列覆盖度量或其他参数是归一化序列标签或另一个归一化参数，其是由基因组的全部或大部分的序列标签数目或其他参数来除的所关注序列的序列标签数目或其他参数。在一些实施方案中，根据所关注序列的全局图谱来调整处理序列覆盖度量或其他参数(诸如片段大小参数)。在一些实施方案中，根据试验样品的GC含量与序列覆盖度之间的样品内相关性来调整处理序列覆盖度量或其他参数。在一些实施方案中，处理序列覆盖度量或其他参数由这些过程的组合产生，这些过程在本文他处进一步描述。In some embodiments, step (a) above may include sequencing at least a portion of the nucleic acid molecules of the test sample to obtain the sequence information of the fetal and maternal nucleic acid molecules of the test sample. In some embodiments, step (c) includes calculating a single chromosome dose for each chromosome of interest as a ratio of the number of sequence tags or other parameters of each chromosome of interest to the number of sequence tags or other parameters of the identified normalized chromosome sequence. In some other embodiments, the chromosome dose is based on a processed sequence coverage metric or another parameter derived from the number of sequence tags. In some embodiments, the processed sequence coverage metric or another parameter is calculated using only unique, non-redundant sequence tags. In some embodiments, the processed sequence coverage metric is a sequence tag density ratio, which is the number of sequence tags normalized to sequence length. In some embodiments, the processed sequence coverage metric or other parameter is a normalized sequence tag or another normalized parameter, which is the number of sequence tags or other parameters of the sequence of interest divided by the number of sequence tags or other parameters of all or most of the sequence in the genome. In some embodiments, the processed sequence coverage metric or other parameters (such as fragment size parameters) are adjusted according to a global map of the sequence of interest. In some embodiments, the processed sequence coverage metric or other parameters are adjusted based on the intra-sample correlation between the GC content of the test sample and the sequence coverage. In some embodiments, the processed sequence coverage metric or other parameters are generated by a combination of these processes, which are further described elsewhere herein.

在一些实施方案中，以所关注染色体中每一条染色体的处理序列覆盖度或其他参数与归一化染色体序列的处理序列覆盖度或其他参数比率来计算染色体剂量。In some implementations, chromosome dose is calculated as the ratio of the treated sequence coverage or other parameters of each chromosome of interest to the treated sequence coverage or other parameters of the normalized chromosome sequence.

在上述实施方案的任意一个实施方案中，完整染色体非整倍体选自完整染色体三倍体、完整染色体单体和完整染色体多体。完整染色体非整倍体选自1-22号染色体、X染色体和Y染色体中任意一条染色体的完整非整倍体。例如，所述不同的完整胎儿染色体非整倍体选自2-三倍体、8-三倍体、9-三倍体、20-三倍体、21-三倍体、13-三倍体、16-三倍体、18-三倍体、22-三倍体、47,XXX、47,XYY和X单倍体。In any of the above implementation schemes, the complete chromosomal aneuploidy is selected from complete chromosomal triploidy, complete chromosomal monosomy, and complete chromosomal polysomy. The complete chromosomal aneuploidy is selected from the complete aneuploidy of any one of chromosomes 1-22, the X chromosome, and the Y chromosome. For example, the different complete fetal chromosomal aneuploidies are selected from 2-triploid, 8-triploid, 9-triploid, 20-triploid, 21-triploid, 13-triploid, 16-triploid, 18-triploid, 22-triploid, 47,XXX, 47,XYY, and X haploid.

在上述实施方案的任意一个实施方案中，对来自不同母体受试者的试验样品重复步骤(a)至(d)，并且该方法包括确定在试验样品的每个试验样品中是否存在任意两种或更多种不同的完整胎染色体非整倍体。In any of the above embodiments, steps (a) to (d) are repeated for test samples from different maternal subjects, and the method includes determining whether any two or more different intact fetal chromosomal aneuploidies exist in each test sample.

在上述实施方案的任意一个实施方案中，方法还可包括计算归一化染色体值(NCV)，其中该NCV将染色体剂量与合格样品集中对应染色体剂量的平均值相关联为：In any of the above embodiments, the method may further include calculating a normalized chromosome value (NCV), wherein the NCV correlates the chromosome dose with the average chromosome dose in the qualified sample set as follows:

其中和分别是合格样品集中第j个染色体剂量的估计平均值和标准偏差，并且x_ij是观察的试验样品i的第j个染色体剂量。Where and are the estimated mean and standard deviation of the dose of the j-th chromosome in the qualified sample set, respectively, and x _{ij} is the dose of the j-th chromosome in the observed test sample i.

其中M_j是在相同流通池上进行测序的多重样品集中第j个染色体剂量的估计中值；是在一个或多个流通池上进行测序的一个或多个多重样集品中第j个染色体剂量的标准偏差，并且x_i是观察的试验样品i的第j个染色体剂量。在该实施方案中，试验样品i是在确定M_j的相同流通池上进行测序的多重样品之一。Where M _j is the estimated median dose of the j-th chromosome in a multiplex sample set sequenced on the same flow cell; is the standard deviation of the dose of the j-th chromosome in one or more multiplex sample sets sequenced on one or more flow cells, and _xi is the observed dose of the j-th chromosome in test sample i. In this embodiment, test sample i is one of the multiplex samples sequenced on the same flow cell where M_j was determined.

在一些实施方案中，提供了用于确定在包含胎儿和母体核酸的母体试验样品中是否存在不同的部分胎儿染色体非整倍体的方法。该方法包括与上文概述的用于检测完整非整倍体的方法类似的程序。然而，不是分析完整染色体，而是分析染色体的区段。参见美国专利申请公布2013/0029852，其以引用方式并入。In some embodiments, methods are provided for determining the presence of distinct partial fetal chromosomal aneuploidy in a maternal test sample containing fetal and maternal nucleic acids. These methods include procedures similar to those outlined above for detecting complete aneuploidy. However, instead of analyzing complete chromosomes, segments of chromosomes are analyzed. See U.S. Patent Application Publication 2013/0029852, which is incorporated herein by reference.

图15示出了根据一些实施方案的用于确定存在拷贝数变异的方法。图15所示的过程100使用基于序列标签数目(即序列标签计数)的序列标签覆盖度来确定CNV。然而，类似于上文关于计算NCV的描述，可使用其他变量或参数(诸如大小、大小比率和甲基化水平)来代替覆盖度。在一些具体实施中，将两个或更多个变量组合以确定CNV。此外，可基于标签所来源的片段的大小来对覆盖度和其他参数进行加权。为了便于阅读，在图1所示的过程100中仅提及覆盖度，但应当注意，可使用其他参数(诸如大小、大小比率和甲基化水平、由大小加权的计数等)来代替覆盖度。Figure 15 illustrates methods for determining the presence of copy number variations (CNVs) according to some embodiments. Process 100 shown in Figure 15 uses sequence tag coverage based on the number of sequence tags (i.e., sequence tag counts) to determine CNVs. However, similar to the description above regarding NCV calculation, other variables or parameters (such as size, size ratio, and methylation level) can be used instead of coverage. In some specific embodiments, two or more variables are combined to determine CNVs. Furthermore, coverage and other parameters can be weighted based on the size of the fragment from which the tag originates. For ease of reading, only coverage is mentioned in process 100 shown in Figure 1, but it should be noted that other parameters (such as size, size ratio, methylation level, size-weighted counts, etc.) can be used instead of coverage.

在操作130和135中，确定合格序列标签覆盖度(或另一个参数的值)和测试序列标签覆盖度(或另一个参数的值)。本公开提供了相对于常规方法提供提高的灵敏度和选择性的用于确定覆盖度量的过程。操作130和135由星号标记并以粗线框加粗以指示这些操作有助于改进现有技术。在一些实施方案中，将序列标签覆盖度量进行归一化、调整、修整和以其他方式处理以提高分析的灵敏度和选择性。这些过程在本文他处进一步描述。In operations 130 and 135, qualified sequence tag coverage (or a value of another parameter) and test sequence tag coverage (or a value of another parameter) are determined. This disclosure provides a procedure for determining a coverage metric that offers improved sensitivity and selectivity compared to conventional methods. Operations 130 and 135 are marked with an asterisk and bolded in a thick box to indicate that these operations contribute to improvements over the prior art. In some embodiments, the sequence tag coverage metric is normalized, adjusted, trimmed, and otherwise processed to improve the sensitivity and selectivity of the analysis. These procedures are further described elsewhere herein.

从概述上来看，方法利用合格训练样品的归一化序列进行试验样品CNV的确定。在一些实施方案中，合格训练样品不受影响并且具有正常的拷贝数。归一化序列提供了将运行内和运行间变异性的测量值归一化的机制。使用来自合格样品集的序列信息来识别归一化序列，该合格样品集从已知包含具有任意一条所关注序列(如染色体或其区段)的正常拷贝数的细胞的受试者中获得。图1所示方法的实施方案的步骤110、120、130、145和146概述了归一化序列的确定。在一些实施方案中，归一化序列用于计算测试序列的序列剂量。参见步骤150。在一些实施方案中，归一化序列还用于计算测试序列的序列剂量所比较的阈值。参见步骤150。从归一化序列和测试序列中获得的序列信息用于确定试验样品中有统计意义的染色体非整倍体识别(步骤160)。In summary, the method utilizes the normalized sequence of qualified training samples to determine the CNV of the test samples. In some embodiments, the qualified training samples are unaffected and have normal copy numbers. The normalized sequence provides a mechanism for normalizing measurements of intra- and inter-run variability. The normalized sequence is identified using sequence information from a qualified sample set obtained from subjects known to contain cells with normal copy numbers of any sequence of interest (such as a chromosome or its segment). Steps 110, 120, 130, 145, and 146 of the implementation of the method shown in Figure 1 outline the determination of the normalized sequence. In some embodiments, the normalized sequence is used to calculate the sequence dose of the test sequence. See step 150. In some embodiments, the normalized sequence is also used to calculate a threshold to which the sequence dose of the test sequence is compared. See step 150. The sequence information obtained from the normalized sequence and the test sequence is used to determine statistically significant chromosomal aneuploidy identification in the test samples (step 160).

转到根据一些实施方案的用于确定存在拷贝数变异的方法的细节，图15提供了用于确定生物样品中的所关注序列(例如染色体或其区段)的CNV的实施方案的流程图100。在一些实施方案中，生物样品从受试者中获得并且包含由不同基因组贡献的核酸的混合物。可由两个个体将不同的基因组贡献给样品，例如，由胎儿和怀有胎儿的母体贡献不同的基因组。另外，可由三个或更多个个体将不同的基因组贡献给样品，例如，由两个或更多个胎儿和怀有胎儿的母体贡献不同的基因组。另选地，由来自同一受试者的非整倍体癌细胞和正常整倍体细胞(例如，来自癌症患者的血浆样品)将基因组贡献给样品。Turning to details of methods for determining the presence of copy number variations according to some embodiments, Figure 15 provides a flowchart 100 of an embodiment for determining CNVs (e.g., chromosomes or segments thereof) of a biological sample. In some embodiments, the biological sample is obtained from a subject and contains a mixture of nucleic acids contributed by different genomes. Different genomes may be contributed to the sample by two individuals, for example, by a fetus and a pregnant mother. Alternatively, different genomes may be contributed to the sample by three or more individuals, for example, by two or more fetuses and a pregnant mother. Alternatively, genomes may be contributed to the sample by aneuploid cancer cells and normal euploid cells from the same subject (e.g., plasma samples from cancer patients).

除了分析患者的试验样品之外，还选择每个可能的所关注染色体的一条或多条归一化染色体或一个或多个归一化染色体区段。归一化染色体或区段的识别与患者样品的正常测试异步，这可在临床环境中进行。换句话讲，在测试患者样品之前识别归一化染色体或区段。将归一化染色体或区段与所关注染色体或区段之间的关联存储起来以供在测试期间使用。如下所解释的，这种关联通常在跨越许多样品的测试的时间段内得以维持。以下讨论涉及用于选择所关注单条染色体或区段的归一化染色体或染色体区段的实施方案。In addition to analyzing patient test samples, one or more normalized chromosomes or one or more normalized chromosomal segments are selected from each possible chromosome of interest. The identification of normalized chromosomes or segments is asynchronous with the normal testing of patient samples, which can be performed in a clinical setting. In other words, normalized chromosomes or segments are identified before testing patient samples. The association between the normalized chromosomes or segments and the chromosomes or segments of interest is stored for use during testing. As explained below, this association is typically maintained over a period of time spanning testing of many samples. The following discussion relates to an implementation scheme for selecting normalized chromosomes or chromosomal segments of interest.

获得合格样品集以识别合格归一化序列并且提供方差值以用于确定试验样品中有统计意义的CNV识别。在步骤110中，从已知包含具有任意一条所关注序列的正常拷贝数的细胞的多个受试者中获得多个生物合格样品。在一个实施方案中，从怀有已使用细胞遗传学方法确认具有正常拷贝数的染色体的胎儿的母体中获得合格样品。生物合格样品可以是生物流体(例如血浆)或如下所述的任何合适的样品。在一些实施方案中，合格样品含有核酸分子(例如cfDNA分子)的混合物。在一些实施方案中，合格样品是含有胎儿和母体cfDNA分子混合物的母体血浆样品。通过使用任何已知的测序方法对核酸(例如，胎儿核酸和母体核酸)的至少一部分进行测序来获得归一化染色体和/或其区段的序列信息。优选地，本文他处所述的下一代测序(NGS)方法中的任意一种方法用于将胎儿和母体核酸以单个或克隆扩增分子进行测序。在各种实施方案中，在测序之前和测序期间如下所公开的那样处理合格样品。可使用如本文所公开的装置、系统和试剂盒来处理合格样品。A qualified sample set is obtained to identify qualified normalized sequences and to provide variance values for determining statistically significant CNV identifications in the test samples. In step 110, multiple bioqualified samples are obtained from multiple subjects known to contain cells with normal copy numbers of any one of the sequences of interest. In one embodiment, qualified samples are obtained from a mother carrying a fetus with chromosomes confirmed to have normal copy numbers using cytogenetic methods. Bioqualified samples can be biological fluids (e.g., plasma) or any suitable sample as described below. In some embodiments, the qualified sample contains a mixture of nucleic acid molecules (e.g., cfDNA molecules). In some embodiments, the qualified sample is a maternal plasma sample containing a mixture of fetal and maternal cfDNA molecules. Sequence information of normalized chromosomes and/or their segments is obtained by sequencing at least a portion of the nucleic acids (e.g., fetal and maternal nucleic acids) using any known sequencing method. Preferably, any of the next-generation sequencing (NGS) methods described elsewhere herein are used to sequence fetal and maternal nucleic acids as single or clonal amplified molecules. In various embodiments, qualified samples are processed before and during sequencing as disclosed below. Qualified samples can be processed using devices, systems, and kits as disclosed herein.

在步骤120中，对合格样品中包含的所有合格核酸中的每一个核酸的至少一部分进行测序，以生成与参考基因组(例如hg18)进行比对的数百万个序列读段(例如36bp读段)。在一些实施方案中，序列读段包含约20bp、约25bp、约30bp、约35bp、约40bp、约45bp、约50bp、约55bp、约60bp、约65bp、约70bp、约75bp、约80bp、约85bp、约90bp、约95bp、约100bp、约110bp、约120bp、约130bp、约140bp、约150bp、约200bp、约250bp、约300bp、约350bp、约400bp、约450bp或约500bp。预期的是，技术进步将使单末端读段能够大于500bp，从而在生成配对末端读段时，使读段能够大于约1000bp。在一个实施方案中，映射序列读段包含36bp。在另一个实施方案中，映射序列读段包含25bp。In step 120, at least a portion of each of the eligible nucleic acids contained in the eligible sample is sequenced to generate millions of sequence reads (e.g., 36 bp reads) aligned to a reference genome (e.g., hg18). In some embodiments, the sequence reads contain approximately 20 bp, approximately 25 bp, approximately 30 bp, approximately 35 bp, approximately 40 bp, approximately 45 bp, approximately 50 bp, approximately 55 bp, approximately 60 bp, approximately 65 bp, approximately 70 bp, approximately 75 bp, approximately 80 bp, approximately 85 bp, approximately 90 bp, approximately 95 bp, approximately 100 bp, approximately 110 bp, approximately 120 bp, approximately 130 bp, approximately 140 bp, approximately 150 bp, approximately 200 bp, approximately 250 bp, approximately 300 bp, approximately 350 bp, approximately 400 bp, approximately 450 bp, or approximately 500 bp. The anticipated advancement is that technological advancements will enable single-end reads to be greater than 500 bp, thereby allowing reads to be greater than approximately 1000 bp when generating paired-end reads. In one embodiment, the mapped sequence read contains 36 bp. In another embodiment, the mapped sequence read contains 25 bp.

将序列读段与参考基因组进行比对，并且将唯一地映射到该参考基因组的读段称为序列标签。不对落入掩蔽参考序列的掩蔽区段上的序列标签进行计数用于CNV分析。The sequence reads are aligned to a reference genome, and the reads that uniquely map to the reference genome are called sequence tags. Sequence tags that fall on masked regions of the masked reference sequence are not counted for CNV analysis.

在一个实施方案中，从唯一地映射到参考基因组的读段中获得包含20bp至40bp的读段的至少约3×10⁶个合格序列标签，至少约5×10⁶个合格序列标签，至少约8×10⁶个合格序列标签，至少约10×10⁶个合格序列标签，至少约15×10⁶个合格序列标签，至少约20×10⁶个合格序列标签、至少约30×10⁶个合格序列标签、至少约40×10⁶个合格序列标签或至少约50×10⁶个合格序列标签。In one implementation, at least about 3 × ^10⁶ qualified sequence tags, at least about 5 × 10⁶ qualified sequence tags, at least about 8 × 10⁶ qualified sequence tags, at least about 10 × ^10⁶ qualified sequence tags, at least about 15 × ^10⁶ qualified sequence tags, at least about 20 × ^10⁶ qualified sequence tags, at least about 30 × ^10⁶ qualified sequence tags, at least about 40 × ^10⁶ qualified sequence tags, or at least about ⁵⁰ × ^10⁶ ^qualified sequence tags are obtained from reads uniquely mapped to the reference genome.

在步骤130中，对通过对合格样品中的核酸进行测序而获得的所有标签进行计数，以获得合格序列标签覆盖度。类似地，在操作135中，对从试验样品中获得的所有标签进行计数以获得测试序列标签覆盖度。本公开提供了相对于常规方法提供提高的灵敏度和选择性的用于确定覆盖度量的过程。操作130和135由星号标记并以粗线框加粗以指示这些操作有助于改进现有技术。在一些实施方案中，将序列标签覆盖度量进行归一化、调整、修整和以其他方式处理以提高分析的灵敏度和选择性。这些过程在本文他处进一步描述。In step 130, all tags obtained by sequencing nucleic acids in a qualified sample are counted to obtain qualified sequence tag coverage. Similarly, in step 135, all tags obtained from a test sample are counted to obtain test sequence tag coverage. This disclosure provides a process for determining a coverage metric that offers improved sensitivity and selectivity compared to conventional methods. Steps 130 and 135 are marked with an asterisk and bolded in a thick box to indicate that these steps contribute to improvements over the prior art. In some embodiments, the sequence tag coverage metric is normalized, adjusted, trimmed, and otherwise processed to improve the sensitivity and selectivity of the analysis. These processes are further described elsewhere herein.

当在合格样品中的每个合格样品中对所有合格序列标签进行映射和计数时，确定合格样品中所关注序列(例如，临床相关序列)的序列标签覆盖度，以及随后从中识别归一化序列的另外序列的序列标签覆盖度。When mapping and counting all eligible sequence tags in each eligible sample within the eligible sample, the sequence tag coverage of the sequences of interest (e.g., clinically relevant sequences) in the eligible sample is determined, as well as the sequence tag coverage of other sequences from which normalized sequences are subsequently identified.

在一些实施方案中，所关注序列是与完整染色体非整倍体相关联的染色体(例如，21号染色体)，并且合格归一化序列是与染色体非整倍体不相关，并且序列标签覆盖度变异近似于所关注序列(即染色体)(例如，21号染色体)的序列标签覆盖度变异的完整染色体。所选择的归一化染色体可以是最近似于所关注序列的序列标签覆盖度变异的一条染色体或一组染色体。1-22号染色体、X染色体和Y染色体中的任意一条或多条染色体可以是所关注序列，并且可将一条或多条染色体识别为合格样品中任意一条1-22号染色体、X染色体和Y染色体中的每一条染色体的归一化序列。如本文他处所述，归一化染色体可以是单条染色体或其可为一组染色体。In some implementations, the sequence of interest is a chromosome associated with an aneuploidy of the complete chromosome (e.g., chromosome 21), and the eligible normalized sequence is a complete chromosome that is not associated with an aneuploidy and whose sequence tag coverage variation approximates the sequence tag coverage variation of the sequence of interest (i.e., chromosome 21). The selected normalized chromosome can be a single chromosome or a group of chromosomes whose sequence tag coverage variation most closely approximates the sequence of interest. Any one or more chromosomes from chromosomes 1-22, the X chromosome, and the Y chromosome can be the sequence of interest, and one or more chromosomes can be identified as the normalized sequence of any one of chromosomes 1-22, the X chromosome, and the Y chromosome in the eligible sample. As described elsewhere herein, the normalized chromosome can be a single chromosome or a group of chromosomes.

在另一个实施方案中，所关注序列是与部分非整倍体相关联的染色体区段(例如染色体缺失或插入、或失衡的染色体易位)，并且归一化序列是与部分非整倍体不相关，并且序列标签覆盖度变异近似于与部分非整倍体相关联的染色体片段的序列标签覆盖度变异的一个染色体区段(或一组染色体区段)。所选择的归一化染色体区段可以是最近似于所关注序列的序列标签覆盖度变异的一个或多个染色体区段。任意一条或多条1-22号染色体、X染色体和Y染色体中的任意一个或多个区段可以是所关注序列。In another implementation, the sequence of interest is a chromosomal segment associated with partial aneuploidy (e.g., chromosomal deletion or insertion, or imbalanced chromosomal translocation), and the normalized sequence is a chromosomal segment (or a group of chromosomal segments) that is not associated with partial aneuploidy and whose sequence tag coverage variation approximates the sequence tag coverage variation of the chromosomal segment associated with partial aneuploidy. The selected normalized chromosomal segment can be one or more chromosomal segments whose sequence tag coverage variation most closely approximates the sequence of interest. Any one or more segments from chromosomes 1-22, the X chromosome, and the Y chromosome can be the sequence of interest.

在其他实施方案中，所关注序列是与部分非整倍体相关联的染色体区段，并且归一化序列是一条或多条全染色体。在其他实施方案中，所关注序列是与非整倍体相关联的全染色体，并且归一化序列是与非整倍体不相关的一个或多个染色体区段。In other embodiments, the sequence of interest is a chromosomal segment associated with partial aneuploidy, and the normalized sequence is one or more whole chromosomes. In other embodiments, the sequence of interest is a whole chromosome associated with aneuploidy, and the normalized sequence is one or more chromosomal segments unrelated to aneuploidy.

无论是否在合格样品中将单条序列或一组序列识别为任意一条或多条所关注序列的归一化序列，都可以选择合格归一化序列以使其具有序列标签覆盖度变异或片段大小参数，该序列标签覆盖度变异或片段大小参数最佳或有效地近似于在合格样品中确定的所关注序列的序列标签覆盖度变异或片段大小参数。例如，合格归一化序列是当用于将所关注序列归一化时在合格样品间产生最小变异性的序列，即，归一化序列的变异性最接近在合格样品中确定的所关注序列的变异性。换句话讲，合格归一化序列是选择用于在合格样品间产生(所关注序列的)最小序列剂量变异的序列。因此，该方法选择当用作归一化染色体时预期产生所关注序列的运行间染色体剂量的最小变异性的序列。Regardless of whether a single sequence or a group of sequences is identified as a normalized sequence of any one or more sequences of interest in qualified samples, a qualified normalized sequence can be selected to have a sequence tag coverage variation or fragment size parameter that best or effectively approximates the sequence tag coverage variation or fragment size parameter of the sequence of interest determined in qualified samples. For example, a qualified normalized sequence is the sequence that produces the minimum variability among qualified samples when used to normalize the sequence of interest; that is, the variability of the normalized sequence is closest to the variability of the sequence of interest determined in qualified samples. In other words, a qualified normalized sequence is the sequence selected to produce the minimum sequence dose variation (of the sequence of interest) among qualified samples. Therefore, this method selects the sequence that is expected to produce the minimum inter-run chromosome dose variability of the sequence of interest when used as a normalized chromosome.

在合格样品中所识别的任意一条或多条所关注序列的归一化序列仍然是用于确定在数天、数周、数月和可能数年内试验样品中是否存在非整倍体的所选归一化序列，前提条件是生成测序文库和对样品进行测序所需的程序基本上不随时间推移而改变。如上所述，就在样品(例如，不同样品)和测序运行(例如，在同一天和/或不同天进行的测序运行)间映射至其的序列标签数目或片段大小参数的值的变异性(也可能是其他原因)而选择用于确定存在非整倍体的归一化序列，该变异性最佳近似于用作归一化参数的所关注序列的变异性。这些程序中的实质性改变将影响映射到所有序列的标签数目，这继而将确定在同一天或不同天进行的相同和/或不同测序运行中哪一条或哪一组序列将在样品间具有最接近所关注序列的变异性的变异性，这将需要重新确定归一化序列集。程序中的实质改变包括用于制备测序文库的实验室方案的改变(其包括与制备多重测序而非单重测序的样品相关的改变)和测序平台的改变(其包括用于测序的化学性质的改变)。The normalized sequence of any one or more sequences of interest identified in eligible samples remains the selected normalized sequence used to determine the presence of aneuploidy in test samples over days, weeks, months, and possibly years, provided that the procedures required for generating sequencing libraries and sequencing the samples remain substantially unchanged over time. As described above, the normalized sequence used to determine the presence of aneuploidy is selected based on the variability (or other reasons) of the value of the number of sequence tags or fragment size parameter mapped to samples (e.g., different samples) and sequencing runs (e.g., sequencing runs performed on the same day and/or different days), which best approximates the variability of the sequence of interest used as the normalization parameter. Substantial changes in these procedures will affect the number of tags mapped to all sequences, which in turn will determine which sequence(s) in the same and/or different sequencing runs performed on the same or different days will have the variability closest to that of the sequence of interest across samples, requiring a redetering of the normalized sequence set. Substantial changes to the procedure include changes to the laboratory protocols used to prepare sequencing libraries (including changes related to preparing samples for multiplex sequencing rather than singlex sequencing) and changes to the sequencing platform (including changes to the chemical properties used for sequencing).

在一些实施方案中，选择用于将特定所关注序列归一化的归一化序列是将一个或多个合格样品与一个或多个受影响样品最佳区分开的序列，这意味着归一化序列是具有最大可区分性的序列，即，归一化序列的可区分性使得其为受影响试验样品中的所关注序列提供最佳区分，以容易地将受影响试验样品与其他未受影响样品区分开。在其他实施方案中，归一化序列使具有最小变异性和最大可区分性的组合的序列。In some embodiments, the normalized sequence selected for normalizing a particular sequence of interest is the sequence that best distinguishes one or more qualified samples from one or more affected samples. This means that the normalized sequence is the sequence with the highest distinguishability; that is, the distinguishability of the normalized sequence is such that it provides the best distinction for the sequence of interest in the affected test samples, so as to easily distinguish the affected test samples from other unaffected samples. In other embodiments, the normalized sequence is a sequence with a combination of minimum variability and maximum distinguishability.

可将可区分性的水平确定为合格样品群中的序列剂量(例如染色体剂量或区段剂量)与一个或多个试验样品中的染色体剂量之间的统计差异，如下文所述和实施例所示。例如，可将可区分性用数字表示为t检验值，其表示合格样品群中的染色体剂量与一个或多个试验样品中的染色体剂量之间的统计差异。类似地，可区分性可基于区段剂量而非染色体剂量。另选地，可将可区分性用数字表示为归一化染色体值(NCV)，其为染色体剂量的z得分，只要NCV的分布是正态的。类似地，在染色体区段是所关注序列的情况下，可将区段剂量的可区分性用数字表示为归一化区段值(NSV)，其是染色体片段剂量的z得分，只要NSV的分布是正态的。在确定z得分时，可使用合格样品集中的染色体或区段剂量的平均和标准偏差。另选地，可使用包括合格样品和受影响样品的训练集中的染色体或区段剂量的平均值和标准偏差。在其他实施方案中，归一化序列是具有最小变异性和最大可区分性或小变异性和大可区分性的最佳组合的序列。The level of distinguishability can be determined as the statistical difference between the sequence dose (e.g., chromosome dose or segment dose) in the qualified sample cohort and the chromosome dose in one or more test samples, as described below and in the examples. For example, distinguishability can be expressed numerically as a t-test value, representing the statistical difference between the chromosome dose in the qualified sample cohort and the chromosome dose in one or more test samples. Similarly, distinguishability can be based on segment dose rather than chromosome dose. Alternatively, distinguishability can be expressed numerically as a normalized chromosome value (NCV), which is the z-score of the chromosome dose, provided that the distribution of the NCV is normal. Similarly, when the chromosome segment is the sequence of interest, the distinguishability of the segment dose can be expressed numerically as a normalized segment value (NSV), which is the z-score of the chromosome segment dose, provided that the distribution of the NSV is normal. When determining the z-score, the mean and standard deviation of the chromosome or segment doses in the qualified sample set can be used. Alternatively, the mean and standard deviation of the chromosome or segment doses in a training set including qualified and affected samples can be used. In other implementations, the normalized sequence is a sequence with the best combination of minimum variability and maximum distinguishability, or small variability and large distinguishability.

该方法识别固有相似特性并且易于产生样品和测序运行间相似变异的序列，并且所识别的序列可用于确定试验样品中的序列剂量。This method identifies sequences with inherent similarity characteristics and readily generates similar variations between samples and sequencing runs, and the identified sequences can be used to determine the sequence dose in the test sample.

序列剂量的确定Determination of sequence dosage

在一些实施方案中，如图1所示的步骤146中所述，在所有合格样品中确定一条或多条所关注染色体或区段的染色体或区段剂量，并且在步骤145中识别归一化染色体或区段序列。在计算序列剂量之前提供一些归一化序列。然后，如下文进一步所述，根据各种标准来识别一条或多条归一化序列，参见步骤145。在一些实施方案中，例如，所识别的归一化序列在所有合格样品间产生所关注序列的最小序列剂量变异性。In some implementations, as described in step 146 of Figure 1, one or more chromosome or segment doses of interest are determined among all eligible samples, and a normalized chromosome or segment sequence is identified in step 145. Some normalized sequences are provided before calculating the sequence dose. Then, as further described below, one or more normalized sequences are identified according to various criteria, see step 145. In some implementations, for example, the identified normalized sequence produces minimal sequence dose variability of the sequence of interest among all eligible samples.

在步骤146中，基于计算的合格标签密度，所关注序列的合格序列剂量(即染色体剂量或区段剂量)被确定为该所关注序列的序列标签覆盖度与随后在步骤145中从中识别归一化序列的另外序列的合格序列标签覆盖度的比率。随后使用所识别的归一化序列来确定试验样品中的序列剂量。In step 146, based on the calculated qualified tag density, the qualified sequence dose (i.e., chromosome dose or segment dose) of the sequence of interest is determined as the ratio of the sequence tag coverage of the sequence of interest to the qualified sequence tag coverage of another sequence from which a normalized sequence is subsequently identified in step 145. The identified normalized sequence is then used to determine the sequence dose in the test sample.

在一个实施方案中，合格样品中的序列剂量是以合格样品中所关注染色体的序列标签数目或片段大小参数与归一化染色体序列的序列标签数目的比率来计算的染色体剂量。归一化染色体序列可以是单条染色体、一组染色体、一条染色体的区段或来自不同染色体的一组区段。因此，在合格样品中将所关注染色体的染色体剂量确定为所关注染色体的标签数目与(i)由单条染色体构成的归一化染色体序列、(ii)由两条或更多条染色体构成的归一化染色体序列、(iii)由染色体的单个区段构成的归一化区段序列、(iv)由来自一条染色体的两个或更多个区段构成的归一化区段序列或(v)由两条或更多条染色体的两个或更多个片段构成的归一化区段序列的标签数目的比率。用于根据(i)-(v)确定所关注21号染色体的染色体剂量的示例如下：所关注染色体(例如，21号染色体)的染色体剂量被确定为21号染色体的序列标签覆盖度与以下序列标签覆盖度中的一个序列标签覆盖度的比率：(i)所有剩余染色体中的每一条染色体，即1-20号染色体、22号染色体、X染色体和Y染色体；(ii)两条或更多条剩余染色体的所有可能组合；(iii)另一条染色体(例如9号染色体)的区段；(iv)一条其他染色体的两个区段，例如9号染色体的两个区段；(v)两条不同染色体的两个区段，例如9号染色体的区段和14号染色体的区段。In one implementation, the sequence dose in a qualified sample is a chromosome dose calculated as the ratio of the number of sequence tags or fragment size parameters of the chromosome of interest in the qualified sample to the number of sequence tags of a normalized chromosome sequence. A normalized chromosome sequence can be a single chromosome, a set of chromosomes, a segment of a chromosome, or a set of segments from different chromosomes. Therefore, the chromosome dose of the chromosome of interest in a qualified sample is determined as the ratio of the number of tags on the chromosome of interest to the number of tags on (i) a normalized chromosome sequence consisting of a single chromosome, (ii) a normalized chromosome sequence consisting of two or more chromosomes, (iii) a normalized segment sequence consisting of a single segment of a chromosome, (iv) a normalized segment sequence consisting of two or more segments from a chromosome, or (v) a normalized segment sequence consisting of two or more fragments from two or more chromosomes. Examples of determining the chromosome dose of interest for chromosome 21 based on (i)-(v) are as follows: The chromosome dose of the chromosome of interest (e.g., chromosome 21) is determined as the ratio of sequence tag coverage of chromosome 21 to one of the following sequence tag coverages: (i) each of all remaining chromosomes, i.e., chromosomes 1-20, chromosome 22, chromosome X, and chromosome Y; (ii) all possible combinations of two or more remaining chromosomes; (iii) a segment of another chromosome (e.g., chromosome 9); (iv) two segments of another chromosome, e.g., two segments of chromosome 9; (v) two segments of two different chromosomes, e.g., a segment of chromosome 9 and a segment of chromosome 14.

在另一个实施方案中，合格样品中的序列剂量是与染色体剂量相对的区段剂量，以合格样品中所关注区段(不是全染色体)的序列标签数目与归一化区段序列的序列标签数目的比率来计算该区段剂量。归一化区段序列可为上文所讨论的归一化染色体或区段序列中的任意一条序列。In another implementation, the sequence dose in the qualified sample is a segmental dose relative to the chromosome dose, calculated as the ratio of the number of sequence tags in the segment of interest (not the entire chromosome) to the number of sequence tags in the normalized segment sequence. The normalized segment sequence can be any one of the normalized chromosome or segment sequences discussed above.

归一化序列的识别Identification of normalized sequences

在步骤145中，识别所关注序列的归一化序列。在一些实施方案中，例如，归一化序列是基于所计算的序列剂量的序列，例如，其在所有合格训练样品间产生所关注序列的最小序列剂量变异性。该方法识别固有相似特性并且易于产生样品和测序运行间相似变异的序列，并且所识别的序列可用于确定试验样品中的序列剂量。In step 145, a normalized sequence of the sequence of interest is identified. In some embodiments, for example, the normalized sequence is a sequence based on a calculated sequence dose, such as one that produces the minimum sequence dose variability of the sequence of interest across all qualified training samples. This method identifies sequences with inherent similarity characteristics and readily produces similar variability between samples and sequencing runs, and the identified sequences can be used to determine the sequence dose in test samples.

可在合格样品集中识别一个或多个所关注序列的归一化序列，并且随后使用在合格样品中所识别的序列来计算试验样品中的每一个试验样品中的一条或多条所关注序列的序列剂量(步骤150)，以确定在试验样品中的每个试验样品中是否存在非整倍体。当使用不同的测序平台时和/或当待测序的核酸的纯化和/或测序文库的制备中存在差异时，所识别的所关注染色体或区段的归一化序列可不同。不考虑所使用的样品制备和/或测序平台，根据本文所述的方法使用归一化序列提供了染色体或其区段的拷贝数变异的特异性和灵敏量度。Normalized sequences of one or more sequences of interest can be identified in a qualified sample set, and the sequences identified in the qualified samples are then used to calculate the sequence dose of one or more sequences of interest in each test sample in the test samples (step 150) to determine the presence of aneuploidy in each test sample. The identified normalized sequences of the chromosome or region of interest may differ when different sequencing platforms are used and/or when there are differences in the purification of the nucleic acids to be sequenced and/or the preparation of the sequencing libraries. Regardless of the sample preparation and/or sequencing platform used, the use of normalized sequences according to the methods described herein provides a specific and sensitive measure of copy number variation in chromosomes or their regions.

在一些实施方案中，识别多于一条归一化序列，即，可确定一条所关注序列的不同归一化序列，并且可确定一条所关注序列的多个序列剂量。例如，当使用14号染色体的序列标签覆盖度时，所关注21号染色体的染色体剂量变异，例如变异系数(CV＝标准偏差/平均值)是最小的。然而，可识别两条、三条、四条、五条、六条、七条、八条或更多条归一化序列以用于确定试验样品中所关注序列的序列剂量。例如，可使用7号染色体、9号染色体、11号染色体或12号染色体作为归一化染色体序列来确定任意一个试验样品中21号染色体的第二剂量，因为这些染色体均具有接近14号染色体的CV。In some implementations, more than one normalized sequence is identified; that is, different normalized sequences of a sequence of interest can be determined, and multiple sequence doses of a sequence of interest can be determined. For example, when using sequence tag coverage of chromosome 14, the chromosomal dose variation of chromosome 21 of interest, such as the coefficient of variation (CV = standard deviation/mean), is minimal. However, two, three, four, five, six, seven, eight, or more normalized sequences can be identified to determine the sequence dose of the sequence of interest in a test sample. For example, chromosomes 7, 9, 11, or 12 can be used as normalized chromosome sequences to determine a second dose of chromosome 21 in any test sample because these chromosomes all have a CV close to that of chromosome 14.

在一些实施方案中，当选择单条染色体作为所关注染色体的归一化染色体序列时，归一化染色体序列将是产生所关注染色体的染色体剂量的染色体，其在所有试验样品(例如合格样品)间具有最小变异性。在一些情况下，最佳归一化染色体可不具有最小变异，但可具有将一个或多个试验样品与合格样品最佳区分开的合格剂量分布，即最佳归一化染色体可不具有最小变异，但可具有最大可区分性。In some implementations, when a single chromosome is selected as the normalized chromosome sequence for the chromosome of interest, the normalized chromosome sequence will be the chromosome that generates the chromosomal dose of the chromosome of interest and has minimal variability across all test samples (e.g., qualified samples). In some cases, the optimal normalized chromosome may not have minimal variability, but may have a qualified dose distribution that best distinguishes one or more test samples from qualified samples; that is, the optimal normalized chromosome may not have minimal variability, but may have maximum distinguishability.

在一些实施方案中，归一化序列包括一条或多条稳健常染色体序列或其片段。在一些实施方案中，稳健常染色体包括除所关注染色体之外的所有常染色体。在一些实施方案中，稳健常染色体包括除X染色体、Y染色体、13号染色体、18号染色体和21号染色体之外的所有常染色体。在一些实施方案中，稳健常染色体包括除从样品确定的偏离正常二倍体状态的常染色体之外的所有常染色体，这些常染色体可用于确定相对于正常二倍体基因组具有异常拷贝数的癌症基因组。In some embodiments, the normalized sequence includes one or more robust autosomal sequences or fragments thereof. In some embodiments, robust autosomes include all autosomes except the chromosome of interest. In some embodiments, robust autosomes include all autosomes except the X chromosome, Y chromosome, chromosome 13, chromosome 18, and chromosome 21. In some embodiments, robust autosomes include all autosomes except those deviating from a normal diploid state determined from the sample, which can be used to identify cancer genomes with abnormal copy numbers relative to a normal diploid genome.

试验样品中非整倍体的确定Determination of aneuploidy in test samples

基于识别合格样品中的归一化序列，确定试验样品中所关注序列的序列剂量，该试验样品包含衍生自在一条或多条所关注序列中不同的基因组的核酸混合物。Based on the identification of normalized sequences in qualified samples, the sequence dose of the sequence of interest in the test sample is determined. The test sample contains a mixture of nucleic acids derived from different genomes of one or more sequences of interest.

在步骤115中，从疑似或已知携带所关注序列的临床相关CNV的受试者中获得试验样品。试验样品可以是生物流体(例如血浆)或如下所述的任何合适的样品。如所解释的，可使用非侵入性手术(诸如简单的采血)来获得样品。在一些实施方案中，试验样品含有核酸分子(例如cfDNA分子)的混合物。在一些实施方案中，试验样品是含有胎儿和母体cfDNA分子混合物的母体血浆样品。In step 115, a test sample is obtained from a subject suspected of or known to carry a clinically relevant CNV of interest. The test sample may be a biological fluid (e.g., plasma) or any suitable sample as described below. As explained, a non-invasive procedure (such as a simple blood draw) may be used to obtain the sample. In some embodiments, the test sample contains a mixture of nucleic acid molecules (e.g., cfDNA molecules). In some embodiments, the test sample is a maternal plasma sample containing a mixture of fetal and maternal cfDNA molecules.

在步骤125中，如针对合格样品所述对试验样品中的测试核酸的至少一部分进行测序，以生成数百万个序列读段，例如36bp读段。在各种实施方案中，将2×36bp配对末端读段用于配对末端测序。如步骤120中所述，将通过对试验样品中的核酸进行测序而生成的读段唯一地映射到参考基因组或与参考基因组进行比对以产生标签。如步骤120中所述，从唯一地映射到参考基因组的读段中获得包含20bp至40bp的读段的至少约3×10⁶个合格序列标签，至少约5×10⁶个合格序列标签，至少约8×10⁶个合格序列标签，至少约10×10⁶个合格序列标签，至少约15×10⁶个合格序列标签，至少约20×10⁶个合格序列标签、至少约30×10⁶个合格序列标签、至少约40×10⁶个合格序列标签或至少约50×10⁶个合格序列标签。在某些实施方案中，通过测序装置产生的读段以电子格式提供。使用如下所述的计算装置来实现比对。将各个读段与参考基因组进行比较，该参考基因组通常非常庞大(数百万个碱基对)以识别读段唯一地与参考基因组对应的位点。在一些实施方案中，比对程序允许读段与参考基因组之间的有限错配。在一些情况下，允许读段中的1、2或3个碱基对与参考基因组中的对应碱基对错配，但仍然进行映射。In step 125, at least a portion of the test nucleic acid in the test sample is sequenced as described for qualified samples to generate millions of sequence reads, such as 36 bp reads. In various embodiments, 2 × 36 bp paired end reads are used for paired end sequencing. As described in step 120, the reads generated by sequencing the nucleic acids in the test sample are uniquely mapped to or aligned with a reference genome to generate tags. As described in step 120, at least approximately 3 × 10⁶ qualified sequence tags, at least approximately 5 × ^10⁶ qualified sequence tags, at least approximately 8 × 10⁶ qualified sequence tags, at least approximately 10 × ^10⁶ qualified sequence tags, at least approximately 15 × ^10⁶ qualified sequence tags, at least approximately 20 × ^10⁶ qualified sequence tags, at least approximately 30 × ^10⁶ qualified sequence tags, at least approximately 40 × ^10⁶ qualified sequence tags, or at least approximately ⁵⁰ × ^10⁶ qualified sequence tags are ^obtained from reads uniquely mapped to the reference genome. In some embodiments, the reads generated by the sequencing device are provided in electronic format. Alignment is performed using a computing device as described below. Each read is compared to a reference genome, which is typically very large (millions of base pairs), to identify sites where the read uniquely corresponds to the reference genome. In some embodiments, the alignment procedure allows for limited mismatches between the read and the reference genome. In some cases, mismatches of 1, 2, or 3 base pairs in a read with corresponding base pairs in the reference genome are allowed, but mapping still takes place.

在步骤135中，使用如下所述的计算装置对通过对试验样品中的核酸进行测序而获得的标签中的所有或大部分进行计数，以确定测试序列标签覆盖度。在一些实施方案中，将每个读段与参考基因组的特定区域(在大多数情况下为染色体或区段)进行比对，并且通过将位点信息附加到读段来将读段转化为标签。当该过程展开时，计算装置可对映射到参考基因组中的每个区域(在大多数情况下为染色体或区段)的标签/读段的数目持续进行运行计数。将每个所关注染色体或区段以及每个对应的归一化染色体或区段的计数存储起来。In step 135, the computing device described below is used to count all or most of the tags obtained by sequencing nucleic acids in the test sample to determine the test sequence tag coverage. In some embodiments, each read is aligned to a specific region of the reference genome (in most cases, a chromosome or segment), and the read is converted into a tag by appending site information to the read. As this process unfolds, the computing device can continuously run a count of the number of tags/reads mapped to each region (in most cases, a chromosome or segment) in the reference genome. The counts for each chromosome or segment of interest and each corresponding normalized chromosome or segment are stored.

在某些实施方案中，参考基因组具有一个或多个排除区域，所述一个或多个排除区域为真实生物基因组的一部分但不包括在参考基因组中。不对可能与这些排除区域进行比对的读段进行计数。排除区域的示例包括长重复序列的区域、X染色体和Y染色体之间的相似性区域等。使用通过上述掩蔽技术获得的掩蔽参考序列，仅考虑将参考序列的未掩蔽片段上的标签用于CNV分析。In some implementations, the reference genome has one or more exclusion regions that are part of the actual biological genome but are not included in the reference genome. Reads that might be aligned to these exclusion regions are not counted. Examples of exclusion regions include regions of long repetitive sequences, similar regions between the X and Y chromosomes, etc. Using the masked reference sequence obtained through the masking techniques described above, only tags on the unmasked segments of the reference sequence are considered for CNV analysis.

在一些实施方案中，当将多个读段与参考基因组或序列上的相同位点进行比对时，方法确定是否对标签进行不止一次计数。可能存在两个标签具有相同序列并因此与参考序列上的相同位点进行比对的情况。用于对标签进行计数的方法在某些情况下可从计数中排除来源于相同测序样品的相同标签。如果在给定样品中有不成比例数目的标签相同，则表明在该程序中存在强偏差或其他缺陷。因此，根据某些实施方案，计数方法不对来自给定样品的与来自先前计数的样品中的标签相同的标签进行计数。In some implementations, when multiple reads are aligned to the same site on a reference genome or sequence, the method determines whether a tag should be counted more than once. It is possible for two tags to have the same sequence and therefore be aligned to the same site on the reference sequence. The method used to count tags may, in some cases, exclude identical tags from the count originating from the same sequencing sample. If a disproportionate number of identical tags are found in a given sample, it indicates a strong bias or other defect in the procedure. Therefore, according to some implementations, the counting method does not count tags from a given sample that are identical to tags from previously counted samples.

可设定各种标准以用于选择何时忽略来自单个样品的相同标签。在某些实施方案中，计数的标签的限定百分比必须是唯一的。如果超过该阈值的标签不是唯一的，则忽略它们。例如，如果限定的百分比要求至少50％是唯一的，则直到样品的唯一标签百分比超过50％时才对相同标签进行计数。在其他实施方案中，唯一标签的阈值数目为至少约60％。在其他实施方案中，唯一标签的阈值百分比为至少约75％、或至少约90％、或至少约95％、或至少约98％、或至少约99％。可将用于21号染色体的阈值设定为90％。如果将30M标签与21号染色体进行比对，则它们中的至少27M必须是唯一的。如果3M计数的标签不是唯一的，并且3千万零一个标签不是唯一的，则不进行计数。可使用适当的统计分析对用于确定何时不对其他相同标签进行计数的特定阈值或其他标准进行选择。影响该阈值或其他标准的一个因素是测序样品与标记可比对的基因组大小的相对量。其他因素包括读段大小和类似考虑因素。Various criteria can be set to select when to ignore identical tags from a single sample. In some embodiments, a defined percentage of the tags to be counted must be unique. If tags exceeding this threshold are not unique, they are ignored. For example, if the defined percentage requires at least 50% uniqueness, identical tags are not counted until the percentage of unique tags in the sample exceeds 50%. In other embodiments, the threshold number of unique tags is at least about 60%. In other embodiments, the threshold percentage of unique tags is at least about 75%, or at least about 90%, or at least about 95%, or at least about 98%, or at least about 99%. The threshold for chromosome 21 can be set to 90%. If 30M tags are compared to chromosome 21, at least 27M of them must be unique. If 3M tags are not unique, and 30 million and one tag is not unique, no counting is performed. Appropriate statistical analysis can be used to select a specific threshold or other criterion for determining when not to count other identical tags. One factor influencing this threshold or other criterion is the relative size of the sequencing sample to the genome size that the tag can be compared to. Other factors include the size of the reading segment and similar considerations.

在一个实施方案中，将映射到所关注序列的测试序列标签数目归一化为它们所映射到的所关注序列的已知长度，以提供测试序列标签密度比。如针对合格样品所述，不需要归一化为所关注序列的已知长度，并且该归一化可作为减少数字中数字位数的步骤包括在内以简化数字以供人类解读。当对试验样品中的所有映射测试序列标签进行计数时，确定试验样品中所关注序列(例如，临床相关序列)的序列标签覆盖度，以及对应于合格样品中识别的至少一条归一化序列的另外序列的序列标签覆盖度。In one implementation, the number of test sequence tags mapped to the sequence of interest is normalized to the known length of the sequence of interest to which they are mapped, to provide a test sequence tag density ratio. As described for qualified samples, normalization to the known length of the sequence of interest is not required, and this normalization may be included as a step to reduce the number of digits in the numbers to simplify the numbers for human interpretation. When all mapped test sequence tags in the test sample are counted, sequence tag coverage of the sequence of interest (e.g., clinically relevant sequences) in the test sample is determined, as well as sequence tag coverage of additional sequences corresponding to at least one normalized sequence identified in the qualified samples.

在步骤150中，基于合格样品中至少一条归一化序列的同一性，确定试验样品中所关注序列的测试序列剂量。在各种实施方案中，使用如本文所述的所关注序列和对应归一化序列的序列标签覆盖度来计算确定测试序列剂量。负责该任务的计算装置将以电子方式访问所关注序列和其相关联的归一化序列之间的关联，该关联可存储在数据库、表格、图形中，或作为代码包含在程序指令中。In step 150, the test sequence dose of the sequence of interest in the test sample is determined based on the identity of at least one normalized sequence in the qualified sample. In various embodiments, the test sequence dose is calculated using sequence tag coverage of the sequence of interest and its corresponding normalized sequence as described herein. The computing device responsible for this task accesses electronically the association between the sequence of interest and its associated normalized sequence, which may be stored in a database, table, graph, or included as code in program instructions.

如本文他处所述，至少一条归一化序列可以是单条序列或一组序列。试验样品中所关注序列的序列剂量是在该试验样品中确定的所述所关注序列的序列标签覆盖度与在该试验样品中确定的至少一条归一化序列的序列标签覆盖度的比率，其中该试验样品中的该归一化序列对应于在合格样品中识别的特定所关注序列的归一化序列。例如，如果在合格样品中识别的21号染色体的归一化序列被确定为染色体(例如14号染色体)，则21号染色体(所关注序列)的测试序列剂量确定为分别在试验样品中确定的21号染色体的序列标签覆盖度与14号染色体的序列标签覆盖度的比率。类似地，测定与染色体非整倍体相关联的13号染色体、18号染色体、X染色体、Y染色体和其他染色体的染色体剂量。所关注染色体的归一化序列可以是一条或一组染色体，或者一个或一组染色体区段。如前所述，所关注序列可以是染色体的一部分，例如染色体区段。因此，染色体片段的剂量可被确定为在试验样品中确定的区段的序列标签覆盖度与在试验样品中确定的归一化染色体区段的序列标签覆盖度的比率，其中该试验样品中的该归一化区段对应于在合格样品中识别的特定所关注区段的归一化区段(单个或一组区段)。染色体区段的大小可在千碱基(kb)至兆碱基(Mb)的范围内(例如，约1kb至10kb、或约10kb至100kb、或约100kb至1Mb)。As described elsewhere in this document, at least one normalized sequence can be a single sequence or a set of sequences. The sequence dose of the sequence of interest in the test sample is the ratio of the sequence tag coverage of the sequence of interest determined in the test sample to the sequence tag coverage of at least one normalized sequence determined in the test sample, wherein the normalized sequence in the test sample corresponds to the normalized sequence of a specific sequence of interest identified in a qualified sample. For example, if the normalized sequence of chromosome 21 identified in a qualified sample is determined to be a chromosome (e.g., chromosome 14), then the test sequence dose of chromosome 21 (the sequence of interest) is determined as the ratio of the sequence tag coverage of chromosome 21 determined in the test sample to the sequence tag coverage of chromosome 14. Similarly, chromosome doses of chromosomes 13, 18, X, Y, and other chromosomes associated with chromosomal aneuploidy are determined. The normalized sequence of the chromosome of interest can be a single chromosome or a set of chromosomes, or a segment or a set of chromosome segments. As previously mentioned, the sequence of interest can be a portion of a chromosome, such as a chromosome segment. Therefore, the dosage of a chromosome segment can be determined as the ratio of the sequence tag coverage of a segment identified in the test sample to the sequence tag coverage of a normalized chromosome segment identified in the test sample, wherein the normalized segment in the test sample corresponds to a normalized segment (single or a group of segments) of a specific segment of interest identified in a qualified sample. The size of the chromosome segment can range from kilobases (kb) to megabases (Mb) (e.g., about 1 kb to 10 kb, or about 10 kb to 100 kb, or about 100 kb to 1 Mb).

在步骤155中，根据建立的在多个合格样品中确定的合格序列剂量与针对确定的已知为非整倍体样品的所关注序列的序列剂量的标准偏差值来推导阈值。需注意，该操作通常与患者试验样品的分析异步执行。该操作可例如与从合格样品中选择归一化序列同时进行。准确分类取决于不同类别(即，非整倍体类型)的概率分布之间的差异。在一些示例中，从每种类型的非整倍体(例如，21-三倍体)的经验分布中选择阈值。如实施例中所述，建立用于对13-三倍体、18-三倍体、21-三倍体和X单倍体非整倍体进行分类的可能阈值，这些实施例描述了通过对从包含胎儿和母体核酸混合物的母体样品中提取的cfDNA进行测序来确定染色体非整倍体的方法的用途。确定用于区分受染色体非整倍体影响的样品的阈值可与用于不同非整倍体的阈值相同或不同。如实施例所示，根据所关注染色体在样品和测序运行中的剂量变异性来确定针对每条所关注染色体的阈值。任何所关注染色体的染色体剂量的变异越小，所关注染色体在用于设定用于确定不同非整倍体的阈值的所有未受影响样品中的剂量分布越窄。In step 155, a threshold is derived based on the standard deviation of the qualified sequence dose determined in multiple qualified samples and the sequence dose for the sequence of interest in a known aneuploid sample. It should be noted that this operation is typically performed asynchronously with the analysis of patient test samples. This operation may, for example, be performed concurrently with the selection of normalized sequences from qualified samples. Accurate classification depends on the differences between the probability distributions of different categories (i.e., aneuploid types). In some examples, a threshold is selected from an empirical distribution for each type of aneuploidy (e.g., 21-triploidy). As described in the embodiments, possible thresholds are established for classifying 13-triploidy, 18-triploidy, 21-triploidy, and X-haploid aneuploidy; these embodiments describe the use of a method for determining chromosomal aneuploidy by sequencing cfDNA extracted from maternal samples containing a mixture of fetal and maternal nucleic acids. The threshold used to distinguish samples affected by chromosomal aneuploidy may be the same as or different from the thresholds used for different aneuploidy types. As illustrated in the examples, a threshold for each chromosome of interest is determined based on the dose variability of the chromosome of interest in the samples and sequencing runs. The smaller the variability in the chromosome dose for any chromosome of interest, the narrower the dose distribution of the chromosome of interest across all unaffected samples used to set the threshold for determining different aneuploidies.

返回与对患者试验样品进行分类相关联的过程流程，在步骤160中，通过将所关注序列的测试序列剂量与从合格序列剂量建立的至少一个阈值进行比较来确定所关注序列在试验样品中的拷贝数变异。该操作可由用于测量序列标签覆盖度和/或计算区段剂量的相同计算装置执行。The process flow associated with classifying patient test samples is returned, and in step 160, the copy number variation of the sequence of interest in the test sample is determined by comparing the test sequence dose of the sequence of interest with at least one threshold established from the qualified sequence dose. This operation can be performed by the same computing device used to measure sequence label coverage and/or calculate segment dose.

在步骤160中，将所关注测试序列的所计算剂量与设定为根据用户定义的“可靠性阈值”进行选择的阈值的剂量进行比较，以将样品分类为“正常”、“受影响”或“无判定”。“无判定”样品是无法可靠地作出明确诊断的样品。每种类型的受影响样品(例如，21-三倍体、部分21-三倍体、X单倍体)具有其自身的阈值，一种用于判定正常(未受影响)样品，另一种用于判定受影响样品(尽管在一些情况下两个阈值一致)。如本文他处所述，在一些情况下，如果试验样品中核酸的胎儿分数足够高，则可将无判定转换为判定(受影响或正常)。可由在该过程流程的其他操作中采用的计算装置报告测试序列的分类。在一些情况下，分类以电子格式报告，并且可以显示、电子邮件、短信等方式发送给感兴趣的人。In step 160, the calculated dose of the test sequence of interest is compared with a dose set to a threshold selected according to a user-defined "reliability threshold" to classify the sample as "normal," "affected," or "undetermined." "Undetermined" samples are those for which a definitive diagnosis cannot be reliably made. Each type of affected sample (e.g., 21-triploidy, partial 21-triploidy, X-haploidy) has its own threshold, one for classifying normal (unaffected) samples and another for classifying affected samples (although in some cases the two thresholds are the same). As described elsewhere herein, in some cases, if the fetal fraction of nucleic acid in the test sample is sufficiently high, "undetermined" can be converted to "determined" (affected or normal). The classification of the test sequence can be reported by a computing device employed in other operations of this process flow. In some cases, the classification is reported electronically and can be displayed, emailed, texted, or sent to interested parties.

在一些实施方案中，CNV的确定包括计算将染色体或区段剂量与如上所述的合格样品集中的对应染色体或区段剂量的平均值关联的NCV或NSV。然后可通过将NCV/NSV与预定拷贝数评估阈值进行比较来确定CNV。In some implementations, CNV determination includes calculating the NCV or NSV that correlates the chromosomal or segment dose with the average dose of the corresponding chromosomal or segment in a qualified sample set as described above. The CNV can then be determined by comparing the NCV/NSV with a predetermined copy number assessment threshold.

可选择拷贝数评估阈值以优化假阳性率和假阴性率。拷贝数评估阈值越高，发生假阳性的可能性越小。类似地，阈值越低，发生假阴性的可能性越小。因此，在第一理想阈值(高于此仅对真阳性进行分类)和第二理想阈值(低于此仅对真阴性进行分类)之间存在权衡。A copy number assessment threshold can be selected to optimize the false positive and false negative rates. A higher copy number assessment threshold reduces the likelihood of false positives. Similarly, a lower threshold reduces the likelihood of false negatives. Therefore, there is a trade-off between a first ideal threshold (above which only true positives are classified) and a second ideal threshold (below which only true negatives are classified).

阈值的设定在很大程度上取决于在未受影响样品集中确定的特定所关注染色体的染色体剂量变异性。变异性取决于多种因素，包括存在于样品中的胎儿cDNA的分数。通过染色体剂量在未受影响样品群体间的平均值或中值以及标准偏差来确定变异性(CV)。因此，用于对非整倍体进行分类的阈值根据下式使用NCV：The threshold setting depends heavily on the chromosomal dose variability of the specific chromosome of interest identified in the unaffected sample set. Variability depends on several factors, including the fraction of fetal cDNA present in the sample. Variability (CV) is determined by the mean or median of the chromosomal dose across the unaffected sample population, along with the standard deviation. Therefore, the threshold used to classify aneuploidy is based on the NCV according to the following formula:

(其中和分别是合格样品集中第j个染色体剂量的估计平均值和标准偏差，并且x_ij是观察试验样品i的第j个染色体剂量。)(where and are the estimated mean and standard deviation of the dose of the j-th chromosome in the qualified sample set, respectively, and x _{ij} is the dose of the j-th chromosome in the observed test sample i.)

其中相关联的胎儿分数如下：The associated fetal scores are as follows:

因此，对于所关注染色体的每个NCV，可根据基于所关注染色体在未受影响样品群间的染色体比率的平均值和标准偏差的CV，计算与给定NCV值相关联的预期胎儿分数。Therefore, for each NCV of the chromosome of interest, the expected fetal score associated with a given NCV value can be calculated based on the CV of the average and standard deviation of the chromosome ratio of the chromosome of interest in the unaffected sample population.

随后，可基于胎儿分数与NCV值之间的关系选择决策边界，高于该决策边界基于正态分布分位数确定样品为阳性(受影响)。如上所述，在一些实施方案中，设定用于在真阳性的检测和假阴性率结果之间进行最佳权衡的阈值。即，选择阈值以使真阳性和真阴性之和最大化，或使假阳性和假阴性之和最小化。Subsequently, a decision boundary can be selected based on the relationship between fetal score and NCV value. Samples above this decision boundary are determined to be positive (affected) based on normal distribution quantiles. As described above, in some embodiments, a threshold is set to optimally balance the results of true positive detections and false negative rates. That is, a threshold is selected to maximize the sum of true positives and true negatives, or to minimize the sum of false positives and false negatives.

某些实施方案提供了用于对包含胎儿和母体核酸分子的生物样品中的胎儿染色体非整倍体进行产前诊断的方法。基于以下过程来作出该诊断：从衍生自生物试验样品(例如母体血浆样品)的胎儿和母体核酸分子混合物的至少一部分中获得序列信息，根据测序数据计算一条或多条所关注染色体的归一化染色体剂量和/或一条或多条所关注区段的归一化区段剂量，以及确定该试验样品中该所关注染色体的该染色体剂量和/或该所关注区段的该区段剂量分别与在多个合格(正常)样品中建立的阈值之间的统计显著差异，并基于该统计差异提供产前诊断。如方法的步骤160中所述，作出正常或受影响的诊断。在无法确信作出正常或受影响的诊断的情况下，提供“无判定”。Some implementations provide a method for prenatal diagnosis of fetal chromosomal aneuploidy in biological samples containing fetal and maternal nucleic acid molecules. The diagnosis is made based on the following process: obtaining sequence information from at least a portion of a mixture of fetal and maternal nucleic acid molecules derived from a biological test sample (e.g., a maternal plasma sample); calculating normalized chromosome doses and/or normalized segment doses of one or more chromosomes of interest and/or segments of interest based on sequencing data; determining statistically significant differences between the chromosome dose of the chromosome of interest and/or the segment dose of the segment of interest in the test sample and thresholds established in multiple qualified (normal) samples, respectively; and providing a prenatal diagnosis based on these statistical differences. A normal or affected diagnosis is made as described in step 160 of the method. In cases where a normal or affected diagnosis cannot be confidently made, a "no decision" is provided.

在一些实施方案中，可选择两个阈值。选择第一阈值以使假阳性率最小化，高于该第一阈值则样品归类为“受影响”，并且选择第二阈值以使假阴性率最小化，低于该第二阈值则样品归类为“未受影响”。具有高于第二阈值但低于第一阈值的NCV的样品可归类为“疑似非整倍体”或“无判定”样品，对于疑似非整倍体或无判定样品，可通过独立方式来确认是否存在非整倍体。第一阈值与第二阈值之间的区域可称为“无判定”区域。In some implementations, two thresholds can be selected. A first threshold is selected to minimize the false positive rate; samples above this first threshold are classified as "affected." A second threshold is selected to minimize the false negative rate; samples below this second threshold are classified as "unaffected." Samples with NCVs above the second threshold but below the first threshold can be classified as "suspected aneuploidy" or "undetermined" samples. For suspected aneuploidy or undetermined samples, the presence of aneuploidy can be confirmed independently. The region between the first and second thresholds can be referred to as the "undetermined" region.

在一些实施方案中，疑似和无判定阈值在表1中示出。可以看出，NCV的阈值在不同染色体间变化。在一些实施方案中，如上所述，阈值根据样品的FF变化。在一些实施方案中，此处应用的阈值技术有助于提高灵敏度和选择性。In some implementations, the suspected and undetermined thresholds are shown in Table 1. It can be seen that the NCV threshold varies across different chromosomes. In some implementations, as described above, the threshold varies based on the sample's FF. In some implementations, the thresholding techniques applied here help improve sensitivity and selectivity.

表1.疑似和受影响NCV阈值归类无判定范围Table 1. Threshold Classification of Suspected and Affected NCVs (No Decision Scope)

疑似Suspected 受影响Affected 13号染色体chromosome 13 3.53.5 4.04.0 18号染色体chromosome 18 3.53.5 4.54.5 21号染色体chromosome 21 3.53.5 4.04.0 X染色体(XO，XXX)X chromosome (XO, XXX) 4.04.0 4.04.0 Y染色体(XX对比XY)Y chromosome (XX vs. XY) 6.06.0 6.06.0

使用三途径过程、似然比、T统计值和/或胎儿分数确定拷贝数Copy number was determined using the three-pathway process, likelihood ratio, T-statistic, and/or fetal fraction.

三途径过程Three-pathway process

图16A示出了用于评估拷贝数的三途径过程的流程图。该图包括工作流程700的三个重叠途径，其中包括对与所有大小的片段相关联的读段覆盖度的途径1(或713A)分析、对与较短片段相关联的读段覆盖度的途径2(或713B)分析和对较短读段相对于所有读段的相对频率的途径3(或713C)分析。Figure 16A shows a flowchart of the three-path process for evaluating copy number. The figure includes three overlapping pathways of workflow 700, including pathway 1 (or 713A) analysis of read coverage associated with all sizes of reads, pathway 2 (or 713B) analysis of read coverage associated with shorter reads, and pathway 3 (or 713C) analysis of the relative frequency of shorter reads relative to all reads.

在获得读段计数之后，在途径713A中使用来自所有大小的片段的读段来确定覆盖度。在途径713B中使用来自短片段的读段来确定覆盖度。在途径713C中确定来自短片段的读段相对于所有读段的频率。相对频率在本文他处也称为大小比率或大小分数。相对频率是片段大小特性的示例。在一些具体实施中，短片段是短于约150个碱基对的片段。在各种具体实施中，短片段可在约50至150、80至150或110至150个碱基对的大小范围内。在一些具体实施中，第三途径或途径713C是可选的。After obtaining the read count, coverage is determined in pathway 713A using reads from all sizes of fragments. Coverage is determined in pathway 713B using reads from short fragments. In pathway 713C, the frequency of reads from short fragments relative to all reads is determined. Relative frequency is also referred to elsewhere herein as size ratio or size fraction. Relative frequency is an example of fragment size characteristics. In some embodiments, a short fragment is a fragment shorter than about 150 base pairs. In various embodiments, a short fragment can be in the size range of about 50 to 150, 80 to 150, or 110 to 150 base pairs. In some embodiments, a third pathway or pathway 713C is optional.

三条途径713A、713B和713C的数据均进行归一化操作714、716、718、719和722，以去除与所关注序列的拷贝数无关的方差。这些归一化操作在方框723中框出。操作714包括通过将分析量除以参考序列的量的总值来将所关注序列的分析量归一化。该归一化步骤使用从试验样品中获得的值。类似地，操作718和722使用从试验样品中获得的值来将分析量归一化。操作716和719使用从未受影响样品训练集中获得的值。The data from the three pathways 713A, 713B, and 713C are all normalized using operations 714, 716, 718, 719, and 722 to remove variance unrelated to the copy number of the sequence of interest. These normalization operations are boxed in 723. Operation 714 involves normalizing the analytical quantity of the sequence of interest by dividing the analytical quantity by the total value of the quantity of the reference sequence. This normalization step uses values obtained from the experimental samples. Similarly, operations 718 and 722 normalize the analytical quantity using values obtained from the experimental samples. Operations 716 and 719 use values obtained from the unaffected sample training set.

操作716去除从未受影响样品训练集中获得的全局波的方差。操作718来去除个体特定GC方差的方差。Operation 716 removes the variance of the global wave obtained from the unaffected sample training set. Operation 718 removes the variance of the individual-specific GC variance.

操作719使用主成分分析(PCA)方法去除进一步的方差。通过PCA方法去除的方差归因于与所关注序列的拷贝数无关的因素。每个分组中的分析量(覆盖度、片段大小比率等)为PCA提供了自变量，而未受影响训练集的样品为这些自变量提供了值。训练集的样品均包括具有相同拷贝数的所关注序列的样品，例如两个拷贝的体细胞染色体、一个拷贝的X染色体(当男性样品用作未受影响样品时)或两个拷贝的X染色体(当女性样品用作未受影响样品时)。因此，样品中的方差不是由非整倍体或其他拷贝数方差引起的。训练集的PCA产生与所关注序列的拷贝数无关的主成分。然后可使用这些主成分来去除与所关注序列的拷贝数无关的试验样品中的方差。Operation 719 uses Principal Component Analysis (PCA) to remove further variance. The variance removed by PCA is attributed to factors independent of the copy number of the sequence of interest. Analytical quantities (coverage, fragment size ratio, etc.) in each group provide independent variables for the PCA, while samples from the unaffected training set provide values for these independent variables. The training set samples all include samples with the same copy number of the sequence of interest, such as two copies of a somatic chromosome, one copy of the X chromosome (when male samples are used as unaffected samples), or two copies of the X chromosome (when female samples are used as unaffected samples). Therefore, the variance in the samples is not caused by aneuploidy or other copy number variance. PCA on the training set produces principal components independent of the copy number of the sequence of interest. These principal components can then be used to remove variance in the experimental samples independent of the copy number of the sequence of interest.

在某些实施方案中，使用根据所关注序列之外的区域中未受影响样品的数据估计的系数，从试验样品的数据中去除一个或多个主成分的方差。在一些具体实施中，区域表示所有稳健染色体。例如，对训练正常样品的归一化分组覆盖度数据执行PCA，从而提供与可捕获数据的最大方差的维度对应的主成分。如此捕获的方差与所关注序列中的拷贝数变异无关。在已经从训练正常样品中获得主成分之后，将这些主成分应用于测试数据。在来自所关注序列之外的区域的分组间生成以试验样品为响应变量并且以主成分为因变量的线性回归模型。所得的回归系数用于通过减去由所估计的回归系数限定的主分量的线性组合来将所关注区域的分组覆盖度归一化。这从所关注序列中去除了与CNV无关的方差。参见框719。将剩余数据用于下游分析。另外，操作722来去除异常数据点。In some implementations, the variance of one or more principal components is removed from the test sample data using coefficients estimated from data of unaffected samples in regions outside the sequence of interest. In some specific implementations, the region represents all robust chromosomes. For example, PCA is performed on normalized group coverage data of training normal samples to provide principal components corresponding to the dimension of the maximum variance of the data that can be captured. The variance thus captured is independent of copy number variation in the sequence of interest. After obtaining the principal components from the training normal samples, these principal components are applied to the test data. A linear regression model is generated between the groupings from regions outside the sequence of interest, with the test samples as the response variable and the principal components as the dependent variable. The resulting regression coefficients are used to normalize the group coverage of the region of interest by subtracting a linear combination of the principal components defined by the estimated regression coefficients. This removes variance independent of CNV from the sequence of interest. See box 719. The remaining data is used for downstream analysis. Additionally, operation 722 is performed to remove outlier data points.

在框723中进行归一化操作之后，已将所有分组的覆盖度值“归一化”以去除非整倍体或其他拷贝数变异之外的变异来源。在某种意义上，为了检测拷贝数变异，相对于其他分组富集或改变所关注序列的分组。参见框724，该框不是操作，而是表示所得的覆盖度值。大框723中的归一化操作可增加信号和/或减小正在分析的量的噪声。类似地，已将分组的短片段的覆盖度值归一化以去除非整倍体或其他拷贝数变异之外的变异来源，如框728所示，并且已类似地将分组的短片段的相对频率(或大小比率)归一化以去除非整倍体或其他拷贝数变异之外的变异来源，如框732所示。与框724一样，框728和732不是操作，而是表示在处理大框723之后的覆盖度和相对频率值。应当理解，大方框723中的操作可被修改、重排或去除。例如，在一些实施方案中，不执行PCA操作719。在其他实施方案中，不执行校正GC操作718。在其他实施方案中，改变操作的顺序；例如，在校正GC操作718之前执行PCA操作719。After the normalization operation in box 723, the coverage values of all groups have been "normalized" to remove sources of variation other than aneuploidy or other copy number variations. In a sense, to detect copy number variations, the grouping of the sequence of interest is enriched or altered relative to other groups. See box 724, which is not an operation but represents the resulting coverage values. The normalization operation in large box 723 can increase the signal and/or reduce noise in the quantity being analyzed. Similarly, the coverage values of short segments of groups have been normalized to remove sources of variation other than aneuploidy or other copy number variations, as shown in box 728, and the relative frequencies (or size ratios) of short segments of groups have been similarly normalized to remove sources of variation other than aneuploidy or other copy number variations, as shown in box 732. Like box 724, boxes 728 and 732 are not operations but represent the coverage and relative frequency values after processing large box 723. It should be understood that the operations in large box 723 can be modified, rearranged, or removed. For example, in some embodiments, PCA operation 719 is not performed. In other embodiments, GC correction operation 718 is not performed. In other embodiments, the order of operations is changed; for example, PCA operation 719 is performed before GC correction operation 718.

在框724所示的归一化和方差去除之后的所有片段的覆盖度用于获得框726中的t统计值。类似地，在框728所示的归一化和方差去除之后的短片段的覆盖度用于获得框730中的t统计值，并且在框732所示的归一化和方差去除之后的短片段的相对频率用于获得框734中的t统计值。The coverage of all segments after normalization and variance removal, as shown in box 724, is used to obtain the t-statistic in box 726. Similarly, the coverage of short segments after normalization and variance removal, as shown in box 728, is used to obtain the t-statistic in box 730, and the relative frequencies of short segments after normalization and variance removal, as shown in box 732, are used to obtain the t-statistic in box 734.

将t统计值应用于拷贝数分析可有助于提高分析的准确性。仅使用平均值以及使用平均值和方差两者来区分两个分布并未捕获两个分布之间的方差。T统计值可反映分布的平均值和方差两者。Applying t-statistics to copy number analysis can help improve the accuracy of the analysis. Using only the mean, or both the mean and variance, to distinguish between two distributions does not capture the variance between them. The t-statistic reflects both the mean and variance of the distribution.

在一些具体实施中，操作726如下计算t统计值：In some specific implementations, operation 726 calculates the t-statistic as follows:

其中x₁是所关注序列的分组覆盖度，x₂是参考区域/序列的分组覆盖度，s₁是所关注序列的覆盖度的标准偏差，s₂是参考区域的分组的标准偏差，n₁是所关注序列的分组的数目；并且n₂是参考区域的分组的数目。Where _x1 is the group coverage of the sequence of interest, _x2 is the group coverage of the reference region/sequence, _s1 is the standard deviation of the coverage of the sequence of interest, _s2 is the standard deviation of the grouping of the reference region, _n1 is the number of groups of the sequence of interest, and _n2 is the number of groups of the reference region.

在一些具体实施中，参考区域包括所有稳健染色体(例如，除了最可能包含非整倍体的染色体之外的染色体)。在一些具体实施中，参考区域包括所关注序列之外的至少一条染色体。在一些模拟中，参考区域包括不包含所关注序列的稳健染色体。在其他具体实施中，参考区域包括已确定为训练样品集提供最佳信号检测能力的染色体集(例如，选自稳健染色体的染色体子集)。在一些实施方案中，信号检测能力基于参考区域将包含拷贝数变异的分组与不包含拷贝数变异的分组区分开的能力。在一些实施方案中，以与标题为“归一化序列的识别”的部分中所述的用于确定“归一化序列”或“归一化染色体”的方式相似的方式来识别参考区域。In some embodiments, the reference region includes all robust chromosomes (e.g., chromosomes other than those most likely to contain aneuploidy). In some embodiments, the reference region includes at least one chromosome other than the sequence of interest. In some simulations, the reference region includes robust chromosomes that do not contain the sequence of interest. In other embodiments, the reference region includes a set of chromosomes that has been determined to provide optimal signal detection capability for the training sample set (e.g., a subset of robust chromosomes). In some embodiments, signal detection capability is based on the ability of the reference region to distinguish groups containing copy number variations from groups that do not contain copy number variations. In some embodiments, the reference region is identified in a manner similar to that used to determine "normalized sequences" or "normalized chromosomes" as described in the section titled "Identification of Normalized Sequences".

胎儿分数的确定Determination of fetal fraction

返回图16A，可将一个或多个胎儿分数估计值(框735)与框726、730和734中的任何t统计值组合，以获得倍体病例的似然估计值。参见框736。在一些具体实施中，通过图16B中的过程800、图16C中的过程900或图16D的过程1000中的任一过程获得一个或多个胎儿分数。可使用作为图16A中的工作流程700的工作流程来并行地实施这些过程。Returning to Figure 16A, one or more fetal score estimates (box 735) can be combined with any t-statistic in boxes 726, 730, and 734 to obtain likelihood estimates for ploidy cases. See box 736. In some implementations, one or more fetal scores are obtained via any of the procedures 800 in Figure 16B, 900 in Figure 16C, or 1000 in Figure 16D. These procedures can be performed in parallel using the workflow 700 in Figure 16A.

图16B示出了根据本公开的一些具体实施的用于根据覆盖度信息确定胎儿分数的示例性过程800。过程800开始于从训练集中获得训练样品的覆盖度信息(例如，序列剂量值)。参见框802。从已知怀有男性胎儿的孕妇中获得训练集中的每个样品。即，样品含有男性胎儿的cfDNA。在一些具体实施中，操作802可获得以不同于如本文所述的序列剂量的方式进行归一化的序列覆盖度，或者该操作可获得其他覆盖度值。Figure 16B illustrates an exemplary process 800 for determining a fetal score based on coverage information according to some specific embodiments of the present disclosure. Process 800 begins by obtaining coverage information (e.g., sequence dose values) of training samples from a training set. See box 802. Each sample in the training set is obtained from pregnant women known to be carrying male fetuses. That is, the samples contain cfDNA of male fetuses. In some specific embodiments, operation 802 may obtain sequence coverage normalized in a manner different from the sequence doses described herein, or the operation may obtain other coverage values.

然后方法800包括计算训练样品的胎儿分数。在一些具体实施中，可根据序列剂量值计算胎儿分数：Method 800 then includes calculating the fetal score for the training sample. In some implementations, the fetal score can be calculated based on the sequence dose value:

其中Rx_j是男性样品的序列剂量，median(Rx_i)是女性样品的序列剂量的中值。在其他具体实施中，可使用平均值或其他集中趋势量度。在一些具体实施中，可通过其他方法获得FF，方法诸如X染色体和Y染色体的相对频率。参见框804。Where _Rxj is the sequence dose of the male sample, and median( _Rxi ) is the median sequence dose of the female sample. In other embodiments, the mean or other measures of central tendency may be used. In some embodiments, FFs may be obtained by other methods, such as the relative frequencies of the X and Y chromosomes. See box 804.

过程800还包括将参考序列划分成亚序列的多个分组。在一些具体实施中，参考序列是完整基因组。在一些具体实施中，分组是100kb分组。在一些具体实施中，将基因组分成约25,000个分组。然后该过程获得分组的覆盖度。参见框806。在一些具体实施中，在进行图16A的框723所示的归一化操作之后获得框806中所用的覆盖度。在其他具体实施中，可使用来自不同大小范围的覆盖度。Process 800 also includes dividing the reference sequence into multiple subsequence groups. In some embodiments, the reference sequence is the entire genome. In some embodiments, the groups are 100kb groups. In some embodiments, the genome is divided into approximately 25,000 groups. The process then obtains the coverage of the groups. See box 806. In some embodiments, the coverage used in box 806 is obtained after performing the normalization operation shown in box 723 of Figure 16A. In other embodiments, coverage from different size ranges may be used.

每个分组与训练集中的样品的覆盖度相关联。因此，可获得每个分组的样品的覆盖度和样品的胎儿分数之间的相关性。过程800包括获得胎儿分数与所有分组的覆盖度之间的相关性。参见框808。然后该过程选择相关性值高于阈值的分组。参见框810。在一些具体实施中，选择具有6000个最高相关性值的分组。目的是识别在训练样品中显示覆盖度与胎儿分数之间具有高相关性的分组。然后可使用这些分组来预测试验样品中的胎儿分数。虽然训练样品是男性样品，但胎儿分数与覆盖度之间的相关性可适用于男性试验样品和女性试验样品。Each group is associated with the coverage of samples in the training set. Therefore, the correlation between the coverage of samples in each group and the fetal score of the sample can be obtained. Process 800 includes obtaining the correlation between the fetal score and the coverage of all groups. See box 808. The process then selects groups with correlation values higher than a threshold. See box 810. In some implementations, groups with the 6000 highest correlation values are selected. The aim is to identify groups that show a high correlation between coverage and fetal score in the training samples. These groups can then be used to predict the fetal score in the test samples. Although the training samples are male samples, the correlation between fetal score and coverage can be applied to both male and female test samples.

使用所选择的具有高相关性值的分组，该过程获得将胎儿分数与覆盖度相关联的线性模型。参见框812。每个所选择的分组为该线性模型提供自变量。因此，所获得的线性模型还包括每个分组的参数或权重。调整这些分组的权重以将该模型拟合到数据。在获得线性模型之后，过程800包括将试验样品的覆盖度数据应用于该模型以确定试验样品的胎儿分数。参见框814。所应用的试验样品的覆盖度数据用于在胎儿分数和覆盖度之间具有高相关性的分组。Using the selected groupings with high correlation values, the process obtains a linear model that associates fetal score with coverage. See box 812. Each selected grouping provides an independent variable for this linear model. Therefore, the obtained linear model also includes parameters or weights for each group. The weights of these groups are adjusted to fit the model to the data. After obtaining the linear model, process 800 includes applying the coverage data of the test sample to the model to determine the fetal score of the test sample. See box 814. The coverage data of the test sample applied is used for groupings with high correlation between fetal score and coverage.

图16A示出了用于处理序列读段的工作流程700，其信息可用于获得胎儿分数估计值。在一些具体实施中，框723中的一个或多个归一化操作是可选的。途径1提供覆盖度信息，其可用于图16B所示的过程800的框806中。Figure 16A illustrates a workflow 700 for processing sequence reads, the information of which can be used to obtain fetal score estimates. In some implementations, one or more normalization operations in box 723 are optional. Pathway 1 provides coverage information, which can be used in box 806 of process 800 shown in Figure 16B.

在一些具体实施中，可将多个胎儿分数估计值组合以提供复合胎儿分数估计值。可使用各种方法来获得胎儿分数估计值。例如，可从覆盖度信息中获得胎儿分数。参见图16B的过程800。在一些具体实施中，还可根据片段的大小分布估计胎儿分数。参见图16C的过程900。在一些具体实施中，还可根据8聚体频率分布估计胎儿分数。参见图16D的过程1000。In some implementations, multiple fetal score estimates can be combined to provide a composite fetal score estimate. Various methods can be used to obtain the fetal score estimate. For example, the fetal score can be obtained from coverage information. See process 800 in Figure 16B. In some implementations, the fetal score can also be estimated based on the size distribution of the segments. See process 900 in Figure 16C. In some implementations, the fetal score can also be estimated based on the octamer frequency distribution. See process 1000 in Figure 16D.

在包含男性胎儿的cfDNA的试验样品中，还可根据Y染色体和/或X染色体的覆盖度估计胎儿分数。在一些具体实施中，通过使用选自以下项的信息来获得推定男性胎儿的胎儿分数的复合估计值：从分组的覆盖度信息中获得的胎儿分数，从片段大小信息中获得的胎儿分数、从Y染色体覆盖度中获得的胎儿分数、从X染色体中获得的胎儿分数以及它们的任何组合。在一些具体实施中，通过使用Y染色体的覆盖度来获得胎儿的推定性别。可以各种方式将两个或更多个胎儿分数组合以提供胎儿分数的复合估计值。例如，在一些具体实施中，可使用平均值或加权平均值方法，其中加权可基于胎儿分数估计值的统计置信度。In test samples containing cfDNA from male fetuses, fetal scores can also be estimated based on Y chromosome and/or X chromosome coverage. In some embodiments, a composite estimate of the fetal score presuming male fetus is obtained using information selected from: fetal scores obtained from grouped coverage information, fetal scores obtained from fragment size information, fetal scores obtained from Y chromosome coverage, fetal scores obtained from the X chromosome, and any combination thereof. In some embodiments, the presumed sex of the fetus is obtained using Y chromosome coverage. Two or more fetal scores can be combined in various ways to provide a composite estimate of the fetal score. For example, in some embodiments, an average or weighted average method can be used, where the weighting may be based on the statistical confidence level of the fetal score estimate.

在一些具体实施中，通过使用选自以下项的信息来获得推定女性胎儿的胎儿分数的复合估计值：从分组的覆盖度信息中获得的胎儿分数、从片段大小信息中获得的胎儿分数以及它们的任何组合。In some implementations, a composite estimate of the fetal score for a presumed female fetus is obtained by using information selected from: fetal scores obtained from group coverage information, fetal scores obtained from fragment size information, and any combination thereof.

图16C示出了根据一些具体实施的用于根据大小分布信息确定胎儿分数的过程。过程900开始于从训练集中获得男性训练样品的覆盖度信息(例如，序列剂量值)。参见框902。然后过程900包括使用上文针对框804所述的方法来计算训练样品的胎儿分数。参见框904。Figure 16C illustrates a specific implementation of a process for determining fetal scores based on size distribution information. Process 900 begins by obtaining coverage information (e.g., sequence dose values) of male training samples from the training set. See box 902. Process 900 then includes calculating the fetal scores of the training samples using the method described above for box 804. See box 904.

过程900继续将大小范围划分成多个分组以提供基于片段大小的分组并确定这些基于片段大小的分组的读段频率。参见框906。在一些具体实施中，在未对框723中所示的因素进行归一化的情况下，获得基于片段大小的分组的频率。在一些具体实施中，在任选地进行图16A的框723所示的归一化操作之后获得基于片段大小的分组的频率。在一些具体实施中，将大小范围分成40个分组。在一些具体实施中，低末端处的分组包括大小小于约55个碱基对的片段。在一些具体实施中，低末端处的分组包括大小在约50至55个碱基对范围内的片段，这排除了读段短于50bp的信息。在一些具体实施中，高末端处的分组包括大小大于约245个碱基对的片段。在一些具体实施中，高末端处的分组包括大小在约245至250个碱基对范围内的片段，这排除了读段长于250bp的信息。Process 900 continues to divide the size range into multiple groups to provide fragment size-based groupings and determine the read frequencies of these fragment size-based groupings. See box 906. In some embodiments, the frequencies of the fragment size-based groupings are obtained without normalizing the factors shown in box 723. In some embodiments, the frequencies of the fragment size-based groupings are obtained after optionally performing the normalization operation shown in box 723 of Figure 16A. In some embodiments, the size range is divided into 40 groups. In some embodiments, the groupings at the low end include fragments smaller than about 55 base pairs. In some embodiments, the groupings at the low end include fragments in the range of about 50 to 55 base pairs, which excludes information with reads shorter than 50 bp. In some embodiments, the groupings at the high end include fragments larger than about 245 base pairs. In some implementations, the high-terminal groupings include fragments ranging in size from approximately 245 to 250 base pairs, which excludes information with read lengths greater than 250 bp.

过程900继续通过使用训练样品的数据来获得将胎儿分数与基于片段大小的分组的读段频率相关联的线性模型。参见框908。所获得的线性模型包括基于大小的分组的读段频率的自变量。该模型还包括每个基于大小的分组的参数或权重。调整这些分组的权重以将该模型拟合到数据。在获得线性模型之后，过程900包括将试验样品的读段频率数据应用于该模型以确定试验样品的胎儿分数。参见框910。Procedure 900 continues by using data from the training samples to obtain a linear model that correlates the fetal score with read frequencies in size-based groups. See box 908. The obtained linear model includes independent variables for read frequencies in size-based groups. The model also includes parameters or weights for each size-based group. The weights of these groups are adjusted to fit the model to the data. After obtaining the linear model, procedure 900 includes applying read frequency data from the test samples to the model to determine the fetal score for the test samples. See box 910.

在一些具体实施中，可使用8聚体频率来计算胎儿分数。图16D示出了根据本公开的一些具体实施的用于根据8聚体频率信息确定胎儿分数的示例性过程1000。过程1000开始于从训练集中获得男性训练样品的覆盖度信息(例如，序列剂量值)。参见框1002。然后过程1000包括使用针对框804所述的方法中的任意一种方法来计算训练样品的胎儿分数。参见框1004。In some implementations, octamer frequencies can be used to calculate the fetal score. Figure 16D illustrates an exemplary process 1000 for determining a fetal score based on octamer frequency information according to some implementations of this disclosure. Process 1000 begins by obtaining coverage information (e.g., sequence dose values) of male training samples from a training set. See box 1002. Process 1000 then includes calculating the fetal score of the training samples using any of the methods described for box 804. See box 1004.

过程1000还包括从每个训练样品的读段中获得8聚体的频率(例如，8个位置处4个核苷酸的所有可能排列)。参见框1006。在一些具体实施中，获得至多或接近65,536个8聚体及它们的频率。在一些具体实施中，在未对框723中所示的因素进行归一化的情况下，获得基于8聚体的频率。在一些具体实施中，在任选地进行归一化操作之后获得8聚体频率。Process 1000 also includes obtaining the frequencies of octamers (e.g., all possible permutations of 4 nucleotides at 8 positions) from the reads of each training sample. See box 1006. In some embodiments, up to or approximately 65,536 octamers and their frequencies are obtained. In some embodiments, octamer-based frequencies are obtained without normalizing the factors shown in box 723. In some embodiments, octamer frequencies are obtained after optionally performing a normalization operation.

每个8聚体与训练集中的样品的频率相关联。因此，可获得每个8聚体的样品的8聚体频率和样品的胎儿分数之间的相关性。过程1000包括获得胎儿分数与所有8聚体的8聚体频率之间的相关性。参见框1008。然后该过程选择相关性值高于阈值的8聚体。参见框1010。目的是识别在训练样品中显示8聚体频率与胎儿分数之间具有高相关性的8聚体。然后可使用这些分组来预测试验样品中的胎儿分数。虽然训练样品是男性样品，但胎儿分数与8聚体频率之间的相关性可适用于男性试验样品和女性试验样品。Each octamer is associated with a frequency in the samples of the training set. Therefore, the correlation between the octamer frequency of each octamer and the fetal score of the sample can be obtained. Process 1000 includes obtaining the correlation between the fetal score and the octamer frequency of all octamers. See box 1008. The process then selects octamers with correlation values above a threshold. See box 1010. The aim is to identify octamers that show a high correlation between octamer frequency and fetal score in the training samples. These groupings can then be used to predict the fetal score in the test samples. Although the training samples are male samples, the correlation between fetal score and octamer frequency can be applied to both male and female test samples.

使用所选择的具有高相关性值的8聚体，该过程获得将胎儿分数与8聚体频率相关联的线性模型。参见框1012。每个所选择的分组为该线性模型提供自变量。因此，所获得的线性模型还包括每个分组的参数或权重。在获得线性模型之后，过程1000包括将试验样品的8聚体频率数据应用于该模型以确定试验样品的胎儿分数。参见框1014。Using selected octamers with high correlation values, the procedure obtains a linear model that correlates fetal score with octamer frequency. See box 1012. Each selected group provides an independent variable for this linear model. Therefore, the obtained linear model also includes parameters or weights for each group. After obtaining the linear model, procedure 1000 includes applying the octamer frequency data of the test sample to the model to determine the fetal score of the test sample. See box 1014.

在一些具体实施中，在计算胎儿分数时，对与基因组的不同部分具有相关性的不同部分的覆盖度(或其他参数)进行加权。此类方法的示例包括在美国专利申请公布US2015/0005176和Kim等人(2015年)，Prenatal Diagnosis，第35卷，第1-6页中描述的SeqFF方法，这些文献全文以引用方式并入以用于计算胎儿分数的目的。在一些具体实施中，具有较高游离核酸片段分数的分组权重更大以用于确定胎儿分数。In some implementations, when calculating the fetal score, the coverage (or other parameters) of different segments that are relevant to different parts of the genome are weighted. Examples of such methods include the SeqFF method described in US Patent Application Publication US2015/0005176 and Kim et al. (2015), Prenatal Diagnosis, Vol. 35, pp. 1–6, the full text of which is incorporated herein by reference for the purpose of calculating the fetal score. In some implementations, groups with higher scores of cell-free nucleic acid fragments are given greater weight in determining the fetal score.

确定似然比Determine the likelihood ratio

返回图16A，在一些具体实施中，过程700包括在操作736中使用基于由操作726提供的所有片段覆盖度的t统计值、由操作726提供的胎儿分数估计值和基于由操作730提供的短片段覆盖度的t统计值来获得最终倍体似然性。这些具体实施使用多变量正态模型来将途径1和途径2的结果组合。在用于评估CNV的一些具体实施中，倍体似然性是非整倍体似然性，其是具有非整倍体假设(例如，三倍体或单倍体)的模型的似然性减去具有整倍体假设的模型的似然性，其中该模型使用基于所有片段覆盖度的t统计值、胎儿分数估计值和基于短片段覆盖度的t统计值作为输入并提供似然性作为输出。Returning to Figure 16A, in some embodiments, process 700 includes using t-statistics based on all segment coverage provided by operation 726, fetal score estimates provided by operation 726, and t-statistics based on short segment coverage provided by operation 730 to obtain the final ploidy likelihood. These embodiments use a multivariate normal model to combine the results of pathway 1 and pathway 2. In some embodiments used to assess CNV, ploidy likelihood is aneuploidy likelihood, which is the likelihood of a model with an aneuploidy assumption (e.g., triploidy or haploidy) minus the likelihood of a model with an euploidy assumption, wherein the model uses t-statistics based on all segment coverage, fetal score estimates, and t-statistics based on short segment coverage as inputs and provides likelihood as output.

在一些具体实施中，倍体似然性表示为似然比。在一些具体实施中，似然比建模为：In some specific implementations, ploidy likelihood is expressed as a likelihood ratio. In some specific implementations, the likelihood ratio is modeled as follows:

其中p₁表示数据来自表示3拷贝或1拷贝模型的多变量正态分布的似然性，p₀表示数据来自表示2拷贝模型的多变量正态分布的似然性，T_short、T_all是根据从短片段和所有片段生成的染色体覆盖度计算的T得分，而q(ff_total)是考虑到与胎儿分数估计值相关联的误差的胎儿分数的密度分布(根据训练数据估计)。该模型将由短片段生成的覆盖度与由所有片段生成的覆盖度组合，这有助于改善受影响样品和未受影响样品的覆盖度得分之间的间距。在所述的实施方案中，模型还利用胎儿分数，从而进一步改善区分受影响样品与未受影响样品的能力。此处，使用如上所述的基于所有片段覆盖度的t统计值(726)、基于短片段覆盖度的t统计值(730)和由过程800(或框726)、900或1000提供的胎儿分数估计值来计算似然比。在一些具体实施中，该似然比用于分析13号染色体、18号染色体和21号染色体。Where p _1 represents the likelihood of the data coming from a multivariate normal distribution representing a 3-copy or 1-copy model, p _0 represents the likelihood of the data coming from a multivariate normal distribution representing a 2-copy model, T _{short} and T _all are T-scores calculated based on chromosome coverage generated from short fragments and all fragments, and q(ff_total ) is the density distribution of fetal scores (estimated from training data) that takes into account the error associated with the fetal score estimate. The model combines the coverage generated from short fragments with the coverage generated from all fragments, which helps to improve the gap between coverage scores for affected and unaffected samples. In the described implementation, the model also utilizes fetal scores to further improve the ability to distinguish between affected and unaffected samples. Here, the likelihood ratio is calculated using the t-statistic (726) based on all fragment coverage, the t-statistic (730) based on short fragment coverage, and the fetal score estimate provided by procedures 800 (or boxes 726), 900, or 1000 as described above. In some specific implementations, this likelihood ratio is used to analyze chromosomes 13, 18, and 21.

在一些具体实施中，由操作736获得的倍体似然性仅使用基于由途径3的操作734提供的短片段的相对频率和由操作726、过程800、900或1000提供的胎儿分数估计值获得的t统计值。可根据以下公式来计算似然比：In some specific implementations, the ploidy likelihood obtained by operation 736 uses only the t-statistic based on the relative frequencies of the short segments provided by operation 734 of pathway 3 and the fetal fraction estimates provided by operation 726, procedures 800, 900, or 1000. The likelihood ratio can be calculated using the following formula:

其中p₁表示数据来自表示3拷贝或1拷贝模型的多变量正态分布的似然性，p₀表示数据来自表示2拷贝模型的多变量正态分布的似然性，T_short-freq是根据短片段的相对频率计算的T得分，而q(ff_total)是考虑到与胎儿分数估计值相关联的误差的胎儿分数的密度分布(根据训练数据估计)。此处，使用如上所述的基于短片段相对频率的t统计值(734)和由过程800(或框726)、900或1000提供的胎儿分数估计值来计算似然比。在一些具体实施中，该似然比用于分析X染色体。Where p _1 represents the likelihood of the data coming from a multivariate normal distribution representing a 3-copy or 1-copy model, p _0 represents the likelihood of the data coming from a multivariate normal distribution representing a 2-copy model, T _{short-freq} is the T-score calculated based on the relative frequency of the short fragments, and q(ff _total ) is the density distribution of the fetal score (estimated from the training data) taking into account the error associated with the fetal score estimate. Here, the likelihood ratio is calculated using the t-statistic (734) based on the relative frequency of the short fragments as described above and the fetal score estimate provided by procedures 800 (or box 726), 900, or 1000. In some specific implementations, this likelihood ratio is used to analyze the X chromosome.

在一些具体实施中，使用基于所有片段覆盖度的t统计值(726)、基于短片段覆盖度的t统计值(730)和短片段的相对频率(734)来计算似然比。此外，可将如上所述获得的胎儿分数与t统计值组合来计算似然比。通过将来自三条途径713A、713B和713C中任一条途径的信息组合，可提高倍体评估的区分能力。在一些具体实施中，可使用不同的组合来获得染色体的似然比，组合例如，来自全部三条途径的t统计值、来自第一条和第二条途径的t统计值、胎儿分数和三个t统计值、胎儿分数和一个t统计值等。然后可基于模型性能来选择最佳组合。In some implementations, the likelihood ratio is calculated using t-statistics based on coverage of all segments (726), t-statistics based on coverage of short segments (730), and the relative frequency of short segments (734). Alternatively, the likelihood ratio can be calculated by combining the fetal score obtained as described above with the t-statistic. The discriminative power of ploidy assessment can be improved by combining information from any one of the three pathways 713A, 713B, and 713C. In some implementations, different combinations can be used to obtain the chromosome likelihood ratio, such as t-statistics from all three pathways, t-statistics from the first and second pathways, fetal score and three t-statistics, fetal score and one t-statistic, etc. The optimal combination can then be selected based on model performance.

在上文所述的各种具体实施中，整倍体模型和非整倍体模型将t统计值作为输入值。然而，这些模型当然也可以原始的或以其他方式转换的覆盖度或丰度值作为输入，并且提供似然性作为输出。以其他方式转换的输入值或t统计输入值可有助于提高模型的预测能力，但并不是在所有具体实施中都必须进行这种转换。In the various implementations described above, the euploid and aneuploid models use t-statistics as input. However, these models can also use raw or otherwise transformed coverage or abundance values as input and provide likelihood as output. Transformed input values or t-statistics can help improve the model's predictive power, but such transformation is not necessary in all implementations.

在用于评估常染色体的一些具体实施中，建模似然比表示已从三倍体或单倍体样品获得的建模数据的似然性相对于已从二倍体样品获得的建模数据的似然性。在一些具体实施中，此类似然比可用于确定常染色体的三倍体或单倍体。In some specific implementations used to assess autosomes, the modeling likelihood ratio represents the likelihood of modeling data obtained from triploid or haploid samples relative to the likelihood of modeling data obtained from diploid samples. In some implementations, this likelihood ratio can be used to determine whether an autosome is triploid or haploid.

在用于评估性染色体的一些具体实施中，评估了X单倍体的似然比和X三倍体的似然比。此外，还评估了X染色体和Y染色体的染色体覆盖度测量值(例如CNV或覆盖度z得分)。在一些具体实施中，使用决策树来评估四个值以确定性染色体的拷贝数。在一些具体实施中，决策树允许确定XX、XY、X、XXY、XXX或XYY的倍体病例。In some implementations used to assess sex chromosomes, the likelihood ratios for X haploids and X triploids were evaluated. Additionally, chromosome coverage measures (e.g., CNV or coverage z-score) for the X and Y chromosomes were assessed. In some implementations, decision trees were used to evaluate four values to determine the copy number of the sex chromosomes. In some implementations, decision trees allowed for the identification of ploidy cases of XX, XY, X, XXY, XXX, or XYY.

在一些具体实施中，将似然比转换为对数似然比，并且可根据经验来设定用于判定非整倍体或拷贝数变异的标准或阈值，以获得特定的灵敏度和选择性。例如，当将对数似然比应用于训练集时，可基于模型的灵敏度和选择性将该对数似然比设定为1.5以用于判定13-三倍体或18-三倍体。此外，例如，在一些应用中，可将针对21号染色体的三倍体的判定标准值设定为3。In some implementations, the likelihood ratio is converted to a log-likelihood ratio, and criteria or thresholds for determining aneuploidy or copy number variation can be empirically set to obtain specific sensitivity and selectivity. For example, when applying the log-likelihood ratio to the training set, it can be set to 1.5 based on the model's sensitivity and selectivity to determine 13-triploidy or 18-triploidy. Furthermore, for example, in some applications, the criterion value for determining triploidy on chromosome 21 can be set to 3.

样品和样品处理Samples and Sample Processing

样品sample

用于确定CNV的样品(例如染色体非整倍体、部分非整倍体等)可包括从一个或多个所关注序列的拷贝数变异待确定的任何细胞、组织或器官中取得的样品。有利地，样品含有存在于细胞中的核酸和/或“游离”的核酸(例如cfDNA)。Samples used to determine CNVs (e.g., chromosomal aneuploidy, partial aneuploidy, etc.) may include samples obtained from any cell, tissue, or organ from which copy number variations of one or more sequences of interest are to be determined. Advantageously, the sample contains nucleic acids present in the cells and/or “free” nucleic acids (e.g., cfDNA).

在一些实施方案中，有利的是获得游离核酸，例如游离DNA(cfDNA)。可通过本领域已知的多种方法从生物样品获得游离核酸(包括游离DNA)，这些生物样品包括但不限于血浆、血清和尿液(参见例如Fan等人，Proc Natl Acad Sci第105章：第16266-16271页[2008年]；Koide等人，Prenatal Diagnosis第25章：第604-607页[2005年]；Chen等人，NatureMed.第2章：第1033-1035页[1996年]；Lo等人，Lancet第350章：第485-487页[1997年]；Botezatu等人，Clin Chem.第46章：第1078-1084页，2000年；和Su等人，J Mol.Diagn.第6章：第101-107页[2004年])。为了从样品中的细胞中分离游离DNA，可使用各种方法，包括但不限于分馏、离心(例如密度梯度离心)、DNA特异性沉淀或高通量细胞分选和/或其他分离方法。用于手动和自动分离cfDNA的市售试剂盒是可用的(Roche Diagnostics,Indianapolis,IN,Qiagen,Valencia,CA,Macherey-Nagel,Duren,DE)。包含cfDNA的生物样品已用于测定中，以通过可检测染色体非整倍体和/或多种多态性的测序测定来确定是否存在染色体异常(例如，21-三倍体)。In some implementations, it is advantageous to obtain cell-free nucleic acids, such as cell-free DNA (cfDNA). Free nucleic acids (including free DNA) can be obtained from biological samples using a variety of methods known in the art, including but not limited to plasma, serum, and urine (see, for example, Fan et al., Proc Natl Acad Sci, Chapter 105: pp. 16266-16271 [2008]; Koide et al., Prenatal Diagnosis, Chapter 25: pp. 604-607 [2005]; Chen et al., Nature Med., Chapter 2: pp. 1033-1035 [1996]; Lo et al., Lancet, Chapter 350: pp. 485-487 [1997]; Botezatu et al., Clin Chem., Chapter 46: pp. 1078-1084, 2000; and Su et al., J Mol. Diagn., Chapter 6: pp. 101-107 [2004]). To isolate cell-free DNA from cells in a sample, various methods can be used, including but not limited to fractionation, centrifugation (e.g., density gradient centrifugation), DNA-specific precipitation, or high-throughput cell sorting and/or other separation methods. Commercially available kits for manual and automated isolation of cfDNA are available (Roche Diagnostics, Indianapolis, IN, Qiagen, Valencia, CA, Macherey-Nagel, Duren, DE). Biological samples containing cfDNA have been used in assays to determine the presence of chromosomal abnormalities (e.g., 21-triploidy) by sequencing assays capable of detecting chromosomal aneuploidy and/or multiple polymorphisms.

在各种实施方案中，存在于样品中的cfDNA可在使用之前(例如，在制备测序文库之前)特异性地或非特异性地富集。样品DNA的非特异性富集是指可用于在制备cfDNA测序文库之前增加样品DNA水平的样品基因组DNA片段的全基因组扩增。非特异性富集可以是存在于包含多于一个基因组的样品中的两个基因组中的一个基因组的选择性富集。例如，非特异性富集可对母体样品中的胎儿基因组具有选择性，其可通过已知方法获得以增加样品中胎儿与母体DNA的相对比例。另选地，非特异性富集可以是存在于样品中的两个基因组的非选择性扩增。例如，非特异性扩增可以是胎儿和母体DNA在包含来自胎儿和母体基因组的DNA混合物的样品中的扩增。用于全基因组扩增的方法是本领域已知的。简并寡核苷酸引物PCR(DOP)、引物延伸PCR技术(PEP)和多重置换扩增(MDA)是全基因组扩增方法的示例。在一些实施方案中，包含来自不同基因组的cfDNA混合物的样品未富集存在于混合物中的基因组的cfDNA。在其他实施方案中，包含来自不同基因组的cfDNA混合物的样品非特异性富集存在于样品中的基因组中的任意一个基因组。In various embodiments, the cfDNA present in a sample may be enriched specifically or non-specifically before use (e.g., before preparing a sequencing library). Non-specific enrichment of sample DNA refers to whole-genome amplification of sample genomic DNA fragments that can be used to increase the sample DNA level before preparing a cfDNA sequencing library. Non-specific enrichment can be selective enrichment of one genome present in two genomes in a sample containing more than one genome. For example, non-specific enrichment can be selective for the fetal genome in a maternal sample, which can be obtained by known methods to increase the relative proportion of fetal to maternal DNA in the sample. Alternatively, non-specific enrichment can be non-selective amplification of two genomes present in a sample. For example, non-specific amplification can be amplification of fetal and maternal DNA in a sample containing a mixture of DNA from the fetal and maternal genomes. Methods for whole-genome amplification are known in the art. Degenerate oligonucleotide primer PCR (DOP), primer extension PCR (PEP), and multiple substitution amplification (MDA) are examples of whole-genome amplification methods. In some embodiments, samples containing a mixture of cfDNA from different genomes do not enrich the cfDNA of the genome present in the mixture. In other embodiments, a sample containing a mixture of cfDNA from different genomes is nonspecifically enriched with any one of the genomes present in the sample.

包含本文所述方法所应用的核酸的样品通常包括生物样品(“试验样品”)，例如如上所述。在一些实施方案中，通过多种熟知的方法中的任意一种方法纯化或分离待筛选一种或多种CNV的核酸。Samples containing the nucleic acids used in the methods described herein typically include biological samples (“test samples”), such as those described above. In some embodiments, the nucleic acids of one or more CNVs to be screened are purified or isolated using any of a variety of well-known methods.

因此，在某些实施方案中，样品包含经纯化或分离的多核苷酸或由经纯化或分离的多核苷酸组成，或者其可包括样品，诸如组织样品、生物流体样品、细胞样品等。合适的生物流体样品包括但不限于血液、血浆、血清、汗液、泪液、痰、尿液、痰、耳溢液、淋巴液、唾液、脑脊液、灌洗液、骨髓悬浮液、阴道液、经宫颈灌洗液、脑液、腹水、乳汁、呼吸道、肠道和泌尿生殖道的分泌物、羊水、乳汁和白细胞析离样品。在一些实施方案中，样品是易于通过非侵入性手术获得的样品，例如血液、血浆、血清、汗液、泪液、痰、尿液、痰、耳液、唾液或粪便。在某些实施方案中，样品是外周血样品或外周血样品的血浆和/或血清级分。在其他实施方案中，生物样品是拭子或涂片、活检标本或细胞培养物。在另一实施方案中，样品是两种或更多种生物样品的混合物，例如，生物样品可包括生物流体样品、组织样品和细胞培养样品中的两种或更多种。如本文所用，术语“血液”、“血浆”和“血清”明确地涵盖其级分或加工部分。类似地，在样品取自活检、拭子、涂片等的情况中，“样品”明确地涵盖来源于活检、拭子、涂片等的处理级分或部分。Therefore, in some embodiments, the sample comprises or is composed of purified or isolated polynucleotides, or may include samples such as tissue samples, biofluid samples, cell samples, etc. Suitable biofluid samples include, but are not limited to, blood, plasma, serum, sweat, tears, sputum, urine, phlegm, ear discharge, lymph, saliva, cerebrospinal fluid, lavage fluid, bone marrow suspension, vaginal fluid, cervical lavage fluid, cerebrospinal fluid, ascites, breast milk, respiratory, intestinal and genitourinary secretions, amniotic fluid, and leukocyte-isolated samples. In some embodiments, the sample is readily obtainable through non-invasive procedures, such as blood, plasma, serum, sweat, tears, sputum, urine, phlegm, ear discharge, saliva, or feces. In some embodiments, the sample is a peripheral blood sample or a plasma and/or serum fraction of a peripheral blood sample. In other embodiments, the biological sample is a swab or smear, a biopsy specimen, or a cell culture. In another embodiment, the sample is a mixture of two or more biological samples, such as biological fluid samples, tissue samples, and cell culture samples. As used herein, the terms “blood,” “plasma,” and “serum” explicitly cover their fraction or processing portion. Similarly, in cases where the sample is taken from a biopsy, swab, smear, etc., “sample” explicitly covers the processing fraction or portion derived from a biopsy, swab, smear, etc.

在某些实施方案中，样品可从来源中获得，包括但不限于来自不同个体的样品、来自相同或不同个体的不同发育阶段的样品、来自不同患病个体(例如患有癌症或疑似患有遗传性疾病的个体)的样品、正常个体、在个体疾病的不同阶段获得的样品、从对疾病进行不同治疗的个体获得的样品、受到不同环境因素影响的个体的样品、对病理易感的个体的样品、从暴露于传染病因子(例如HIV)的个体获得的样品等。In some implementations, samples may be obtained from sources including, but not limited to, samples from different individuals, samples from different developmental stages of the same or different individuals, samples from different diseased individuals (e.g., individuals with cancer or suspected of having a hereditary disease), normal individuals, samples obtained at different stages of an individual's disease, samples obtained from individuals receiving different treatments for their disease, samples from individuals affected by different environmental factors, samples from individuals susceptible to pathology, and samples obtained from individuals exposed to infectious agents (e.g., HIV).

在一个示例性但非限制性的实施方案中，样品是从妊娠女性(例如，孕妇)中获得的母体样品。在这种情况下，可使用本文所述的方法分析样品以提供胎儿的潜在染色体异常的产前诊断。母体样品可以是组织样品、生物流体样品或细胞样品。作为非限制性示例，生物流体包括血液、血浆、血清、汗液、泪液、痰、尿液、痰、耳溢液、淋巴液、唾液、脑脊液、灌洗液、骨髓悬浮液、阴道液、经宫颈灌洗液、脑液、腹水、乳汁、呼吸道、肠道和泌尿生殖道的分泌物和白细胞析离样品。In one exemplary but non-limiting embodiment, the sample is a maternal sample obtained from a pregnant woman (e.g., a pregnant woman). In this case, the methods described herein can be used to analyze the sample to provide prenatal diagnosis of potential chromosomal abnormalities in the fetus. The maternal sample can be a tissue sample, a biofluid sample, or a cell sample. As a non-limiting example, biofluids include blood, plasma, serum, sweat, tears, sputum, urine, phlegm, ear discharge, lymph, saliva, cerebrospinal fluid, lavage fluid, bone marrow suspension, vaginal fluid, transcervical lavage fluid, cerebrospinal fluid, ascites, breast milk, respiratory, intestinal, and genitourinary secretions, and leukocyte separation samples.

在另一个示例性但非限制性的实施方案中，母体样品是两种或更多种生物样品的混合物，例如，生物样品可包括生物流体样品、组织样品和细胞培养样品中的两种或更多种。在一些实施方案中，样品是易于通过非侵入性手术获得的样品，例如血液、血浆、血清、汗液、泪液、痰、尿液、乳汁、痰、耳溢液、唾液和粪便。在一些实施方案中，生物样品是外周血样品和/或其血浆及血清组分。在其他实施方案中，生物样品是拭子或涂片、活检标本或细胞培养物样品。如上所公开，术语“血液”、“血浆”和“血清”明确地涵盖其级分或加工部分。类似地，在样品取自活检、拭子、涂片等的情况中，“样品”明确地涵盖来源于活检、拭子、涂片等的处理级分或部分。In another exemplary but non-limiting embodiment, the parent sample is a mixture of two or more biological samples, such as biological fluid samples, tissue samples, and cell culture samples. In some embodiments, the sample is a sample readily obtainable by non-invasive surgery, such as blood, plasma, serum, sweat, tears, sputum, urine, breast milk, phlegm, earwax, saliva, and feces. In some embodiments, the biological sample is a peripheral blood sample and/or its plasma and serum components. In other embodiments, the biological sample is a swab or smear, a biopsy specimen, or a cell culture sample. As disclosed above, the terms “blood,” “plasma,” and “serum” explicitly cover their fractions or processing portions. Similarly, in cases where the sample is taken from a biopsy, swab, smear, etc., “sample” explicitly covers the processing fraction or portion derived from a biopsy, swab, smear, etc.

在某些实施方案中，也可从体外培养的组织、细胞或其他含多核苷酸的来源中获得样品。培养的样品可从来源中获取，包括但不限于在不同培养基和条件(例如pH、压力或温度)中维持的培养物(例如组织或细胞)、维持不同时长的培养物(例如组织或细胞)、用不同的因子或试剂(例如候选药物或调节剂)处理的培养物(例如组织或细胞)、或不同类型的组织和/或细胞的培养物。In some embodiments, samples may also be obtained from in vitro cultured tissues, cells, or other polynucleotide-containing sources. Cultured samples may be obtained from sources including, but not limited to, cultures (e.g., tissues or cells) maintained in different media and conditions (e.g., pH, pressure, or temperature), cultures (e.g., tissues or cells) maintained for different durations, cultures (e.g., tissues or cells) treated with different factors or reagents (e.g., candidate drugs or modulators), or cultures of different types of tissues and/or cells.

从生物来源分离核酸的方法是熟知的，并且将根据来源的性质而有所不同。本领域技术人员可以容易地从本文所述方法所需的来源分离核酸。在某些情况下，可能有利的是将核酸样品中的核酸分子片段化。片段化可以是随机的，或者它可以是特异性的，如例如使用限制性内切核酸酶消化所实现的。随机片段化的方法是本领域熟知的，并且包括例如限制性DNA酶消化、碱处理和物理剪切。在一个实施方案中，从未经片段化的cfDNA中获得样品核酸。Methods for isolating nucleic acids from biological sources are well known and will vary depending on the nature of the source. Those skilled in the art can readily isolate nucleic acids from the sources desired by the methods described herein. In some cases, it may be advantageous to fragment the nucleic acid molecules in the nucleic acid sample. Fragmentation can be random or it can be specific, such as by digestion with restriction endonucleases. Methods for random fragmentation are well known in the art and include, for example, restriction DNase digestion, alkaline treatment, and physical shearing. In one embodiment, sample nucleic acids are obtained from unfragmented cfDNA.

测序文库制备Sequencing library preparation

在一个实施方案中，本文所述的方法可利用下一代测序技术(NGS)，其允许在单次测序运行中以基因组分子(即，单重测序)或以包含索引基因组分子的合并样品(例如，多重测序)对多个样品进行单独测序。这些方法可生成高达数十亿个DNA序列读段。在各种实施方案中，可使用例如本文所述的下一代测序技术(NGS)来确定基因组核酸和/或索引基因组核酸的序列。在各种实施方案中，可使用如本文所述的一个或多个处理器来执行对使用NGS获得的大量序列数据的分析。In one embodiment, the methods described herein may utilize next-generation sequencing (NGS) technology, which allows for the individual sequencing of multiple samples in a single sequencing run, either as genomic molecules (i.e., singlet sequencing) or as a pooled sample containing index genomic molecules (e.g., multiplex sequencing). These methods can generate up to billions of DNA sequence reads. In various embodiments, next-generation sequencing (NGS) technologies, such as those described herein, may be used to determine the sequences of genomic nucleic acids and/or index genomic nucleic acids. In various embodiments, one or more processors, as described herein, may be used to perform analysis of large volumes of sequence data obtained using NGS.

在各种实施方案中，此类测序技术的使用不包括测序文库的制备。In various implementation schemes, the use of such sequencing technologies does not include the preparation of sequencing libraries.

然而，在某些实施方案中，本文设想的测序方法包括测序文库的制备。在一个例示性方法中，测序文库制备包括随机采集待测序的衔接子修饰的DNA片段(例如，多核苷酸)。可通过逆转录酶的作用从DNA或RNA(包括DNA或cDNA的等同物或类似物，例如由RNA模板产生的互补或拷贝DNA的DNA或cDNA)中制备多核苷酸测序文库。多核苷酸可以双链形式(例如，dsDNA，诸如基因组DNA片段、cDNA、PCR扩增产物等)起始，或者在某些实施方案中，多核苷酸可以单链形式(例如，ssDNA、RNA等)起始并已转化为dsDNA形式。举例来说，在某些实施方案中，单链mRNA分子可拷贝成适用于制备测序文库的双链cDNA。初级多核苷酸分子的精确序列通常对文库制备方法并不重要，并且可以是已知的或未知的。在一个实施方案中，多核苷酸分子是DNA分子。更具体地，在某些实施方案中，多核苷酸分子表示生物体的整个遗传互补序列或生物体的基本上整个遗传互补序列，并且是基因组DNA分子(例如，细胞DNA、游离DNA(cfDNA)等)，其通常包括内含子序列和外显子序列(编码序列)，以及非编码调控序列诸如启动子和增强子序列。在某些实施方案中，初级多核苷酸分子包括人类基因组DNA分子，例如存在于妊娠受试者的外周血中的cfDNA分子。However, in some embodiments, the sequencing methods envisioned herein include the preparation of sequencing libraries. In one exemplary method, sequencing library preparation involves randomly acquiring adaptor-modified DNA fragments (e.g., polynucleotides) to be sequenced. Polynucleotide sequencing libraries can be prepared from DNA or RNA (including equivalents or analogs of DNA or cDNA, such as complementary or copy DNA or cDNA generated from an RNA template) by the action of reverse transcriptase. Polynucleotides may be initiated in double-stranded form (e.g., dsDNA, such as genomic DNA fragments, cDNA, PCR amplification products, etc.), or in some embodiments, polynucleotides may be initiated in single-stranded form (e.g., ssDNA, RNA, etc.) and have been converted to dsDNA form. For example, in some embodiments, single-stranded mRNA molecules may be copied into double-stranded cDNA suitable for preparing sequencing libraries. The precise sequence of the primary polynucleotide molecule is generally not important to the library preparation method and may be known or unknown. In one embodiment, the polynucleotide molecule is a DNA molecule. More specifically, in some embodiments, the polynucleotide molecule represents the entire complementary genetic sequence of an organism or substantially the entire complementary genetic sequence of an organism, and is a genomic DNA molecule (e.g., cellular DNA, cell-free DNA (cfDNA), etc.), which typically includes intron sequences and exon sequences (coding sequences), as well as non-coding regulatory sequences such as promoter and enhancer sequences. In some embodiments, the primary polynucleotide molecule includes human genomic DNA molecules, such as cfDNA molecules present in the peripheral blood of pregnant subjects.

通过使用包含特定范围的片段大小的多核苷酸来促进一些NGS测序平台的测序文库的制备。此类文库的制备通常包括大的多核苷酸(例如，细胞基因组DNA)的片段化，以获得所需大小范围内的多核苷酸。The preparation of sequencing libraries on some NGS sequencing platforms is facilitated by using polynucleotides that encompass a specific range of fragment sizes. The preparation of such libraries typically involves fragmenting large polynucleotides (e.g., cellular genomic DNA) to obtain polynucleotides within the desired size range.

可通过本领域技术人员已知的多种方法中的任意一种方法来实现片段化。例如，可通过机械方法来实现片段化，机械方法包括但不限于雾化、超声处理和水剪切。然而，机械片段化通常在C-O、P-O和C-C键处切割DNA主链，从而产生具有断裂的C-O、P-O和/C-C键的平末端和3'-和5'-突出末端的异质混合物(参见例如Alnemri和Liwack，J Biol.Chem第265章：第17323-17333页[1990年]；Richards和Boyer，J Mol Biol第11章：第327-240页[1965年])，该异质混合物可能需要修复，因为它们可能缺少后续酶反应(例如，制备用于测序的DNA所需的测序衔接子的连接)所必需的5'-磷酸。Fragmentation can be achieved by any of a variety of methods known to those skilled in the art. For example, fragmentation can be achieved by mechanical methods, including but not limited to atomization, sonication, and water shearing. However, mechanical fragmentation typically involves cutting the DNA backbone at C-O, P-O, and C-C bonds, resulting in a heterogeneous mixture with blunt ends and 3'- and 5'-protruding ends of broken C-O, P-O, and/or C-C bonds (see, for example, Alnemri and Liwack, J Biol. Chem, Chapter 265: pp. 17323-17333 [1990]; Richards and Boyer, J Mol Biol, Chapter 11: pp. 327-240 [1965]). This heterogeneous mixture may require repair because it may lack the 5'-phosphate necessary for subsequent enzymatic reactions (e.g., the ligation of sequencing adaptors required for the preparation of DNA for sequencing).

相比之下，cfDNA通常以小于约300个碱基对的片段存在，因此片段化通常不是使用cfDNA样品生成测序文库所必需的。In contrast, cfDNA typically exists in fragments of less than about 300 base pairs, so fragmentation is usually not necessary for generating sequencing libraries from cfDNA samples.

通常，无论多核苷酸是强制片段化的(例如体外片段化的)还是作为片段天然存在的，它们均转化成具有5'-磷酸和3'-羟基的平末端DNA。标准方案，例如，使用例如本文他处所述的Illumina平台进行测序的方案，指示用户对样品DNA进行末端修复，在dA-加尾之前对末端修复的产品进行纯化，以及在文库制备的衔接子连接步骤之前对dA-加尾产品进行纯化。Typically, whether polynucleotides are forcibly fragmented (e.g., in vitro fragmentation) or exist naturally as fragments, they are converted into blunt-ended DNA with 5'-phosphate and 3'-hydroxyl groups. Standard protocols, such as those using Illumina platforms described elsewhere herein, instruct the user to perform end repair on the sample DNA, purify the end-repaired product before dA-tailing, and purify the dA-tailed product before the adaptor ligation step in library preparation.

本文所述的测序文库制备方法的各种实施方案无需执行标准方案通常要求的一个或多个步骤来获得可由NGS测序的经修饰的DNA产物。简化方法(ABB方法)、1步方法和2步方法是用于制备测序文库的方法的示例，其可见于2012年7月20日提交的专利申请13/555,037，该专利申请全文以引用方式并入本文。Various embodiments of the sequencing library preparation methods described herein do not require the performance of one or more steps typically required by standard protocols to obtain modified DNA products that can be sequenced by NGS. Examples of simplified methods (ABB methods), one-step methods, and two-step methods for preparing sequencing libraries can be found in patent application 13/555,037, filed July 20, 2012, the entire contents of which are incorporated herein by reference.

用于跟踪和验证样品完整性的标记核酸Labeled nucleic acids used to track and verify sample integrity

在各种实施方案中，可通过例如在处理之前对已引入样品中的样品基因组核酸(例如，cfDNA)和伴随的标记核酸的混合物进行测序来实现样品完整性的验证和样品跟踪。In various implementations, sample integrity verification and sample tracking can be achieved, for example, by sequencing a mixture of sample genomic nucleic acids (e.g., cfDNA) and accompanying labeled nucleic acids introduced into the sample prior to processing.

可将标记核酸可与试验样品(例如，生物来源样品)组合并进行包括例如以下步骤中的一个或多个步骤的处理：对生物来源样品进行分馏(例如，从全血样品中获得基本上游离的血浆级分)，对来自分馏的(例如，血浆)或未分馏的生物来源样品(例如，组织样品)的核酸进行纯化，以及测序。在一些实施方案中，测序包括制备测序文库。与来源样品组合的标记分子的序列或序列组合选择为对该来源样品而言是唯一的。在一些实施方案中，样品中的唯一标记分子均具有相同的序列。在其他实施方案中，样品中的唯一标记分子是多条序列，例如两条、三条、四条、五条、六条、七条、八条、九条、十条、十五条、二十条或更多条不同序列的组合。Labeled nucleic acids can be combined with a test sample (e.g., a biological sample) and processed including one or more of the following steps: fractionation of the biological sample (e.g., obtaining a substantially free plasma fraction from a whole blood sample), purification of nucleic acids from the fractionated (e.g., plasma) or unfractionated biological sample (e.g., tissue sample), and sequencing. In some embodiments, sequencing includes preparing a sequencing library. The sequence or combination of sequences of the labeling molecules combined with the source sample is selected to be unique to that source sample. In some embodiments, the unique labeling molecules in the sample all have the same sequence. In other embodiments, the unique labeling molecules in the sample are multiple sequences, such as combinations of two, three, four, five, six, seven, eight, nine, ten, fifteen, twenty, or more different sequences.

在一个实施方案中，可使用具有相同序列的多个标记核酸分子来验证样品的完整性。另选地，可使用多个标记核酸分子来验证样品的同一性，所述标记核酸分子具有至少两条、至少三条、至少四条、至少五条、至少六条、至少七条、至少八条、至少九条、至少十条、至少11条、至少12条、至少13条、至少14条、至少15条、至少16条、至少17条、至少18条、至少19条、至少20条、至少25条、至少30条、至少35条、至少40条、至少50条或更多条不同的序列。验证所述多个生物样品(即，两个或更多个生物样品)的完整性要求所述两个或更多个样品中的每一个样品都用标记核酸进行标记，所述标记核酸具有对于被标记的所述多个试验样品中的每一个样品而言是唯一的序列。例如，可用具有序列A的标记核酸来标记第一样品，并且可用具有序列B的标记核酸来标记第二样品。另选地，可用均具有序列A的标记核酸分子来标记第一样品，并且可用序列B和C的混合物来标记第二样品，其中序列A、B和C是具有不同序列的标记分子。In one implementation, multiple labeled nucleic acid molecules having the same sequence can be used to verify the integrity of a sample. Alternatively, multiple labeled nucleic acid molecules can be used to verify the identity of a sample, wherein the labeled nucleic acid molecules have at least two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, eleven, thirteen, eleventeen, thirteenth ... or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteenth, or thirteen Alternatively, a first sample may be labeled with a marker nucleic acid molecule that all have sequence A, and a second sample may be labeled with a mixture of sequences B and C, wherein sequences A, B, and C are marker molecules with different sequences.

可在文库制备(如果要制备文库的话)和测序之前发生的样品制备的任何阶段将标记核酸添加到样品中。在一个实施方案中，可将标记分子与未处理的来源样品组合。例如，可在用于采集血样的采集管中提供标记核酸。另选地，可在抽血后将标记核酸添加到血样中。在一个实施方案中，将标记核酸添加到用于采集生物流体样品的容器中，例如将标记核酸添加到用于采集血样的采血管中。在另一个实施方案中，将标记核酸添加到生物流体样品的级分中。例如，将标记核酸添加到血样(例如母体血浆样品)的血浆和/或浆液组分中。在另一个实施方案中，将标记分子添加到经纯化的样品中，例如已从生物样品中纯化的核酸样品。例如，将标记核酸添加到经纯化的母体和胎儿cfDNA样品中。类似地，可在处理标本之前将标记核酸添加到活检标本中。在一些实施方案中，可将标记核酸与将标记分子递送到生物样品的细胞中的载体组合。细胞递送载体包括pH灵敏型和阳离子脂质体。Labeled nucleic acids can be added to a sample at any stage of sample preparation, prior to library preparation (if a library is to be prepared) and sequencing. In one embodiment, the labeling molecule can be combined with an untreated source sample. For example, the labeling nucleic acid can be provided in a collection tube used to collect a blood sample. Alternatively, the labeling nucleic acid can be added to a blood sample after blood collection. In one embodiment, the labeling nucleic acid is added to a container used to collect a biological fluid sample, such as adding the labeling nucleic acid to a blood collection tube used to collect a blood sample. In another embodiment, the labeling nucleic acid is added to a fraction of a biological fluid sample. For example, the labeling nucleic acid is added to the plasma and/or serous components of a blood sample (e.g., a maternal plasma sample). In another embodiment, the labeling molecule is added to a purified sample, such as a nucleic acid sample purified from a biological sample. For example, the labeling nucleic acid is added to purified maternal and fetal cfDNA samples. Similarly, the labeling nucleic acid can be added to a biopsy specimen prior to specimen processing. In some embodiments, the labeling nucleic acid can be combined with a carrier for delivering the labeling molecule to cells of a biological sample. Cell delivery carriers include pH-sensitive and cationic liposomes.

在各种实施方案中，标记分子具有反基因组序列，这些反基因组序列是不存在于生物来源样品的基因组中的序列。在一个示例性实施方案中，用于验证人类生物来源样品完整性的标记分子具有不存在于人类基因组中的序列。在另选的实施方案中，标记分子具有不存在于来源样品和任意一种或多种其他已知基因组中的序列。例如，用于验证人类生物来源样品完整性的标记分子具有不存在于人类基因组和小鼠基因组中的序列。替代方案允许验证包含两个或更多个基因组的试验样品的完整性。例如，可使用具有不存在于人类基因组和受影响细菌的基因组两者中的序列的标记分子来验证从受病原体(例如，细菌)影响的受试者中获得的人类游离DNA样品的完整性。可在万维网上ncbi.nlm.nih.gov/genomes处公开获得多种病原体(例如细菌、病毒、酵母菌、真菌、原生动物等)的基因组序列。在另一个实施方案中，标记分子是具有不存在于任何已知基因组中的序列的核酸。可通过算法随机生成标记分子的序列。In various embodiments, the marker molecule has anti-genomic sequences that are not present in the genome of the biologically derived sample. In one exemplary embodiment, the marker molecule used to verify the integrity of a human biologically derived sample has a sequence not present in the human genome. In another embodiment, the marker molecule has a sequence not present in the source sample and any one or more other known genomes. For example, the marker molecule used to verify the integrity of a human biologically derived sample has a sequence not present in both the human genome and the mouse genome. Alternatives allow for the verification of the integrity of test samples containing two or more genomes. For example, the integrity of a human cell-free DNA sample obtained from a subject affected by a pathogen (e.g., bacteria) can be verified using a marker molecule having a sequence not present in both the human genome and the genome of the affected bacteria. Genomic sequences of a variety of pathogens (e.g., bacteria, viruses, yeasts, fungi, protozoa, etc.) are publicly available at ncbi.nlm.nih.gov/genomes on the World Wide Web. In another embodiment, the marker molecule is a nucleic acid having a sequence not present in any known genome. The sequence of the marker molecule can be randomly generated by an algorithm.

在各种实施方案中，标记分子可以是天然存在的脱氧核糖核酸(DNA)、核糖核酸或人工核酸类似物(核酸模拟物)，人工核酸类似物包括肽核酸(PNA)、吗啉代核酸、锁核酸、二醇核酸和苏糖核酸，它们与天然存在的DNA或RNA的区别在于不具有磷酸二酯主链的分子或DNA模拟物主链的改变。脱氧核糖核酸可来自天然存在的基因组或可通过使用酶或通过固相化学合成在实验室中生成。化学方法也可用于生成自然界中不存在的DNA模拟物。可用的其中磷酸二酯键已被置换，但脱氧核糖得以保留的DNA的衍生物包括但不限于具有由硫代甲酰缩醛或羧酰胺键形成的主链的DNA模拟物，该DNA模拟物已显示是良好的结构DNA模拟物。其他DNA模拟物包括吗啉代衍生物和肽核酸(PNA)，PNA包含基于N-(2-氨基乙基)甘氨酸的假肽主链(Ann Rev Biophys Biomol Struct第24章：第167-183页[1995年])。PNA是极好的DNA(或核糖核酸[RNA])的结构模拟物，并且PNA低聚物能够与沃森-克里克(Watson-Crick)互补DNA和RNA(或PNA)低聚物形成非常稳定的双链体结构，并且它们还可通过螺旋侵入与双链体DNA中的靶标结合(Mol Biotechnol第26章：第233-248页[2004年]。可用作标记分子的DNA类似物的另一种良好的结构模拟物/类似物是硫代磷酸酯DNA，其中非桥联氧中的一个氧被硫置换。此修饰降低了核酸内切酶和核酸外切酶2(包括5'至3'和3'至5'DNAPOL 1核酸外切酶)、核酸酶S1和P1、RNA酶、血清核酸酶和蛇毒磷酸二酯酶的作用。In various implementations, the marker molecule can be naturally occurring deoxyribonucleic acid (DNA), ribonucleic acid, or artificial nucleic acid analogs (nucleic acid mimics). Artificial nucleic acid analogs include peptide nucleic acids (PNA), morpholinonucleotides, locked nucleic acids, diol nucleic acids, and threonic acid, which differ from naturally occurring DNA or RNA in that they lack a phosphodiester backbone or have altered DNA mimic backbones. Deoxyribonucleic acid can be derived from naturally occurring genomes or generated in the laboratory using enzymes or through solid-phase chemical synthesis. Chemical methods can also be used to generate DNA mimics that do not exist in nature. Available DNA derivatives in which the phosphodiester bonds have been replaced, but the deoxyribose has been retained, include, but are not limited to, DNA mimics having a backbone formed by thioformyl acetals or carboxamide bonds, which have shown to be good structural DNA mimics. Other DNA mimics include morpholino derivatives and peptide nucleic acids (PNA), which contain a pseudopeptide backbone based on N-(2-aminoethyl)glycine (Ann Rev Biophys Biomol Struct, Chapter 24: pp. 167-183 [1995]). PNA is an excellent structural mimic of DNA (or ribonucleic acid [RNA]), and PNA oligomers can form very stable double-stranded structures with Watson-Crick complementary DNA and RNA (or PNA) oligomers, and they can also bind to targets in double-stranded DNA via helical invasion (Mol Biotechnol, Chapter 26: pp. 233-248 [2004]). Another good structural mimic/analyst for DNA can be used as a marker molecule is phosphate-thioester DNA, in which one of the oxygen atoms in the non-bridging oxygen is replaced by sulfur. This modification reduces the activity of endonucleases and exonucleases 2 (including 5' to 3' and 3' to 5' DNAPOL 1 exonucleases), nucleases S1 and P1, RNases, serum nucleases, and snake venom phosphodiesterase.

标记分子的长度可不同于或类似于样品核酸的长度，即，标记分子的长度可类似于样品基因组分子的长度，或者其可大于或小于样品基因组分子的长度。通过构成标记分子的核苷酸或核苷酸类似物碱基的数目来测量标记分子的长度。可使用本领域已知的分离方法，将长度不同于样品基因组分子的长度的标记分子与来源核酸区分开。例如，可通过电泳分离(例如毛细管电泳)来确定标记和样品核酸分子的长度差异。大小差异可有利于对标记和样品核酸质量的定量和评估。优选地，标记核酸短于基因组核酸，并且具有足够的长度以排除它们映射到样品的基因组。例如，如需要30个碱基的人类序列来唯一地映射到人类基因组的情况。因此，在某些实施方案中，用于对人类样品进行测序生物测定的标记分子的长度应为至少30bp。The length of the marker molecule may be different from or similar to the length of the sample nucleic acid; that is, the length of the marker molecule may be similar to the length of the sample genomic molecule, or it may be greater than or less than the length of the sample genomic molecule. The length of the marker molecule is measured by the number of nucleotide or nucleotide analog bases constituting the marker molecule. Separation methods known in the art can be used to distinguish marker molecules with a length different from that of the sample genomic molecule from the source nucleic acid. For example, the length difference between the marker and sample nucleic acid molecules can be determined by electrophoretic separation (e.g., capillary electrophoresis). Size differences can be beneficial for the quantification and evaluation of the quality of the marker and sample nucleic acid. Preferably, the marker nucleic acid is shorter than the genomic nucleic acid and has sufficient length to exclude them from mapping to the sample genome. For example, in cases where a 30-base human sequence is required to uniquely map to the human genome. Therefore, in some embodiments, the length of the marker molecule used for sequencing bioassays of human samples should be at least 30 bp.

主要通过用于验证来源样品完整性的测序技术来确定标记分子长度的选择。还可考虑经测序的样品基因组核酸的长度。例如，一些测序技术采用多核苷酸的克隆扩增，这可需要待克隆扩增的基因组多核苷酸具有最小长度。例如，使用Illumina GAII序列分析仪进行测序包括通过桥式PCR(也称为簇扩增)对具有110bp最小长度的多核苷酸进行体外克隆扩增，将衔接子连接到该多核苷酸以提供可进行克隆扩增和测序的至少200bp且小于600bp的核酸。在一些实施方案中，衔接子连接的标记分子的长度在约200bp至约600bp之间、约250bp至550bp之间、约300bp至500bp之间、或约350bp至450bp之间。在其他实施方案中，衔接子连接的标记分子的长度为约200bp。例如，当对存在于母体样品中的胎儿cfDNA进行测序时，可将标记分子的长度选择为类似于胎儿cfDNA分子的长度。因此，在一个实施方案中，在测定中使用的标记分子的长度可为约150bp、约160bp、170bp、约180bp、约190bp或约200bp，该测定包括对母体样品中的cfDNA进行大规模平行测序以确定是否存在胎儿染色体非整倍体；优选地，标记分子为约170pp。其他测序方法(例如SOLiD测序、Polony测序和454测序)使用乳液PCR来克隆扩增DNA分子进行测序，并且每种技术都规定了待扩增分子的最小和最大长度。待作为克隆扩增核酸进行测序的标记分子的长度可为至多约600bp。在一些实施方案中，待测序的标记分子的长度可大于600bp。The selection of marker molecule length is primarily determined by the sequencing technology used to verify the integrity of the source sample. The length of the genomic nucleic acid in the sequenced sample may also be considered. For example, some sequencing technologies employ the clonal amplification of polynucleotides, which may require the genomic polynucleotides to be cloned to have a minimum length. For instance, sequencing using an Illumina GAII sequencer involves the in vitro clonal amplification of a polynucleotide with a minimum length of 110 bp via bridge PCR (also known as cluster amplification), ligating an adaptor to the polynucleotide to provide nucleic acid of at least 200 bp and less than 600 bp for cloning and sequencing. In some embodiments, the length of the marker molecule ligated to the adaptor is between about 200 bp and about 600 bp, about 250 bp and about 550 bp, about 300 bp and about 500 bp, or about 350 bp and about 450 bp. In other embodiments, the length of the marker molecule ligated to the adaptor is about 200 bp. For example, when sequencing fetal cfDNA present in maternal samples, the length of the marker molecule may be selected to be similar to the length of the fetal cfDNA molecule. Therefore, in one embodiment, the length of the marker molecule used in the assay can be about 150 bp, about 160 bp, 170 bp, about 180 bp, about 190 bp, or about 200 bp. This assay involves massively parallel sequencing of cfDNA in a maternal sample to determine the presence of fetal chromosomal aneuploidy; preferably, the marker molecule is about 170 pp. Other sequencing methods (e.g., SOLiD sequencing, Polony sequencing, and 454 sequencing) use emulsion PCR to clone and amplify DNA molecules for sequencing, and each technique specifies minimum and maximum lengths of the molecules to be amplified. The length of the marker molecule to be sequenced as a cloned and amplified nucleic acid can be up to about 600 bp. In some embodiments, the length of the marker molecule to be sequenced can be greater than 600 bp.

单分子测序技术不采用分子的克隆扩增，并且能够在非常宽的模板长度范围内对核酸进行测序，在大多数情况下不需要待测序的分子具有任何特定长度。然而，每单位质量的序列收率取决于3'端羟基基团的数目，因此具有相对较短的测序模板比具有长模板更有效。如果从长于1000nt的核酸开始，则通常建议将核酸剪切至100至200nt的平均长度，使得可从相同质量的核酸产生更多的序列信息。因此，标记分子的长度可在数十个碱基至数千个碱基的范围内。用于单分子测序的标记分子的长度可为至多约25bp、至多约50bp、至多约75bp、至多约100bp、至多约200bp、至多约300bp、至多约400bp、至多约500bp、至多约600bp、至多约700bp、至多约800bp、至多约900bp、至多约1000bp或更长。Single-molecule sequencing technology does not employ clonal amplification of molecules and can sequence nucleic acids over a very wide range of template lengths, in most cases without requiring the molecule to be sequenced to have any specific length. However, the sequence yield per unit mass depends on the number of 3' hydroxyl groups, so a relatively short sequencing template is more efficient than a long one. If starting with nucleic acids longer than 1000 nt, it is generally recommended to cleave the nucleic acids to an average length of 100 to 200 nt to generate more sequence information from the same mass of nucleic acids. Therefore, the length of labeled molecules can range from tens to thousands of bases. The length of labeled molecules used for single-molecule sequencing can be up to approximately 25 bp, up to approximately 50 bp, up to approximately 75 bp, up to approximately 100 bp, up to approximately 200 bp, up to approximately 300 bp, up to approximately 400 bp, up to approximately 500 bp, up to approximately 600 bp, up to approximately 700 bp, up to approximately 800 bp, up to approximately 900 bp, up to approximately 1000 bp, or longer.

还通过经测序的基因组核酸的长度来确定所选的标记分子的长度。例如，cfDNA在人类血流中作为细胞基因组DNA的基因组片段循环。孕妇血浆中存在的胎儿cfDNA分子通常比母体cfDNA分子短(Chan等人，Clin Chem第50章：第8892页[2004年])。循环胎儿DNA的大小分馏已证实循环胎儿DNA片段的平均长度为<300bp，而母体DNA据估计为约0.5Kb至1Kb之间(Li等人，Clin Chem，第50章：第1002-1011页[2004年])。这些发现与Fan等人的发现一致，他们使用NGS确定胎儿cfDNA很少为>340bp(Fan等人，Clin Chem第56章：第1279-1286页[2010年])。用标准的基于二氧化硅的方法从尿液中分离的DNA由两种级分组成：源自脱落细胞的高分子量DNA和经肾DNA(Tr-DNA)的低分子量(150-250碱基对)级分(Botezatu等人，Clin Chem.第46章：第1078-1084页，2000年；和Su等人，J Mol.Diagn.第6章：第101-107页，2004年)。将新开发的用于从体液中分离游离核酸的技术应用于分离经肾核酸已表明，尿液中存在的DNA和RNA片段远短于150个碱基对(美国专利申请公布20080139801)。在实施方案中，其中cfDNA是经测序的基因组核酸，所选择的标记分子可为至多约cfDNA的长度。例如，待作为单个核酸分子或克隆扩增核酸进行测序的母体cfDNA样品中使用的标记分子的长度可在约100bp至600bp之间。在其他实施方案中，样品基因组核酸是较大分子的片段。例如，经测序的样品基因组核酸是片段化的细胞DNA。在实施方案中，当对片段化的细胞DNA进行测序时，标记分子的长度可为至多DNA片段的长度。在一些实施方案中，标记分子的长度为至少将序列读段唯一地映射到适当的参考基因组所需的最小长度。在其他实施方案中，标记分子的长度是排除标记分子映射到样品参考基因组所需的最小长度。The length of the selected marker molecule was also determined by the length of the sequenced genomic nucleic acid. For example, cfDNA circulates in human bloodstream as a fragment of cellular genomic DNA. Fetal cfDNA molecules present in maternal plasma are typically shorter than maternal cfDNA molecules (Chan et al., Clin Chem, Chapter 50: p. 8892 [2004]). Size fractionation of circulating fetal DNA has confirmed that the average length of circulating fetal DNA fragments is <300 bp, while maternal DNA is estimated to be between approximately 0.5 Kb and 1 Kb (Li et al., Clin Chem, Chapter 50: pp. 1002-1011 [2004]). These findings are consistent with those of Fan et al., who used NGS to determine that fetal cfDNA is rarely >340 bp (Fan et al., Clin Chem, Chapter 56: pp. 1279-1286 [2010]). DNA isolated from urine using standard silica-based methods consists of two fractions: high-molecular-weight DNA derived from exfoliated cells and low-molecular-weight (150-250 base pairs) fractions of renal DNA (Tr-DNA) (Botezatu et al., Clin Chem. Chapter 46: pp. 1078-1084, 2000; and Su et al., J Mol. Diagn. Chapter 6: pp. 101-107, 2004). Application of newly developed techniques for isolating cell-free nucleic acids from bodily fluids to the isolation of renal nucleic acids has shown that DNA and RNA fragments present in urine are much shorter than 150 base pairs (US Patent Application Publication 20080139801). In embodiments, where cfDNA is sequenced genomic nucleic acid, the selected marker molecule can be at most approximately the length of cfDNA. For example, the length of the marker molecule used in a maternal cfDNA sample to be sequenced as a single nucleic acid molecule or cloned amplified nucleic acid can be between approximately 100 bp and 600 bp. In other embodiments, the sample genomic nucleic acid is a fragment of a larger molecule. For example, the sequenced sample genomic nucleic acid is fragmented cellular DNA. In embodiments, when sequencing fragmented cellular DNA, the length of the marker molecule may be at most the length of the DNA fragment. In some embodiments, the length of the marker molecule is the minimum length required to at least uniquely map the sequence read to the appropriate reference genome. In other embodiments, the length of the marker molecule is the minimum length required to exclude the marker molecule from mapping to the sample reference genome.

此外，标记分子可用于验证未通过核酸测序进行测定，并且可通过除测序之外的常见生物技术(例如实时PCR)进行验证的样品。In addition, marker molecules can be used to verify samples that have not been determined by nucleic acid sequencing and can be verified by common biotechnologies other than sequencing, such as real-time PCR.

样品对照(例如，用于测序和/或分析的过程中阳性对照)。Sample controls (e.g., positive controls used during sequencing and/or analysis).

在各种实施方案中，例如如上所述，引入样品中的标记序列可用作阳性对照来验证测序以及后续加工和分析的准确性和有效性。In various implementations, such as those described above, the labeled sequence introduced into the sample can be used as a positive control to verify the accuracy and validity of sequencing and subsequent processing and analysis.

因此，提供了用于提供对样品中的DNA进行测序的过程中阳性对照(IPC)的组合物和方法。在某些实施方案中，提供了用于对包含基因组混合物的样品中的cfDNA进行测序的阳性对照。IPC可用于将从不同样品(例如，在不同测序运行上在不同时间进行测序的样品)集中获得的序列信息的基线偏移相关联。因此，例如，IPC可将从母体试验样品中获得的序列信息与从在不同时间进行测序的合格样品集中获得的序列信息相关联。Therefore, compositions and methods are provided for providing a positive control (IPC) during the sequencing of DNA in a sample. In some embodiments, a positive control is provided for sequencing cfDNA in a sample containing a mixture of genomes. The IPC can be used to correlate baseline offsets of sequence information obtained from sets of different samples (e.g., samples sequenced at different times on different sequencing runs). Thus, for example, the IPC can correlate sequence information obtained from a maternal test sample with sequence information obtained from a set of qualified samples sequenced at different times.

类似地，在区段分析的情况下，IPC可将从受试者中获得的特定区段的序列信息与从在不同时间进行测序的(类似序列的)合格样品集中获得的序列相关联。在某些实施方案中，IPC可将从受试者中获得的特定癌症相关基因座的序列信息与从合格样品集(例如，来自已知的扩增/缺失等)中获得的序列信息相关联。Similarly, in the case of segment analysis, IPC can correlate sequence information of a specific segment obtained from a subject with sequences obtained from a qualified sample set (of similar sequences) sequenced at different times. In some embodiments, IPC can correlate sequence information of a specific cancer-related locus obtained from a subject with sequence information obtained from a qualified sample set (e.g., from known amplifications/deletions, etc.).

此外，IPC可用作在测序过程中跟踪样品的标记。IPC还可为所关注染色体的一个或多个非整倍体(例如，21-三倍体、13-三倍体、18-三倍体)提供定性阳性序列剂量值(例如，NCV)，以提供适当的解释并确保数据的可靠性和准确性。在某些实施方案中，可产生包含来自男性基因组和女性基因组的核酸的IPC，来为母体样品中的X染色体和Y染色体提供剂量。In addition, IPCs can be used as markers to track samples during sequencing. IPCs can also provide qualitative positive sequence dose values (e.g., NCVs) for one or more aneuploidies of the chromosome of interest (e.g., 21-triploid, 13-triploid, 18-triploid) to provide appropriate interpretation and ensure the reliability and accuracy of the data. In some embodiments, IPCs containing nucleic acids from both the male and female genomes can be generated to provide dose to the X and Y chromosomes in maternal samples.

过程中对照的类型和数量取决于所需测试的类型或性质。例如，对于需要对来自包含基因组混合物的样品的DNA进行测序以确定是否存在染色体非整倍体的测试，过程中对照可包含从已知包含正在测试的相同染色体非整倍体的样品中获得的DNA。在一些实施方案中，IPC包含来自已知包含所关注染色体的非整倍体的样品的DNA。例如，用于确定母体样品中是否存在胎儿三倍体(例如21-三倍体)的测试的IPC包含从具有21-三倍体的个体中获得的DNA。在一些实施方案中，IPC包含从具有不同非整倍体的两个或更多个个体中获得的DNA混合物。例如，用于确定是否存在13-三倍体、18-三倍体、21-三倍体和X单倍体的测试的IPC，包含从每个怀有胎儿的孕妇中获得的DNA样品与被测试的三倍体中的一种三倍体的组合。除了完整的染色体非整倍体之外，还可产生IPC来为用于确定是否存在部分非整倍体的测试提供阳性对照。The type and number of in-process controls depend on the type or nature of the test required. For example, for a test that requires sequencing of DNA from a sample containing a mixture of genomes to determine the presence of chromosomal aneuploidy, in-process controls may include DNA obtained from a sample known to contain the same chromosomal aneuploidy being tested. In some embodiments, IPCs contain DNA from samples known to contain aneuploidy of the chromosome of interest. For example, an IPC for a test to determine the presence of fetal triploidy (e.g., 21-triploidy) in a maternal sample contains DNA obtained from an individual with 21-triploidy. In some embodiments, IPCs contain a mixture of DNA obtained from two or more individuals with different aneuploidies. For example, an IPC for a test to determine the presence of 13-triploidy, 18-triploidy, 21-triploidy, and X haploidy contains a combination of DNA samples obtained from each pregnant woman carrying a fetus and one of the triploids being tested. In addition to complete chromosomal aneuploidy, IPCs may also be generated to provide a positive control for a test used to determine the presence of partial aneuploidy.

可使用从两个受试者中获得的细胞基因组DNA的混合物来产生用作检测单个非整倍体的对照的IPC，其中一个受试者是非整倍体基因组的贡献者。例如，可通过将来自携带三体染色体的男性或女性受试者的基因组DNA与已知不携带三体染色体的女性受试者的基因组DNA组合来产生IPC，该IPC作为用于确定胎儿三倍体(21-三倍体)的测试的对照。可从两个受试者的细胞中提取基因组DNA，并且将其剪切来提供在约100bp至400bp之间、约150bp至350bp之间、或约200bp至300bp之间的片段，以模拟母体样品中的循环cfDNA片段。选择来自携带非整倍体(例如，21-三倍体)的受试者的片段化DNA的比例，来模拟母体样品中存在的循环胎儿cfDNA的比例，以提供包含片段化DNA的混合物的IPC，该片段化DNA包括约5％、约10％、约15％、约20％、约25％、约30％的来自携带非整倍体的受试者的DNA。IPC可包含来自各自携带不同非整倍体的不同受试者的DNA。例如，IPC可包含约80％的未受影响的女性DNA，并且剩余的20％可以是来自各自携带21-三体染色体、13-三体染色体和18-三体染色体的三个不同受试者的DNA。制备片段化DNA的混合物以进行测序。片段化DNA的混合物的处理可包括制备测序文库，该测序文库可使用任何大规模平行的方法以单重或多重方式进行测序。基因组IPC的储备溶液可储存并用于多个诊断测试中。An IPC (Integrated Biomarker) can be generated using a mixture of cellular genomic DNA obtained from two subjects, one of whom is a contributor to the aneuploid genome. For example, an IPC can be generated by combining genomic DNA from a male or female subject carrying a trisomy chromosome with genomic DNA from a female subject known not to carry a trisomy chromosome; this IPC serves as a control for a test to determine fetal triploidy (21-trisomy). Genomic DNA can be extracted from the cells of both subjects and cut to provide fragments between approximately 100 bp and 400 bp, approximately 150 bp and 350 bp, or approximately 200 bp and 300 bp to mimic circulating cfDNA fragments in maternal samples. The proportion of fragmented DNA from subjects carrying aneuploidy (e.g., trisomy 21) is selected to mimic the proportion of circulating fetal cfDNA present in maternal samples to provide an IPC containing a mixture of fragmented DNA comprising approximately 5%, approximately 10%, approximately 15%, approximately 20%, approximately 25%, and approximately 30% DNA from subjects carrying aneuploidy. The IPC may contain DNA from different subjects each carrying different aneuploidies. For example, the IPC may contain approximately 80% unaffected female DNA, and the remaining 20% may be DNA from three different subjects each carrying trisomy 21, trisomy 13, and trisomy 18. The mixture of fragmented DNA is prepared for sequencing. Processing of the mixture of fragmented DNA may include preparing a sequencing library that can be sequenced in singleton or multiplex mode using any massively parallel method. Stock solutions of genomic IPCs can be stored and used in multiple diagnostic tests.

另选地，可使用从已知怀有具有已知染色体非整倍体的胎儿的母体中获得的cfDNA来产生IPC。例如，可从怀有具有21-三倍体的胎儿的孕妇中获得cfDNA。从母体样品中提取cfDNA，将其克隆到细菌载体中并在细菌中生长，以提供IPC的持续来源。可使用限制性酶从细菌载体中提取DNA。另选地，可通过例如PCR扩增克隆的cfDNA。可在与来自试验样品的cfDNA相同的运行中处理IPC DNA以进行测序，所述试验样品将分析是否存在染色染非整倍体。Alternatively, IPCs can be generated using cfDNA obtained from a mother known to be carrying a fetus with a known chromosomal aneuploidy. For example, cfDNA can be obtained from a pregnant woman carrying a fetus with a 21-triploidy. The cfDNA is extracted from the maternal sample, cloned into a bacterial vector, and grown in bacteria to provide a continuous source of IPCs. The DNA can be extracted from the bacterial vector using restriction enzymes. Alternatively, the cloned cfDNA can be amplified, for example, by PCR. The IPC DNA can be processed in the same run as the cfDNA from the test sample for sequencing, which will be analyzed for the presence of chromosomal aneuploidy.

虽然上文针对三倍体描述了IPC的产生，但应当理解，可产生IPC来反映其他部分非整倍体，包括例如各种区段扩增和/或缺失。因此，例如，在已知与特定扩增相关联的各种癌症(例如，与20Q13相关联的乳腺癌)的情况下，可产生结合这些已知扩增的IPC。While the generation of IPCs has been described above in the context of triploidy, it should be understood that IPCs can be generated to reflect other partial aneuploidy, including, for example, various segment amplifications and/or deletions. Thus, for example, in cases where various cancers are known to be associated with a particular amplification (e.g., breast cancer associated with 20Q13), IPCs incorporating these known amplifications can be generated.

测序方法sequencing methods

如上所述，将所制备的样品(例如测序文库)作为用于识别拷贝数变异程序的一部分进行测序。可利用多种测序技术中的任意一种技术。As described above, the prepared sample (e.g., a sequencing library) is sequenced as part of a procedure for identifying copy number variations. Any of a variety of sequencing technologies can be used.

一些测序技术是可商购获得的，诸如从Affymetrix公司(Sunnyvale，CA)获得的边杂交边测序平台，从454Life Sciences(Bradford，CT)、Illumina/Solexa(Hayward，CA)和Helicos Biosciences(Cambridge，MA)获得的边合成边测序平台以及从AppliedBiosystems(Foster City，CA)获得的边连接边测序平台，如下所述。除了使用HelicosBiosciences的边合成边测序进行的单分子测序之外，其他单分子测序技术包括但不限于Pacific Biosciences的SMRT^TM技术、ION TORRENT^TM技术，以及例如由Oxford NanoporeTechnologies开发的纳米孔测序。Some sequencing technologies are commercially available, such as the hybridization-while-sequencing platform from Affymetrix (Sunnyvale, CA), the synthesis-while-sequencing platform from 454 Life Sciences (Bradford, CT), Illumina/Solexa (Hayward, CA), and Helicos Biosciences (Cambridge, MA), and the ligation-while-sequencing platform from Applied Biosystems (Foster City, CA), as described below. In addition to single-molecule sequencing using Helicos Biosciences' synthesis-while-sequencing, other single-molecule sequencing technologies include, but are not limited to, Pacific Biosciences' SMRT ^™ technology, ION TORRENT ^™ technology, and nanopore sequencing, for example, developed by Oxford Nanopore Technologies.

虽然自动桑格法(Sanger method)被认为是“第一代”技术，但包括自动桑格测序(Sanger sequencing)的桑格测序也可用于本文所述的方法中。另外的合适测序方法包括但不限于核酸成像技术，例如原子力显微镜(AFM)或透射电子显微镜(TEM)。下文更详细地描述了示例性的测序技术。While the automated Sanger method is considered a "first-generation" technology, Sanger sequencing, including automated Sanger sequencing, can also be used in the methods described herein. Other suitable sequencing methods include, but are not limited to, nucleic acid imaging techniques such as atomic force microscopy (AFM) or transmission electron microscopy (TEM). Exemplary sequencing techniques are described in more detail below.

在一个例示性但非限制性的实施方案中，本文所述的方法包括使用Illumina的边合成边测序和基于可逆终止子的测序化学方法来获得试验样品中核酸的序列信息，例如母体样品中的cfDNA、正在筛查癌症的受试者中的cfDNA或细胞DNA等(例如，如Bentley等人，Nature第6章：第53-59页[2009年]中所述)。模板DNA可以是基因组DNA，例如细胞DNA或cfDNA。在一些实施方案中，将来自分离细胞的基因组DNA用作模板，并且将其片段化成几百个碱基对的长度。在其他实施方案中，将cfDNA用作模板，并且不需要将其片段化，因为cfDNA是作为短片段存在的。例如，胎儿cfDNA在血流中以长度为约170个碱基对(bp)的片段循环(Fan et al.，Clin Chem第56章：第1279-1286页[2010年]，并且不需要在测序之前将该DNA片段化。Illumina的测序技术依赖于将片段化的基因组DNA连接到锚寡核苷酸所结合的平面的光学透明表面。将模板DNA进行末端修复以产生5'-磷酸化的平末端，并且使用克列诺片段(Klenow fragment)的聚合酶活性将单个A碱基添加到平末端磷酸化的DNA片段的3'末端。此添加制备用于连接至寡核苷酸衔接子的DNA片段，所述寡核苷酸衔接子在其3'末端具有单个T碱基的突出端以提高连接效率。衔接子寡核苷酸与流通池锚寡核苷酸互补(不会与重复扩增分析中的锚/锚定读段混淆)。在有限稀释条件下，将接头修饰的单链模板DNA添加到流通池中，并通过与锚寡核苷酸杂交进行固定。将连接的DNA片段延伸并桥式扩增以产生具有数亿簇的超高密度测序流通池，每个簇包含约1,000个相同模板的拷贝。在一个实施方案中，随机片段化的基因组DNA在其进行簇扩增之前使用PCR进行扩增。另选地，使用无扩增(例如，无PCR)基因组文库制备，并且仅使用簇扩增来富集随机片段化的基因组DNA(Kozarewa等人，Nature Methods第6章：第291-295页[2009年])。使用稳健的四色DNA边合成边测序技术对模板进行测序，该技术采用具有可去除荧光染料的可逆终止子。使用激光激发和全内反射光学器件来实现高灵敏度荧光检测。将约几十至几百个碱基对的短序列读段与参考基因组进行比对，并且使用专门开发的数据分析管线软件来识别短序列读段与参考基因组的唯一映射。在第一次读取完成之后，模板可原位再生以能够从片段的相反端进行第二次读取。因此，可使用DNA片段的单末端或配对末端测序。In one exemplary but non-limiting embodiment, the method described herein includes using Illumina's sequencing-by-synthesis and sequencing chemistry based on reversible terminators to obtain sequence information of nucleic acids in a test sample, such as cfDNA from a maternal sample, cfDNA from a subject screening for cancer, or cellular DNA (e.g., as described in Bentley et al., Nature, Chapter 6: pp. 53-59 [2009]). The template DNA can be genomic DNA, such as cellular DNA or cfDNA. In some embodiments, genomic DNA from isolated cells is used as a template and fragmented into lengths of several hundred base pairs. In other embodiments, cfDNA is used as a template and does not need to be fragmented because cfDNA exists as a short fragment. For example, fetal cfDNA circulates in the bloodstream in fragments approximately 170 base pairs (bp) in length (Fan et al., Clin Chem, Chapter 56: pp. 1279-1286 [2010]), and this DNA does not need to be fragmented prior to sequencing. Illumina's sequencing technology relies on ligating fragmented genomic DNA to an optically transparent surface on which the anchor oligonucleotide is bound. The template DNA is end-repaired to produce 5'-phosphorylated blunt ends, and a single A base is added to the 3' end of the blunt-ended phosphorylated DNA fragment using the polymerase activity of the Klenow fragment. This addition prepares the DNA fragment for ligation to an oligonucleotide adaptor, which has a single T base overhang at its 3' end to improve ligation efficiency. The adaptor oligonucleotide is complementary to the flow-through pool anchor oligonucleotide (so as not to be confused with anchor/anchored reads in repeat amplification analysis). Under limiting dilution conditions, the adaptor-modified single-stranded template DNA is added to the flow-through pool and immobilized by hybridization with the anchor oligonucleotide. The ligated DNA is then... Fragment extension and bridging amplification are used to generate an ultra-high-density sequencing flow pool with hundreds of millions of clusters, each containing approximately 1,000 copies of the same template. In one implementation, randomly fragmented genomic DNA is amplified using PCR prior to cluster amplification. Alternatively, an amplification-free (e.g., PCR-free) genomic library is prepared, and cluster amplification is used only to enrich the randomly fragmented genomic DNA (Kozarewa et al., Nature Methods, Chapter 6: pp. 291–295 [2009]). The template is sequenced using a robust four-color DNA sequencing-by-synthesis technique employing a reversible terminator with removable fluorescent dyes. High-sensitivity fluorescence detection is achieved using laser excitation and total internal reflection optics. Short reads of approximately tens to hundreds of base pairs are aligned to a reference genome, and specially developed data analysis pipeline software is used to identify a unique mapping between the short reads and the reference genome. After the first read is completed, the template can be regenerated in situ to enable a second read from the opposite end of the fragment. Thus, sequencing can be performed using single-end or paired-end DNA fragments.

本发明的各种实施方案可使用允许配对末端测序的边合成边测序。在一些实施方案中，Illumina的边合成边测序包括聚类片段。进行聚类是其中每个片段分子进行等温扩增的过程。在一些实施方案中，如此处所述的示例，片段具有连接至该片段的两个末端的两种不同衔接子，所述衔接子允许该片段与流通池泳道表面上的两种不同寡核苷酸杂交。片段还包括或连接至片段两端的两个索引序列，所述索引序列提供标记以在多重测序中识别不同的样品。在一些测序平台中，待测序的片段也称为插入序列。Various embodiments of the present invention may use sequencing-by-synthesis that allows for sequencing of paired ends. In some embodiments, Illumina's sequencing-by-synthesis includes clustering fragments. Clustering is a process in which each fragment molecule undergoes isothermal amplification. In some embodiments, as in the example described herein, the fragment has two different adaptors attached to both ends of the fragment, which allow the fragment to hybridize with two different oligonucleotides on the surface of the flow-through lane. The fragment also includes or is attached to two index sequences at both ends of the fragment, which provide markers to identify different samples in multiplex sequencing. In some sequencing platforms, the fragment to be sequenced is also referred to as an insert sequence.

在一些具体实施中，用于在Illumina平台中进行聚类的流通池是具有泳道的载玻片。每个泳道是涂覆有两种类型寡核苷酸的引物苔的玻璃通道。通过在表面上的两种类型寡核苷酸中的第一种寡核苷酸来实现杂交。该寡核苷酸与片段一个末端上的第一接头互补。聚合酶产生杂交片段的互补链。双链分子变性，并且原始模板链被洗掉。剩余的链与许多其他剩余的链平行，通过桥式应用进行克隆扩增。In some implementations, the flow cell used for clustering on the Illumina platform is a glass slide with lanes. Each lane is a glass channel coated with primer motifs of two types of oligonucleotides. Hybridization is achieved by using the first of the two types of oligonucleotides on the surface. This oligonucleotide is complementary to a first linker at one end of the fragment. Polymerase produces the complementary strand of the hybridized fragment. The double-stranded molecule denatures, and the original template strand is washed away. The remaining strand, parallel to many other remaining strands, is used for clonal amplification via bridging applications.

在桥式扩增中，一条链折叠，并且该链的第二末端上的第二衔接子区域与流通池表面上的第二类型寡核苷酸杂交。聚合酶产生互补链，从而形成双链桥式分子。该双链分子变性，导致两个单链分子通过两种不同的寡核苷酸连接到流通池。然后反复重复该过程，并且同时在数百万个簇中进行，从而导致所有片段的克隆扩增。在桥式扩增之后，反向链被裂解并洗掉，仅留下正向链。封闭3'端以防止不需要的引发。In bridging amplification, one strand folds, and a second adaptor region at the second end of this strand hybridizes with a type II oligonucleotide on the surface of the flow cell. Polymerase produces a complementary strand, forming a double-stranded bridging molecule. This double-stranded molecule denatures, causing two single-stranded molecules to link to the flow cell via two different oligonucleotides. This process is then repeated repeatedly, simultaneously in millions of clusters, resulting in clonal amplification of all fragments. After bridging amplification, the reverse strand is cleaved and washed away, leaving only the forward strand. The 3' end is sealed to prevent unwanted initiation.

在聚类之后，测序开始于延伸第一测序引物以生成第一次读取。在每次循环中，荧光标记的核苷酸竞争添加到正在增长的链中。基于模板的序列仅掺入一个荧光标记的核苷酸。在添加每个核苷酸后，簇由光源激发，并发出特征性荧光信号。循环次数决定了读段的长度。发射波长和信号强度决定了碱基判定。同时读取给定簇的所有相同的链。以大规模平行的方式对数以亿计的簇进行测序。在完成第一次读取时，将所读取的产物洗掉。Following clustering, sequencing begins with extending the first sequencing primer to generate the first read. In each cycle, fluorescently labeled nucleotides are competitively added to the growing strand. Only one fluorescently labeled nucleotide is incorporated into the template-based sequence. After each nucleotide addition, the cluster is excited by a light source and emits a characteristic fluorescent signal. The number of cycles determines the read length. The emission wavelength and signal intensity determine the base determination. All identical strands of a given cluster are read simultaneously. Hundreds of millions of clusters are sequenced in massively parallel fashion. Upon completion of the first read, the read product is washed away.

在包括两种索引引物的方案的下一步骤中，引入索引1引物并将其与模板上的索引1区域杂交。索引区域提供对片段的识别，这可用于在多重测序过程中解复用样品。生成与第一次读取类似地索引1读取。在完成索引1读取之后，将所读取的产物洗掉，并将链的3'末端去保护。然后模板链折叠并结合至流通池上的第二寡核苷酸。以与索引1相同的方式读取索引2序列。然后在步骤结束时洗掉索引2读段产物。In the next step of the protocol, which includes two index primers, index 1 primer is introduced and hybridized to the index 1 region on the template. The index region provides identification of the fragment, which can be used to demultiplex the sample during multiplexing. An index 1 read is generated, similar to the first read. After the index 1 read is complete, the read product is washed away, and the 3' end of the strand is deprotected. The template strand is then folded and bound to a second oligonucleotide on the flow cell. The index 2 sequence is read in the same manner as index 1. The index 2 read product is then washed away at the end of the step.

在读取两个索引后，通过使用聚合酶来延伸第二流通池寡核苷酸从而形成双链桥来启动读取2。该双链DNA变性，并且3'端被阻断。原正向链被切割并洗掉，留下反向链。读取2以引入读取2测序引物开始。与读取1一样，重复测序步骤直到实现所需长度。将读取2产物洗掉。该整个过程生成数百万个读段，表示所有片段。基于在样品制备期间引入的唯一索引来分离来自合并样品文库的序列。对于每个样品，对碱基判定的相似延伸的读段进行本地聚类。将正向和反向读段配对，从而产生邻接序列。将这些邻接序列与参考基因组进行比对以进行变异体识别。After reading two indices, read 2 is initiated by extending the second flow cell oligonucleotide using polymerase to form a double-stranded bridge. The double-stranded DNA is denatured, and the 3' end is blocked. The original forward strand is cleaved and washed away, leaving the reverse strand. Read 2 is then initiated with the introduction of the read 2 sequencing primer. As with read 1, the sequencing steps are repeated until the desired length is achieved. The read 2 product is washed away. This entire process generates millions of reads, representing all fragments. Sequences from the merged sample library are separated based on unique indices introduced during sample preparation. For each sample, reads with similar extensions determined by base determination are locally clustered. Forward and reverse reads are paired to generate adjacent sequences. These adjacent sequences are aligned to a reference genome for variant identification.

上述边合成边测序示例包括配对末端读段，其用于本发明所公开的方法的许多实施方案中。配对末端测序包括从片段两个末端进行的2次读取。当一对读段映射到参考序列时，可以确定两个读段之间的碱基对距离，然后可以使用该距离来确定从中获得读段的片段的长度。在一些情况下，跨越两个分组的片段将使其配对末端读取中的一次读取与一个分组进行比对，并且另一次读取与相邻分组进行比对。随着分组变得更长或读段变的更短，这种情况将变得越来越少。可使用各种方法来说明这些片段的分组成员身份。例如，在确定分组的片段大小频率时可以省略这些片段；可对相邻分组中的两者的这些片段进行计数；可将这些片段分配给包含两个分组的更大数量的碱基对的分组；或者可将这些片段分配给具有与每个分组中的碱基对的部分碱基对相关联的权重的两个分组。The above-described sequencing-by-synthesis examples include paired end reads, which are used in many embodiments of the methods disclosed in this invention. Paired end sequencing involves two reads from both ends of a fragment. When a pair of reads is mapped to a reference sequence, the base pair distance between the two reads can be determined, and this distance can then be used to determine the length of the fragment from which the reads are obtained. In some cases, a fragment spanning two groups will have one read in its paired end read aligned with one group and the other with an adjacent group. This will become less frequent as groups become longer or reads become shorter. Various methods can be used to indicate the group membership of these fragments. For example, these fragments can be omitted when determining the fragment size frequency of a group; these fragments can be counted for both in adjacent groups; these fragments can be assigned to a group containing a larger number of base pairs from both groups; or these fragments can be assigned to two groups with weights associated with a subset of base pairs in each group.

配对末端读段可使用不同长度的插入序列(即，待测序的不同片段大小)。作为本公开中的默认含义，配对末端读段用于指从各种插入序列长度中获得的读段。在一些情况下，为了区分短插入序列配对末端读段与长插入序列配对末端读段，后者也称为配偶对读段。在包括配偶对读段的一些实施方案中，首先将两个生物素连接衔接子连接到相对长的插入序列(例如，几kb)的两个末端。然后所述生物素连接衔接子连接插入序列的这两个末端以形成环化分子。然后可通过进一步使该环状化分子片段化来获得包含所述生物素连接衔接子的亚片段。然后可通过与上述短插入序列配对末端测序相同的程序对以相反序列顺序包含原始片段的两个末端的亚片段进行测序。使用Illumina平台的配偶对测序的更多细节示于以下URL的在线公布中，该URL以引用方式全文并入本文：res|.|illumina|.|com/documents/products/technotes/technote_nextera_matepair_dat a_processing。关于配对末端测序的附加信息可见于美国专利7601499和美国专利公布2012/0,053,063，这些专利关于配对末端测序方法和装置上的材料以引用方式并入。Paired-end reads can use insert sequences of different lengths (i.e., different fragment sizes to be sequenced). By default in this disclosure, paired-end reads are used to refer to reads obtained from various insert sequence lengths. In some cases, to distinguish short insert sequence paired-end reads from long insert sequence paired-end reads, the latter are also referred to as partner reads. In some embodiments including partner reads, two biotin-linked adaptors are first attached to the two ends of a relatively long insert sequence (e.g., several kb). The biotin-linked adaptors then connect the two ends of the insert sequence to form a circular molecule. Subfractions containing the biotin-linked adaptors can then be obtained by further fragmenting this circular molecule. The subfractions containing the two ends of the original fragment in reverse sequence order can then be sequenced using the same procedure as described above for short insert sequence paired-end sequencing. Further details regarding mated-end sequencing using the Illumina platform are available online at the following URL, which is incorporated herein by reference in its entirety: res|.|illumina|.|com/documents/products/technotes/technote_nextera_matepair_data_processing. Additional information on mated-end sequencing can be found in U.S. Patent 7,601,499 and U.S. Patent Publication 2012/0,053,063, the materials relating to mated-end sequencing methods and apparatus of which are incorporated by reference.

在DNA片段测序后，将预定长度(例如，100bp)的序列读段映射到已知的参考基因组或与其进行比对。所映射的或比对的读段及它们在参考序列上的对应位置也称为标签。在一个实施方案中，参考基因组序列是NCBI36/hg18序列，其可在万维网上genome dotucsc dot edu/cgi-bin/hgGateway？org＝Human&db＝hg18&hgsid＝166260105)处获得。另选地，参考基因组序列是GRCh37/hg19，其可在万维网上genome dot ucsc dot edu/cgi-bin/hgGateway处获得。公共序列信息的其他来源包括GenBank、dbEST、dbSTS、EMBL(欧洲分子生物学实验室)和DDBJ(日本DNA数据库)。多种计算机算法可用于比对序列，包括但不限于BLAST(Altschul等人，1990年)、BLITZ(MPsrch)(Sturrock和Collins，1993年)、FASTA(Person和Lipman，1988年)、BOWTIE(Langmead等人，Genome Biology，10：R25.1-R25.10[2009年])或ELAND(Illumina公司，San Diego，CA，USA)。在一个实施方案中，对血浆cfDNA分子的克隆扩增拷贝的一个末端进行测序，并通过Illumina基因组分析仪的生物信息学比对分析进行处理，该分析仪使用核苷酸数据库(ELAND)软件的高效大规模比对。After DNA fragment sequencing, reads of a predetermined length (e.g., 100 bp) are mapped to or aligned with a known reference genome. The mapped or aligned reads and their corresponding positions on the reference sequence are also referred to as tags. In one implementation, the reference genome sequence is the NCBI36/hg18 sequence, available on the World Wide Web at genome dot ucsc dot edu/cgi-bin/hgGateway?org=Human&db=hg18&hgsid=166260105. Alternatively, the reference genome sequence is GRCh37/hg19, available on the World Wide Web at genome dot ucsc dot edu/cgi-bin/hgGateway. Other sources of public sequence information include GenBank, dbEST, dbSTS, EMBL (European Molecular Biology Laboratory), and DDBJ (Japan DNA Database). Various computer algorithms can be used for sequence alignment, including but not limited to BLAST (Altschul et al., 1990), BLITZ (MPsrch) (Sturrock and Collins, 1993), FASTA (Person and Lipman, 1988), BOWTIE (Langmead et al., Genome Biology, 10: R25.1-R25.10 [2009]), or ELAND (Illumina, San Diego, CA, USA). In one embodiment, one end of a clone amplified copy of a plasma cfDNA molecule is sequenced and processed by bioinformatics alignment analysis using an Illumina Genome Analyzer that utilizes the efficient, large-scale alignment of the Nucleotide Database (ELAND) software.

可使用其他测序方法和系统来获得序列读段。Other sequencing methods and systems can be used to obtain sequence reads.

用于确定CNV的装置和系统Devices and systems for determining CNV

通常使用各种计算机执行的算法和程序来执行测序数据的分析和从其推导的诊断。因此，某些实施方案采用包括存储在一个或多个计算机系统或其他处理系统中或通过一个或多个计算机系统或其他处理系统传输的数据的过程。本文所公开的实施方案还包括用于执行这些操作的装置。该装置可被专门构造用于所需目的，或者其可以由存储在计算机中的计算机程序和/或数据结构来选择性地激活或重新配置的通用计算机(或计算机组)。在一些实施方案中，处理器组协同地(例如，经由网络或云计算)和/或并行地执行所述分析操作中的一些或全部操作。用于执行本文所述的方法的处理器或处理器组可以是各种类型，包括微控制器和微处理器，微处理器诸如可编程设备(例如，CPLD和FPGA)和不可编程设备(诸如门阵列ASIC或通用微处理器)。The analysis of sequencing data and the diagnosis derived therefrom are typically performed using various computer-executed algorithms and programs. Therefore, some embodiments employ processes that include data stored in or transmitted through one or more computer systems or other processing systems. The embodiments disclosed herein also include means for performing these operations. This means may be specifically configured for the desired purpose, or it may be a general-purpose computer (or group of computers) that can be selectively activated or reconfigured by computer programs and/or data structures stored in the computer. In some embodiments, the processor group performs some or all of the operations in the analysis operations collaboratively (e.g., via a network or cloud computing) and/or in parallel. The processors or processor groups used to perform the methods described herein can be of various types, including microcontrollers and microprocessors, such as programmable devices (e.g., CPLDs and FPGAs) and non-programmable devices (such as gate array ASICs or general-purpose microprocessors).

此外，某些实施方案涉及有形和/或非暂态计算机可读介质或计算机程序产品，其包括用于执行各种计算机实现的操作的程序指令和/或数据(包括数据结构)。计算机可读介质的示例包括但不限于半导体存储器设备、磁性介质(诸如磁盘驱动器)、磁带、光学介质(诸如CD)、磁光介质以及被专门配置为存储和执行程序指令的硬件设备，诸如只读存储器设备(ROM)和随机存取存储器(RAM)。计算机可读介质可由最终用户直接控制，或者所述介质可由最终用户间接控制。直接控制介质的示例包括位于用户设施处的介质和/或不与其他实体共享的介质。间接控制介质的示例包括用户可经由外部网络和/或经由提供共享资源(诸如“云”)的服务间接访问的介质。程序指令的示例包括诸如由编译器产生的机器代码以及包含可由计算机使用解释器执行的较高级代码的文件。Furthermore, some embodiments involve tangible and/or non-transitory computer-readable media or computer program products, which include program instructions and/or data (including data structures) for performing various computer-implemented operations. Examples of computer-readable media include, but are not limited to, semiconductor memory devices, magnetic media (such as disk drives), magnetic tape, optical media (such as CDs), magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as read-only memory devices (ROM) and random access memory (RAM). The computer-readable medium may be directly controlled by an end user, or the medium may be indirectly controlled by an end user. Examples of directly controlled media include media located at the user's facility and/or media not shared with other entities. Examples of indirectly controlled media include media that the user can indirectly access via external networks and/or via services providing shared resources (such as the "cloud"). Examples of program instructions include machine code generated by a compiler and files containing higher-level code that can be executed by a computer using an interpreter.

在各种实施方案中，在本发明所公开的方法和装置中采用的数据或信息以电子格式提供。此类数据或信息可包括来源于核酸样品的读段和标签、与参考序列的特定区域进行比对(例如，与染色体或染色体区段进行比对)的此类标签的计数或密度、参考序列(包括仅提供或主要提供多态性的参考序列)、染色体和区段剂量、判定(诸如非整倍体判定)、归一化染色体和区段值、染色体或区段和对应的归一化染色体或区段的配对、咨询建议、诊断等。如本文所用，以电子格式提供的数据或其他信息可用于存储在机器上并在机器之间传输。常规地，电子格式的数据以数字方式提供，并且可作为位和/或字节存储在各种数据结构、列表、数据库等中。数据可以电子方式、光学方式等体现。In various embodiments, the data or information used in the methods and apparatus disclosed herein is provided in electronic format. Such data or information may include reads and tags derived from nucleic acid samples, counts or densities of such tags compared to specific regions of a reference sequence (e.g., compared to chromosomes or chromosomal segments), reference sequences (including those providing only or primarily polymorphism), chromosome and segment dosages, determinations (such as aneuploidy determinations), normalized chromosome and segment values, pairings of chromosomes or segments with corresponding normalized chromosomes or segments, consultations, diagnoses, etc. As used herein, data or other information provided in electronic format may be stored on a machine and transferred between machines. Conventionally, data in electronic format is provided digitally and may be stored as bits and/or bytes in various data structures, lists, databases, etc. Data may be represented electronically, optically, etc.

一个实施方案提供了用于生成输出的计算机程序产品，该输出指示在试验样品中是否存在非整倍体(例如，胎儿非整倍体)或癌症。计算机产品可包含用于执行上述用于确定染色体异常的方法中的任何一种或多种方法的指令。如所解释的，计算机产品可包括非暂态和/或有形计算机可读介质，该非暂态和/或有形计算机可读介质具有记录在其上的用于使处理器能够确定染色体剂量，并且在一些情况下，确定是否存在胎儿非整倍体的计算机可执行或可编译逻辑(例如，指令)。在一个示例中，计算机产品包括计算机可读介质，该计算机可读介质具有记录在其上的用于使处理器能够诊断胎儿非整倍体的计算机可执行或可编译逻辑(例如，指令)，该计算机可执行或可编译逻辑包括：用于从来自母体生物样品的核酸分子的至少一部分中接收测序数据，其中所述测序数据包括计算的染色体和/或区段剂量；用于根据所接收的数据分析胎儿非整倍体的计算机辅助逻辑；和用于生成输出的输出程序，该输出指示所述胎儿非整倍体的存在、不存在或种类。One embodiment provides a computer program product for generating output indicating the presence of aneuploidy (e.g., fetal aneuploidy) or cancer in a test sample. The computer product may contain instructions for performing any one or more of the methods described above for determining chromosomal abnormalities. As explained, the computer product may include a non-transitory and/or tangible computer-readable medium having computer-executable or compilable logic (e.g., instructions) recorded thereon for enabling a processor to determine chromosome dosage and, in some cases, the presence of fetal aneuploidy. In one example, the computer product includes a computer-readable medium having computer-executable or compilable logic (e.g., instructions) recorded thereon for enabling a processor to diagnose fetal aneuploidy, the computer-executable or compilable logic including: receiving sequencing data from at least a portion of nucleic acid molecules from a maternal biological sample, said sequencing data including calculated chromosome and/or segment dosages; computer-aided logic for analyzing fetal aneuploidy based on the received data; and an output program for generating output indicating the presence, absence, or type of said fetal aneuploidy.

可将来自所考虑样品的序列信息映射到染色体参考序列，以识别任意一条或多条所关注染色体中的每一条染色体的序列标签数目，以及识别所述任意一条或多条所关注染色体中的每一条染色体的归一化区段序列的序列标签数目。在各种实施方案中，例如，参考序列存储在数据库诸如关系数据库或对象数据库中。Sequence information from the sample under consideration can be mapped to a chromosome reference sequence to identify the number of sequence tags for each chromosome in any one or more chromosomes of interest, and the number of sequence tags for the normalized segment sequences of each chromosome in any one or more chromosomes of interest. In various embodiments, for example, the reference sequence is stored in a database such as a relational database or an object database.

应当理解，对于无辅助的人类而言，执行本文所公开的方法的计算操作是不切实际的，甚至在大多数情况下是不可能的。例如，在没有计算装置辅助的情况下，将来自样品的单个30bp读段映射到人类染色体中的任意一条染色体可能需要多年的努力。当然，由于可靠的非整倍体判定通常需要将数千(例如，至少约10,000)或甚至数百万个读段映射到一条或多条染色体，因此该问题是复杂的。It should be understood that performing the computational operations of the methods disclosed herein is impractical, and in most cases impossible, for unassisted humans. For example, mapping a single 30bp read from a sample to any one chromosome in a human chromosome without the aid of a computational device could take years of effort. Of course, the problem is complex because reliable aneuploidy determination typically requires mapping thousands (e.g., at least about 10,000) or even millions of reads to one or more chromosomes.

可使用用于评价试验样品中所关注基因序列的拷贝数的系统来执行本文所公开的方法。该系统包括：(a)测序仪，该测序仪用于接收来自该试验样品的核酸并提供来自该样品的核酸序列信息；(b)处理器；和(c)一个或多个计算机可读介质，所述一个或多个计算机可读介质具有存储在其上的指令，所述指令用于在所述处理器上执行以执行用于识别任何CNV(例如，染色体或部分非整倍体)的方法。The methods disclosed herein can be performed using a system for evaluating the copy number of a gene sequence of interest in a test sample. The system includes: (a) a sequencer for receiving nucleic acids from the test sample and providing nucleic acid sequence information from the sample; (b) a processor; and (c) one or more computer-readable media having instructions stored thereon for execution on the processor to perform a method for identifying any CNV (e.g., a chromosome or partial aneuploidy).

在一些实施方案中，由具有存储在其上的计算机可读指令的计算机可读介质来指示方法，所述计算机可读指令用于执行用于识别任何CNV(例如，染色体或部分非整倍体)的方法。因此，一个实施方案提供了计算机程序产品，该计算机程序产品包括一个或多个计算机可读非暂态存储介质，所述一个或多个计算机可读非暂态存储介质具有存储在其上的计算机可执行指令，所述计算机可执行指令当由计算机系统的一个或多个处理器执行时，使得该计算机系统实现用于评估包含胎儿和母体游离核酸的试验样品中所关注序列的拷贝数的方法。该方法包括：(a)接收通过对试验样品中的游离核酸片段进行测序而获得的序列读段；(b)将所述游离核酸片段的所述序列读段与包含所述所关注序列的参考基因组进行比对，从而提供测试序列标签，其中将所述参考基因组分成多个分组；(c)确定存在于该试验样品中的所述游离核酸片段的大小；(d)基于从中获得所述标签的游离核酸片段的所述大小来对所述测试序列标签进行加权；(e)基于(d)的所加权的标签来计算所述分组的覆盖度；以及(f)根据所计算的覆盖度来识别所述所关注序列中的拷贝数变异。在一些具体实施中，对测试序列标签进行加权包括使覆盖度偏向测试序列标签，所述测试序列标签从具有试验样品中一个基因组的大小或大小范围特性的游离核酸片段中获得。在一些具体实施中，对测试序列标签进行加权包括将值1分配给从大小或大小范围的游离核酸片段中获得的标签，以及将值0分配给其他标签。在一些具体实施中，方法还包括在参考基因组(包括所关注序列)的分组中确定片段大小参数的值，该片段大小参数包括试验样品中片段大小短于或长于阈值的一定量的游离核酸片段。此处，识别所关注序列中的拷贝数变异包括使用片段大小参数的值以及(e)中计算的覆盖度。在一些具体实施中，系统被配置为使用上述各种方法和过程来评估试验样品中的拷贝数。In some embodiments, the method is indicated by a computer-readable medium having computer-readable instructions stored thereon, the computer-readable instructions being used to perform a method for identifying any CNV (e.g., a chromosome or partial aneuploidy). Thus, one embodiment provides a computer program product comprising one or more computer-readable non-transitory storage media having computer-executable instructions stored thereon, the computer-executable instructions, when executed by one or more processors of a computer system, causing the computer system to implement a method for assessing the copy number of sequences of interest in a test sample containing cell-free fetal and maternal nucleic acids. The method includes: (a) receiving a sequence read obtained by sequencing a cell-free nucleic acid fragment in a test sample; (b) aligning the sequence read of the cell-free nucleic acid fragment to a reference genome containing the sequence of interest to provide a test sequence tag, wherein the reference genome is divided into multiple groups; (c) determining the size of the cell-free nucleic acid fragment present in the test sample; (d) weighting the test sequence tag based on the size of the cell-free nucleic acid fragment from which the tag is obtained; (e) calculating the coverage of the group based on the weighted tag in (d); and (f) identifying copy number variations in the sequence of interest based on the calculated coverage. In some embodiments, weighting the test sequence tag includes biasing the coverage towards the test sequence tag, which is obtained from a cell-free nucleic acid fragment having the size or size range characteristics of a genome in the test sample. In some embodiments, weighting the test sequence tag includes assigning a value of 1 to tags obtained from cell-free nucleic acid fragments of a size or size range, and assigning a value of 0 to other tags. In some implementations, the method further includes determining a value for a fragment size parameter within a grouping of a reference genome (including the sequence of interest). This fragment size parameter comprises a certain amount of free nucleic acid fragments in the test sample whose fragment size is shorter or longer than a threshold. Here, identifying copy number variations in the sequence of interest involves using the value of the fragment size parameter and the coverage calculated in (e). In some implementations, the system is configured to assess the copy number in the test sample using the various methods and procedures described above.

在一些实施方案中，指令还可包括在提供母体试验样品的人类受试者的患者病历中自动记录与方法相关的信息，诸如染色体剂量和胎儿染色体非整倍体的存在与否。可由例如实验室、医师办公室、医院、健康维护组织、保险公司或个人病历网站来维护患者病历。此外，基于处理器实现的分析的结果，方法还可包括开处方、启动和/或改变从中获取母体试验样品的人类受试者的治疗。这可包括对从受试者中获取的另外的样品执行一个或多个另外的测试或分析。In some implementations, the instructions may also include automatically recording method-related information, such as chromosome dosage and the presence or absence of fetal chromosomal aneuploidy, in the patient's medical record of the human subject from whom the maternal test sample is provided. Patient records may be maintained by, for example, a laboratory, physician's office, hospital, health maintenance organization, insurance company, or personal medical record website. Furthermore, based on the results of processor-implemented analysis, the method may also include prescribing, initiating, and/or modifying treatment of the human subject from whom the maternal test sample is obtained. This may include performing one or more additional tests or analyses on additional samples obtained from the subject.

本发明所公开的方法还可使用计算机处理系统来执行，该计算机处理系统适于或被配置成执行用于识别任何CNV(例如，染色体或部分非整倍体)的方法。一个实施方案提供了适于或被配置为执行如本文所述的方法的计算机处理系统。在一个实施方案中，装置包括测序设备，该测序设备适于或被配置用于对样品中的核酸分子的至少一部分进行测序，以获得本文他处所述的序列信息的类型。该装置还可包括用于处理样品的部件。此类部件在本文他处有所描述。The methods disclosed in this invention can also be performed using a computer processing system adapted or configured to perform methods for identifying any CNV (e.g., chromosomes or partial aneuploidy). One embodiment provides a computer processing system adapted or configured to perform the methods described herein. In one embodiment, the apparatus includes a sequencing device adapted or configured to sequence at least a portion of nucleic acid molecules in a sample to obtain the type of sequence information described elsewhere herein. The apparatus may also include components for processing the sample. Such components are described elsewhere herein.

序列或数据可直接或间接地输入到计算机中或存储在计算机可读介质上。在一个实施方案中，计算机系统直接联接到读取和/或分析来自样品的核酸序列的测序设备。来自此类工具的序列或其他信息经由计算机系统中的接口提供。另选地，由系统处理的序列由序列存储源诸如数据库或其他储存库提供。一旦可供处理装置使用，存储器设备或大容量存储装置就会至少暂时地缓冲或存储核酸序列。此外，该存储器设备可存储各种染色体或基因组等的标签计数。该存储器还可存储用于分析呈现序列或映射数据的各种例程和/或程序。此类程序/例程可包括用于执行统计分析的程序等。Sequences or data may be directly or indirectly input into a computer or stored on a computer-readable medium. In one embodiment, the computer system is directly coupled to a sequencing device that reads and/or analyzes nucleic acid sequences from a sample. Sequences or other information from such tools are provided via an interface in the computer system. Alternatively, sequences processed by the system are provided by a sequence storage source such as a database or other repository. Once available for processing, a storage device or mass storage device buffers or stores the nucleic acid sequences, at least temporarily. Furthermore, the storage device may store tag counts for various chromosomes or genomes, etc. The storage may also store various routines and/or programs for analyzing the presented sequence or mapping data. Such programs/routines may include programs for performing statistical analyses, etc.

在一个示例中，用户将样品置于测序装置中。通过连接到计算机的测序装置来采集和/或分析数据。计算机上的软件允许数据采集和/或分析。可将数据存储、显示(经由监视器或其他类似设备)和/或发送到另一个位置。计算机可连接到互联网，该互联网用于将数据传输到由远程用户(例如，医师、科学家或分析员)使用的手持设备。应当理解，可在传输之前存储和/或分析数据。在一些实施方案中，采集原始数据并将其发送到将分析和/或存储该数据的远程用户或装置。传输可经由互联网进行，但也可经由卫星或其他连接进行。另选地，数据可存储在计算机可读介质上，并且该介质可发送到最终用户(例如，经由邮件)。远程用户可位于相同或不同的地理位置，包括但不限于建筑物、城市、州、国家或洲。In one example, a user places a sample in a sequencing device. Data is collected and/or analyzed via the sequencing device connected to a computer. Software on the computer allows for data collection and/or analysis. The data can be stored, displayed (via a monitor or other similar device), and/or transmitted to another location. The computer may be connected to the Internet, which is used to transfer data to a handheld device used by a remote user (e.g., a physician, scientist, or analyst). It should be understood that data may be stored and/or analyzed prior to transmission. In some embodiments, raw data is collected and transmitted to a remote user or device that will analyze and/or store the data. Transmission may be via the Internet, but may also be via satellite or other connections. Alternatively, data may be stored on a computer-readable medium, and this medium may be sent to the end user (e.g., via mail). Remote users may be located in the same or different geographic locations, including but not limited to buildings, cities, states, countries, or continents.

在一些实施方案中，方法还包括采集关于多条多核苷酸序列的数据(例如读段、标记和/或参考染色体序列)并将该数据发送到计算机或其他计算系统。例如，计算机可连接到实验室装备，例如样品采集装置、核苷酸扩增装置、核苷酸测序装置或杂交装置。然后，计算机可采集由实验室设备收集的适用数据。数据可在任何步骤存储在计算机上，例如，在实时收集时、在发送之前、在发送期间或与发送同时、或在发送之后。数据可存储在可从计算机提取的计算机可读介质上。所采集或存储的数据可例如经由局域网或广域网(诸如互联网)从计算机传输到远程位置。在该远程位置处，可对所传输的数据执行各种操作，如下所述。In some implementations, the method further includes acquiring data on multiple polynucleotide sequences (e.g., reads, tags, and/or reference chromosome sequences) and transmitting that data to a computer or other computing system. For example, the computer may be connected to laboratory equipment such as a sample acquisition device, nucleotide amplification device, nucleotide sequencing device, or hybridization device. The computer can then acquire applicable data collected by the laboratory equipment. Data may be stored on the computer at any step, e.g., during real-time acquisition, before transmission, during or concurrent with transmission, or after transmission. Data may be stored on a computer-readable medium retrievable from the computer. The acquired or stored data may be transmitted from the computer to a remote location, for example, via a local area network (LAN) or a wide area network (WAN) such as the Internet. At that remote location, various operations can be performed on the transmitted data, as described below.

在本文所公开的系统、装置和方法中，可存储、传输、分析和/或操纵的电子格式数据的类型如下：In the systems, apparatuses, and methods disclosed herein, the types of electronically formatted data that can be stored, transmitted, analyzed, and/or manipulated are as follows:

通过对试验样品中的核酸测序而获得的读段Reads obtained by sequencing nucleic acids in test samples

通过将读段与参考基因组或其他参考序列进行比对而获得的标签Tags obtained by aligning reads with a reference genome or other reference sequences

参考基因组或序列Reference genome or sequence

序列标签密度—参考基因组或其他参考序列的两个或更多个区域(通常为染色体或染色体片段)中每一个区域的标签计数或数目Sequence tag density—the tag count or number of each region (usually a chromosome or chromosome segment) in two or more regions of a reference genome or other reference sequence.

归一化染色体或染色体区段对特定所关注染色体或染色体区段的同一性The identity of a normalized chromosome or chromosome segment to a specific chromosome or chromosome segment of interest.

从所关注染色体或区段中获得的染色体或染色体区段(或其他区域)和对应的归一化染色体或区段的剂量The chromosome or chromosome segment (or other region) obtained from the chromosome or segment of interest and the corresponding normalized chromosome or segment dose.

将染色体剂量判定为受影响、未受影响或无判定的阈值Thresholds for determining whether chromosome dose is affected, unaffected, or not determined.

染色体剂量的实际判定Actual determination of chromosome dosage

诊断(与判定相关联的临床病症)Diagnosis (clinical condition associated with the diagnosis)

对来源于判定和/或诊断的进一步测试的建议Recommendations for further testing derived from judgment and/or diagnosis

来源于判定和/或诊断的治疗和/或监测计划Treatment and/or monitoring plans derived from judgment and/or diagnosis

可使用不同的装置在一个或多个位置处获得、存储、分析和/或操纵这些各种类型的数据。处理选项范围很广。在最简单的情况下，在处理试验样品的位置处(如医生办公室或其他临床环境)存储和使用所有或大部分的这些信息。在最复杂的情况下，在一个位置处获得样品，在不同位置处对其进行处理并且任选地测序，在一个或多个不同位置处将读段进行比对并且作出判定，并且在又一个位置(其可以是获得样品的位置)处准备诊断、建议和/或计划。These various types of data can be acquired, stored, analyzed, and/or manipulated at one or more locations using different devices. The range of processing options is extensive. In the simplest case, all or most of this information is stored and used at the location where the test sample is processed (such as a physician's office or other clinical setting). In the most complex case, the sample is acquired at one location, processed and optionally sequenced at different locations, reads are compared and determined at one or more different locations, and a diagnosis, recommendation, and/or plan is prepared at yet another location (which could be the location where the sample was acquired).

在各种实施方案中，用测序装置生成读段，然后将读段传输到远程位点，在该远程位点处对读段进行处理以产生非整倍体判定。在该远程位置处，例如，将读段与参考序列进行比对以产生标签，对所述标签计数并将其指定给所关注染色体或片段。同样在该远程位置处，使用相关联的归一化染色体或片段将计数转换成剂量。还在该远程位置处，使用剂量来生成非整倍体判定。In various implementations, a sequencing device generates reads, which are then transmitted to a remote site where they are processed to produce an aneuploidy determination. At this remote location, for example, the reads are aligned to a reference sequence to generate tags, which are counted and assigned to the chromosome or segment of interest. Also at this remote location, the counts are converted into doses using the associated normalized chromosome or segment. And still at this remote location, the doses are used to generate the aneuploidy determination.

在处理操作中，可在不同位置采用以下操作：During processing, the following operations can be performed at different locations:

样品采集Sample collection

测序前的样品处理Sample processing before sequencing

测序sequencing

分析序列数据并推导非整倍体判定Analyze sequence data and derive aneuploidy determination.

诊断diagnosis

向患者或医疗服务人员报告诊断和/或判定Report the diagnosis and/or judgment to the patient or healthcare provider.

开发用于进一步治疗、测试和/或监测的计划Develop plans for further treatment, testing, and/or monitoring.

执行计划Implementation Plan

咨询consult

这些操作中的任意一个或多个操作可以是自动化的，如本文他处所述。通常，序列数据的测序和分析以及推导非整倍体判定将通过计算来执行。其他操作可手动或自动执行。Any one or more of these operations can be automated, as described elsewhere in this document. Typically, sequencing and analysis of sequence data, as well as the derivation of aneuploidy determination, will be performed computationally. Other operations can be performed manually or automatically.

可执行样品采集的位置的示例包括医务人员的办公室、诊所、患者的家(其中提供了样品采集工具或试剂盒)和移动医疗车。可执行测序前的样品处理的位置的示例包括医务人员的办公室、诊所、患者的家(其中提供样品处理装置或试剂盒)、移动医疗车和非整倍体分析供应商的设施。可执行测序的位置的示例包括医务人员的办公室、诊所、医务人员的办公室、诊所、患者的家(其中提供样品测序装置和/或试剂盒)、移动医疗车和非整倍体分析供应商的设施。进行测序的位置可设置有用于以电子格式传输序列数据(通常为读段)的专用网络连接。此类连接可以是有线的或无线的，并且具有数据并且可被配置为将数据发送到位点，在该位点处可将数据在传输到处理站点之前处理和/或聚合。数据聚合器可由健康组织诸如健康维护组织(HMO)维护。Examples of locations where sample collection can be performed include healthcare worker offices, clinics, patients' homes (where sample collection tools or kits are provided), and mobile medical vehicles. Examples of locations where pre-sequencing sample processing can be performed include healthcare worker offices, clinics, patients' homes (where sample processing equipment or kits are provided), mobile medical vehicles, and facilities of aneuploidy analysis providers. Examples of locations where sequencing can be performed include healthcare worker offices, clinics, healthcare worker offices, clinics, patients' homes (where sample sequencing equipment and/or kits are provided), mobile medical vehicles, and facilities of aneuploidy analysis providers. Sequencing locations may have dedicated network connections for transmitting sequence data (typically reads) in electronic format. Such connections can be wired or wireless and have data and can be configured to send data to a site where it can be processed and/or aggregated before being transmitted to a processing site. Data aggregators may be maintained by health organizations such as health maintenance organizations (HMOs).

分析和/或推导操作可在任何前述位置处执行，或者另选地在专用于计算和/或服务于分析核酸序列数据的另外的远程位点处执行。此类位置包括例如簇，诸如通用服务器群、非整倍体分析服务业务的设施等。在一些实施方案中，用于执行分析的计算装置被租用或租借。计算资源可以是可互联网访问的处理器集合的一部分，诸如通称为云的处理资源。在一些情况下，由彼此关联或非关联的并行或大规模并行处理器组来执行计算。可使用分布式处理诸如簇计算、网格计算等来完成该处理。在此类实施方案中，计算资源的簇或网格共同形成由多个处理器或计算机构成的超级虚拟计算机，所述多个处理器或计算机共同作用以执行本文所述的分析和/或推导。这些技术以及更常规的超级计算机可用于处理如本文所述的序列数据。每项技术都是依赖于处理器或计算机的并行计算的形式。就网格计算而言，这些处理器(通常是全部计算机)通过网络(私人、公共或互联网)通过常规网络协议诸如以太网连接。相比之下，超级计算机具有通过本地高速计算机总线连接的许多处理器。Analysis and/or derivation operations can be performed at any of the aforementioned locations, or alternatively at other remote sites dedicated to computation and/or service for analyzing nucleic acid sequence data. Such locations include, for example, clusters, such as general-purpose server clusters, facilities for aneuploid analysis services, etc. In some embodiments, the computing devices used to perform the analysis are leased or rented. Computing resources can be part of an internet-accessible set of processors, such as processing resources commonly referred to as the cloud. In some cases, computation is performed by parallel or massively parallel processor groups, either correlated or uncorrelated. Distributed processing such as cluster computing, grid computing, etc., can be used to accomplish this processing. In such embodiments, clusters or grids of computing resources collectively form a supercomputer consisting of multiple processors or computers that work together to perform the analysis and/or derivation described herein. These techniques, as well as more conventional supercomputers, can be used to process sequence data as described herein. Each technique is a form of parallel computing relying on processors or computers. In the case of grid computing, these processors (typically all computers) are connected via a network (private, public, or the internet) using conventional network protocols such as Ethernet. In contrast, supercomputers have many processors connected via a local high-speed computer bus.

在某些实施方案中，诊断(例如，胎儿患有唐氏综合征(Downs syndrome)或患者患有特定类型的癌症)在与分析操作相同的位置处生成。在其他实施方案中，诊断在不同位置处执行。在一些示例中，报告诊断在获取样品的位置处执行，虽然不一定是这种情况。可生成或报告诊断和/或执行计划开发的位置的示例包括医务人员的办公室、诊所、计算机可访问的互联网站和具有与网络的有线或无线连接的手持设备(诸如手机、平板电脑、智能电话等)。执行咨询的位置的示例包括医务人员的办公室、诊所、计算机可访问的互联网站、手持设备等。In some implementations, a diagnosis (e.g., a fetus having Down syndrome or a patient having a specific type of cancer) is generated at the same location as the analytical procedure. In other implementations, the diagnosis is performed at a different location. In some examples, reporting a diagnosis is performed at the location where the sample was obtained, although this is not always the case. Examples of locations where diagnoses may be generated or reported and/or where program development may be performed include a healthcare professional's office, a clinic, a computer-accessible internet site, and handheld devices (such as mobile phones, tablets, smartphones, etc.) with wired or wireless connectivity to a network. Examples of locations where consultations may be performed include a healthcare professional's office, a clinic, a computer-accessible internet site, a handheld device, etc.

在一些实施方案中，样品采集、样品处理和测序操作在第一位置处进行，并且分析和推导操作在第二位置处进行。然而，在一些情况下，在一个位置处(例如，医务人员的办公室或诊所)进行样品采集，并且在不同位置处进行样品处理和测序，所述不同位置任选地为进行分析和推导的相同位置。In some implementations, sample collection, sample processing, and sequencing are performed at a first location, and analysis and derivation are performed at a second location. However, in some cases, sample collection is performed at one location (e.g., a healthcare worker's office or clinic), and sample processing and sequencing are performed at different locations, optionally the same location where analysis and derivation are performed.

在多个实施方案中，上文列出的操作的序列可由用户或实体启动样品采集、样品处理和/或测序来引发。在一个或多个这些操作已开始执行之后，其他操作可自然地紧随其后。例如，测序操作可以使读段被自动采集并发送到处理装置，然后该处理装置通常自动地并且可能在没有进一步用户干预的情况下执行序列分析和非整倍体的推导操作。在一些具体实施中，然后将该处理操作的结果(可能会重新格式化为诊断)自动递送到处理将信息报告给健康专业人员和/或患者的系统部件或实体。如所解释的，此类信息也可进行自动处理以产生治疗、测试和/或监测计划，可能还有咨询信息。因此，启动早期操作可引发端对端顺序，其中为健康专业人员、患者或其他相关方提供可用于根据身体状况采取行动的诊断、计划、咨询和/或其他信息。即使整个系统的各部分在物理上是分开的并且可能远离例如样品和序列装置的位置，也可实现这一点。In several implementations, the sequence of operations listed above can be initiated by a user or entity to perform sample collection, sample processing, and/or sequencing. After one or more of these operations have been initiated, other operations can naturally follow. For example, a sequencing operation can cause reads to be automatically collected and sent to a processing device, which then typically and possibly automatically performs sequence analysis and aneuploidy deduction without further user intervention. In some implementations, the results of this processing operation (which may be reformatted for diagnosis) are then automatically delivered to system components or entities that process the information to healthcare professionals and/or patients. As explained, such information can also be automatically processed to generate treatment, testing, and/or monitoring plans, and possibly counseling information. Thus, initiating early operations can trigger an end-to-end sequence in which healthcare professionals, patients, or other stakeholders are provided with diagnostic, planning, counseling, and/or other information that can be used to take action based on the patient's condition. This can be achieved even if the parts of the system are physically separate and may be located, for example, away from the sample and sequencing devices.

图17示出了用于从试验样品中产生判定或诊断的分散系统的一个具体实施。样品采集位置01用于从患者(诸如妊娠女性或推定的癌症患者)中获得试验样品。然后将样品提供给处理和测序位置03，在此处可如上所述对试验样品进行处理和测序。位置03包括用于处理样品的装置以及用于对经处理的样品进行测序的装置。如本文他处所述，测序的结果是读段的集合，其通常以电子格式提供并提供给网络诸如互联网，其由图17中的参考标号05指示。Figure 17 illustrates a specific embodiment of a distributed system for generating a judgment or diagnosis from a test sample. Sample collection location 01 is used to obtain the test sample from a patient (such as a pregnant woman or a presumed cancer patient). The sample is then provided to processing and sequencing location 03, where the test sample can be processed and sequenced as described above. Location 03 includes means for processing the sample and means for sequencing the processed sample. As described elsewhere herein, the sequencing result is a collection of reads, which is typically provided in electronic format and made available to networks such as the Internet, indicated by reference numeral 05 in Figure 17.

将序列数据提供给执行分析和判定生成的远程位置07。该位置可包括一个或多个强大的计算设备，诸如计算机或处理器。在位置07处的计算资源已完成其分析并根据所接收的序列信息生成判定之后，将该判定中继回到网络05。在一些具体实施中，在位置07处不仅生成判定，而且还生成相关联的诊断。然后将该判定和/或诊断通过网络传输并返回至如图17所示的样品采集位置01。如所解释的，这仅仅是关于如何在各种位置之间划分与生成判定或诊断相关联的各种操作的许多变型中的一种变型。一种常见的变异体包括在单个位置处提供样品采集、处理和测序。另一种变型包括在与分析和判定生成相同的位置处提供处理和测序。The sequence data is provided to a remote location 07 for performing analysis and decision generation. This location may include one or more powerful computing devices, such as computers or processors. After the computing resources at location 07 have completed their analysis and generated a decision based on the received sequence information, the decision is relayed back to network 05. In some implementations, not only is a decision generated at location 07, but also an associated diagnosis. This decision and/or diagnosis is then transmitted over the network and returned to sample collection location 01, as shown in Figure 17. As explained, this is merely one of many variations regarding how to divide the various operations associated with generating decisions or diagnoses across various locations. A common variation involves providing sample collection, processing, and sequencing at a single location. Another variation involves providing processing and sequencing at the same location as analysis and decision generation.

图18阐明了用于在不同位置处执行各种操作的选项。在图18中所示的最细化意义上，以下操作中的每一个操作在单独的位置处执行：样品采集、样品处理、测序、读段比对、判定、诊断和报告和/或计划开发。Figure 18 illustrates the options for performing various operations at different locations. In the most detailed sense shown in Figure 18, each of the following operations is performed at a separate location: sample acquisition, sample processing, sequencing, read alignment, determination, diagnosis and reporting, and/or planning development.

在聚集这些操作中的一些操作的一个实施方案中，样品处理和测序在一个位置进行，而读段比对、判定和诊断在单独的位置进行。参见图18中由参考字符A标识的部分。在图18中由字符B标识的另一个具体实施中，样品采集、样品处理和测序均在相同位置处执行。在该具体实施中，在第二位置处执行读段比对和判定。最后，在第三位置处执行诊断和报告和/或计划开发。在图18中由字符C所示的具体实施中，在第一位置处执行样品采集，在第二位置处一起执行样品处理、测序、读段比对、判定和诊断，并且在第三位置处执行报告和/或计划开发。最后，在图18中标为D的具体实施中，在第一位置处执行样品采集，在第二位置处执行样品处理、测序、读段比对和判定全部，并且在第三位置处执行诊断和报告和/或计划管理。In one implementation that aggregates some of these operations, sample processing and sequencing are performed in one location, while read alignment, determination, and diagnosis are performed in separate locations. See the portion identified by reference character A in Figure 18. In another embodiment, identified by character B in Figure 18, sample acquisition, sample processing, and sequencing are all performed at the same location. In this embodiment, read alignment and determination are performed at a second location. Finally, diagnosis and reporting, and/or program development, are performed at a third location. In an embodiment shown by character C in Figure 18, sample acquisition is performed at a first location, sample processing, sequencing, read alignment, determination, and diagnosis are performed together at a second location, and reporting and/or program development are performed at a third location. Finally, in an embodiment labeled D in Figure 18, sample acquisition is performed at a first location, sample processing, sequencing, read alignment, and determination are all performed at a second location, and diagnosis and reporting, and/or program management, are performed at a third location.

一个实施方案提供了用于确定包含胎儿和母体核酸的试验样品中是否存在非整倍体的系统，该系统包括测序仪，该测序仪用于接收核酸样品并提供来自该样品的胎儿和母体核酸序列信息；一个或多个处理器，所述一个或多个处理器被配置为：(a)确定该试验样品的胎儿分数值，其中该试验样品的该胎儿分数指示该试验样品中胎源游离核酸片段的相对量；(b)由计算机系统接收通过对该试验样品中的所述游离核酸片段进行测序而获得的序列读段；(c)由计算机系统将所述游离核酸片段的所述序列读段与包含所关注序列的参考基因组进行比对，从而提供序列标签；(d)由计算机系统确定该参考基因组的至少一部分的所述序列标签的覆盖度；以及(e)基于在(d)中确定的所述序列标签的覆盖度和在(a)中确定的胎儿分数来确定该试验样品在排除区域内，其中该排除区域由至少胎儿分数检测限(LOD)曲线限定，其中该胎儿分数LOD曲线随覆盖度值而变化，并且指示在给定不同覆盖度的情况下实现检测标准所需的胎儿分数的最小值。One embodiment provides a system for determining the presence of aneuploidy in a test sample containing fetal and maternal nucleic acids, the system comprising a sequencer for receiving a nucleic acid sample and providing fetal and maternal nucleic acid sequence information from the sample; and one or more processors configured to: (a) determine a fetal score of the test sample, wherein the fetal score of the test sample indicates the relative amount of fetal cell-free nucleic acid fragments in the test sample; (b) receive, by a computer system, sequence reads obtained by sequencing the cell-free nucleic acid fragments in the test sample; and (c) process, by the computer system, the sequence reads obtained by sequencing the cell-free nucleic acid fragments in the test sample. The sequence read of the cell-free nucleic acid fragment is aligned with a reference genome containing the sequence of interest to provide a sequence tag; (d) the coverage of the sequence tag of at least a portion of the reference genome is determined by a computer system; and (e) the test sample is determined to be within an exclusion region based on the coverage of the sequence tag determined in (d) and the fetal fraction determined in (a), wherein the exclusion region is defined by a limit of detection (LOD) curve for at least fetal fraction, wherein the fetal fraction LOD curve varies with coverage values and indicates the minimum fetal fraction required to achieve the detection criteria given different coverage values.

在本文提供的系统中的任意一种系统的一些实施方案中，测序仪被配置为执行下一代测序(NGS)。在一些实施方案中，测序仪被配置为使用具有可逆染料终止子的边合成边测序来执行大规模并行测序。在其他实施方案中，测序仪被配置为执行边连接边测序。在其他实施方案中，测序仪被配置为执行单分子测序。In some embodiments of any of the systems provided herein, the sequencer is configured to perform next-generation sequencing (NGS). In some embodiments, the sequencer is configured to perform massively parallel sequencing using sequencing-by-synthesis with reversible dye terminators. In other embodiments, the sequencer is configured to perform sequencing-by-ligation. In still other embodiments, the sequencer is configured to perform single-molecule sequencing.

在本文提供的系统中的任意一种系统的一些实施方案中，一个或多个处理器被编程为执行上述各种方法。In some implementations of any of the systems provided herein, one or more processors are programmed to perform the various methods described above.

本公开的另一方面涉及计算机程序产品，该计算机程序产品包括存储程序代码的非暂态机器可读介质，该程序代码在由计算机系统的一个或多个处理器执行时，使得该计算机系统：(a)确定试验样品的胎儿分数值，其中该试验样品的胎儿分数指示该试验样品中胎源游离核酸片段的相对量；(b)由计算机系统接收通过对该试验样品中的所述游离核酸片段进行测序而获得的序列读段；(c)由计算机系统将所述游离核酸片段的所述序列读段与包含所关注序列的参考基因组进行比对，从而提供序列标签；(d)由计算机系统确定该参考基因组的至少一部分的所述序列标签的覆盖度；以及(e)基于在(d)中确定的所述序列标签的覆盖度和在(a)中确定的胎儿分数来确定该试验样品在排除区域内，其中该排除区域由至少胎儿分数检测限(LOD)曲线限定，其中该胎儿分数LOD曲线随覆盖度值而变化，并且指示在给定不同覆盖度的情况下实现检测标准所需的胎儿分数的最小值。Another aspect of this disclosure relates to a computer program product comprising a non-transitory machine-readable medium storing program code that, when executed by one or more processors of a computer system, causes the computer system to: (a) determine a fetal score of a test sample, wherein the fetal score of the test sample indicates the relative amount of cell-free fetal nucleic acid fragments in the test sample; (b) receive, by the computer system, a sequence read obtained by sequencing the cell-free nucleic acid fragments in the test sample; (c) align the sequence read of the cell-free nucleic acid fragments with a reference genome containing the sequence of interest, thereby providing a sequence tag; (d) determine, by the computer system, the coverage of the sequence tag for at least a portion of the reference genome; and (e) determine, based on the coverage of the sequence tag determined in (d) and the fetal score determined in (a), that the test sample is within an exclusion region, wherein the exclusion region is defined by at least a limit of detection (LOD) curve for fetal score, wherein the fetal score LOD curve varies with coverage values and indicates the minimum fetal score required to achieve the detection criteria given different coverage values.

在本文提供的系统的一些实施方案中，计算机程序产品包括存储程序代码的非暂态机器可读介质，该程序代码将由一个或多个处理器执行以执行上述各种方法。In some implementations of the systems provided herein, the computer program product includes a non-transitory machine-readable medium storing program code that will be executed by one or more processors to perform the various methods described above.

实验experiment

实施例1Example 1

该实施例的目的在于阐述如何使用上述方法来获得LOD曲线。The purpose of this embodiment is to illustrate how to obtain LOD curves using the above method.

研究设计Research Design

对于该研究，将在VeriSeq NIPT精度研究(BRIGID-0147VeriSeq NIPT：精度研究方案和BRIGID-0166VeriSeq NIPT—精度研究报告)中生成的数据中的一部分用于确定VeriSeq NIPT溶液系统的检测限。For this study, a portion of the data generated in the VeriSeq NIPT Accuracy Study (BRIGID-0147 VeriSeq NIPT: Accuracy Study Protocol and BRIGID-0166 VeriSeq NIPT—Accuracy Study Report) will be used to determine the detection limit of the VeriSeq NIPT solution system.

在VeriSeq NIPT精度研究中，将来自21-三倍体和非妊娠血浆池的合并cfDNA组合，以产生5％胎儿分数21-三倍体受影响池，其通过VeriSeq NIPT测定法进行处理。男性母体池也通过VeriSeq NIPT测定按原样进行处理。In the VeriSeq NIPT precision study, pooled cfDNA from 21-triploid and non-pregnant plasma pools was combined to generate a 5% fetal fraction 21-triploid affected pool, which was processed by VeriSeq NIPT assay. Male maternal pools were also processed unchanged by VeriSeq NIPT assay.

表1示出了用于LOD研究的精度研究设计的实验室内精度部分。实验室内精度部分由2台Hamilton仪器、2台NextSeq仪器和三个试剂批次组成，总共在6天内运行12次。Table 1 shows the in-laboratory precision component of the precision study design used for LOD studies. The in-laboratory precision component consisted of two Hamilton instruments, two NextSeq instruments, and three reagent batches, running a total of 12 times over 6 days.

表1—精度研究的实验室内精度部分Table 1—Laboratory Accuracy Component of Accuracy Studies

自动化和测序仪器与一致的操作者配对。将操作者和仪器变型组合。Pair automated and sequencing instruments with consistent operators. Combine operator and instrument variants.

用试剂批次1和2生成的数据用于建立检测限。通过将来自T21的序列数据(5％FF)与非妊娠女性的序列数据组合，我们生成了具有不同胎儿分数的T21样品的合成序列数据。针对5个不同水平的测序深度(2百万、4百万、6百万、8百万和1千万个唯一比对读段覆盖度—参见表2)重复该过程。在宽范围内的精细网格上选择稀释点，以覆盖所有水平的测序深度(1.25％-4.5％)的预期检测限，步长为0.25％。Data generated using reagent batches 1 and 2 were used to establish the limits of detection. Synthetic sequence data from T21 samples (5% FF) were generated by combining sequence data from non-pregnant women with different fetal fractions. This process was repeated for five different sequencing depths (2 million, 4 million, 6 million, 8 million, and 10 million unique aligned read coverage—see Table 2). Dilution points were selected on a wide range of fine grids to cover the expected limits of detection for all sequencing depths (1.25%–4.5%), with a step size of 0.25%.

对于每个稀释点和每个覆盖度水平，生成20个重复的合成样品(更详细的方法描述参见DEV报告-0072，对数似然比得分和检测限预测的数学模型)。For each dilution point and each coverage level, 20 replicate synthetic samples were generated (see DEV report-0072 for a more detailed method description, Mathematical Model for Log Likelihood Ratio Score and Detection Limit Prediction).

使用计算机模拟生成的稀释系列，使用如BRIGID-0150(VeriSeq NIPT溶液设计验证方案检测限)中所述的概率单位回归方法针对每个覆盖度水平建立T21的LoD。此外，使用T21的LOD结果以及LOD T21与LOD T13和T18之间的已知关系来确定T13和T18的LOD，如DEV报告-0072中所述。Using computer-generated dilution series, the LoD of T21 was established for each coverage level using a probabilistic unit regression method as described in BRIGID-0150 (VeriSeq NIPT Solution Design Validation Protocol Detection Limit). Furthermore, the LODs of T13 and T18 were determined using the LOD results of T21 and the known relationships between LOD T21 and LODs T13 and T18, as described in DEV Report-0072.

结果result

通过R&D Clinical Aneuploidy Detection and Analysis Suite(cADAS)v3.2来计算每个重复运行的胎儿分数的定量得分、对数似然比(LLR)和估计值。The quantitative score, log-likelihood ratio (LLR), and estimate of the fetal score for each repeated run were calculated using R&D Clinical Aneuploidy Detection and Analysis Suite (cADAS) v3.2.

样品过滤和异常值Sample filtration and outliers

将所有无模板对照(NTC)样品排除在分析之外。随后去除这些样品，得到564个样品，其中N＝48T21池和N-48非妊娠池。将这些样品进一步用于计算机模拟稀释分析。其余样品不直接使用，而是用作分析的填料。非妊娠样品中的一个样品被标记为NES_FF_QC失败，该QC失败被忽略，并且该样品纳入分析，因为样品是非妊娠的(其中不具有胎儿DNA)，因此预期在该度量中失败。All template-free control (NTC) samples were excluded from the analysis. These samples were then removed, resulting in 564 samples, with N=48 in pool T21 and N-48 in the non-pregnant pool. These samples were further used for computer-simulated dilution analysis. The remaining samples were not used directly but were used as packing material for the analysis. One non-pregnant sample was marked as NES_FF_QC failure; this QC failure was ignored, and the sample was included in the analysis because it was non-pregnant (containing no fetal DNA) and therefore expected to fail in that metric.

计算机模拟稀释质量验证Computer simulation of dilution quality verification

由男性血浆样品和女性血浆样品的合并混合物来制备受影响样品，其中每种样品约占50％。因此，Y染色体DNA存在于样品中并可用于验证计算机模拟稀释过程。图19示出了根据稀释分数的合成生成的样品的染色体Y覆盖度(左图)和FF级分估计值(右图)。两个线性曲线图均示出FF(初始非妊娠FF＝0％，并且受影响FF＝5％)和计算机模拟稀释度之间的协调值。The affected sample was prepared from a combined mixture of male and female plasma samples, with each sample comprising approximately 50%. Therefore, Y chromosome DNA was present in the sample and could be used to validate the computer-simulated dilution process. Figure 19 shows the Y chromosome coverage (left) and FF fraction estimates (right) of the synthesized sample based on the dilution fraction. Both linear plots show the reconciliation values between the FF (initial non-pregnant FF = 0%, and affected FF = 5%) and the computer-simulated dilution.

确定T21三倍体测试的检测限Determine the detection limit of the T21 triploid test

根据对指定覆盖度水平的实验观察来确定LOD的程序描述于BRIGID 0150vA中。(VeriSeq NIPT溶液设计验证方案检测限)。The procedure for determining the LOD based on experimental observations of a specified coverage level is described in BRIGID 0150vA (VeriSeq NIPT solution design validation protocol detection limit).

每个批次和每个覆盖度水平的T21的测量LOD汇总于表2中。The measured LODs for T21 for each batch and each coverage level are summarized in Table 2.

表2检测限Table 2 Detection Limits

检测限对比覆盖度Detection limit contrast coverage

DEV报告-0072中描述的理论模型(对数似然比得分和检测限预测的数学模型)预测以Log-Log标度计的LOD对比覆盖度是直线。图20示出了LOD对比覆盖度的线性拟合的结果。我们观察到实验数据与预测行为十分匹配。The theoretical model described in DEV Report-0072 (the mathematical model for predicting the log-likelihood ratio score and detection limit) predicts a linear LOD contrast coverage on a Log-Log scale. Figure 20 shows the results of the linear fit to the LOD contrast coverage. We observed a very good match between the experimental data and the predicted behavior.

拟合的LOD对比覆盖度线预测任何覆盖度的检测限。预测LOD与观察LOD叠加的结果在图21中示出。The fitted LOD, compared to the coverage line, predicts the detection limit for any coverage. The result of overlaying the predicted LOD with the observed LOD is shown in Figure 21.

平均情况的检测限Detection limit in average case

样品的总群具有可变的覆盖度。图22示出了使用V2版本的VeriSeq NIPT测定法处理的大量样品(N＝14,400)中的覆盖度分布(NES)。以根据覆盖度分布范围内的覆盖度的预期LOD来计算总群的预期LOD。The total population of samples has variable coverage. Figure 22 shows the coverage distribution (NES) of a large number of samples (N = 14,400) processed using the VeriSeq NIPT assay, version V2. The expected LOD of the total population was calculated based on the expected LOD of coverage within the coverage distribution range.

其中LOD(NES)是根据覆盖度的LOD，并且p(NES)是图22所示的NES的概率密度函数。T21的所得平均情况LOD示于表3中。Where LOD(NES) is the LOD based on coverage, and p(NES) is the probability density function of NES shown in Figure 22. The average LOD obtained for T21 is shown in Table 3.

表3平均群体覆盖度的检测限Table 3 Detection Limits for Average Population Coverage

度量measure 批次1Batch 1 批次2Batch 2 批次3Batch 3 LOD21LOD21 2.81％2.81% 2.40％2.40% 2.43％2.43%

其他非整倍体T13、T18的检测限Detection limits for other aneuploid T13 and T18

如在先前的研究(对数似然比得分和检测限预测的数学模型)中所述，可根据染色体中的至少一条染色体的已知LOD推断不同染色体的非整倍体的检测限。使用来自先前研究的关系，我们可使用先前确定的倍增系数来推断T13三倍体和T18三倍体的LOD，结果示于表4和表5中。As described in previous research (Mathematical Model for Log-Likelihood Ratio Score and Detection Limit Prediction), the detection limit for aneuploidy on different chromosomes can be inferred from the known LOD of at least one chromosome. Using relations from previous research, we were able to infer the LOD of T13 and T18 triploids using previously determined doubling factors, and the results are shown in Tables 4 and 5.

表4根据验收标准推断的T13和T18评估的平均群体覆盖度的检测限Table 4. Detection limits for the average population coverage inferred from the acceptance criteria in T13 and T18 assessments.

非整倍体aneuploidy 观察LoDObserve LoD LoD系数LoD coefficient 根据T21推断的LoDLoD inferred from T21 T21T21 2.81％2.81% 100％100% 2.81％2.81% T18T18 NANA 81％81% 2.27％2.27% T13T13 NANA 72％72% 2.02％2.02%

表5精度％CV验收标准Table 5 Accuracy %CV Acceptance Criteria

度量measure ％FF%FF 通过标准Through standards 通过/失败Pass/Fail LOD13LOD13 2.02％2.02% LOD<4％LOD < 4% 通过pass LOD18LOD18 2.27％2.27% LOD<4％LOD < 4% 通过pass LOD21LOD21 2.81％2.81% LOD<4％LOD < 4% 通过pass

结论in conclusion

已确定VeriSeq NIPT系统对于正常母体二倍体cfDNA背景下的胎儿cfDNA的胎儿分数(三倍体13/18/21)的检测限(LOD)。检测限值汇总于表6中。对于所有测试的非整倍体，发现观察LOD低于测试要求文件中规定的值。基于表6中呈现的数据，所确定的检测限通过了对所有靶向非整倍体的另一项研究的接受标准。The limits of detection (LODs) for the VeriSeq NIPT system for fetal fractions of fetal cfDNA (triploid 13/18/21) in a normal maternal diploid cfDNA background have been determined. The LODs are summarized in Table 6. For all aneuploidy tested, the observed LODs were found to be lower than the values specified in the testing requirements document. Based on the data presented in Table 6, the determined LODs met the acceptance criteria of another study targeting all aneuploidy.

表6：非整倍体胎儿分数LODTable 6: Fractional LOD of aneuploid fetuses

非整倍体aneuploidy LOD(％FF)LOD(%FF) T13T13 2.02％2.02% T18T18 2.27％2.27% T21T21 2.81％2.81%

实施例2Example 2

实施例2示出了标记为上述NES和胎儿分数LOD曲线方法的常规方法的各种性能度量的经验数据。Example 2 illustrates empirical data on various performance metrics of the conventional method labeled as the NES and fetal fractional LOD curve method described above.

图23示出了根据观察胎儿分数的NES覆盖度。两步覆盖度阈值技术用于排除样品。落入排除区域的各种样品可见于图中。大多数排除的样品具有在0％至20％之间的胎儿分数。并且它们的覆盖度受两个阈值水平的限制。Figure 23 shows NES coverage based on observed fetal scores. A two-step coverage thresholding technique was used to exclude samples. Various samples falling into the exclusion zone are shown in the figure. Most excluded samples had fetal scores between 0% and 20%, and their coverage was limited by two threshold levels.

图24示出了使用由LOD曲线和读段阈值限定的排除区域的数据排除。左图的读段阈值为2百万个读段。右图示出了使用1百万个读段的覆盖度阈值的数据。如图所示，在期望排除较少样品的条件下可降低覆盖度阈值以排除较少样品。许多排除的样品具有相对高的读段覆盖度和相对低的观察胎儿分数。相比之下，在图23中，许多排除的样品具有相对高的胎儿分数和相对低的覆盖度。Figure 24 illustrates data exclusion using exclusion regions defined by LOD curves and read thresholds. The left figure shows a read threshold of 2 million reads. The right figure shows data using a coverage threshold of 1 million reads. As shown, the coverage threshold can be lowered to exclude fewer samples when fewer samples are desired. Many excluded samples have relatively high read coverage and relatively low observed fetal scores. In contrast, in Figure 23, many excluded samples have relatively high fetal scores and relatively low coverage.

图25示出了现有方法和上述LOD QC方法第一次运行和第二次运行的通过率和失败率。数据显示，这两种方法都具有相似的通过率和失败率。现有方法的最终失败率为0.42％。并且LOD QC方法的最终失败率为0.38％。Figure 25 shows the pass and failure rates for the first and second runs of the existing method and the LOD QC method described above. The data shows that both methods have similar pass and failure rates. The final failure rate of the existing method is 0.42%, while the final failure rate of the LOD QC method is 0.38%.

图26示出了通过两步阈值法排除并通过标记为2604的胎儿分数LOD曲线方法挽救的数据。通过LOD QC方法排除并通过常规方法挽救的样品在区域2602中示出。图26的左图示出了与LOD曲线组合的2百万个读段的覆盖度阈值的数据。右图示出了1百万个读段的较低覆盖度阈值的挽救数据。使用左图2百万个读段的覆盖度阈值，LOD QC方法排除了61个另外的样品。但是当覆盖度阈值降低到1百万个读段时，通过LOD QC方法排除的样品就少了151个。Figure 26 shows data excluded by a two-step thresholding method and salvaged by the fetal fractional depth-of-reading (LOD) curve method, labeled 2604. Samples excluded by the LOD QC method and salvaged by conventional methods are shown in region 2602. The left panel of Figure 26 shows data with a coverage threshold of 2 million reads combined with the LOD curve. The right panel shows salvaged data with a lower coverage threshold of 1 million reads. Using the 2 million read coverage threshold in the left panel, the LOD QC method excluded 61 additional samples. However, when the coverage threshold was lowered to 1 million reads, 151 fewer samples were excluded by the LOD QC method.

图27示出了通过LOD QC方法挽救的样品。左图示出了在区域2702中通过常规方法排除并且通过LO DQC方法挽救的样品。在重新运行之后，这些样品中的80％落入包含区域2704内。而10％的样品仍留在相同的区域2702中。9.7％的样品落在LOD曲线下方进入到排除区域中。右图示出了通过应用LOD曲线的1百万覆盖度阈值挽救的样品。区域2708中示出了通过常规方法排除的通过1百万个读段阈值挽救的样品。在重新运行样品之后，80％的样品移动到包含区域2710中，13.7％的样品仍留在相同区域2708中，并且6.3％的样品落在LOD曲线下方进入到排除区域中。该图示出了单独的LOD方法或与覆盖度阈值方法组合的LOD方法可挽救原本会被排除的很大百分比的样品。在实施过程中，LOD QC方法有助于避免重新运行被挽救的样品的需要，从而节省了用于CNV检测的时间、成本和资源。Figure 27 shows samples saved using the LOD QC method. The left figure shows samples excluded in region 2702 by conventional methods and saved by the LOD QC method. After rerun, 80% of these samples fell into the containment region 2704. 10% remained in the same region 2702. 9.7% of the samples fell below the LOD curve and entered the exclusion region. The right figure shows samples saved by applying a 1 million read threshold to the LOD curve. Region 2708 shows samples saved by a 1 million read threshold after being excluded by conventional methods. After rerunning the samples, 80% moved into the containment region 2710, 13.7% remained in the same region 2708, and 6.3% fell below the LOD curve and entered the exclusion region. This figure demonstrates that the LOD method alone, or the LOD method in combination with a coverage threshold method, can save a large percentage of samples that would otherwise be excluded. During implementation, the LOD QC method helps avoid the need to rerun the salvaged sample, thus saving time, cost, and resources for CNV detection.

图28示出了在区域2802中通过LOD QC方法排除的数据。在重新运行之后，少数样品移动到LOD曲线上方进入到包含区域2708中。更多数据停留在相同2802区域中或进一步移动到LOD曲线2708下方。来自图27和图28的结果表明，LOD QC方法包括实际上满足QC条件的许多样品，并且排除不能满足QC标准的样品，即使在重新运行时也是如此。这说明LOD QC方法可智能且正确地分离阳性和阴性样品。Figure 28 shows the data excluded by the LOD QC method in region 2802. After rerun, a few samples moved above the LOD curve into region 2708. More data remained in the same region 2802 or moved further below the LOD curve 2708. The results from Figures 27 and 28 indicate that the LOD QC method includes many samples that actually meet the QC criteria and excludes samples that do not meet the QC criteria, even during rerun. This demonstrates that the LOD QC method can intelligently and correctly separate positive and negative samples.

图29示出了现有方法和LOD QC方法两次运行的通过率和失败率。总体而言，对于两种方法，第一次运行和第二次运行中的通过率和失败率都是类似的。Figure 29 shows the pass rate and failure rate of the existing method and the LOD QC method in two runs. Overall, the pass rate and failure rate are similar for both methods in the first and second runs.

图30示出75％置信度LOD曲线挽救了为21-三倍体(T21)假阳性的样品3002。该曲线还挽救了为18-三倍体(T18)假阴性的样品3004。这导致T21假阳性减少100％，并且T18假阴性减少50％。说明LOD QC方法可增加灵敏度和特异性两者。Figure 30 shows that the 75% confidence level LOD curve rescued sample 3002, which was a false positive for 21-triploidy (T21). The curve also rescued sample 3004, which was a false negative for 18-triploidy (T18). This resulted in a 100% reduction in T21 false positives and a 50% reduction in T18 false negatives. This demonstrates that the LOD QC method can increase both sensitivity and specificity.

图31示出，对于模拟的T21样品，可以99.7％的灵敏度检测通过LOD QC的样品，而可以88.1％的灵敏度检测通过LOD QC的样品。无论LOD QC状态如何，所有样品的总体灵敏度均为99.7。Figure 31 shows that, for the simulated T21 sample, samples passing the LOD QC can be detected with a sensitivity of 99.7%, while samples passing the LOD QC can be detected with a sensitivity of 88.1%. Regardless of the LOD QC status, the overall sensitivity for all samples is 99.7.

Claims

1. A method for processing test samples containing cell-free nucleic acid fragments derived from the mother and fetus, the method being implemented using a computer system including a memory and one or more processors, the method comprising:

(a) Determine an observation of the fetal fraction of the test sample, wherein the fetal fraction of the test sample indicates the relative amount of fetal cell-free nucleic acid fragments in the test sample;

(b) The computer system receives a sequence read obtained by sequencing the free nucleic acid fragment in the test sample;

(c) The computer system aligns the sequence read of the free nucleic acid fragment with a reference genome containing the sequence of interest, thereby providing a sequence tag;

(d) The computer system determines the coverage of the sequence tag for at least a portion of the reference genome;

(e) Determining the test sample within the exclusion region based on the sequence tag coverage determined in (d) and the observed fetal fraction determined in (a), wherein the exclusion region is defined by at least a fetal fraction detection limit (LOD) curve, wherein the fetal fraction LOD curve varies with coverage values and indicates the minimum observed fetal fraction required to achieve a given confidence level for a given coverage value, the given confidence level being the minimum fetal fraction greater than the corresponding true fetal fraction required to achieve the detection standard; and

(f) As a response to determining that the test sample is within the exclusion region, the test sample is excluded for determining the copy number variation (CNV) of the sequence of interest, or the test sample is re-sequencing to obtain a re-sequencing read for determining the CNV of the sequence of interest.

The LOD curve mentioned above is obtained through the following:

(i) For each of a plurality of observed fetal scores and for each of a plurality of coverage levels, select one or more selected training samples from a first plurality of training samples, wherein the true fetal scores of the training samples have a given confidence level, wherein each training sample includes cell-free nucleic acid fragments derived from the mother and the fetus.

(ii) Using one or more selected training samples, obtain curves of real fetal scores versus observed fetal scores at multiple coverage levels;

(iii) Using a second plurality of training samples to obtain real fetal scores, the real fetal scores having at least a minimum fetal score required to achieve the detection criteria for a plurality of coverage levels;

(iv) Using the true fetal score versus observed fetal score curve and the true fetal score having at least the minimum fetal score required to achieve the detection criteria, obtain multiple minimum observed fetal scores for multiple coverage levels required with a given confidence level, said given confidence level being the minimum true fetal score having at least the minimum fetal score required to achieve the detection criteria; and

(v) Obtain fetal score LOD curves using multiple minimum observed fetal scores for multiple coverage levels.

2. The method according to claim 1, further comprising:

Use the resequencing sequence reads to repeat (a)-(d); and

The test sample was determined to be outside the exclusion zone.

3. The method of claim 1, wherein the fetal fractional LOD curve is obtained based on the LOD of the affected training samples affected by CNV.

4. The method of claim 3, wherein the affected training sample includes a computer-simulated sample.

5. The method of claim 3, wherein the affected training sample comprises an in vitro sample.

6. The method of claim 3, wherein the affected training sample is obtained by combining samples having two or more fetal scores.

7. The method of claim 1, wherein the detection criterion is that, at the minimum observed value of the fetal score and the given coverage value, there exists a given confidence level, wherein the given confidence level is greater than a specified LOD required to detect the affected sample at a certain detection confidence level.

8. The method of claim 7, wherein the detection standard is an X% confidence level, wherein the X% confidence level is, with respect to the minimum observed value of the fetal score, the corresponding true fetal score is greater than the specified LOD required to detect the affected sample at a Y% detection confidence level.

9. The method of claim 8, wherein X% is about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99%, or 99.5%.

10. The method of claim 8, wherein Y% is a detection confidence level of about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or 99%.

11. The method of claim 8, wherein X% is 50% and Y% is 95%.

12. The method of claim 7, wherein the designated LOD is determined as the minimum observed fetal fraction of the affected sample in which Y% can be detected.

13. The method of claim 7, wherein the distribution of the true fetal scores observed at the observation coverage is used to obtain the detection criteria for the observed fetal scores at said coverage.

14. The method according to any one of claims 1-13, wherein the exclusion region is below the fetal fractional level of deviation (LOD) curve.

15. The method according to any one of claims 1-13, wherein the exclusion region is defined by the fetal fractional LOD curve and the coverage threshold.

16. The method according to any one of claims 1-13, wherein the exclusion region is below both the fetal fraction LOD curve and the coverage threshold.

17. The method according to any one of claims 1-13, wherein determining the sequence tag coverage of the at least portion of the reference genome comprises:

(i) Divide the reference genome into multiple groups;

(ii) Determine the number of sequence tags to be compared with each group; and

(iii) The sequence tag coverage is determined using the number of sequence tags in the groupings of at least a portion of the reference genome.

18. The method of claim 17, further comprising, prior to (iii), adjusting the number of sequence tags compared with the group by taking into account inter-group variation due to factors other than copy number variation.

19. The method according to any one of claims 1-13, wherein the observed value of the fetal fraction of the test sample is determined based on the size of the free nucleic acid fragment.

20. The method of claim 19, wherein the observed value of the fetal fraction of the test sample is determined by the following steps:

Obtain the frequency distribution of the size of the free nucleic acid fragments; and

The frequency distribution is applied to a model that correlates fetal fractions with the frequency of fragment size to obtain observations of the fetal fractions.

21. The method of claim 17, wherein the observed value of the fetal score of the test sample is determined based on the coverage information of the grouping of the reference genome.

22. The method of claim 21, wherein the observed fetal score is calculated by applying coverage values of a plurality of groups of the reference genome to a model that correlates the fetal score with the coverage of the groups to obtain the observed fetal score.

23. The method of claim 22, wherein the plurality of groups of the reference genome have a higher fraction of fetal cell-free nucleic acid fragments than the other groups.

24. The method according to any one of claims 1-13, wherein the observed value of the fetal score of the test sample is determined based on the coverage information of the sex chromosome grouping.

25. A system for assessing the copy number of a nucleic acid sequence of interest in a test sample, the system comprising a processor and one or more computer-readable storage media having instructions stored thereon for performing the following operations on the processor:

(b) Receiving a sequence read obtained by sequencing the free nucleic acid fragment in the test sample;

(c) Aligning the sequence read of the free nucleic acid fragment with a reference genome containing the sequence of interest to provide a sequence tag;

(d) Determine the coverage of the sequence tag for at least a portion of the reference genome; and

(e) The test sample is determined to be within an exclusion region based on the sequence tag coverage determined in (d) and the observed fetal fraction determined in (a), wherein the exclusion region is defined by at least a fetal fraction detection limit (LOD) curve, wherein the fetal fraction LOD curve varies with coverage values and indicates the minimum observed fetal fraction required to achieve a given confidence level for a given coverage value, wherein the given confidence level is the minimum fetal fraction greater than the corresponding true fetal fraction required to achieve the detection criteria.

The LOD curve mentioned above is obtained through the following:

(i) From a first plurality of training samples, for each of a plurality of observed fetal scores and for each of a plurality of coverage levels, one or more selected training samples are selected, the true fetal scores of the training samples having a given confidence level, wherein each training sample includes cell-free nucleic acid fragments derived from the mother and fetus.

26. A machine-readable medium storing program code, which, when executed by one or more processors of a computer system, causes the computer system to:

(a) Determine the observation of the fetal fraction of the test sample, wherein the fetal fraction of the test sample indicates the relative amount of fetal cell-free nucleic acid fragments in the test sample;

The LOD curve mentioned above is obtained through the following:

27. A system for assessing the copy number of a nucleic acid sequence of interest in a test sample, the system comprising one or more processors and one or more computer-readable storage media having program code stored thereon for performing the following:

(d) Determine the coverage of the sequence tag for at least a portion of the reference genome;

(f) As a response to determining that the test sample is within the exclusion region, the test sample is excluded for determining the copy number variation (CNV) of the sequence of interest, or a re-sequencing read from the re-sequencing of the test sample is obtained for determining the CNV of the sequence of interest.

The LOD curve mentioned above is obtained through the following:

28. The system of claim 27, wherein the program code includes code for implementing the following:

Use the resequencing reads to repeat (a)-(d);

Determine that the test sample is outside the exclusion zone; and

A CNV determination is made for the test sample regarding the sequence of interest.

29. The system of claim 27, wherein the fetal fractional LOD curve is obtained based on the LOD of the affected training samples affected by CNV.

30. The system of claim 29, wherein the affected training sample includes a computer-simulated sample.

31. The system of claim 29, wherein the affected training sample comprises an in vitro sample.

32. The system of claim 29, wherein the affected training sample is obtained by combining samples having two or more fetal scores.

33. The system of claim 27, wherein the detection criterion is that, at the minimum observed value of the fetal score and the given coverage value, there exists a given confidence level, wherein the given confidence level is greater than a specified LOD required to detect the affected sample at a certain detection confidence level.

34. The system of claim 33, wherein the detection standard is an X% confidence level, wherein the X% confidence level is, with respect to the minimum observed value of the fetal score, the corresponding true fetal score being greater than the specified LOD required to detect the affected sample at a Y% detection confidence level.

35. The system of claim 34, wherein X% is about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99%, or 99.5%.

36. The system of claim 34, wherein Y% is a detection confidence level of about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or 99%.

37. The system of claim 34, wherein X% is 50% and Y% is 95%.

38. The system of claim 33, wherein the designated LOD is determined as the minimum observed fetal fraction in which Y% of the affected sample can be detected.

39. The system of claim 33, wherein the distribution of the true fetal scores observed at the observation coverage is used to obtain the detection criteria for the observed fetal scores at the observation coverage.

40. The system according to any one of claims 27-39, wherein the exclusion region is below the fetal fractional OOD curve.

41. The system according to any one of claims 27-39, wherein the exclusion region is defined by the fetal fractional LOD curve and the coverage threshold.

42. The system according to any one of claims 27-39, wherein the exclusion region is below both the fetal fraction LOD curve and the coverage threshold.

43. The system according to any one of claims 27-39, wherein determining the sequence tag coverage of the at least portion of the reference genome comprises:

(i) Divide the reference genome into multiple groups;

(ii) Determine the number of sequence tags to be compared with each group; and

44. The system of claim 43, wherein determining the sequence tag coverage of the portion of the reference genome further includes, prior to (iii), adjusting the number of sequence tags aligned with the group by taking into account inter-group variations due to factors other than copy number variations.

45. The system according to any one of claims 27-39, wherein the observed value of the fetal fraction of the test sample is determined based on the size of the free nucleic acid fragment.

46. The system of claim 45, wherein the observed value of the fetal fraction of the test sample is determined by the following steps:

47. The system of claim 43, wherein the observed value of the fetal score of the test sample is determined based on the coverage information of the grouping of the reference genome.

48. The system of claim 47, wherein the observed fetal score is calculated by applying coverage values of a plurality of groups of the reference genome to a model that correlates the fetal score with the coverage of the groups to obtain the observed fetal score.

49. The system of claim 48, wherein the plurality of groups of the reference genome have a higher fraction of fetal cell-free nucleic acid fragments than the other groups.

50. The system according to any one of claims 27-39, wherein the observed value of the fetal score of the test sample is determined based on coverage information of sex chromosome grouping.

51. A machine-readable medium storing program code, which, when executed by one or more processors of a computer system, causes the computer system to perform a method comprising the steps of:

(c) The computer system aligns the sequence read of the cell-free nucleic acid fragment with a reference genome containing the sequence of interest, thereby providing a sequence tag;

The LOD curve mentioned above is obtained through the following:

52. The machine-readable medium of claim 51, wherein the test sample has been determined to be negative for the CNV determination of the sequence of interest.

53. The machine-readable medium of claim 51, further comprising the step of:

Use the resequencing reads to repeat (a)-(d);

Determine that the test sample is outside the exclusion zone; and

54. The machine-readable medium of claim 51, wherein the fetal fractional LOD curve is obtained based on the LOD of the affected training sample influenced by CNV.

55. The machine-readable medium of claim 54, wherein the affected training sample comprises a computer-simulated sample.

56. The machine-readable medium of claim 54, wherein the affected training sample comprises an in vitro sample.

57. The machine-readable medium of claim 54, wherein the affected training sample is obtained by combining samples having two or more fetal scores.

58. The machine-readable medium of claim 51, wherein the detection criterion is that, at the minimum observed value of the fetal score and the given coverage value, there exists a given confidence level, wherein the given confidence level is greater than a specified LOD required to detect the affected sample at a certain detection confidence level.

59. The machine-readable medium of claim 58, wherein the detection standard is an X% confidence level, the X% confidence level being, with respect to the minimum observed value of the fetal score, a corresponding true fetal score greater than a specified LOD required to detect the affected sample at a Y% detection confidence level.

60. The machine-readable medium of claim 59, wherein X% is about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99%, or 99.5%.

61. The machine-readable medium of claim 59, wherein Y% is a detection confidence level of about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or 99%.

62. The machine-readable medium of claim 59, wherein X% is 50% and Y% is 95%.

63. The machine-readable medium of claim 58, wherein the designated LOD is determined as the minimum observed fetal fraction in which Y% of the affected sample can be detected.

64. The machine-readable medium of claim 58, wherein the distribution of the actual fetal scores observed at the observation coverage is used to obtain the detection criteria for the observed fetal scores at said coverage.

65. The machine-readable medium according to any one of claims 51-64, wherein the exclusion region is below the fetal fractional level of distance (LOD) curve.

66. The machine-readable medium according to any one of claims 51-64, wherein the exclusion region is defined by the fetal fractional LOD curve and the coverage threshold.

67. The machine-readable medium according to any one of claims 51-64, wherein the exclusion region is below both the fetal fractional level of diameter (LOD) curve and the coverage threshold.

68. The machine-readable medium according to any one of claims 51-64, wherein the sequence tag coverage for determining the at least portion of the reference genome comprises:

(i) Divide the reference genome into multiple groups;

(ii) Determine the number of sequence tags to be compared with each group; and

69. The machine-readable medium of claim 68, wherein determining the sequence tag coverage of the portion of the reference genome further includes, prior to (iii), adjusting the number of sequence tags aligned with the group by taking into account inter-group variations due to factors other than copy number variations.

70. The machine-readable medium according to any one of claims 51-64, wherein the observed value of the fetal fraction of the test sample is determined based on the size of the free nucleic acid fragment.

71. The machine-readable medium of claim 70, wherein the observed fetal fraction of the test sample is determined by the following steps:

72. The machine-readable medium of claim 68, wherein the observed value of the fetal score of the test sample is determined based on the coverage information of the grouping of the reference genome.

73. The machine-readable medium of claim 72, wherein the observed fetal score is calculated by applying coverage values of a plurality of groups of the reference genome to a model that correlates the fetal score with the coverage of the groups to obtain the observed fetal score.

74. The machine-readable medium of claim 73, wherein the plurality of groups of the reference genome have a higher fraction of fetal cell-free nucleic acid fragments than the other groups.

75. The machine-readable medium according to any one of claims 51-64, wherein the observed value of the fetal score of the test sample is determined based on coverage information of sex chromosome grouping.