HK1246901B

HK1246901B - Method and computer system for analyzing genomic dna of organisms

Info

Publication number: HK1246901B
Application number: HK18106330.6A
Authority: HK
Inventors: R‧卓马纳克; B‧A‧彼得斯; B‧G‧科尔马尼
Original assignee: 完整基因有限公司
Priority date: 2011-04-14
Filing date: 2018-05-16
Publication date: 2022-04-01

Description

Method and computer system for analyzing genomic DNA of organisms

本申请是申请日为2012年04月13日、申请号为“201280029331.7”(PCT申请号PCT/US2012/033686)、发明名称为“复杂核酸序列数据的处理和分析”的分案申请。This application is a divisional application with an application date of April 13, 2012, application number “201280029331.7” (PCT application number PCT/US2012/033686), and invention name “Processing and Analysis of Complex Nucleic Acid Sequence Data”.

对相关申请的交叉引用Cross-reference to related applications

本申请要求2011年4月14日提交的美国临时专利申请No.61/517,196的优先权权益，其在此通过提及完整并入。This application claims the benefit of priority to U.S. Provisional Patent Application No. 61/517,196, filed April 14, 2011, which is hereby incorporated by reference in its entirety.

本申请要求2011年8月25日提交的美国临时专利申请No.61/527,428的优先权权益，其在此通过提及完整并入。This application claims the benefit of priority to U.S. Provisional Patent Application No. 61/527,428, filed August 25, 2011, which is hereby incorporated by reference in its entirety.

本申请要求2011年10月12日提交的美国临时专利申请No.61/546,516的优先权权益，其在此通过提及完整并入。This application claims the benefit of priority to U.S. Provisional Patent Application No. 61/546,516, filed October 12, 2011, which is hereby incorporated by reference in its entirety.

发明背景Background of the Invention

需要用于分析复杂核酸的改善的技术，例如特别是用于改善序列准确度及用于分析具有经由核酸扩增引入的大量误差的序列的方法。There is a need for improved techniques for analyzing complex nucleic acids, such as methods, in particular, for improving sequence accuracy and for analyzing sequences with substantial errors introduced through nucleic acid amplification.

此外，需要用于测定对高等生物体基因组的亲本贡献，即人基因组的单元型定相(phasing)的改善的技术。用于单元型定相的方法，包括计算方法和实验定相综述于Browning and Browning,Nature Reviews Genetics12:703-7014,2011。Furthermore, there is a need for improved techniques for determining parental contributions to the genomes of higher organisms, ie, the human genome. Methods for haplotype phasing, including computational methods and experimental phasing, are reviewed in Browning and Browning, Nature Reviews Genetics 12:703-7014, 2011.

发明概述SUMMARY OF THE INVENTION

本发明提供了用于分析源自复杂核酸测序的序列信息的技术(如本文中定义的)，其导致单元型定相、误差降低和其它特征，基于算法和分析技术，与长片段读取结果(LFR)技术结合开发。The present invention provides techniques for analyzing sequence information derived from complex nucleic acid sequencing (as defined herein) that result in unit type phasing, error reduction, and other features based on algorithms and analysis techniques developed in conjunction with long fragment read (LFR) technology.

依照本发明的一个方面，提供了用于测定一种或多种生物体(也就是说，个别生物体或生物体群体)的复杂核酸(例如全基因组)序列的方法。此类方法包括：(a)在一个或多个计算装置接收复杂核酸的多个读取结果；并(b)用计算装置从所述读取结果产生复杂核酸的装配序列，所述装配序列在70,75,80,85,90或95％或更高的响应率时每兆碱基包含小于1.0,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1,0.08,0.07,0.06,0.05或0.04假单核苷酸变体，其中通过一种或多种计算装置实施所述方法。在一些方面，计算机可读的非短暂存储介质存储一种或多种顺序的指令，所述指令包含在由一种或多种计算装置执行时引起所述一种或多种计算装置实施此类方法的步骤的指令。According to one aspect of the present invention, a method for determining the sequence of a complex nucleic acid (e.g., a whole genome) of one or more organisms (i.e., an individual organism or a population of organisms) is provided. Such a method comprises: (a) receiving a plurality of reads of the complex nucleic acid at one or more computing devices; and (b) generating, using a computing device, an assembled sequence of the complex nucleic acid from the reads, the assembled sequence comprising less than 1.0, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.08, 0.07, 0.06, 0.05, or 0.04 false single nucleotide variants per megabase at a call rate of 70, 75, 80, 85, 90, or 95% or higher, wherein the method is performed by the one or more computing devices. In some aspects, a computer-readable non-transitory storage medium stores one or more sequential instructions comprising instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform the steps of such a method.

依照一个实施方案(其中此类方法牵涉单元型定相)，所述方法进一步包括鉴定装配序列中的多个序列变体，并对序列变体定相(例如70,75,80,85,90,95％或更多的序列变体)以产生定相序列，即对序列变体定相的序列。可以在误差校正背景中使用此类定相信息。例如，依照一个实施方案，此类方法包括将与至少两个(或三个或更多个)定相序列变体的定相不一致的序列变体鉴定为误差。According to one embodiment, wherein such a method involves haplotype phasing, the method further comprises identifying a plurality of sequence variants in the assembled sequence and phasing the sequence variants (e.g., 70, 75, 80, 85, 90, 95% or more of the sequence variants) to produce a phased sequence, i.e., a sequence in which the sequence variants are phased. Such phasing information can be used in the context of error correction. For example, according to one embodiment, such a method comprises identifying as an error a sequence variant that is inconsistent with the phasing of at least two (or three or more) phased sequence variants.

依照另一个此类实施方案，在此类方法中，接收复杂核酸的多个读取结果的步骤包括从多个等分试样之每个接收多个读取结果的计算装置和/或其计算机逻辑，每个等分试样包含复杂核酸的一个或多个片段。关于提供此类片段的等分试样的信息可用于校正误差或响应碱基，该碱基在其它情况中会是“无响应”。依照一个此类实施方案，此类方法包括计算装置和/或其计算机逻辑，其基于来自两个或更多个等分试样的位置的初步碱基响应(preliminary base call)，在所述装配序列的所述位置处响应碱基。例如，方法可以包括基于来自至少两个、至少三个、至少四个或超过四个等分试样的初步碱基响应，在所述装配序列的某个位置处响应碱基。在一些实施方案中，此类方法可以包括若碱基响应在至少两个、至少三个、至少四个等分试样或超过四个等分试样中存在，则将其鉴定为真的。在一些实施方案中，此类方法可以包括若碱基响应在至少大多数(或至少60％、至少75％或至少80％)对装配序列中所述位置做出初步碱基响应的等分试样中存在，则将其鉴定为真的。依照另一个此类实施方案，此类方法包括计算装置和/或其计算机逻辑，其在碱基响应在来自两个或更多个等分试样的读取结果中存在三次或更多次时将该碱基响应鉴定为真的。According to another such embodiment, in such methods, the step of receiving multiple read results of complex nucleic acids includes a computing device and/or its computer logic that receives multiple read results from each of a plurality of aliquots, each aliquot comprising one or more fragments of complex nucleic acids. Information about the aliquots providing such fragments can be used to correct errors or respond to bases that would otherwise be "no response." According to one such embodiment, such methods include a computing device and/or its computer logic that responds to bases at the positions of the assembly sequence based on preliminary base calls from the positions of two or more aliquots. For example, a method can include responding to bases at a certain position of the assembly sequence based on preliminary base calls from at least two, at least three, at least four, or more than four aliquots. In some embodiments, such methods can include identifying a base call as true if it exists in at least two, at least three, at least four, or more than four aliquots. In some embodiments, such methods can include identifying a base call as genuine if it is present in at least a majority (or at least 60%, at least 75%, or at least 80%) of the aliquots that make preliminary base calls to the position in the assembled sequence. According to another such embodiment, such methods include a computing device and/or computer logic thereof that identifies a base call as genuine if it is present three or more times in reads from two or more aliquots.

依照另一个此类实施方案，通过鉴定与每个片段附接的等分试样特异性标签(或等分试样特异性标签组)测定读取结果起源的等分试样。任选地，此类等分试样特异性标签包含误差校正或误差检测代码(例如Reed-Solomon误差校正码)。依照本发明的一个实施方案，在对片段和附接的等分试样特异性标签测序后，所得的读取结果包含标签序列数据和片段序列数据。若标签序列数据是正确的，即若标签序列匹配用于等分试样鉴定的标签序列，或备选地若标签序列数据具有可以使用误差校正代码校正的一个或多个误差，则可以使用包括此类标签序列数据的读取结果用于所有目的，特别是用于第一计算机方法(例如在一个或多个计算装置上执行)，其需要标签序列数据并产生第一输出，包括但不限于单元型定相、样品多路复用、库多路复用、定相或基于正确标签序列数据的任何误差校正方法(例如基于鉴定特定读取结果的起源等分试样的误差校正方法)。若标签序列是不正确的并且不能校正，则不将包含此类不正确标签序列数据的读取结果弃去，而且在第二计算机方法(例如由一个或多个计算装置执行)中使用，所述第二计算机方法不需要标签序列数据，包括但不限于定位、装配和基于集合的统计学，并且产生第二输出。According to another such embodiment, the aliquot of the origin of the read result is determined by identifying the aliquot-specific tag (or aliquot-specific tag group) attached to each fragment. Optionally, such aliquot-specific tags include error correction or error detection codes (e.g., Reed-Solomon error correction codes). According to one embodiment of the present invention, after sequencing the fragment and the attached aliquot-specific tags, the resulting read result includes tag sequence data and fragment sequence data. If the tag sequence data is correct, that is, if the tag sequence matches the tag sequence for aliquot identification, or alternatively if the tag sequence data has one or more errors that can be corrected using an error correction code, the read result including such tag sequence data can be used for all purposes, particularly for a first computer method (e.g., performed on one or more computing devices), which requires tag sequence data and produces a first output, including but not limited to unit type phasing, sample multiplexing, library multiplexing, phasing, or any error correction method based on correct tag sequence data (e.g., error correction method based on identifying the origin aliquot of a specific read result). If the tag sequence is incorrect and cannot be corrected, the read results containing such incorrect tag sequence data are not discarded, but are used in a second computer method (e.g., performed by one or more computing devices) that does not require tag sequence data, includes but is not limited to positioning, assembly, and set-based statistics, and produces a second output.

依照另一个实施方案，此类方法进一步包括：提供复杂核酸的某个区域的第一定相序列的计算装置和/或其计算机逻辑，所述区域包含短串联重复；比较所述区域的第一定相序列的读取结果(例如规则(regular)或配偶-对(mate-pair)读取结果)与所述区域的第二定相序列的读取结果(例如使用序列覆盖)的计算装置和/或其计算机逻辑；和基于所述比较鉴定第一定相序列或第二定相序列之一中短串联重复扩充的计算装置和/或其计算机逻辑。According to another embodiment, such a method further comprises: a computing device and/or computer logic thereof that provides a first phased sequence of a region of a complex nucleic acid, wherein the region comprises short tandem repeats; a computing device and/or computer logic thereof that compares reads of the first phased sequence of the region (e.g., regular or mate-pair reads) with reads of a second phased sequence of the region (e.g., using sequence coverage); and a computing device and/or computer logic thereof that identifies a short tandem repeat expansion in one of the first phased sequence or the second phased sequence based on the comparison.

依照另一个实施方案，所述方法进一步包括计算装置和/或其计算机逻辑，其从生物体的至少一个亲本获得基因型数据，并从读取结果和基因型数据产生复杂核酸的装配序列。According to another embodiment, the method further comprises a computing device and/or computer logic thereof that obtains genotype data from at least one parent of the organism and generates an assembled sequence of the complex nucleic acid from the reads and the genotype data.

依照另一个实施方案，所述方法进一步包含实施步骤的计算装置和/或其计算机逻辑，所述步骤包括：对所述复杂核酸的第一区比对多个所述读取结果，由此创建比对读取结果间的重叠；鉴定所述重叠内的N个杂合候选物；聚簇2^N至4^N种可能性的空间或其选定子空间，由此创建多个簇；鉴定两个具有最高密度的簇，每个鉴定的簇包含基本上无噪音的中心；并对所述复杂核酸的一个或多个别的区域重复前述步骤。对每个区域鉴定的簇可以限定重叠群，并且这些重叠群可以彼此匹配以形成重叠群组，一个代表每个单元型。According to another embodiment, the method further comprises a computing device and/or computer logic thereof that performs steps comprising: aligning a plurality of the reads for a first region of the complex nucleic acid, thereby creating an overlap between the aligned reads; identifying N heterozygous candidates within the overlap; clustering the space of ^2N to ^4N possibilities or a selected subspace thereof, thereby creating a plurality of clusters; identifying two clusters with the highest density, each identified cluster comprising a substantially noise-free center; and repeating the aforementioned steps for one or more additional regions of the complex nucleic acid. The clusters identified for each region can define a contig, and these contigs can be matched to each other to form groups of contigs, one for each haplotype.

依照另一个实施方案，此类方法进一步包括提供一定量的复杂核酸，并对复杂核酸测序以产生读取结果。According to another embodiment, such methods further comprise providing an amount of a complex nucleic acid, and sequencing the complex nucleic acid to generate a read.

依照另一个实施方案，在此类方法中，复杂核酸选自下组：基因组、外显子组(exome)、转录物组、甲基化组(methylome)、不同生物体基因组的混合物、和生物体的不同细胞类型的基因组的混合物。According to another embodiment, in such methods, the complex nucleic acid is selected from the group consisting of a genome, an exome, a transcriptome, a methylome, a mixture of genomes of different organisms, and a mixture of genomes of different cell types of an organism.

依照本发明的另一个方面，提供了通过任何上述方法产生的装配的人基因组序列。例如，一个或多个计算机可读的非短暂存储介质存储通过任何上述方法产生的装配的人基因组序列。依照另一个方面，计算机可读的非短暂存储介质存储一种或多种顺序的指令，所述指令包含在由一种或多种计算装置执行时引起所述一种或多种计算装置实施任何、一些或所有上述方法的指令。According to another aspect of the present invention, an assembled human genome sequence produced by any of the above methods is provided. For example, one or more computer-readable non-transitory storage media stores an assembled human genome sequence produced by any of the above methods. According to another aspect, a computer-readable non-transitory storage medium stores one or more sequences of instructions comprising instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform any, some, or all of the above methods.

依照本发明的另一个方面，提供了用于测定人全基因组序列的方法，此类方法包括：(a)在一个或多个计算装置接收所述基因组的多个读取结果；并(b)用所述一个或多个计算装置从所述读取结果产生所述基因组的装配序列，所述装配序列在70％或更大的基因组响应率时包含每千兆碱基小于600个假杂合单核苷酸变体；依照一个实施方案，基因组的装配序列具有70％或更多的基因组响应率和70％或更大的外显子组响应率。在一些方面，计算机可读的非短暂存储介质存储一种或多种顺序的指令，该指令包含在由一种或多种计算装置执行时引起所述一种或多种计算装置实施本文中描述的任何发明方法的指令。According to another aspect of the present invention, a method for determining the sequence of a whole human genome is provided, the method comprising: (a) receiving a plurality of reads of the genome at one or more computing devices; and (b) generating an assembled sequence of the genome from the reads using the one or more computing devices, the assembled sequence comprising less than 600 false heterozygous single nucleotide variants per gigabase at a genome call rate of 70% or greater; according to one embodiment, the assembled sequence of the genome has a genome call rate of 70% or greater and an exome call rate of 70% or greater. In some aspects, a computer-readable non-transitory storage medium stores one or more sequential instructions comprising instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform any of the inventive methods described herein.

依照本发明的另一个方面，提供了用于测定人全基因组序列的方法，此类方法包括：(a)在一个或多个计算装置接收来自多个等分试样之每个的多个读取结果，每个等分试样包含基因组的一个或多个片段；并(b)用所述一种或多种计算装置从所述读取结果产生所述基因组的定相装配序列，所述装配序列在70％或更大的基因组响应率时包含每千兆碱基小于1000个假单核苷酸变体。在一些方面，计算机可读的非短暂存储介质存储一种或多种顺序的指令，该指令包含在由一种或多种计算装置执行时引起所述一种或多种计算装置实施此类方法的指令。According to another aspect of the present invention, a method for determining the sequence of a whole human genome is provided, the method comprising: (a) receiving, at one or more computing devices, a plurality of reads from each of a plurality of aliquots, each aliquot comprising one or more fragments of a genome; and (b) generating, with the one or more computing devices, a phased assembly sequence of the genome from the reads, the assembly sequence comprising less than 1000 false single nucleotide variants per gigabase at a genome call rate of 70% or greater. In some aspects, a computer-readable non-transitory storage medium stores one or more sequential instructions comprising instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform such a method.

附图简述BRIEF DESCRIPTION OF THE DRAWINGS

图1A和1B显示了测序系统的例子。Figures 1A and 1B show examples of sequencing systems.

图2显示了可以在测序仪和/或计算机系统中使用或与测序仪和/或计算机系统结合使用的计算装置的例子。FIG. 2 shows an example of a computing device that can be used in or in conjunction with a sequencer and/or computer system.

图3显示了LFR算法的一般体系结构。Figure 3 shows the general architecture of the LFR algorithm.

图4显示了对邻近杂合SNP的成对分析。Figure 4 shows pairwise analysis of adjacent heterozygous SNPs.

图5显示了选择假设和将得分归入假设的例子。Figure 5 shows an example of selecting hypotheses and assigning scores to hypotheses.

图6显示了图构建。Figure 6 shows the graph construction.

图7显示了图优化。Figure 7 shows the graph optimization.

图8显示了重叠群比对。Figure 8 shows the contig alignment.

图9显示了亲本辅助通用定相。Figure 9 shows parent-assisted universal phasing.

图10显示了天然的重叠群分离。Figure 10 shows the natural contig separation.

图11显示通用定相。Figure 11 shows general phasing.

图12显示使用LFR进行的误差检测。Figure 12 shows error detection using LFR.

图13显示了降低假阴性数目的方法的例子，其中，不管读取结果的数目是如何的小，也能够产生可信的杂合SNP响应。FIG. 13 shows an example of a method for reducing the number of false negatives, wherein confident heterozygous SNP calls can be made regardless of how small the number of reads is.

图14显示了用解析单元型的克隆覆盖法检测人胚胎中CTG重复的扩充(expansion)。Figure 14 shows the detection of CTG repeat expansion in human embryos using haplotype-resolved clonal coverage.

图15的图显示了用多重置换扩增(MDA)方案对纯化的基因组DNA标准品(1.031、8.25和66皮克[pg])和1或10个PVP40细胞进行的扩增，如实施例1中描述的。15 is a graph showing amplification of purified genomic DNA standards (1.031, 8.25, and 66 picograms [pg]) and 1 or 10 PVP40 cells using the multiple displacement amplification (MDA) protocol, as described in Example 1.

图16显示了用两种MDA方案扩增得出的与GC偏爱有关的数据。测定跨整个板的平均循环数目，并将其从每个个别标志物扣除以计算“△循环”数目。相对于每个标志物周围的1000个碱基对的GC含量对△循环绘图以指示每个样品的相对GC偏爱(未显示)。将每个△循环的绝对值求和以创建“△和”量度。较低的△和、以及数据相对于GC含量的相对平坦曲线产生了被较好呈现的全基因组序列。△和是61(对于我们的MDA方法)和287(对于SurePlex扩增的DNA)，这指示我们的方案比SurePlex方案产生小得多的GC偏爱。Figure 16 shows the data related to GC preference obtained by amplifying two MDA protocols. The average number of cycles across the entire plate was determined and deducted from each individual marker to calculate the "Δ cycle" number. The GC content of 1000 base pairs around each marker was plotted against the Δ cycle to indicate the relative GC preference of each sample (not shown). The absolute value of each Δ cycle was summed to create a "Δ sum" metric. Lower Δ sums and the relatively flat curve of the data relative to GC content produced a better presented full-genome sequence. The Δ sums were 61 (for our MDA method) and 287 (for SurePlex amplified DNA), indicating that our protocol produces a much smaller GC preference than the SurePlex protocol.

图17显示了样品7C和10C的基因组覆盖。使用相对于单倍体基因组覆盖标准化的100千碱基覆盖窗的10兆碱基移动平均值对覆盖绘图。拷贝数目1和3处的虚线分别代表单倍体和三倍体拷贝数目。这两个胚胎是男性的，并且对于X和Y染色体具有单倍体拷贝数目。在这些样品中未发现全染色体或染色体大区段的其它丧失或获得。Figure 17 shows the genome coverage of sample 7C and 10C.Use the 10 megabase moving average of the 100 kilobase coverage window of standardization to cover drawing with respect to the haploid genome.The dotted lines at copy number 1 and 3 places represent haploid and triploid copy number respectively.These two embryos are males, and have haploid copy number for X and Y chromosome.In these samples, do not find other loss or acquisition of full chromosome or chromosome large segment.

图18是用于本发明方法的条形码衔接头设计的实施方案的示意图。LFR衔接头由独特的5’条形码衔接头、共同的5’衔接头和共同的3’衔接头组成。共同的衔接头均设计为具有不能与3’片段连接的3’双脱氧核苷酸，这消除衔接头二聚体的形成。在连接后，将衔接头的封闭部分除去，并用未封闭的寡核苷酸替换。通过随后用Taq聚合物进行切口平移并用T4连接酶进行连接来解决剩余的切口。Figure 18 is a schematic diagram of an embodiment of the barcode adapter design for the method of the present invention. The LFR adapter is composed of a unique 5' barcode adapter, a common 5' adapter and a common 3' adapter. The common adapters are all designed to have 3' dideoxynucleotides that cannot be connected to the 3' fragments, which eliminates the formation of adapter dimers. After connection, the closed portion of the adapter is removed and replaced with an unblocked oligonucleotide. The remaining nick is resolved by subsequently nicking with a Taq polymer and connecting with a T4 ligase.

图19显示了累积GC覆盖图。对LFR和标准库绘制GC的累积覆盖以比较GC偏爱差异。对于样品NA19240(a和b)，对整个基因组(c)和仅编码部分(d)两者绘制3个LFR库(重复1、重复2、和10个细胞)和1个标准库。在所有LFR库中，高GC区中的覆盖丧失是明显的，其在含有更高比例的富含GC的区域的编码区(b和d)中是更明显的。Figure 19 shows a cumulative GC coverage plot. Cumulative GC coverage was plotted for LFR and standard libraries to compare GC preference differences. For sample NA19240 (a and b), three LFR libraries (replicate 1, replicate 2, and 10 cells) and one standard library were plotted for both the entire genome (c) and coding portion only (d). In all LFR libraries, coverage loss in high GC regions was evident, which was more pronounced in coding regions (b and d) that contained a higher proportion of GC-rich regions.

图20显示了基因组装配物之间的单元型分型表现的比较。将标准装配库和LFR装配库的变体响应组合，并作为用于定相的基因座来使用，除了规定的情况外。LFR定相率基于亲本定相杂合SNP的计算。*对于那些没有亲本基因组数据的个体(NA12891、NA12892和NA20431)，通过用定相杂合SNP的数目除以预期为真的杂合SNP数目(尝试要进行定相的SNP的数目–50,000个预期误差)，来计算定相率。N50计算基于相对于NCBI构件36(在NA1924010个细胞和高覆盖及NA20431高覆盖的情况中为构件37)人参照基因组的所有重叠群总装配长度。由于所有DNA变性成单链后在384孔板上被分散，致使单倍体片段覆盖比细胞数目大4倍。起始DNA不足量解释了NA20431基因组中较低的定相效率。#10个细胞的样品用含有超过10个细胞的各个孔(其可能是这些细胞在收集过程中处在细胞周期各个阶段的结果)的覆盖来衡量。定相率范围为84％至97％。Figure 20 shows a comparison of haplotype performance between genome assemblies. The variant responses of the standard assembly library and the LFR assembly library were combined and used as loci for phasing, except for the specified cases. The LFR phasing rate is based on the calculation of parental phased heterozygous SNPs. *For those individuals without parental genomic data (NA12891, NA12892 and NA20431), the phasing rate was calculated by dividing the number of phased heterozygous SNPs by the number of heterozygous SNPs expected to be true (the number of SNPs attempted to be phased - 50,000 expected errors). The N50 calculation is based on the total assembly length of all contigs relative to the NCBI component 36 (component 37 in the case of NA1924010 cells and high coverage and NA20431 high coverage) human reference genome. Since all DNA is denatured into single strands and dispersed on a 384-well plate, the haploid fragment coverage is 4 times larger than the number of cells. The insufficient amount of starting DNA explains the lower phasing efficiency in the NA20431 genome. Samples with #10 cells were measured using coverage of wells containing more than 10 cells (which may be a result of these cells being in various stages of the cell cycle at the time of collection). Phasing rates ranged from 84% to 97%.

图21显示了LFR单元分型算法。(a)变量提取：从加标签的等分试样的读取结果提取变量。10碱基Reed-Solomon码确保能经由误差校正而实现标签恢复。(b)杂合SNP对的连接性评估：针对某个邻域内的每个杂合SNP对，计算共享等分试样的矩阵。环1是一条染色体上的总体杂合SNP。环2是染色体上位于环1杂合SNP的邻域中的总体杂合SNP。此邻域由预期的杂合SNP数目和预期的片段长度来限制。(c)图的生成：产生无向图，其中结点对应于杂合SNP，而连接(connections)对应于那些SNP间关系的最佳假设的方向(orientation)和强度。(如本文中使用的，“结点”是可以具有一个或多个数值的数据[数据项或数据对象]，所述数值代表多核苷酸序列中的碱基响应或其它序列变异(例如杂合性或indel(插入缺失))。)方向是二元的(binary)。图21分别描绘了杂合SNP对之间的翻转和未翻转的关系。强度通过对共享等分试样矩阵的要素采用模糊逻辑操作来限定。(d)图优化：经由最小跨度树操作来优化所述图。(e)重叠群产生：将每个子树简化成重叠群，这通过使第一杂合SNP保持未改变、并使所述子树上的其它杂合SNP基于其通向第一杂合SNP的路径而翻转或不翻转来进行。对每个重叠群指派亲本1(P1)和亲本2(P2)是任意的。全染色体树中的缺口限定所述染色体上不同子树/重叠群的边界。(f)将LFR重叠群定位到亲本染色体：使用亲本信息，将母亲或父亲标签置于每个重叠群的P1和P2单元型上。Figure 21 shows the LFR unit typing algorithm. (a) Variant extraction: Variant is extracted from the read results of the tagged aliquot. The 10-base Reed-Solomon code ensures that the label recovery can be achieved through error correction. (b) Connectivity evaluation of heterozygous SNP pairs: For each heterozygous SNP pair within a certain neighborhood, the matrix of shared aliquots is calculated. Ring 1 is the total heterozygous SNPs on a chromosome. Ring 2 is the total heterozygous SNPs in the neighborhood of the ring 1 heterozygous SNP on the chromosome. This neighborhood is restricted by the expected number of heterozygous SNPs and the expected fragment length. (c) Graph generation: An undirected graph is generated in which nodes correspond to heterozygous SNPs and connections correspond to the orientation and strength of the best hypothesis of the relationship between those SNPs. (As used in this article, a "node" is a data [data item or data object] that can have one or more numerical values, which represent base calls or other sequence variations (such as heterozygosity or indels) in a polynucleotide sequence.) The direction is binary. Figure 21 depicts the flipped and unflipped relationships between pairs of heterozygous SNPs, respectively. The strength is defined by using fuzzy logic operations on the elements of the shared aliquot matrix. (d) Graph optimization: The graph is optimized via minimum spanning tree operations. (e) Overlap generation: Each subtree is simplified into overlapping groups, which is performed by keeping the first heterozygous SNP unchanged and flipping or not flipping the other heterozygous SNPs on the subtree based on their path to the first heterozygous SNP. It is arbitrary to assign parent 1 (P1) and parent 2 (P2) to each overlapping group. Gaps in the full chromosome tree define the boundaries of different subtrees/overlap groups on the chromosome. (f) Mapping LFR overlapping groups to parental chromosomes: Using parental information, mother or father labels are placed on the P1 and P2 unit types of each overlapping group.

图22显示了重复LFR文库间的单元型不一致性。在所有共享的定相杂合SNP基因座处比较来自样品NA12877和NA19240的两个重复文库。这是全面比较，因为大多数定相基因座在两个文库间是共享的。Figure 22 shows the haplotype discordance between replicate LFR libraries. The two replicate libraries from samples NA12877 and NA19240 were compared at all shared phased heterozygous SNP loci. This is a comprehensive comparison because most phased loci are shared between the two libraries.

图23显示了通过LFR实现的误差降低。单独的标准文库杂合SNP响应、以及与LFR响应的组合都独立地通过重复LFR文库来定相。一般地，LFR引入约10倍以上的假阳性变体响应。这最可能因在基于phi29的多重置换扩增期间随机掺入不正确碱基而发生。重要的是，若要求杂合SNP响应必须要定相，并且要可见于三个或更多个独立孔中，则误差的降低是显著的，结果也好于没有误差校正的标准文库。LFR也可以从标准文库除去误差，这将响应准确度改善约10倍。Figure 23 shows the error reduction achieved by LFR. The standard library heterozygous SNP response alone, and the combination with the LFR response, are all independently phased by repeating the LFR library. Generally, LFR introduces about 10 times more false positive variant responses. This is most likely due to the random incorporation of incorrect bases during the multiple displacement amplification based on phi29. Importantly, if the heterozygous SNP response is required to be phased and to be visible in three or more independent wells, the reduction in error is significant, and the result is also better than the standard library without error correction. LFR can also remove errors from the standard library, which improves the response accuracy by about 10 times.

图24显示了无响应的位置的LFR再响应。为了证明LFR挽救无响应的位置的潜力，在染色体18上选择通过标准软件未响应(没有响应)的三个实例位置。通过将它们用作为LFR重叠群一部分的C/T杂合SNP定相，可以部分或完全响应这些位置。共享孔(针对成对碱基中每个碱基都有至少一个读取结果的那些孔；一对接受评估的基因座有16对碱基)的分布容许将三个N/N位置再响应到A/N、C/C和T/C响应中，并将C-A-C-T和T-N-C-C限定为单元型。使用孔的信息容许LFR精确响应那些在2-3个预期孔中只有少到2-3个读取结果(比无孔信息的情况少约3倍)的等位基因。Figure 24 shows the LFR response of the position of no response. In order to prove the potential of LFR to rescue the position of no response, three example positions that were not responded (no response) by standard software were selected on chromosome 18. By phasing them with C/T heterozygous SNPs as part of the LFR overlapping group, these positions can be partially or completely responded. The distribution of shared holes (those holes with at least one reading result for each base in the paired bases; there are 16 pairs of bases for a pair of assessed loci) allows the three N/N positions to be responded to A/N, C/C and T/C responses again, and C-A-C-T and T-N-C-C are defined as unit types. Using the information of the hole allows LFR to accurately respond to those alleles that have only as few as 2-3 reading results (about 3 times less than the case of no hole information) in 2-3 expected holes.

图25显示了在每个分析样品中具有多个不利变异的基因的数目。Figure 25 shows the number of genes with multiple adverse variants in each analyzed sample.

图26显示了在NA20431中具有等位表达差异、并具有能改变TFBS的SNP的基因。在被证明有显著的等位表达差异的非穷尽基因列表中，发现6个基因具有能改变TFBS的SNP，其与观察到的等位基因之间的表达差异相关联。相对于NCBI构件37给出所有位置。“CDS”代表编码序列，且“UTR3”代表3’非翻译区。Figure 26 shows genes with allelic differential expression and SNPs that alter TFBSs in NA20431. In a non-exhaustive list of genes demonstrated to have significant allelic differential expression, six genes were found to have SNPs that alter TFBSs that correlated with the observed expression differences between alleles. All positions are given relative to NCBI build 37. "CDS" stands for coding sequence, and "UTR3" stands for 3' untranslated region.

发明详述Detailed Description of the Invention

如本文中及所附权利要求书中使用的，单数形式“一个”、“一种”、和“所述/该”包括复数提及物，除非上下文另有明确规定。如此，例如提及“聚合酶”指一种试剂或此类试剂的混合物，并且提及“所述方法”包括提及本领域技术人员已知的等同步骤和/或方法，等等。As used herein and in the appended claims, the singular forms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to "a polymerase" refers to one reagent or a mixture of such reagents, and reference to "the method" includes reference to equivalent steps and/or methods known to those skilled in the art, and so forth.

除非另有定义，本文中使用的所有技术和科学术语与本发明所属领域普通技术人员的通常理解具有相同的意义。为了描述及公开出版物中描述并且可以与目前描述的方法结合使用的装置、组合物、配制剂和方法，本文中提及的所有出版物通过提及并入本文。Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. All publications mentioned herein are incorporated herein by reference for the purpose of describing and disclosing the devices, compositions, formulations, and methods that are described in the publications and that can be used in connection with the presently described methods.

在提供数值范围的情况下，应当理解本发明内涵盖每个居间数值，除非上下文另有明确规定，至下限单位的十分之一，介于所述范围的上限和下限之间以及该陈述范围中的任何其它陈述或居间数值。这些较小范围的上限和下限可以独立包括在内，较小范围也涵盖在本发明内，服从陈述范围中的任何明确排除界限。在陈述范围包括界限之一或两者的情况中，本发明中还包括排除那些包括的界限两者之任一的范围。Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit, unless the context clearly dictates otherwise, between the upper and lower limits of the stated range, and any other stated or intervening value in the stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included, and the smaller ranges are also encompassed within the invention, subject to any express exclusion in the stated range. Where a stated range includes one or both of the limits, ranges excluding either of those included limits are also encompassed within the invention.

在以下描述中，列出大量具体详情以提供对本发明的更彻底理解。然而，本领域技术人员会显而易见的是，可以在没有一个或多个这些具体详情的情况下实施本发明。在其它情况中，尚未描述公知的特征和本领域技术人员公知的规程以避免使本发明难理解。In the following description, a number of specific details are listed to provide a more thorough understanding of the present invention. However, it will be apparent to those skilled in the art that the present invention can be practiced without one or more of these specific details. In other cases, well-known features and procedures known to those skilled in the art have not been described to avoid obscuring the present invention.

虽然本发明主要参照具体实施方案描述，还涵盖的是，在读取结果本公开内容后，其它实施方案对于本领域技术人员会变得显而易见，并且意图此类实施方案包含在本发明方法内。While the invention has been described primarily with reference to specific embodiments, it is contemplated that other embodiments will become apparent to those skilled in the art upon reading this disclosure, and it is intended that such embodiments be encompassed within the methods of the invention.

测序系统和数据分析Sequencing systems and data analysis

在一些实施方案中，可以通过测序系统实施DNA样品(例如诸如代表全人基因组的样品)的测序。图1中显示了测序系统的两个例子。In some embodiments, sequencing of a DNA sample (e.g., such as a sample representing a whole human genome) can be performed by a sequencing system. Two examples of sequencing systems are shown in FIG1 .

图1A和1B是实例测序系统190的框图，所述测序系统190配置为实施依照本文中描述的实施方案的用于核酸序列分析的技术和/或方法。测序系统190可以包含多个子系统或者与多个子系统联系，所述子系统诸如例如一个或多个测序仪诸如测序仪191、一个或多个计算机系统诸如计算机系统197和一个或多个数据储存库诸如数据储存库195。在图1A中显示的实施方案中，系统190的多个子系统可以通过一个或多个网络193通信连接，所述网络193可以包括包交换或其它类型的网络基础设施装置(例如路由器、开关等)，其配置为促成远程系统间的信息交换。在图1B中显示的实施方案中，测序系统190是测序装置，其中多个子系统(例如诸如测序仪191、计算机系统197和可能数据储存库195)是通信和/或操作偶联并在测序装置内集成的组件。Figures 1A and 1B are block diagrams of an example sequencing system 190, which is configured to implement the techniques and/or methods for nucleic acid sequence analysis according to the embodiments described herein. The sequencing system 190 may include or be associated with multiple subsystems, such as, for example, one or more sequencers such as a sequencer 191, one or more computer systems such as a computer system 197, and one or more data repositories such as a data repository 195. In the embodiment shown in Figure 1A, the multiple subsystems of the system 190 may be communicatively connected via one or more networks 193, which may include packet switching or other types of network infrastructure devices (e.g., routers, switches, etc.) configured to facilitate information exchange between remote systems. In the embodiment shown in Figure 1B, the sequencing system 190 is a sequencing device, in which multiple subsystems (e.g., such as a sequencer 191, a computer system 197, and possibly a data repository 195) are components that are coupled for communication and/or operation and integrated within the sequencing device.

在一些操作背景中，图1A和1B中显示的实施方案的数据储存库195和/或计算机系统197可以在云计算环境196内配置。在云计算环境中，可以将包含数据储存库的存储装置和/或包含计算机系统的计算装置分配并例示，作为效用且按需要使用；如此，云计算环境作为服务提供基础设施(例如物理和虚拟机器、原始/块存储、防火墙、负载均衡器、聚合器(aggregator)、网络、存储簇(storage cluster)，等等)、平台(例如可以包含操作系统的计算装置和/或解决办法栈(solution stack)、编程语言执行环境、数据库服务器、网络服务器、应用服务器，等等)和实施任何存储相关和/或计算任务必需的软件(例如应用、应用编程界面或API，等等)。In some operational contexts, the data repository 195 and/or computer system 197 of the embodiments shown in Figures 1A and 1B can be configured within a cloud computing environment 196. In a cloud computing environment, storage devices including data repositories and/or computing devices including computer systems can be allocated and instantiated as utilities and used as needed; thus, the cloud computing environment provides as a service infrastructure (e.g., physical and virtual machines, raw/block storage, firewalls, load balancers, aggregators, networks, storage clusters, etc.), a platform (e.g., computing devices and/or solution stacks that can include operating systems, programming language execution environments, database servers, web servers, application servers, etc.), and software necessary to implement any storage-related and/or computing tasks (e.g., applications, application programming interfaces or APIs, etc.).

注意到在多个实施方案中，本文中描述的技术可以通过包含各种构造和形式因素的一些或所有上述子系统和组件(例如诸如测序仪、计算机系统和数据储存库)的多种系统和装置实施；如此，应当以例示性而非限制性意义看待图1A和1B中显示的实例实施方案和构造。Note that in various embodiments, the techniques described herein may be implemented by a variety of systems and devices including some or all of the above-described subsystems and components (e.g., such as sequencers, computer systems, and data repositories) in various configurations and form factors; as such, the example embodiments and configurations shown in Figures 1A and 1B should be viewed in an illustrative and non-limiting sense.

测序仪191配置并可操作为接受源自生物学样品碎片的靶核酸192，并且对靶核酸实施测序。可以使用可以实施测序的任何合适的机器，其中此类机器可以使用各种测序技术，其包括但不限于通过杂交测序、通过连接测序、通过合成测序、单分子测序、光学序列检测、电磁序列检测、电压变化序列检测和适合于从DNA产生读取结果测序结果的任何其它现在已知或以后开发的技术。在多个实施方案中，测序仪可以测序靶核酸，并且可以产生读取结果测序结果，其可以包含或不包含缺口，并且可以是或不是配对-对(或成对末端)读取结果。如图1A和1B中显示的，测序仪191测序靶核酸192，并获得读取结果测序结果194，其得到传送以(暂时和/或持久)存储于一个或多个数据储存库195和/或通过一个或多个计算机系统197处理。Sequencer 191 configurations and can be operated as to accept the target nucleic acid 192 that is derived from biological sample fragment, and target nucleic acid is implemented to order-checking.Any suitable machine that can implement order-checking can be used, wherein such machine can use various sequencing technologies, it includes but not limited to by hybridization order-checking, by connecting order-checking, by synthetic order-checking, single molecule order-checking, optical sequence detection, electromagnetic sequence detection, voltage variation sequence detection and be suitable for producing any other now known or later technology of reading result order-checking result from DNA.In a plurality of embodiments, sequencer can order-check target nucleic acid, and can produce and read result order-checking result, it can comprise or do not comprise breach, and can be or not pairing-to (or paired end) reading result.As shown in Figure 1A and 1B, sequencer 191 order-check target nucleic acid 192, and obtain reading result order-checking result 194, it is transmitted to be stored in one or more data storage repositories 195 and/or processed by one or more computer systems 197 with (temporarily and/or lasting).

数据储存库195可以在一个或多个存储装置(例如硬盘驱动器、光盘、固态驱动器等)上执行，所述存储装置可以配置为盘阵列(例如诸如SCSI阵列)、存储簇或任何其它合适的存储装置构造。数据储存库的存储装置可以配置为系统190的内部/集成组件或与系统190可附接的外部组件(例如诸如外部硬驱动机或盘阵列)(例如如图1B中显示的)，和/或可以以合适的方式通信互连，所述合适的方式诸如例如网格、存储簇、存储区网络(SAN)和/或网络附接存储(NAS)(例如如图1A中显示的)。在多个实施方案和实现中，数据储存库可以在存储装置上以一个或多个以文件存储信息的文件系统、以一个或多个以数据记录存储信息的数据库和/或以任何其它合适的数据存储构造执行。The data repository 195 can be executed on one or more storage devices (e.g., hard drives, optical disks, solid-state drives, etc.), which can be configured as disk arrays (e.g., such as SCSI arrays), storage clusters, or any other suitable storage device configuration. The storage devices of the data repository can be configured as internal/integrated components of the system 190 or as external components attachable to the system 190 (e.g., such as external hard drives or disk arrays) (e.g., as shown in FIG. 1B ), and/or can be communicatively interconnected in a suitable manner, such as, for example, a grid, a storage cluster, a storage area network (SAN), and/or a network attached storage (NAS) (e.g., as shown in FIG. 1A ). In various embodiments and implementations, the data repository can be executed on the storage devices as one or more file systems that store information in files, as one or more databases that store information in data records, and/or as any other suitable data storage configuration.

计算机系统197可以包含一个或多个计算装置，其包含通用处理器(例如中央处理单元或CPU)、存储器和计算机逻辑199，其与配置数据和/或操作系统(OS)软件一起可以实施本文中描述的一些或所有技术和方法，和/或可以控制测序仪191的操作。例如，本文中描述的任何方法(例如用于误差校正、单元型定相，等等)可以完全或部分由计算装置实施，所述计算装置包含处理器，该处理器可以配置为执行逻辑199，用于实施方法的各个方法。此外，虽然方法步骤可以以编号步骤呈现，但是应当理解本文中描述的方法的步骤可以同时(例如通过计算装置簇平行进行)或以不同次序实施。计算机逻辑199的功能性可以以单一集成模块(例如在集成逻辑中)执行或者可以在两个或更多个软件模块中组合，所述软件模块可以提供一些别的功能性。The computer system 197 may include one or more computing devices comprising a general purpose processor (e.g., a central processing unit or CPU), memory, and computer logic 199, which, together with configuration data and/or operating system (OS) software, may implement some or all of the techniques and methods described herein, and/or may control the operation of the sequencer 191. For example, any of the methods described herein (e.g., for error correction, cell type phasing, etc.) may be implemented in whole or in part by a computing device comprising a processor that may be configured to execute logic 199 for implementing the various methods of the method. In addition, although the method steps may be presented as numbered steps, it should be understood that the steps of the methods described herein may be implemented simultaneously (e.g., performed in parallel by a cluster of computing devices) or in a different order. The functionality of the computer logic 199 may be implemented in a single integrated module (e.g., in an integrated logic) or may be combined in two or more software modules that may provide some additional functionality.

在一些实施方案中，计算机系统197可以是单一计算装置。在其它实施方案中,计算机系统197可以包含多个计算装置，其可以在网格、簇或在云计算环境中通信和/或可操作互连。此类多个计算装置可以在不同形状因子(form factor)诸如计算结点、刀片(blade)，或任何其它合适的硬件构造中配置。出于这些原因，应当以例示性而非限制性意义看待图1A和1B中的计算机系统197。In some embodiments, computer system 197 can be a single computing device. In other embodiments, computer system 197 can include multiple computing devices that can communicate and/or be operably interconnected in a grid, cluster, or cloud computing environment. Such multiple computing devices can be configured in different form factors such as computing nodes, blades, or any other suitable hardware configurations. For these reasons, the computer system 197 in Figures 1A and 1B should be viewed in an illustrative and non-restrictive sense.

图2是作为测序仪和/或计算机系统的一部分的实例计算装置200的框图，所述计算装置200可以配置为执行用于实施各种数据处理和/或控制功能性的指令。2 is a block diagram of an example computing device 200 as part of a sequencer and/or computer system that may be configured to execute instructions for implementing various data processing and/or control functionality.

在图2中，计算装置200包含直接或经由一个或多个系统总线诸如总线275间接互连的几个组件。此类组件可以包含但不限于键盘278、持久性存储装置279(例如诸如固定盘、固态盘、光盘等等)和显示适配器282，一个或多个显示装置(例如诸如LCD监视器、平板监视器、等离子屏等)可以与所述显示适配器282偶联。外围设备和输入/输出(I/O)装置(其与I/O控制器271偶联)可以通过本领域中已知的多种手段(包括但不限于一个或多个串行端口、一个或多个并行端口和一个或多个通用串行总线(USB)与计算装置200连接。外部接口281(其可以包括网络接口卡和/或串行端口)可以用于连接计算装置200与网络(例如诸如因特网或局域网络(LAN))。外部接口281还可以包括许多输入接口，其可以从各种外部装置诸如例如测序仪或其任何组件接受信息。经由系统总线275的互连容许一个或多个处理器(例如CPU)273与每个连接的组件通信并且执行来自系统存储器272和/或来自存储装置279的指令(和/或控制其执行)及各个组件间的信息交换。系统存储器272和/或存储装置279可以作为一个或多个计算机可读的非短暂存储介质体现，所述计算机可读的非短暂存储介质存储由处理器273执行的指令序列及其它数据。此类计算机可读的非短暂存储介质包括但不限于随机存取存储器(RAM)、只读存储器(ROM)、电磁介质(例如诸如硬盘驱动器、固态驱动器、拇指存储器(thumb drive)、软盘等)、光学介质诸如紧致磁盘(CD)或数字通用盘(DVD)、闪速存储器等。各种数据值和其它结构化或未结构化信息可以从一个组件或子系统输出到另一个组件或子系统，可以经由显示适配器282和合适的显示装置对用户呈现，可以通过网络经由外部接口281发送到远程装置或远程数据储存库，或者(暂时和/或永久)存储于存储装置279上。In FIG2 , computing device 200 includes several components interconnected directly or indirectly via one or more system buses, such as bus 275. Such components may include, but are not limited to, keyboard 278, persistent storage device 279 (e.g., such as a fixed disk, solid-state disk, optical disk, etc.), and display adapter 282, to which one or more display devices (e.g., such as an LCD monitor, a flat-panel monitor, a plasma screen, etc.) may be coupled. Peripheral devices and input/output (I/O) devices (which are coupled to I/O controller 271) may be connected to computing device 200 by a variety of means known in the art, including, but not limited to, one or more serial ports, one or more parallel ports, and one or more universal serial buses (USB). External interface 281 (which may include a network interface card and/or a serial port) may be used to connect computing device 200 to a network (e.g., such as the Internet or a local area network (LAN)). External interface 281 may also include a number of input interfaces that may receive information from various external devices, such as, for example, a sequencer or any component thereof. Interconnection via system bus 275 allows one or more processors to communicate with each other. Processor (e.g., CPU) 273 communicates with each connected component and executes instructions (and/or controls their execution) from system memory 272 and/or storage device 279, as well as information exchange between the various components. System memory 272 and/or storage device 279 may be embodied as one or more computer-readable, non-transitory storage media that store sequences of instructions executed by processor 273 and other data. Such computer-readable, non-transitory storage media include, but are not limited to, random access memory (RAM), read-only memory (ROM), electromagnetic media (e.g., such as hard drives, solid-state drives, thumb drives, floppy disks, etc.), optical media such as compact disks (CDs) or digital versatile disks (DVDs), flash memory, etc. Various data values and other structured or unstructured information may be output from one component or subsystem to another, presented to a user via display adapter 282 and a suitable display device, transmitted over a network via external interface 281 to a remote device or remote data repository, or stored (temporarily and/or permanently) on storage device 279.

由计算装置200实施的任何方法和功能性可以以模块或集成方式使用硬件和/或计算机软件以逻辑形式执行。如本文中使用的，“逻辑”指在由一个或多个计算装置的一个或多个处理器(例如CPU)执行时可操作为实施一个或多个功能性和/或返回一个或多个结果形式的数据或由其它逻辑元件使用的数据的一组指令。在多个实施方案和实现中，任何给定的逻辑可以作为由一个或多个处理器(例如CPU)可执行的一个或多个软件组件、作为一个或多个硬件组件诸如专用集成电路(Application-Specific Integrated Circuit,ASIC)和/或现场可编程门阵列(Field-Programmable Gate Array,FPGA)，或者作为一个或多个软件组件和一个或多个硬件组件的任何组合执行。任何特定逻辑的软件组件可以不限于作为独立软件应用、作为客户机-服务器系统中的客户机、作为客户机-服务器系统中的服务器、作为一个或多个软件模块、作为一个或多个功能库以及作为一个或多个静态和/或动态连接库执行。在执行期间，任何特定逻辑的指令可以作为一个或多个计算机过程、线程、纤维和任何其它合适的运行时间实体体现，所述运行时间实体可以在一个或多个计算装置的硬件上例示，并且可以是分配的计算资源，其可以包括但不限于存储器、CPU时间、存储空间和网络带宽。Any method and functionality implemented by computing device 200 can be performed in a module or integrated manner using hardware and/or computer software in a logical form. As used herein, "logic" refers to a group of instructions that can be operated as the data of implementing one or more functionalities and/or returning one or more result forms or the data used by other logical elements when executed by one or more processors (e.g., CPU) of one or more computing devices. In multiple embodiments and implementations, any given logic can be performed as one or more software components executable by one or more processors (e.g., CPU), as one or more hardware components such as application specific integrated circuits (ASIC) and/or field programmable gate arrays (FPGA), or as any combination of one or more software components and one or more hardware components. The software component of any particular logic can be performed without limitation as a standalone software application, as a client in a client-server system, as a server in a client-server system, as one or more software modules, as one or more function libraries, and as one or more static and/or dynamic link libraries. During execution, the instructions of any particular logic may be embodied as one or more computer processes, threads, fibers, and any other suitable runtime entities, which may be instantiated on the hardware of one or more computing devices and may be allocated computing resources, which may include, but are not limited to, memory, CPU time, storage space, and network bandwidth.

用于LFR过程的技术和算法Techniques and algorithms used for the LFR process

碱基响应Base calling

使用本发明的组合物和方法对靶核酸测序的总体方法记载于本文中及例如美国专利申请公开文本2010/0105052-A1；公布的专利申请号WO2007120208,WO2006073504,WO2007133831和US2007099208以及美国专利申请No.11/679,124；11/981,761；11/981,661；11/981,605；11/981,793；11/981,804；11/451,691；11/981,607；11/981,767；11/982,467；11/451,692；11/541,225；11/927,356；11/927,388；11/938,096；11/938,106；10/547,214；11/981,730；11/981,685；11/981,797；11/934,695；11/934,697；11/934,703；12/265,593；11/938,213；11/938,221；12/325,922；12/252,280；12/266,385；12/329,365；12/335,168；12/335,188；和12/361,507，其通过提及完整并入本文用于所有目的。还可见Drmanacet al.,Science 327,78-81,2010。长片段读取(LFR)方法已经披露于美国专利申请No.12/816,365,12/329,365,12/266,385,和12/265,593及美国专利No.7,906,285,7,901,891和7,709,197，其在此通过提及完整收入。本文中提供了进一步的详细和改进。General methods for sequencing a target nucleic acid using the compositions and methods of the invention are described herein and, for example, in U.S. Patent Application Publication No. 2010/0105052-A1; Published Patent Application Nos. WO2007120208, WO2006073504, WO2007133831, and US2007099208, and U.S. Patent Application Nos. 11/679,124; 11/981,761; 11/981,661; 11/981,605; 11/981,793; 11/981,804; 11/451,691; 11/981,607; 11/981,767; 11/982,467; 11/451,692; 11/541 ,225;11/927,356;11/927,388;11/938,096;11/938,106;10/547,214;11/981,730;11/981,685;11/981,797;11/934,695;11/934,697;11/934,703;1 2/265,593; 11/938,213; 11/938,221; 12/325,922; 12/252,280; 12/266,385; 12/329,365; 12/335,168; 12/335,188; and 12/361,507, which are incorporated herein by reference in their entirety for all purposes. See also Dr. Mana et al., Science 327, 78-81, 2010. Long fragment read (LFR) methods have been disclosed in U.S. Patent Application Nos. 12/816,365, 12/329,365, 12/266,385, and 12/265,593 and U.S. Patent Nos. 7,906,285, 7,901,891, and 7,709,197, which are hereby incorporated by reference in their entireties. Further details and improvements are provided herein.

在一些实施方案中，数据提取会依赖于两类图像数据：划分表面上所有DNB位置的明视场图像，和在每个测序循环期间获得的荧光图像组。数据提取软件可以用于鉴定具有明视场图像的所有对象，然后对于每个此类对象，软件可以用于计算每个测序循环的平均荧光值。对于任何给定的循环，有四个数据点，其对应于以不同波长拍摄的四个图像，用以询问所述碱基是否是A、G、C或T。合并这些原始数据点(在本文中又称为“碱基响应”)，对每个DNB产生不连续的读取结果测序结果。In some embodiments, data extraction can rely on two types of image data: a bright field image of all DNB positions on the partitioned surface, and a fluorescent image set obtained during each sequencing cycle. Data extraction software can be used to identify all objects with bright field images, and then for each such object, the software can be used to calculate the average fluorescence value of each sequencing cycle. For any given cycle, there are four data points, which correspond to four images taken at different wavelengths to inquire whether the base is A, G, C or T. These raw data points (also referred to as "base calls" in this article) are merged to produce a discontinuous read result sequencing result for each DNB.

计算装置可以装配鉴定碱基群体以提供关于靶核酸的序列信息和/或鉴定靶核酸中特定序列的存在。例如，计算装置可以通过执行各种逻辑依照本文中描述的技术和算法装配鉴定碱基群体；此类逻辑的例子是以任何合适的编程语言诸如Java、C++、Perl、Python和任何其它合适的常规和/或面向对象的编程语言书写的软件代码。在以一个或多个计算机过程形式执行时，此类逻辑可以读取结果、书写和/或以其它方式处理结构化和未结构化的数据，该数据可以以各种结构在持久性存储上和/或在易失性存储器中存储；此类存储结构的例子包括但不限于文件、表、数据库记录、阵列、列表、向量、变数、存储器和/或处理器寄存器、从面向对象类别例示的持久性和/或存储器数据对象和任何其它合适的数据结构。在一些实施方案中，通过比对从对多个DNB实施的多个测序循环获得的重叠序列将鉴定的碱基装配成完整序列。如本文中使用的，术语“完整序列”指部分或整个基因组及部分或整个靶核酸的序列。在别的实施方案中，由一个或多个计算装置或其计算机逻辑实施的装配方法利用可以用于“拼凑”重叠序列以提供完整序列的算法。在又一些实施方案中，参照表用于辅助将鉴定的序列装配成完整序列。可以使用关于选择生物体的现有测序数据编译参照表。例如人基因组数据可以经由国立生物技术信息中心于ftp.ncbi.nih.gov/refseq/release，或者经由J.Craig Venter Institute于www.jcvi.org/researchhuref/访问。整个人基因组信息或人基因组信息的子集可以用于创建用于特定测序询问的参照表。另外，特定参照表可以从源自特定群体的经验数据，包括来自具有特定种族性、地理传统、宗教或文化限定群体的人类的遗传序列构建，因为人基因组内的变异可以随其中含有的信息起源而使参照数据倾斜。,例如美国专利公开文本No.2011-0004413，名称为“Method andSystem for Calling Variations in a Sample Polynucleotide Sequence withRespect to a Reference Polynucleotide Sequence”(其通过提及并入本文用于所有目的)中提供了用于响应与参照多核苷酸序列相比多核苷酸序列中的变异及用于多核苷酸序列装配(或重新装配)的例示性方法。A computing device can assemble an identification base group to provide sequence information about a target nucleic acid and/or identify the presence of a specific sequence in the target nucleic acid. For example, a computing device can assemble an identification base group according to the techniques and algorithms described herein by executing various logics; examples of such logic are software code written in any suitable programming language such as Java, C++, Perl, Python, and any other suitable conventional and/or object-oriented programming language. When executed in the form of one or more computer processes, such logic can read results, write, and/or otherwise process structured and unstructured data, which can be stored in various structures on persistent storage and/or in volatile memory; examples of such storage structures include, but are not limited to, files, tables, database records, arrays, lists, vectors, variables, memory and/or processor registers, persistent and/or memory data objects exemplified from object-oriented classes, and any other suitable data structures. In some embodiments, the identified bases are assembled into a complete sequence by comparing overlapping sequences obtained from multiple sequencing cycles implemented on multiple DNBs. As used herein, the term "complete sequence" refers to the sequence of a portion or the entire genome and a portion or the entire target nucleic acid. In other embodiments, the assembly method implemented by one or more computing devices or its computer logic utilizes and can be used for " piecing together " overlapping sequences to provide the algorithm of complete sequence.In other embodiments, reference table is used for assisting the sequence of identifying to be assembled into complete sequence.Can use the reference table of existing sequencing data compilation about selecting organism.For example human genome data can be via National Center for Biotechnology Information at ftp.ncbi.nih.gov/refseq/release, or visit via J.Craig Venter Institute at www.jcvi.org/researchhuref/.The subset of whole human genome information or human genome information can be used for creating the reference table for specific sequencing inquiry.In addition, specific reference table can comprise from the empirical data of specific group, the genetic sequence construction of the mankind with specific ethnicity, geographical tradition, religion or culture limit group, because the variation in human genome can make reference data tilt with the information origin wherein contained. Exemplary methods for responding to variations in a polynucleotide sequence compared to a reference polynucleotide sequence and for polynucleotide sequence assembly (or reassembly) are provided, for example, in U.S. Patent Publication No. 2011-0004413, entitled “Method and System for Calling Variations in a Sample Polynucleotide Sequence with Respect to a Reference Polynucleotide Sequence” (which is incorporated herein by reference for all purposes).

在本文中讨论的发明的任何实施方案中，核酸模板和/或DNB群体可以包含许多靶核酸以基本上覆盖整个基因组或整个靶多核苷酸。如本文中使用的，“基本上覆盖”意指分析的核苷酸(即靶序列)量含有当量至少两个拷贝的靶多核苷酸，或在另一个方面，至少10个拷贝，或在另一个方面，至少20个拷贝，或在另一个方面，至少100个拷贝。靶多核苷酸可以包含DNA片段，其包含基因组DNA片段和cDNA片段及RNA片段。用于重新构建靶多核苷酸序列的步骤的指导可以参见以下参考文献，其通过提及并入：Lander et al,Genomics,2:231-239(1988)；Vingron et al,J.Mol.Biol.,235:1-12(1994)；及类似的参考文献。In any embodiment of the invention discussed herein, nucleic acid templates and/or DNB populations can include many target nucleic acids to substantially cover the entire genome or the entire target polynucleotide. As used herein, "substantially covering" means that the amount of nucleotides (i.e., target sequence) analyzed contains an equivalent of at least two copies of the target polynucleotide, or in another aspect, at least 10 copies, or in another aspect, at least 20 copies, or in another aspect, at least 100 copies. The target polynucleotide can include DNA fragments comprising genomic DNA fragments and cDNA fragments and RNA fragments. Guidance for reconstructing the steps of the target polynucleotide sequence can be found in the following references, which are incorporated by reference: Lander et al, Genomics, 2: 231-239 (1988); Vingron et al, J. Mol. Biol., 235: 1-12 (1994); and similar references.

在一些实施方案中，对测序的复杂核苷酸的每个询问位置产生四幅图像，一幅用于每种颜色染料。通过调节染料和背景强度之间的串扰测定图像中每个点的位置和四种颜色中每种的所得强度。定量模型可以拟合至所得的四维数据集。以质量得分对给定点响应碱基，所述质量得分反映四个强度多么好地拟合所述模型。In some embodiments, four images are generated for each interrogation position of the complex nucleotide sequence being sequenced, one for each color dye. The resulting intensity of each of the positions and four colors of each of the positions of each point in the image is determined by adjusting the crosstalk between the dye and the background intensity. The quantitative model can be fitted to the resulting four-dimensional data set. A quality score is used to respond to a given point base, and the quality score reflects how well the four intensities fit the model.

可以通过一个或多个计算装置或其计算机逻辑在几个步骤中实施每个视场的四幅图像的碱基响应。首先，使用修改的形态学“图像打开”操作针对背景校正图像强度。由于DNB的位置与照相机像素位置排在一起，强度提取作为来自经背景校正的图像的像素强度的简单读取结果完成。然后，针对光学和生物学信号串扰两者的几个来源校正这些强度，如下文描述的。然后，将经校正的强度通到概率模型，其最终对每个DNB产生四种可能碱基响应结果的四种可能性的组。然后，使用预先拟合的逻辑回归将几种度量组合以计算碱基响应得分。Base calling of four images per field of view can be performed in several steps by one or more computing devices or their computer logic. First, the image intensities are corrected for background using a modified morphological "image open" operation. Since the positions of the DNBs line up with the camera pixel positions, intensity extraction is accomplished as a simple readout of the pixel intensities from the background-corrected images. These intensities are then corrected for several sources of both optical and biological signal crosstalk, as described below. The corrected intensities are then passed to a probabilistic model, which ultimately produces a set of four possible base call outcomes for each DNB. Several metrics are then combined using a pre-fitted logistic regression to calculate a base call score.

强度校正：使用作为由一个或多个计算装置执行的计算机逻辑执行的线性回归模型校正生物学和光学串扰的几个来源。线性回归优于解卷积方法，该解卷积方法在计算上更昂贵的并且产生具有相似质量的结果。光学串扰的来源包括四个荧光染料谱间的过滤带重叠，和邻近DNB间由于其极其接近处的光衍射所致的侧面串扰。串扰的生物学来源包括先前循环的不完全清洗、探针合成误差和污染邻近位置信号的探针“滑动”、在询问锚定(anchor)“外部”(与锚定更远离的)碱基时不完全锚定延伸。线性回归用于测定DNB强度中可以使用任一邻近DNB的强度或来自先前循环或其它DNB位置的强度预测的部分。然后，从初始提取强度扣除可以通过串扰的这些来源解释的强度部分。为了测定回归系数，线性回归模型的左侧的强度需要主要仅由“背景”强度，即正在实施回归的给定碱基不会响应的DNB的强度组成。这需要使用初始强度进行的预响应(pre-calling)步骤。一旦选择没有特定碱基响应(具有合理置信度)的DNB，计算装置或其计算机逻辑实施串扰来源的同时回归：Intensity Correction: Several sources of biological and optical crosstalk are corrected using a linear regression model implemented as computer logic executed by one or more computing devices. Linear regression is superior to deconvolution methods, which are computationally more expensive and produce results of similar quality. Sources of optical crosstalk include filter band overlap between the four fluorescent dye spectra and lateral crosstalk between adjacent DNBs due to light diffraction at their close proximity. Biological sources of crosstalk include incomplete washing from previous cycles, probe synthesis errors and probe "slippage" that contaminates signals at adjacent positions, and incomplete anchor extension when interrogating bases "outside" (further away from the anchor) of the anchor. Linear regression is used to determine the portion of the DNB intensity that can be predicted using the intensity of any adjacent DNB or the intensity from previous cycles or other DNB positions. The portion of the intensity that can be explained by these sources of crosstalk is then subtracted from the initial extracted intensity. To determine the regression coefficient, the intensity on the left side of the linear regression model needs to consist primarily of "background" intensity, that is, the intensity of DNBs that will not respond to the given base being regressed. This requires a pre-calling step using the initial intensities. Once a DNB is selected that does not have a specific base call (with reasonable confidence), the computing device or its computer logic performs a simultaneous regression of the sources of crosstalk:

近邻DNB串扰都使用上述回归校正。还有，使用涉及所有可用DNB位置里所有近邻的线性模型对每个DNB校正其特定的邻域。Neighbor DNB crosstalk was corrected using the above regression. Additionally, each DNB was corrected for its specific neighborhood using a linear model involving all neighbors in all available DNB positions.

碱基响应概率：使用最大强度响应碱基不造成四种碱基的背景强度分布的不同形状。为了解决此类可能的差异，基于背景强度的经验概率分布开发概率模型。一旦校正强度，计算装置或其计算机逻辑预响应使用最大强度的一些DNB(通过某个置信度阈值的DNB)，并且使用这些预响应的DNB来驱动背景强度分布(给定碱基不响应的DNA的强度分布)。在获得此类分布后，计算装置可以对每个DNB计算所述分布下的尾概率，其描述所述强度是背景强度的经验概率。因此，对于每个DNB和四种强度中的每种，计算装置或其逻辑可以获得并存储其作为背景的概率然后，计算装置可以计算使用这些概率的所有可能碱基响应的概率。可能的碱基响应结果需要还描述可以被DNB双重或一般多重占据或不被DNB占据的点。组合计算的概率与其先验概率(对于多重占据的或空的点为较低先验的)产生16种可能结果的概率：Base Call Probability: Calling a base with maximum intensity does not result in different shapes for the background intensity distributions of the four bases. To account for these possible differences, a probability model is developed based on an empirical probability distribution of background intensities. Once the intensities are corrected, the computing device or its computer logic pre-calls some DNBs (DNBs that pass a certain confidence threshold) with maximum intensity and uses these pre-call DNBs to drive the background intensity distribution (the intensity distribution of DNA that does not call a given base). After obtaining this distribution, the computing device can calculate the tail probability under the distribution for each DNB, which describes the empirical probability that the intensity is the background intensity. Therefore, for each DNB and each of the four intensities, the computing device or its logic can obtain and store its probability of being background. The computing device can then calculate the probabilities of all possible base calls using these probabilities. Possible base call outcomes need to also describe points that can be doubly or generally multiply occupied by DNBs, or not occupied by DNBs. Combining the calculated probabilities with their prior probabilities (lower priors for multiply occupied or empty points) yields the probabilities of 16 possible outcomes:

然后，可以组合这16种概率以获得四种可能碱基响应的四种概率的缩减集。也就是说：These 16 probabilities can then be combined to obtain a reduced set of four probabilities for the four possible base calls. That is:

得分计算：逻辑回归用于得到得分计算公式。计算装置或其计算机逻辑将逻辑回归拟合到使用几种度量作为输入的碱基响应的定位结果。所述度量包括响应碱基和下一最高碱基之间的概率比、响应碱基的强度、响应碱基身份的指示变量和描述域(field)的总体聚簇质量的度量。所有度量转化为与协调的和不协调的响应之间的对数优势率(log-odds-ratio)为共线。使用交叉验证改进所述模型。具有最终逻辑回归系数的分对数(logit)函数用于计算产生的得分。Score calculation: Logistic regression is used to obtain the score calculation formula. A computing device or its computer logic fits a logistic regression to the positioning results of the base calls using several metrics as input. The metrics include the probability ratio between the call base and the next highest base, the intensity of the call base, an indicator variable for the identity of the call base, and a metric describing the overall clustering quality of the field. All metrics are converted to the log-odds-ratio between the coordinated and uncoordinated calls as collinear. Cross-validation is used to improve the model. A logit function with the final logistic regression coefficient is used to calculate the resulting score.

定位和装配Positioning and assembly

在别的实施方案中，读取结果数据以压缩二进制形式编码，并且包括响应的碱基和质量得分两者。质量得分与碱基准确度相关联。分析软件逻辑，包括序列装配软件可以使用得分来从具有读取结果的各个碱基确定证据的贡献。In another embodiment, the read result data is encoded in a compressed binary form and includes both the base of the call and a quality score. The quality score is associated with the base accuracy. Analysis software logic, including sequence assembly software, can use the score to determine the contribution of evidence from each base with a read result.

读取结果由于DNB结构而可以是“有缺口的”。缺口大小随酶消化固有的变化性而有所变化(通常+/-1个碱基)。由于cPAL的随机存取性质，读取结果在其它情况下高质量的DNB中偶尔可以具有未读取结果的碱基(“无响应”)。使读取结果对配对。Reads can be "gapped" due to DNB structure. Gap size varies (typically +/- 1 base) due to the inherent variability of enzyme digestion. Due to the random access nature of cPAL, reads can occasionally have unread bases ("no calls") in otherwise high-quality DNBs. Read pairs are paired.

能够比对读取结果数据与参照序列的定位软件逻辑可以用于将通过本文中描述的测序方法产生的数据定位。在由一个或多个计算装置执行时，此类定位逻辑一般会容许相对于参照序列的小变化，诸如由各个基因组变异、读取结果误差或未读取结果碱基引起的那些变化。此特性经常容许直接再建SNP。为了支持转配较大的变异，包括大规模结构变化或密集变异区，可以分开定位DNB的每个臂，在比对后应用配偶(mate)配对约束。The positioning software logic that can compare read result data and reference sequence can be used for the data location produced by sequencing method described herein.When being executed by one or more computing devices, this type of positioning logic generally can allow small changes relative to reference sequence, such as those changes caused by each genome variation, read result error or not reading result base.This characteristic often allows directly rebuilding SNP.In order to support the larger variation of transfer, including large-scale structural variation or dense variant zone, each arm of DNB can be separately positioned, and spouse (mate) pairing constraint is applied after comparison.

如本文中使用的，术语“序列变体”或仅“变体”包括任何变体，包括但不限于一个或多个碱基的取代或替换；一个或多个碱基的插入或缺失(又称为“indel”)；倒位；转变；重复或拷贝数变化(CNV)；三核苷酸重复扩充；结构变化(SV；例如染色体内或染色体间重排，例如易位)；等等。在二倍体基因组中，“杂合性”或“het”是基因对中特定基因的两个不同等位基因。两个等位基因可以是不同突变体或与突变体成对的野生型等位基因。本方法也可以在分析非二倍体生物体中使用，无论此类生物体是单倍体/一倍体(N＝1，其中N＝染色体的单倍体数目)还是多倍体或非整倍体。As used herein, the term "sequence variant" or simply "variant" includes any variant, including but not limited to substitution or replacement of one or more bases; insertion or deletion of one or more bases (also known as "indels"); inversions; transitions; duplications or copy number variations (CNVs); trinucleotide repeat expansions; structural variations (SVs; e.g., intrachromosomal or interchromosomal rearrangements, e.g., translocations); and the like. In a diploid genome, "heterozygosity" or "het" is two different alleles of a particular gene in a gene pair. The two alleles can be different mutants or wild-type alleles paired with a mutant. The present method can also be used in analyzing non-diploid organisms, whether such organisms are haploid/haploid (N=1, where N=the haploid number of chromosomes) or polyploid or aneuploid.

在一些实施方案中，序列读取结果的装配可以利用支持DNB读取结果结构(具有非响应碱基的配对的、有缺口的读取结果)的软件逻辑来产生二倍体基因组装配，其在一些实施方案中可以被产生用于对杂合子位点定相的本发明LFR方法的序列信息利用。In some embodiments, assembly of sequence read results can utilize software logic that supports DNB read result structures (paired, gapped read results with non-responsive bases) to produce diploid genome assemblies, which in some embodiments can be used to generate sequence information for the LFR method of the present invention for phasing heterozygous sites.

本发明的方法可以用于重建不存在于参照序列中的新区段。在一些实施方案中可以使用如下的算法，其利用证据(贝叶斯)推理和基于de Bruijin图的算法的组合。在一些实施方案中，可以使用针对每个数据集凭经验校正的统计学模型，容许所有读取结果数据在没有预过滤或数据修整的情况下使用。也可以通过调节配对读取结果来检测大规模结构变化(包括但不限于缺失、易位等)和拷贝数变化。The method of the present invention can be used to reconstruct new segments that are not present in the reference sequence. In some embodiments, the following algorithm can be used, which utilizes a combination of evidence (Bayesian) reasoning and an algorithm based on a de Bruijin graph. In some embodiments, a statistical model that is empirically calibrated for each data set can be used, allowing all read result data to be used without pre-filtering or data trimming. Large-scale structural changes (including but not limited to deletions, translocations, etc.) and copy number changes can also be detected by adjusting the paired read results.

对LFR数据定相Phasing LFR data

图3描述了LFR数据定相中的主要步骤。这些步骤如下：Figure 3 depicts the main steps in LFR data phasing. These steps are as follows:

(1)使用LFR数据进行的图构建：一个或多个计算装置或其计算机逻辑产生无向图，其中顶点代表杂合SNP，而边缘代表那些杂合SNP间的连接。边缘由方向和连接强度构成。一个或多个计算装置可以在存储结构中存储此类图，所述存储结构包括但不限于文件、表、数据库记录、阵列、列表、向量、变量、存储器和/或处理器寄存器、持久性和/或从面向对象的种类例示的存储器数据对象、和任何其它合适的短暂和/或持久性数据结构。(1) Graph Construction Using LFR Data : One or more computing devices or their computer logic generate an undirected graph in which vertices represent heterozygous SNPs and edges represent connections between those heterozygous SNPs. Edges are composed of directions and connection strengths. One or more computing devices can store such graphs in storage structures including, but not limited to, files, tables, database records, arrays, lists, vectors, variables, memory and/or processor registers, persistent and/or memory data objects instantiated from object-oriented classes, and any other suitable transient and/or persistent data structures.

(2)使用配偶对数据进行的图构建：步骤2与步骤1类似，其中与LFR数据相反，基于配偶对数据进行连接。为了进行连接，必须在相同读取结果(相同臂或配偶臂)中用两个感兴趣的杂合SNP找到DNB。(2) Graph construction using mate pair data : Step 2 is similar to step 1, where the connection is based on the mate pair data instead of the LFR data. To perform the connection, a DNB must be found with two heterozygous SNPs of interest in the same read (same arm or mate arm).

(3)图组合：上述每幅图的计算装置或其计算机逻辑表示经由NxN稀疏矩阵进行，其中N是所述染色体上候选杂合SNP的数目。两个结点在上述每种方法中可以仅具有一个连接。在组合两个方法的情况中，两个结点可以有多至两个连接。因此，计算装置或其计算机逻辑可以使用选择算法来选择一个连接作为选择的连接。对于这些研究，发现配偶对数据的质量显著次于LFR数据的质量。因此，仅使用LFR导出的连接。(3) Graph combination: The computing device or its computer logic representation of each of the above graphs is performed via an NxN sparse matrix, where N is the number of candidate heterozygous SNPs on the chromosome. Two nodes can have only one connection in each of the above methods. In the case of combining the two methods, two nodes can have up to two connections. Therefore, the computing device or its computer logic can use a selection algorithm to select a connection as the selected connection. For these studies, it was found that the quality of the mate pair data was significantly inferior to the quality of the LFR data. Therefore, only the connections derived from LFR were used.

(4)图修整：针对存储的图数据通过计算装置设计并应用一系列试探法以除去一些误差的连接。更精确地，结点必须满足一个方向上至少两个连接和另一个方向上一个连接的条件；否则，将其消除。(4) Graph pruning: A series of heuristics are designed and applied by a computing device to the stored graph data to remove some erroneous connections. More precisely, a node must satisfy the condition of at least two connections in one direction and one connection in the other direction; otherwise, it is eliminated.

(5)图优化：计算装置或其计算机逻辑通过产生最小跨度树(MST)来优化图。幂函数设置为-|强度|。在此过程期间，在可能的情况下，由于与较强路径的竞争而使较低的强度边缘消除。因此，MST提供了最强且最可靠的连接的自然选择。(5) Graph Optimization: The computing device or its computer logic optimizes the graph by generating a minimum spanning tree (MST). The power function is set to -|strength|. During this process, lower strength edges are eliminated when possible due to competition with stronger paths. Thus, the MST provides a natural selection of the strongest and most reliable connections.

(6)重叠群建立：一旦在计算机可读介质中产生和/或存储最小跨度树，计算装置或其逻辑可以使所有结点再取向，期间取得一个结点(在这里，第一结点)常数。此第一结点是锚结点。对于每个结点，计算装置然后寻找至锚结点的路径。测试结点的方向是路径上边缘方向的聚集体。(6) Contig Creation: Once the minimum spanning tree is generated and/or stored in a computer-readable medium, the computing device or its logic may reorient all nodes, obtaining a constant number of nodes (here, the first node). This first node is the anchor node. For each node, the computing device then searches for a path to the anchor node. The orientation of the test node is the aggregate of the orientations of the edges on the path.

(7)通用定相：在上述步骤后，计算装置或其逻辑对在先前步骤中建立的每个重叠群定相。在这里，与定相形成对比，这部分的结果称为预定相的，指示这不是最终的定相。由于第一结点任意选择为锚结点，整个重叠群的定相不必与亲本染色体一致。对于通用定相，使用重叠群上可获得三人一组信息的几个杂合SNP。然后，使用这些三人一组杂合SNP来鉴定重叠群的比对。在通用定相步骤结束时，所有重叠群都已经适当标记，并且因此可以认为是全染色体重叠群。(7) Universal phasing: After the above steps, the computing device or its logic phases each of the contigs created in the previous steps. Here, in contrast to phasing, this part of the results is called pre-phased, indicating that this is not the final phasing. Since the first node is arbitrarily chosen as the anchor node, the phasing of the entire contig does not have to be consistent with the parent chromosome. For universal phasing, several heterozygous SNPs for which trio information is available on the contig are used. These trios of heterozygous SNPs are then used to identify the alignment of the contigs. At the end of the universal phasing step, all contigs have been appropriately labeled and can therefore be considered to be whole chromosome contigs.

重叠群产生Contig generation

为了产生重叠群，对于每个杂合SNP对，计算装置或其计算机逻辑测试两个假设：正向方向和反向方向。正向方向意味着两个杂合SNP以它们最初列出(最初按字母表顺序)的相同方向连接。反向方向意味着两个杂合SNP以其最初列表的相反次序连接。图4描绘了对邻近杂合SNP的成对分析，其涉及将正向和反向方向归入杂合SNP对。In order to produce overlapping groups, for each heterozygous SNP pair, computing device or its computer logic tests two hypotheses: forward direction and reverse direction.Forward direction means that two heterozygous SNPs are connected with the same direction that they were originally listed (initially in alphabetical order).Reverse direction means that two heterozygous SNPs are connected with the reverse order of their initial listing.Fig. 4 depicts the paired analysis to adjacent heterozygous SNPs, which involves classifying forward and reverse directions into heterozygous SNP pairs.

每个方向会具有数字支持，显示了相应假设的有效性。此支持是图5中显示的连接矩阵的16个单元的函数，该图5显示了假设选择的例子，及对其分配得分。为了简化函数，将16个变量简化成3个：幂(power)1、幂2和杂质(impurity)。幂1和幂2是对应于每个假设的两个最高值单元。杂质是所有其它单元(而不是与假设对应的2个)的总和与矩阵中单元的总和的比率。基于相应单元的总和进行两个假设间的选择。具有较高和的假设是获胜假设。以下计算仅用于分配所述假设的强度。强假设是对于幂1和幂2具有高数值及对于杂质具有低数值的假设。Each direction can have digital support, has shown the validity of corresponding hypothesis.This support is the function of 16 units of the connection matrix shown in Figure 5, and this Figure 5 shows the example of hypothesis selection, and assigns score to it.In order to simplify the function, 16 variables are simplified into 3: power (power) 1, power 2 and impurity (impurity). Power 1 and power 2 are the two highest value units corresponding to each hypothesis. Impurity is the ratio of the sum of all other units (rather than 2 corresponding to the hypothesis) and the sum of the units in the matrix. The selection between two hypotheses is carried out based on the sum of the corresponding units. The hypothesis with higher sum is the winning hypothesis. The following calculation is only used to distribute the intensity of the hypothesis. Strong hypothesis is the hypothesis with high numerical value for power 1 and power 2 and low numerical value for impurity.

将三个量度幂1、幂2和杂质馈给到模糊推理系统(图6)中，以将其效应简化成0和1之间(包括端点)的单一数值-得分。模糊推理系统(FIS)作为计算机逻辑实施，所述计算机逻辑可以由一个或多个计算装置执行。The three metrics Power 1, Power 2, and Impurity are fed into a fuzzy inference system ( FIG. 6 ) to reduce their effects to a single numerical value—a score—between 0 and 1 (inclusive). A fuzzy inference system (FIS) is implemented as computer logic that can be executed by one or more computing devices.

对多至预期重叠群长度(例如20-50Kb)的合理距离内的每个杂合SNP对进行连接操作。图6显示了图构建，描绘了三个邻近杂合SNP的一些例示性连接和强度。The linking operation was performed for each heterozygous SNP pair within a reasonable distance up to the expected contig length (eg, 20-50 Kb).Figure 6 shows the graph construction, depicting some exemplary links and intensities for three neighboring heterozygous SNPs.

模糊推理引擎的规则如下定义：The rules of the fuzzy inference engine are defined as follows:

(1)若幂1较小且幂2较小，则得分是非常小的。(1) If power 1 is small and power 2 is small, the score is very small.

(2)若幂1是中等的且幂2较小，则得分是小的。(2) If power 1 is medium and power 2 is small, the score is small.

(3)若幂1是中等的且幂2是中等的，则得分是中等的。(3) If power 1 is medium and power 2 is medium, then the score is medium.

(4)若幂1较大且幂2较小，则得分是中等的。(4) If power 1 is large and power 2 is small, the score is medium.

(5)若幂1较大且幂2是中等的，则得分较大。(5) If power 1 is large and power 2 is medium, the score is large.

(6)若幂1较大且幂2较大，则得分是非常大的。(6) If power 1 is large and power 2 is large, the score is very large.

(7)若杂质较小，则得分较大。(7) If the impurity is smaller, the score is larger.

(8)若杂质是中等的，则得分是小的。(8) If the impurity is medium, the score is small.

(9)若杂质较大，则得分是非常小的。(9) If the impurities are large, the score will be very small.

对于每个变量，较小、中等和较大的定义是不同的，并且由其特定隶属函数决定。在将模糊推理系统(FIS)暴露于每个变量集后，将输入集对规则的贡献传播到模糊逻辑系统，并且产生输出的单一(去模糊化)数字：得分。此得分限于0和1之间，1显示最高质量。For each variable, the definitions of small, medium, and large are different and determined by its specific membership function. After exposing the fuzzy inference system (FIS) to each set of variables, the input set's contribution to the rules is propagated to the fuzzy logic system, and a single (defuzzified) number is generated as output: the score. This score is limited to between 0 and 1, with 1 indicating the highest quality.

在对每个结点对应用FIS后，计算装置或其计算机逻辑构建整幅图。图7显示了此图的例子。根据获胜假设的方向给结点着色。每个连接的强度通过对感兴趣的杂合SNP对应用FIS导出。一旦构建初步图(图7的顶部图)，计算装置或其计算机逻辑优化该图(图7的底部图)，并将其简化成树。此优化过程通过从初始图产生最小跨度树(MST)完成。MST保证从每个结点到任何另一结点的独特路径。After applying FIS to each node, computing device or its computer logic builds whole picture. Fig. 7 shows the example of this figure. Node is colored according to the direction of winning hypothesis. The intensity of each connection is derived by applying FIS to the heterozygous SNP of interest. Once constructing preliminary figure (top figure of Fig. 7), computing device or its computer logic optimizes this figure (bottom figure of Fig. 7) and simplifies it into tree. This optimization process is completed by generating minimum span tree (MST) from initial figure. MST guarantees the unique path from each node to any other node.

图7显示了图优化。在此应用中，每个重叠群上的第一结点用作锚结点，并且所有其它结点相对于所述结点取向。根据方向，每个命中将必须翻转或不然，以匹配锚结点的方向。图8显示了给定例子的重叠群比对方法。在此方法结束时，可得到定相的重叠群。FIG7 illustrates graph optimization. In this application, the first node on each contig is used as an anchor node, and all other nodes are oriented relative to the node. Depending on the orientation, each hit will have to flip or flip to match the orientation of the anchor node. FIG8 illustrates the contig alignment method for a given example. At the end of this method, phased contigs are obtained.

在定量方法中的此点时，将两个单元型分开。虽然已知这些单元型之一来自模板，而一个来自父本，但是完全不知道哪一个来自哪个亲本。在定相的下一步中，计算装置或其计算机逻辑尝试将正确的亲本标签(母本/父本)归入每个单元型。此过程称为通用定相。为了这样做，需要知道至少几个杂合SNP(在重叠群上)与亲本的联系。此信息可以通过进行三人一组(母本-父本-后代)定相获得。使用三重测序基因组，鉴定出具有已知亲本联系的一些基因座，更具体地在至少一个亲本是纯合时。然后，计算装置或其计算机逻辑使用这些联系以将正确的亲本标签(母本/父本)归入整个重叠群，也就是说，实施亲本辅助通用定相(图9)。At this point in the quantitative method, the two haplotypes are separated. Although it is known that one of these haplotypes is from template, and one is from father, it is not known which one is from which parent. In the next step of phasing, a computing device or its computer logic attempts to classify the correct parental tag (maternal/father) into each haplotype. This process is called universal phasing. In order to do this, it is necessary to know the connection between at least several heterozygous SNPs (on overlapping groups) and parents. This information can be obtained by performing a three-person group (maternal-father-offspring) phasing. Using a triple sequencing genome, some loci with known parental connections are identified, more specifically when at least one parent is homozygous. Then, a computing device or its computer logic uses these connections to classify the correct parental tag (maternal/father) into the entire overlapping group, that is, to implement parent-assisted universal phasing (Fig. 9).

为了保证高准确度，可以实施下列各项：(1)在可能时(例如在NA19240的情况中)，从多个来源(例如内部和1000个基因组)获得三重信息，并使用此类资源的组合；(2)需要重叠群包含至少两个已知的三重定相基因座；(3)消除在一行中具有一系列三重错配(指示区段误差)的重叠群；并(4)消除在三重基因座末端具有单一三人一组错配(指示潜在的区段误差)的重叠群。To ensure high accuracy, the following were implemented: (1) when possible (e.g., in the case of NA19240), triplet information was obtained from multiple sources (e.g., in-house and 1000 Genomes) and combinations of such resources were used; (2) contigs were required to contain at least two known triply phased loci; (3) contigs with a series of triplet mismatches in a row (indicating a segment error) were eliminated; and (4) contigs with a single triplet mismatch at the end of a triplet locus (indicating a potential segment error) were eliminated.

图10显示了自然重叠群分离。不论使用亲本数据与否，重叠群在天然情况下经常不连续下去超出某个点。重叠群分离的原因是：(1)某些区域中超过平常的DNA片段化或缺乏扩增，(2)低杂合SNP密度，(3)参照基因组上的多聚N序列，和(4)DNA重复区(倾向于误差定位)。Figure 10 shows natural contig separation. Regardless of whether parental data is used, contigs often discontinue beyond a certain point in nature. Reasons for contig separation are: (1) greater than normal DNA fragmentation or lack of amplification in certain regions, (2) low heterozygous SNP density, (3) poly-N sequences on the reference genome, and (4) DNA repetitive regions (which are prone to mislocalization).

图11显示了通用定相。通用定相的主要优点之一是获得完整染色体“重叠群”的能力。这是有可能的，因为每个重叠群(在通用定相后)携带具有正确亲本标签的单元型。因此，所有携带标签母本的重叠群可以放在相同单元型上；并且可以对父本重叠群完成相似操作。Figure 11 shows universal phasing. One of the main advantages of universal phasing is the ability to obtain complete chromosome "contigs." This is possible because each contig (after universal phasing) carries a haplotype with the correct parental label. Therefore, all contigs carrying the label maternal can be placed on the same haplotype; and similar operations can be performed for paternal contigs.

LFR方法的另一个主要优点是显著提高杂合SNP响应的准确度的能力。图12显示了源自使用LFR方法的误差检测的两个例子。图12(左侧)中显示了第一个例子，其中连接矩阵不支持任何预期的假设。这指示杂合SNP之一实际上不是杂合SNP。在此例子中，A/C杂合SNP实际上是纯合基因座(A/A)，其被装配器误差标记为杂合基因座。可以鉴定此误差，并且将其消除或(在此情况中)校正。图13(右侧)中显示了第二个例子，其中此情况的连接矩阵同时支持这两种假设。这是杂合SNPerozygous响应不真实的迹象。Another major advantage of the LFR method is the ability to significantly improve the accuracy of heterozygous SNP responses. Figure 12 shows two examples of error detection derived from the LFR method. Figure 12 (left side) shows the first example, in which the connection matrix does not support the hypothesis of any expectation. This indicates that one of heterozygous SNPs is not actually a heterozygous SNP. In this example, the A/C heterozygous SNP is actually a homozygous locus (A/A), which is marked as a heterozygous locus by the assembler error. This error can be identified and eliminated or (in this case) corrected. Figure 13 (right side) shows the second example, in which the connection matrix of this situation supports both hypotheses simultaneously. This is the unreal sign of heterozygous SNPerozygous response.

“健康”杂合SNP连接矩阵是仅具有两个高单元(在预期的杂合SNP位置，即不在直线上)的连接矩阵。所有其它可能性指向潜在的问题，并且可以消除或用于对感兴趣的基因座产生交替的碱基响应。A "healthy" heterozygous SNP linkage matrix is one with only two high cells (at the expected heterozygous SNP positions, i.e., not on a straight line). All other possibilities point to potential problems and can be eliminated or used to generate alternate base calls for the locus of interest.

LFR方法的另一个优点是以较弱的支持响应杂合SNP的能力(例如其中由于偏爱或错配率而难以定位DNB)。由于LFR方法需要对杂合SNP的额外约束，可以降低杂合SNP响应在非LFR装配器中需要的阈值。图13表明此情况的例子，其中可以进行确信的杂合SNP响应，尽管有少量读取结果。在图13(右侧)中，在正常情况下，低数目的支持性读取结果会阻止任何装配器确信地响应相应的杂合SNP。然而，由于连接矩阵是“干净的”，可以更确信地将杂合SNP响应归入这些基因座。Another advantage of the LFR method is the ability to respond to heterozygous SNPs with weaker support (e.g., where it is difficult to locate DNBs due to bias or mismatch rates). Since the LFR method requires additional constraints on heterozygous SNPs, the threshold that heterozygous SNP responses require in non-LFR assemblers can be lowered. Figure 13 shows an example of this situation, in which confident heterozygous SNP responses can be made despite a small amount of reads. In Figure 13 (right side), under normal circumstances, a low number of supportive reads would prevent any assembler from confidently responding to the corresponding heterozygous SNPs. However, since the connection matrix is "clean," heterozygous SNP responses can be more confidently attributed to these loci.

注释剪接位点中的SNPAnnotation of SNPs in splice sites

转录RNA中的内含子需要在它们变为mRNA前剪接出去。关于剪接的信息在这些RNA的序列内体现，并且基于一致性的。剪接位点共有序列中的突变是许多人类疾病的原因(Faustino and Cooper,Genes Dev.17:419-437,2011)。大多数剪接位点符合外显子周围的固定位置处的简单共有序列。在这点上，开发出注释剪接位点突变的程序。在此程序中，使用共有剪接位置模型(www.life.umd.edu/labs/mount/RNAinfo)。对样式：外显子5’端区中的CAG|G(“|”表示外显子开始)和相同外显子3’端区中的MAG|GTRAG(“|”表示外显子的结束)实施查找。这里，M＝{A,C},R＝{A,G}。此外，将剪接共有位置分类成两类：I型，其中与模型的一致性是100％需要的；和II型，其中与模型的一致性在大于50％情况中保持。据推测，I型位置中的SNP突变会引起错过剪接，而II型位置中的SNP仅会降低剪接事件的效率。Introns in transcribed RNA need to be spliced out before they become mRNA. Information about splicing is embodied in the sequences of these RNAs and is based on consistency. Mutations in the splice site consensus sequence are the cause of many human diseases (Faustino and Cooper, Genes Dev. 17: 419-437, 2011). Most splice sites conform to simple consensus sequences at fixed positions around the exons. At this point, a program for annotating splice site mutations has been developed. In this program, a consensus splice position model (www.life.umd.edu/labs/mount/RNAinfo) is used. A search is performed on the pattern: CAG|G ("|" indicates the start of the exon) in the 5' end region of the exon and MAG|GTRAG ("|" indicates the end of the exon) in the 3' end region of the same exon. Here, M = {A, C}, R = {A, G}. In addition, the splicing consensus positions were classified into two categories: Type I, where consistency with the model is 100% required; and Type II, where consistency with the model is maintained in greater than 50% of cases. It is speculated that SNP mutations in Type I positions will cause missed splicing, while SNPs in Type II positions will only reduce the efficiency of the splicing event.

用于注释剪接位点突变的程序逻辑包括两个部分。在部分I中，产生含有来自输入参照基因组的模型位置序列的文件。在部分2中，将来自测序项目的SNP与这些模型位置序列比较，并报告任何I型和II型突变。程序逻辑是外显子中心的，代替内含子中心的(为了便于分析基因组)。对于给定的外显子，在其5’端中，我们查找共有“cAGg”(对于位置-3,-2,-1,0。0意味着外显子的开始)。大写字母意味着I型位置，而小写字母意味着II型位置)。在外显子的3’端中，对共有“magGTrag”(对于位置序列-3,-2,-1,0,1,2,3,4)实施查找。仅忽略不符合这些要求的基因组释放的外显子(占所有情况的约5％)。这些外显子落入其它次要种类的共有剪接位点，并且不通过程序逻辑调查。将来自测序的基因组的任何SNP与这些基因组位置处的模型序列比较。会报告I型中的任何错配。若突变偏离一致性，则报告II型位置中的错配。The program logic for annotating splice site mutations includes two parts. In part I, a file containing a model position sequence from the input reference genome is generated. In part 2, the SNPs from the sequencing project are compared with these model position sequences and any type I and type II mutations are reported. The program logic is exon-centric, replacing the intron-centric (for ease of analyzing the genome). For a given exon, in its 5' end, we look for a consensus "cAGg" (for positions -3, -2, -1, 0. 0 means the beginning of the exon). Uppercase letters mean type I positions, while lowercase letters mean type II positions). In the 3' end of the exon, a search is performed for a consensus "magGTrag" (for position sequences -3, -2, -1, 0, 1, 2, 3, 4). Exons released from genomes that do not meet these requirements are only ignored (approximately 5% of all cases). These exons fall into other minor categories of consensus splice sites and are not investigated by the program logic. Any SNP from the sequenced genome is compared with the model sequence at these genomic positions. Any mismatch in type I is reported. Mismatches in type II positions are reported if the mutation deviates from consensus.

上述程序逻辑检测大多数坏的剪接位点突变。报告的坏的SNP无疑是成问题的。但是有许多其它坏的SNP，其引起通过此程序检测不到的剪接问题。例如，人基因组内有不符合上文提及的一致性的许多内含子。还有，内含子中间的分支点突变也可以引起剪接问题。没有报告这些剪接位点突变。The program logic described above detects most bad splice site mutations. Reported bad SNPs are undoubtedly problematic. However, there are many other bad SNPs that cause splicing problems that this program doesn't detect. For example, the human genome contains many introns that don't conform to the consistency mentioned above. Furthermore, branch point mutations within introns can also cause splicing problems. These splice site mutations are not reported.

注释影响转录因子结合位点(TFBS)的SNP。JASPAR模型用于从释放的人基因组序列(构件36或构件37)寻找TFBS。JASPAR Core是以矩阵建模的针对脊椎动物的130个TFBS位置频率数据的集合(Bryne et al.,Nucl.Acids Res.36:D102-D106,2008；Sandelin etal.,Nucl.Acids Res.23:D91-D94,2004)。这些模型从JASPAR网址(http:// jaspar.genereg.net/cgi-bin/jaspar_db.pl？rm＝browse&db＝core&tax_group＝vertebrates)下载。使用以下公式将这些模型转化成位置权重矩阵(PWM)：wi＝log2[(fi+pNi1/2)/(Ni+Ni1/2)/p]，其中：fi是对位置I处特定碱基观察到的频率；Ni是所述位置处的总体观察；且p是目前核苷酸的背景频率，其默认为0.25(bogdan.org.ua/2006/09/11/position-frequency-matrix-to-position-weight-matrix-pfm2pwm.html；Wassermanand Sandelin,Nature Reviews,Genetics 5:P276-287,2004)。一种特定的程序Mast(meme.sdsc.edu/meme/mast-intro.html)用于对基因组内的序列区段搜索TFBS位点。运行程序以提取参照基因组中的TFBS位点。步骤的概要如下：(i)对于具有mRNA的每个基因，从基因组提取[-5000,1000]推定的含有TFBS的区域，0是mRNA起始位置。(ii)对所有PWM模型运行推定的含有TFBS的序列的Mast搜索。(iii)选择高于给定阈值的那些命中。(iv)对于具有多个或重叠命中的区域，仅选择1-命中，即具有最高Mast搜索得分的命中。Annotate SNPs that affect transcription factor binding sites (TFBSs). The JASPAR model was used to find TFBSs from the released human genome sequence (build 36 or build 37). JASPAR Core is a collection of matrix-modeled TFBS position frequency data for 130 vertebrates (Bryne et al., Nucl. Acids Res. 36: D102-D106, 2008; Sandelin et al., Nucl. Acids Res. 23: D91-D94, 2004). These models were downloaded from the JASPAR website ( http://jaspar.genereg.net/cgi-bin/jaspar_db.pl?rm=browse&db=core&tax_group=vertebrates ). These models were converted into position weight matrices (PWMs) using the following formula: wi = log2[(fi + pNi1/2)/(Ni + Ni1/2)/p], where: fi is the frequency of observation for a particular base at position 1; Ni is the total number of observations at that position; and p is the background frequency of the current nucleotide, which defaults to 0.25 (bogdan.org.ua/2006/09/11/position-frequency-matrix-to-position-weight-matrix-pfm2pwm.html; Wasserman and Sandelin, Nature Reviews, Genetics 5: P276-287, 2004). A specialized program, Mast (meme.sdsc.edu/meme/mast-intro.html), was used to search for TFBS sites within a sequence segment within a genome. The program was run to extract TFBS sites in the reference genome. The steps are summarized as follows: (i) For each gene with an mRNA, extract the [-5000, 1000] putative TFBS-containing regions from the genome, where 0 is the mRNA start position. (ii) Run a Mast search of the putative TFBS-containing sequences for all PWM models. (iii) Select those hits above a given threshold. (iv) For regions with multiple or overlapping hits, select only the single hit, i.e., the hit with the highest Mast search score.

凭借来自合适计算机可读介质中产生和/或存储的参照基因组的TFBS模型命中，计算装置或其计算机逻辑可以鉴定位于命中区内的SNP。这些SNP会影响模型，和命中得分变化。书写第二种程序以计算命中得分的此类变化，因为含有SNP的区段两次运行到PWM模型中，一次对于参照，及第二次对于具有SNP取代的区段。引起区段命中得分下降超过3的SNP鉴定为坏的SNP。By means of the TFBS model hit with reference to the genome generated and/or stored in a suitable computer readable medium, a computing device or its computer logic can identify the SNPs located in the hit region. These SNPs can affect the model and the hit score changes. A second program is written to calculate this type of change in hit score because the segment containing the SNP is run into the PWM model twice, once for reference, and a second time for the segment with the SNP replacement. SNPs that cause the segment hit score to decline by more than 3 are accredited as bad SNPs.

具有两个坏的SNP的基因的选择。将具有坏的SNP的基因分类成两个种类：(1)那些影响转录的AA序列的；和(2)那些影响转录结合位点的。对于AA序列影响，包括以下SNP亚类：Selection of genes with two bad SNPs. Genes with bad SNPs were classified into two categories: (1) those affecting the AA sequence of transcription; and (2) those affecting the transcription binding site. For AA sequence effects, the following SNP subcategories were included:

(1)无义或无终止变异。这些突变引起截短的蛋白质或延伸的蛋白质。在任一情况中，蛋白质产物的功能是完全丧失的或不太有效的。(1) Nonsense or non-stop mutations . These mutations result in either a truncated protein or an extended protein. In either case, the function of the protein product is completely lost or less efficient.

(2)剪接位点变异。这些突变引起内含子的剪接位点被破坏(对于那些根据所述模型需要是100％的某个核苷酸的那些位置)或严重减少(对于那些根据所述模型对于某个核苷酸需要是大于50％的位点。SNP引起剪接位点核苷酸突变为另一种核苷酸，其低于50％一致性，如通过剪接位点共有序列模型预测的)。这些突变有可能会生成截短的、缺少外显子、或在蛋白质产物数量上严重减少的蛋白质。(2) Splice site variants . These mutations cause the splice sites of introns to be destroyed (for those positions that require a certain nucleotide to be 100% according to the model) or severely reduced (for those positions that require a certain nucleotide to be greater than 50% according to the model. SNPs cause the splice site nucleotide to mutate to another nucleotide that is less than 50% identical, as predicted by the splice site consensus sequence model). These mutations have the potential to produce proteins that are truncated, lack exons, or have a severely reduced amount of protein product.

(3)AA变异的Polyphen2注释。对于引起蛋白质氨基酸序列而非其长度变化的SNP，使用Polyphen2(Adzhubei et al.,Nat.Methods 7:248-249,2010)作为主要的注释工具。Polyphen2将SNP注释为“良性的”、“未知的”、“可能损害性的”和“大概损害性的”。“可能损害性的”和“大概损害性的”两者鉴定为坏的SNP。Polyphen2的这些种类分配基于Polyphen2软件的结构预测。(3) Polyphen2 annotation of AA variation . For SNPs that cause changes in the protein amino acid sequence rather than its length, Polyphen2 (Adzhubei et al., Nat. Methods 7:248-249, 2010) was used as the main annotation tool. Polyphen2 annotated SNPs as "benign", "unknown", "possibly damaging" and "probably damaging". Both "possibly damaging" and "probably damaging" were identified as bad SNPs. These classification assignments by Polyphen2 were based on the structural predictions of the Polyphen2 software.

对于转录结合位点突变，基于参照基因组作为TFBS结合位点的筛选，使用75％的模型最大得分(maxScore)。除去所述区域中<＝75％最大得分的任何模型命中。对于那些剩余的命中，若SNP引起命中得分下降3以上，则认为它是有害的SNP。For transcription binding site mutations, a model maximum score (maxScore) of 75% was used based on the reference genome as a screen for TFBS binding sites. Any model hits with a maximum score of <= 75% in the region were removed. For those remaining hits, if the SNP caused the hit score to drop by more than 3, it was considered a deleterious SNP.

报告了两类基因。1类基因是那些具有至少2个坏AA影响性突变的基因。这些突变可以全在单一等位基因上(1.1类)，或在2个独特等位基因上散布(1.2类)。2类基因是1类集的超集。2类基因是含有至少2个坏的SNP的基因，不论它是AA影响性的还是TFBS位点影响性的。但是，要求是至少1个SNP是AA影响性的。2类基因是那些在1类中的基因，或那些具有1处有害AA突变和1处以上有害TFBS影响性变异的基因。2.1类意味着所有这些有害突变来自单一等位基因，而2.2类意味着有害SNP来自两个独特等位基因。Two categories of genes are reported. Category 1 genes are those with at least 2 bad AA-affecting mutations. These mutations can be all on a single allele (category 1.1) or spread across 2 unique alleles (category 1.2). Category 2 genes are a superset of the category 1 set. Category 2 genes are genes that contain at least 2 bad SNPs, regardless of whether it is AA-affecting or TFBS site-affecting. However, the requirement is that at least 1 SNP is AA-affecting. Category 2 genes are those in category 1, or those with 1 deleterious AA mutation and 1 or more deleterious TFBS-affecting variants. Category 2.1 means that all of these deleterious mutations come from a single allele, while category 2.2 means that the deleterious SNPs come from two unique alleles.

前述技术和算法适用于用于对复杂核酸测序的方法，任选地与测序前的LFR处理结合(与测序结合的LFR可以称为“LFR测序”)，其如下详细描述。用于对复杂核酸测序的此类方法可以通过一个或多个执行计算机逻辑的计算装置实施。此类逻辑的一个例子是以任何合适的编程语言诸如Java、C++、Perl、Python和任何其它合适的常规的和/或面向对象的编程语言书写的软件代码。在以一个或多个计算机过程形式执行时，此类逻辑可以读取结果、书写和/或以其它方式处理结构化的和非结构化的数据，该数据可以在持久性存储器上和/或易失性存储器中在多个结构中存储；此类存储结构的例子包括但不限于文件、表、数据库记录、阵列、列表、向量、变数、存储器和/或处理器寄存器、从面向对象类别例示的持久性和/或存储器数据对象和任何其它合适的数据结构。The foregoing techniques and algorithms are applicable to methods for sequencing complex nucleic acids, optionally in combination with LFR processing prior to sequencing (LFR combined with sequencing may be referred to as "LFR sequencing"), which are described in detail below. Such methods for sequencing complex nucleic acids may be implemented by one or more computing devices that execute computer logic. An example of such logic is software code written in any suitable programming language such as Java, C++, Perl, Python, and any other suitable conventional and/or object-oriented programming language. When executed in the form of one or more computer processes, such logic may read results, write, and/or otherwise process structured and unstructured data, which may be stored in a plurality of structures on persistent storage and/or in volatile memory; examples of such storage structures include, but are not limited to, files, tables, database records, arrays, lists, vectors, variables, memory and/or processor registers, persistent and/or memory data objects instantiated from object-oriented classes, and any other suitable data structures.

改善长读取结果测序中的准确度Improving accuracy in long-read sequencing

在使用某些长读取结果技术的DNA测序中(例如纳米孔测序)，长(例如10-100kb)读取结果长度是可用的，但是一般具有较高的假阴性和假阳性率。来自此类长读取结果技术的序列的最终准确度可以依照以下一般方法使用单元型信息(完全或部分定相)显著增强。In DNA sequencing using certain long-read technologies (e.g., nanopore sequencing), long (e.g., 10-100 kb) read lengths are available, but generally have higher false negative and false positive rates. The ultimate accuracy of sequences from such long-read technologies can be significantly enhanced using cell type information (complete or partial phasing) according to the following general approach.

首先，计算装置或其计算机逻辑将读取结果彼此比对。预期大量杂合响应存在于重叠中。例如，若2个至5个100kb片段重叠最少10％，则这导致>10kb重叠，其可以粗略转变成10个杂合基因座。或者，将每个长读取结果与参照基因组比对，通过参照基因组，会隐含获得读取结果的多重比对。First, a computing device or its computer logic compares the reads to each other. A large number of heterozygous calls are expected to be present in the overlap. For example, if two to five 100 kb fragments overlap by at least 10%, this results in a >10 kb overlap, which can roughly translate into 10 heterozygous loci. Alternatively, each long read is compared to a reference genome, which implicitly provides a multiple alignment of the reads.

一旦实现了多重读取结果比对，可以考虑重叠区。可以调节重叠会包括大量(例如N＝10)杂合基因座的实情以考虑杂合的组合。此组合形式导致单元型概率的较大空间(4N或4^^N；若N＝10，则4^N＝约100万)。在N维空间中的所有这4^N个点中，预期仅两个点含有生物学可行的信息，即，那些对应于两个单元型的。换言之，存在有4^N/2(在这里为1e6/2或约500,000)的噪音抑制率。实际上，大部分的此4^N空间是退化的，特别是因为序列已经比对(并且因此相似)，而且还因为每个基因座通常不携带超过2个可能的碱基(若它是真的杂合的话)。因此，此空间的下界实际上是2^N(若N＝10，则2^N＝约1000)。因此，噪音抑制率可以仅是2^N/2(在这里为1000/2＝500)，其仍然是相当给人印象深刻的。随着假阳性和假阴性数目增加，空间的大小从2^N扩充到4^N，其继而导致较高的噪音抑制率。换言之，随着噪音增加，它会自动地受到更多抑制。因此，预期输出积仅保留非常小的(且相当恒定的)噪音量，几乎不依赖于输入噪音。(权衡(tradeoff)是更有噪声条件中的产率(yield)损失)。当然，在下述情况下改变这些抑制率：(1)误差是系统性的(或其它数据特质)，(2)算法不是最佳的，(3)重叠部分是较短的，或者(4)覆盖冗余是较小的。N可以是大于1的任何整数，诸如2,3,5,10或更多。Once multiple read results are aligned, overlaps can be considered. The fact that the overlap includes a large number of heterozygous loci (e.g., N=10) can be adjusted to account for heterozygous combinations. This combination results in a larger space of haplotype probabilities (4N or 4^ ^N ; if N=10, then ^4N =about 1 million). Of all these ^4N points in N-dimensional space, only two points are expected to contain biologically feasible information, i.e., those corresponding to two haplotypes. In other words, there is a noise suppression rate of ^4N /2 (here 1e6/2 or about 500,000). In fact, most of this ^4N space is degenerate, especially because the sequences are already aligned (and therefore similar), and also because each locus generally does not carry more than 2 possible bases (if it is truly heterozygous). Therefore, the lower bound of this space is actually ^2N (if N=10, then ^2N =about 1000). Therefore, the noise suppression rate can be only ^2N /2 (here 1000/2=500), which is still quite impressive. As the number of false positives and false negatives increases, the size of the space expands from ^2N to ^4N , which in turn leads to higher noise suppression rates. In other words, as noise increases, it is automatically suppressed more. Therefore, the expected output product retains only a very small (and fairly constant) amount of noise, almost independent of the input noise. (The tradeoff is the yield loss in noisier conditions). Of course, these suppression rates are changed if: (1) the errors are systematic (or other data idiosyncrasies), (2) the algorithm is not optimal, (3) the overlap is short, or (4) the coverage redundancy is small. N can be any integer greater than 1, such as 2, 3, 5, 10, or more.

以下方法可用于提高长读取结果测序方法的准确度，所述长读取结果测序方法可以具有较大的初始误差率。The following methods can be used to improve the accuracy of long-read sequencing methods that may have a large initial error rate.

首先，计算装置或其计算机逻辑比对几个读取结果，例如5个读取结果或更多，诸如10-20个读取结果。假设读取结果是约100kb，且共享重叠是10％，这导致5个读取结果中的10kb重叠。还假设每1kb中有杂合性。因此，在此共同区域中会有总共10个杂合性。First, a computing device or its computer logic compares several reads, for example, 5 reads or more, such as 10-20 reads. Assume the reads are approximately 100 kb and the shared overlap is 10%, resulting in a 10 kb overlap among the 5 reads. Also assume there is heterozygosity every 1 kb. Therefore, there will be a total of 10 heterozygosities in this common region.

接着，计算装置或其计算机逻辑填写上述10个候选杂合性的alpha10个可能性(其中alpha介于2和4之间)的部分(例如仅非零元素)或整个矩阵。在一个实现中，此矩阵的alpha10个单元中仅2个应当是高密度(例如如通过阈值测量的，所述阈值可以是预先确定的或动态的)。这些是对应于真正杂合性的单元。这两个单元可以认为是基本上无噪音的中心。剩余部分应当含有几乎0且偶而1个隶属关系，尤其在误差不是系统性的时。若误差是系统性的，可以有聚簇事件(例如具有超过仅0或1的第三个单元)，其使任务更加困难。然而，即使在此情况中，假簇的簇隶属关系应当显著弱于(例如如通过绝对或相对量测量的)两个预期簇的簇隶属关系。此情况中的权衡是起始点应当包括比对的更多多重序列，这与具有较长的读取结果或较大的覆盖冗余直接相关。Then, computing device or its computer logic fills in part (for example, only non-zero elements) or the entire matrix of the alpha10 possibilities (wherein alpha is between 2 and 4) of the above-mentioned 10 candidate heterozygosities. In one implementation, only 2 of the alpha10 units of this matrix should be high density (for example, as measured by a threshold value, which can be predetermined or dynamic). These are the units corresponding to true heterozygosity. These two units can be considered to be essentially noiseless centers. The remainder should contain almost 0 and occasionally 1 affiliation, especially when the error is not systematic. If the error is systematic, there can be a clustering event (for example, having more than only a 3rd unit of 0 or 1), which makes the task more difficult. However, even in this case, the cluster affiliation of a false cluster should be significantly weaker than the cluster affiliation of two expected clusters (for example, as measured by absolute or relative amounts). The trade-off in this case is that the starting point should include more multiple sequences of comparisons, which is directly related to having a longer reading result or larger coverage redundancy.

上述步骤假设在重叠读取结果间观察到两个可行簇。对于大量假阳性，情况不会如此。若情况如此，在alpha维空间中，会使预期的两个簇变模糊，即代替作为具有高密度的单一点，它们会是感兴趣单元周围的M个点的变模糊的簇，其中感兴趣的这些单元是在簇中心的无噪音中心。这使聚簇方法能够捕捉预期点的位置，尽管有精确的序列不在每个读取结果中呈现的实情。簇事件也可以在簇模糊(即可以有超过两个中心)时，但是与上文的描述类似的方式发生，对于二倍体生物体，得分(例如簇单元的总计数)可以用于区分较弱的簇与两个真实的簇。两个真实的簇可以用于对多个区域创建重叠群，如本文中所描述的，并且重叠群可以匹配到两组中以对复杂核酸的较大区域形成单元型。The above steps assume that two feasible clusters are observed between overlapping read results. For a large number of false positives, the situation will not be like this. If this is the case, in alpha dimensional space, the two expected clusters will be blurred, that is, instead of being a single point with high density, they will be the blurred clusters of M points around the unit of interest, where these units of interest are the noiseless centers at the center of the cluster. This enables the clustering method to capture the position of the expected point, despite the fact that the accurate sequence is not presented in each read result. Cluster events can also occur when the cluster is fuzzy (i.e., more than two centers can be arranged), but in a similar manner to the description above, for diploid organisms, scores (such as the total count of cluster units) can be used to distinguish between weaker clusters and two real clusters. Two real clusters can be used to create overlapping groups to multiple regions, as described herein, and overlapping groups can be matched in two groups to form unit types to the larger region of complex nucleic acids.

最终，计算装置或其计算机逻辑可以使用基于群体的(已知的)单元型来提高置信度和/或在寻找真实簇中提供额外的指引。一种实现此方法的方式是给每个观察到的单元型提供权重，并且对未观察到的单元型提供较小的但非零的数值。通过这样做，实现对天然单元型的偏爱，所述天然单元型已经在感兴趣的群体中观察到。Finally, the computing device or its computer logic can use population-based (known) haplotypes to increase confidence and/or provide additional guidance in finding true clusters. One way to implement this approach is to assign a weight to each observed haplotype and assign a smaller, but non-zero, value to unobserved haplotypes. By doing so, a preference is achieved for natural haplotypes that have been observed in the population of interest.

使用具有含未校正误差的标签序列数据的读取结果Read results using tag sequence data with uncorrected errors

如本文中讨论的，依照本发明的一个实施方案，将复杂核酸的样品分成多份等分试样(例如多孔板中的孔)，扩增，并片段化。然后，将等分试样特异性标签与片段连接以鉴定复杂核酸的特定片段起源的等分试样。任选地，标签包含误差校正代码，例如Reed-Solomon误差校正(或误差检测)代码。在对片段测序时，对标签和复杂核酸序列的片段两者测序。若标签序列中有误差，且不可能鉴定片段起源的等分试样，或者使用误差校正代码校正序列，则可以放弃整个读取结果，导致大量序列数据的损失。应当注意到，包含正确的和经校正的标签序列数据的读取结果是高准确度的，但是低产率的，而包含不能校正的标签序列数据的读取结果是低准确度的，但高产率的。取而代之，此类序列数据用于与那些需要此类数据以依靠特定标签与特定等分试样的联合的身份鉴定起源等分试样的那些方法不同的方法。需要具有正确的(或经校正的)标签序列数据的读取结果的方法的例子包括但不限于样品或库多路复用、定相或误差校正或任何其它需要正确的(或经校正的)标签序列的方法。可以采用具有不能校正的标签序列数据的读取结果的方法的例子包括任何其它方法，包括但不限于定位、基于参照的且局部的重新装配、基于集合的统计学(例如等位基因频率、重新突变的位置，等等)。As discussed herein, according to one embodiment of the present invention, a sample of complex nucleic acid is divided into multiple aliquots (e.g., wells in a multi-well plate), amplified, and fragmented. Aliquot-specific tags are then connected to the fragments to identify the aliquots from which specific fragments of the complex nucleic acid originate. Optionally, the tags contain error correction codes, such as Reed-Solomon error correction (or error detection) codes. When sequencing the fragments, both the tags and the fragments of the complex nucleic acid sequence are sequenced. If there is an error in the tag sequence and it is impossible to identify the aliquots from which the fragments originate, or if the sequence is corrected using an error correction code, the entire read result can be abandoned, resulting in a large amount of sequence data loss. It should be noted that the read result containing correct and corrected tag sequence data is highly accurate, but has a low yield, while the read result containing tag sequence data that cannot be corrected is low accurate, but has a high yield. Instead, such sequence data is used for methods different from those that require such data to identify the aliquots of origin by relying on the combination of specific tags with specific aliquots. Examples of methods requiring reads with correct (or corrected) tag sequence data include, but are not limited to, sample or library multiplexing, phasing or error correction, or any other method requiring correct (or corrected) tag sequences. Examples of methods that can employ reads with uncorrected tag sequence data include any other method, including, but not limited to, mapping, reference-based and local reassembly, and set-based statistics (e.g., allele frequencies, positions of de novo mutations, etc.).

将长读取结果转化成虚(virtual)LFRConvert long reads into virtual LFRs

设计用于LFR的算法(包括定相算法)可以通过将随机虚标签(具有一致分布)归入每个(10-100kb)长片段用于长读取结果。虚标签具有使真正一致的分布能够用于每个代码的益处。由于合并代码的差异和代码的解码效率差异，LFR不能实现此一致性水平。可以在LFR中的任何两个代码的表示中容易地观察到3:1(和多至10:1)的比率。然而，虚LFR方法导致任何两个代码间的真正1:1比率。Algorithms designed for LFR (including phasing algorithms) can be used for long read results by assigning random virtual tags (with consistent distribution) to each (10-100kb) long fragment. Virtual tags have the benefit of enabling a truly consistent distribution for each code. Due to the differences in the merged codes and the differences in the decoding efficiency of the codes, LFR cannot achieve this consistency level. A ratio of 3:1 (and up to 10:1) can be easily observed in the representation of any two codes in LFR. However, the virtual LFR method results in a true 1:1 ratio between any two codes.

用于对复杂核酸测序的方法Methods for sequencing complex nucleic acids

概述Overview

依照本发明的一个方面，提供了用于对复杂核酸测序的方法。依照本发明的某些实施方案，提供了用于对非常少量的此类复杂核酸(例如1pg至10ng)测序的方法。即使在扩增后，此类方法产生以高响应率和准确度为特征的装配序列。依照其它实施方案，使用等分取样来鉴定并消除复杂核酸测序中的误差。依照另一个实施方案，LFR与复杂核酸测序结合使用。According to one aspect of the present invention, methods for sequencing complex nucleic acids are provided. According to certain embodiments of the present invention, methods for sequencing very small amounts of such complex nucleic acids (e.g., 1 pg to 10 ng) are provided. Even after amplification, such methods produce assembled sequences characterized by high response rates and accuracy. According to other embodiments, aliquot sampling is used to identify and eliminate errors in the sequencing of complex nucleic acids. According to another embodiment, LFR is used in combination with complex nucleic acid sequencing.

除非另有指示，本发明的实践可以采用有机化学、聚合物技术、分子生物学(包括重组技术)、细胞生物学、生物化学和免疫学的常规技术和描述，其在本领域技术内。此类常规技术包括聚合物阵列合成、杂交、连接和使用标记物检测杂交。可以通过参考下文的例子具有合适技术的具体例示。然而，当然也可以使用其它等同的常规方法。此类常规技术和描述可以参见标准实验室手册，诸如Genome Analysis:A Laboratory Manual Series(Vols.I-IV),Using Antibodies:A Laboratory Manual,Cells:A Laboratory Manual,PCR Primer:A Laboratory Manual,and Molecular Cloning:A Laboratory Manual(allfrom Cold Spring Harbor Laboratory Press),Stryer,L.(1995)Biochemistry(4thEd.)Freeman,New York,Gait,“Oligonucleotide Synthesis:A Practical Approach”1984,IRL Press,London,Nelson and Cox(2000),Lehninger,Principles ofBiochemistry 3rd Ed.,W.H.Freeman Pub.,New York,N.Y.and Berg et al.(2002)Biochemistry,5th Ed.,W.H.Freeman Pub.,New York,N.Y.，其全部通过提及完整并入本文用于所有目的。Unless otherwise indicated, practice of the present invention can adopt the conventional techniques and descriptions of organic chemistry, polymer technology, molecular biology (including recombinant technology), cell biology, biochemistry and immunology, which are within the skill of the art. Such conventional techniques include polymer array synthesis, hybridization, connection and use of markers to detect hybridization. Specific illustrations of suitable techniques can be provided by reference to the examples below. However, other equivalent conventional methods can certainly also be used. Such routine techniques and descriptions can be found in standard laboratory manuals, such as Genome Analysis: A Laboratory Manual Series (Vols. I-IV), Using Antibodies: A Laboratory Manual, Cells: A Laboratory Manual, PCR Primer: A Laboratory Manual, and Molecular Cloning: A Laboratory Manual (all from Cold Spring Harbor Laboratory Press), Stryer, L. (1995) Biochemistry (4th Ed.) Freeman, New York, Gait, “Oligonucleotide Synthesis: A Practical Approach” 1984, IRL Press, London, Nelson and Cox (2000), Lehninger, Principles of Biochemistry 3rd Ed., W.H. Freeman Pub., New York, N.Y. and Berg et al. (2002) Biochemistry, 5th Ed., W.H. Freeman Pub., New York, N.Y., all of which are incorporated herein by reference in their entirety for all purposes.

使用本发明的组合物和方法对靶核酸测序的总体方法记载于本文及例如美国专利公开文本2010/0105052和US2007099208及美国专利申请No.11/679,124(以US 2009/0264299公布)；11/981,761(US 2009/0155781)；11/981,661(US 2009/0005252)；11/981,605(US 2009/0011943)；11/981,793(US 2009-0118488)；11/451,691(US 2007/0099208)；11/981,607(US 2008/0234136)；11/981,767(US 2009/0137404)；11/982,467(US 2009/0137414)；11/451,692(US 2007/0072208)；11/541,225(US 2010/0081128；11/927,356(US2008/0318796)；11/927,388(US 2009/0143235)；11/938,096(US 2008/0213771)；11/938,106(US 2008/0171331)；10/547,214(US 2007/0037152)；11/981,730(US 2009/0005259)；11/981,685(US 2009/0036316)；11/981,797(US 2009/0011416)；11/934,695(US 2009/0075343)；11/934,697(US 2009/0111705)；11/934,703(US 2009/0111706)；12/265,593(US 2009/0203551)；11/938,213(US 2009/0105961)；11/938,221(US 2008/0221832)；12/325,922(US 2009/0318304)；12/252,280(US 2009/0111115)；12/266,385(US 2009/0176652)；12/335,168(US 2009/0311691)；12/335,188(US 2009/0176234)；12/361,507(US 2009/0263802),11/981,804(US 2011/0004413)；和12/329,365；公布的国际专利申请号WO2007120208,WO2006073504和WO2007133831，其全部通过提及完整并入本文用于所有目的。用于响应与参照多核苷酸序列相比多核苷酸序列中的变异及用于多核苷酸序列装配(或再装配)的例示性方法例如在美国专利公开文本No.2011-0004413,(App.No.12/770,089)中提供，其通过提及完整并入本文用于所有目的。还可见Drmanac et al.,Science327,78-81,2010。还通过并且完整并为了所有目的并入共同悬而未决的相关申请Nos.61/623,876，标题为“Identification Of Dna Fragments And Structural Variations“。General methods for sequencing a target nucleic acid using the compositions and methods of the invention are described herein and in, for example, U.S. Patent Publication Nos. 2010/0105052 and 2007099208 and U.S. Patent Application Nos. 11/679,124 (published as US 2009/0264299); 11/981,761 (US 2009/0155781); 11/981,661 (US 2009/0005252); 11/981,605 (US 2009/0011943); 11/981,793 (US 2009-0118488); 11/451,691 (US 2007/0099208); 11/981,607 (US 2008/0234136); 11/981,767 (US 2009/0137404); 11/982,467 (US 2009/0137414); 11/451,692 (US 2007/0072208); 11/541,225 (US 2010/0081128; 11/927,356(US2008/0318796); 11/927,388(US 2009/0143235); 11/938,096(US 2008/0213771); 11/938,106(US 2008/0171331);10/547,214(US 2007/0037152); 11/981,730 (US 2009/0005259); 11/981,685 (US 2009/0036316); 11/981,797 (US 2009/0011416); 11/934,695 (US 2009/0075343); 11/934,697 (US 2009/0111705); 11/934,703 (US 2009/0111706); 12/265,593 (US 2009/0203551); 11/938,213 (US 2009/0105961);11/938,221(US 2008/0221832); 12/325,922 (US 2009/0318304); 12/252,280 (US 2009/0111115); 12/266,385 (US 2009/0176652); 12/335,168 (US 2009/0311691); 12/335,188 (US 2009/0176234); 12/361,507 (US 2009/0263802), 11/981,804 (US 2011/0004413); and 12/329,365; published international patent application numbers WO2007120208, WO2006073504, and WO2007133831, all of which are incorporated herein by reference in their entirety for all purposes. Exemplary methods for responding to variations in a polynucleotide sequence compared to a reference polynucleotide sequence and for polynucleotide sequence assembly (or reassembly) are provided, for example, in U.S. Patent Publication No. 2011-0004413, (App. No. 12/770,089), which is incorporated herein by reference in its entirety for all purposes. See also Drmanac et al., Science 327, 78-81, 2010. Co-pending related application Nos. 61/623,876, entitled "Identification Of DNA Fragments And Structural Variations", is also incorporated by reference in its entirety and for all purposes.

此方法包括将靶核酸从样品提取并片段化。片段化的核酸用于生成靶核酸模板，其一般会包含一个或多个衔接头。将靶核酸模板进行扩增方法以形成核酸纳米球，该核酸纳米球通常在表面上布置。对本发明的核酸纳米球实施测序应用，通常经由通过连接技术的测序进行，所述连接技术包括组合探针锚定连接(“cPAL”)方法，其在下文更为详细描述。cPAL和其它测序方法也可以用于检测特定序列，诸如包括本发明核酸构建体(其包括核酸纳米球及线性和环状核酸模板)中的单核苷酸多态性(“SNPs”)。上文提及的专利申请和Drmanac等的引用文章提供了关于下列各项的额外的详细信息：例如制备核酸模板，包括衔接头设计、将衔接头插入基因组DNA片段中以生成环状库构建体；扩增此类库构建体以生成DNA纳米球(DNB)；在固体支持物上生成DNB的阵列；cPAL测序；等等，其与本文中公开的方法结合使用。This method includes extracting and fragmenting the target nucleic acid from the sample. The fragmented nucleic acid is used to generate a target nucleic acid template, which generally includes one or more adapters. The target nucleic acid template is subjected to an amplification method to form a nucleic acid nanoball, which is generally arranged on a surface. The nucleic acid nanoballs of the present invention are subjected to sequencing applications, generally via sequencing by ligation technology, including a combined probe anchor ligation ("cPAL") method, which is described in more detail below. cPAL and other sequencing methods can also be used to detect specific sequences, such as single nucleotide polymorphisms ("SNPs") in the nucleic acid constructs of the present invention (which include nucleic acid nanoballs and linear and circular nucleic acid templates). The above-mentioned patent application and the cited article by Drmanac et al. provide additional detailed information on the following: for example, preparing nucleic acid templates, including adapter design, inserting adapters into genomic DNA fragments to generate circular library constructs; amplifying such library constructs to generate DNA nanoballs (DNBs); generating arrays of DNBs on a solid support; cPAL sequencing; etc., which are used in combination with the methods disclosed herein.

如本文中使用的，术语“复杂核酸”指不同核酸或多核苷酸的大群体。在某些实施方案中，靶核酸是基因组DNA；外显子组DNA(针对转录序列富集的全基因组DNA的子集，其含有基因组中外显子的集合)；转录物组(即细胞或细胞群体中生成的所有mRNA转录物的集合，或由此类mRNA生成的cDNA)、甲基化组(methylome)(即基因组中甲基化位点的群体和甲基化样式)；微生物组(microbiome)；不同生物体基因组的混合物、生物体的不同细胞类型的基因组的混合物；和包含大量不同核酸分子的其它复杂核酸混合物(例子包括但不限于微生物组、异种移植物、包括正常细胞和肿瘤细胞两者的实体瘤活组织检查，等等)，包括前述类型的复杂核酸的子集。在一个实施方案中，此类复杂核酸具有包含至少一千兆碱基(Gb)的整个序列(二倍体人基因组包含约6Gb序列)。As used herein, the term "complex nucleic acid" refers to a large population of different nucleic acids or polynucleotides. In certain embodiments, the target nucleic acid is genomic DNA; exome DNA (a subset of whole-genome DNA enriched for transcribed sequences, containing a set of exons in a genome); transcriptome (i.e., a set of all mRNA transcripts generated in a cell or cell population, or cDNA generated by such mRNA), methylome (i.e., a population of methylation sites in a genome and a methylation pattern); microbiome (microbiome); a mixture of genomes of different organisms, a mixture of genomes of different cell types of an organism; and other complex nucleic acid mixtures (examples include, but are not limited to, microbiome, xenografts, solid tumor biopsies including both normal cells and tumor cells, etc.) comprising a large number of different nucleic acid molecules, including subsets of complex nucleic acids of the aforementioned types. In one embodiment, such complex nucleic acids have an entire sequence comprising at least one gigabase (Gb) (a diploid human genome comprises approximately 6Gb sequences).

复杂核酸的非限制性例子包括“循环核酸”(CNA)，其是在人血液或其它体液(例如包括但不限于淋巴液、液体、腹水、乳液、尿液、粪和支气管灌洗)中循环，并且可以作为无细胞的(CF)或细胞关联的核酸(综述见Pinzani et al.,Methods 50:302-307,2010)(例如预期母本血流中的循环胎儿细胞(见例如Kavanagh et al.,J.Chromatol.B 878:1905-1911,2010)或来自癌症患者血流的循环肿瘤细胞(CTC)(见例如Allard et al.,Clin CancerRes.10:6897-6904,2004))区分的核酸。另一个例子是单细胞或少量细胞，诸如例如来自活组织检查(例如从胚泡滋养外胚层活组织检查得到的胎儿细胞；来自实体瘤的针吸出的癌细胞；等等)的少量细胞的基因组DNA。另一个例子是组织中、血液或其它体液中的病原体，例如细菌细胞、病毒或其它病原体，等等。Non-limiting examples of complex nucleic acids include "circulating nucleic acids" (CNA), which are nucleic acids that circulate in human blood or other body fluids (e.g., including but not limited to lymph, fluid, ascites, milk, urine, feces, and bronchial lavage) and can be distinguished as cell-free (CF) or cell-associated nucleic acids (reviewed in Pinzani et al., Methods 50:302-307, 2010) (e.g., circulating fetal cells in the bloodstream of an expectant mother (see, e.g., Kavanagh et al., J. Chromatol. B 878:1905-1911, 2010) or circulating tumor cells (CTCs) from the bloodstream of a cancer patient (see, e.g., Allard et al., Clin Cancer Res. 10:6897-6904, 2004)). Another example is genomic DNA from a single cell or a small number of cells, such as, for example, a small number of cells from a biopsy (e.g., fetal cells obtained from a blastocyst trophectoderm biopsy; cancer cells from a needle aspirate of a solid tumor; etc.). Another example is pathogens in tissue, blood or other body fluids, such as bacterial cells, viruses or other pathogens, etc.

如本文中使用的，术语“靶核酸”(或多核苷酸)或“感兴趣的核酸”指适合于通过本文中描述的方法处理和测序的任何核酸(或多核苷酸)。核酸可以是单链的或双链的，并且可以包括DNA、RNA或其它已知的核酸。靶核酸可以是任何生物体的那些靶核酸，所述生物体包括但不限于病毒、细菌、酵母、植物、鱼、爬行类、两栖类、鸟类和哺乳动物(包括但不限于小鼠、大鼠、犬、猫、山羊、绵羊、牛、马、猪、兔、猴和其它非人灵长类及人)。靶核酸可以获自个体或多个个体(即群体)。获得核酸的样品可以含有来自细胞或甚至生物体的混合物的核酸，诸如：包含人细胞和细菌细胞的人唾液样品；包含小鼠细胞和来自移植的人肿瘤的细胞的小鼠异种移植物；等等。As used herein, the term "target nucleic acid" (or polynucleotide) or "nucleic acid of interest" refers to any nucleic acid (or polynucleotide) suitable for processing and sequencing by the methods described herein. Nucleic acids can be single-stranded or double-stranded and can include DNA, RNA or other known nucleic acids. Target nucleic acids can be those of any organism, including but not limited to viruses, bacteria, yeast, plants, fish, reptiles, amphibians, birds and mammals (including but not limited to mice, rats, dogs, cats, goats, sheep, cattle, horses, pigs, rabbits, monkeys and other non-human primates and humans). Target nucleic acids can be obtained from an individual or multiple individuals (i.e., a population). The sample from which nucleic acid is obtained can contain nucleic acids from a mixture of cells or even organisms, such as: a human saliva sample comprising human cells and bacterial cells; a mouse xenograft comprising mouse cells and cells from a transplanted human tumor; and the like.

靶核酸可以是未扩增的或者可以通过本领域中已知的任何合适的核酸扩增方法扩增。可以依照本领域中已知的方法纯化靶核酸以除去细胞和亚细胞杂质(脂质、蛋白质、碳水化合物、与要测序的那些核酸不同的核酸，等等)，或者它们可以是未纯化的，即包括至少一些细胞和亚细胞杂质，包括但不限于受到破坏以释放其核酸用于处理和测序的完整细胞。可以使用本领域中已知的方法从任何合适的样品获得靶核酸。此类样品包括但不限于：组织、分离的细胞或细胞培养物、体液(包括但不限于血液、尿液、血清、淋巴、唾液、肛门和阴道分泌物、汗液和精液)；空气、农业、水和土壤样品，等等。在一个方面，自基因组DNA形成本发明的核酸构建体。The target nucleic acid can be unamplified or can be amplified by any suitable nucleic acid amplification method known in the art. The target nucleic acid can be purified according to methods known in the art to remove cellular and subcellular impurities (lipids, proteins, carbohydrates, nucleic acids different from those to be sequenced, etc.), or they can be unpurified, i.e., include at least some cellular and subcellular impurities, including but not limited to intact cells that are damaged to release their nucleic acids for processing and sequencing. Target nucleic acids can be obtained from any suitable sample using methods known in the art. Such samples include but are not limited to: tissue, isolated cells or cell cultures, body fluids (including but not limited to blood, urine, serum, lymph, saliva, anal and vaginal secretions, sweat and semen); air, agriculture, water and soil samples, etc. In one aspect, nucleic acid constructs of the present invention are formed from genomic DNA.

鸟枪测序的高覆盖是期望的，因为它可以克服碱基响应和装配中的误差。如本文中使用的，对于装配序列(assembled sequence)中的任何给定位置，术语“序列覆盖丰余”、“序列覆盖”或仅“覆盖”意指代表位置的读取结果的数量。它可以从初始基因组的长度(G)、读取结果数(N)和平均读取结果长度(L)以N x L/G计算。覆盖也可以通过对每个参照位置进行碱基的计数来直接计算。对于全基因组序列，覆盖以装配序列中的所有碱基的平均值表示。序列覆盖是碱基被读出的平均次数(如上文描述的)。它经常以“倍数覆盖”表示，例如“40倍覆盖”，这意味着每个碱基在最终装配序列中以平均40个读取结果来代表。The high coverage of shotgun sequencing is desirable because it can overcome the errors in base calls and assembly. As used in this article, for any given position in assembly sequence (assembled sequence), the term "sequence coverage surplus", "sequence coverage" or only "covering" means the quantity of the readings representing the position. It can be calculated with N x L/G from the length (G) of the initial genome, the number of readings (N) and the average reading length (L). Coverage can also be directly calculated by carrying out the counting of bases to each reference position. For full genome sequence, coverage is represented by the mean value of all bases in the assembly sequence. Sequence coverage is the average number of times (as described above) that base is read. It is often represented by "multiple coverage", for example "40 times of coverage", which means that each base represents with an average of 40 readings in the final assembly sequence.

如本文中使用的，术语“响应率(call rate)”意指复杂核酸中完全响应的碱基的百分比比较，通常参考合适的参照序列，诸如，例如参照基因组。因此，对于全人基因组，“基因组响应率”(或简单地说“响应率”)是相对于全人基因组参照而言在人基因组中完全响应的碱基的百分比。“外显子组响应率”是相对于外显子组参照而言在外显子组中完全响应的碱基的百分比。外显子组序列可以通过用从DNA样品选择性捕获目标基因组区的多个已知方法富集的基因组部分测序获得。或者，外显子组序列可以通过对包括外显子组序列的全人基因组测序获得。如此，全人基因组序列可以具有“基因组响应率”和“外显子组响应率”两者。还有“原始读取结果响应率”，其反映的是被分配了A/C/G/T的碱基的数目，而不是所尝试的碱基的总数。(偶而地，术语“覆盖”代替“响应率”使用，但是意义从上下文看会是显而易见的)。As used herein, the term "call rate" means the percentage comparison of bases that are fully responded in a complex nucleic acid, typically with reference to a suitable reference sequence, such as, for example, a reference genome. Therefore, for the whole human genome, a "genome call rate" (or simply "call rate") is the percentage of bases that are fully responded in the human genome relative to the whole human genome reference. An "exome call rate" is the percentage of bases that are fully responded in the exon group relative to the exon group reference. The exome sequence can be obtained by sequencing a portion of the genome enriched by a plurality of known methods for selectively capturing a target genomic region from a DNA sample. Alternatively, the exome sequence can be obtained by sequencing a whole human genome comprising an exome sequence. In this way, the whole human genome sequence can have both a "genome call rate" and an "exome call rate". There is also a "raw read result call rate" that reflects the number of bases assigned A/C/G/T, rather than the total number of bases attempted. (Occasionally, the term "coverage" is used instead of "call rate," but the meaning will be apparent from the context).

制备复杂核酸的片段Preparation of fragments of complex nucleic acids

核酸分离。使用常规技术分离靶基因组DNA，例如如上文引用的Sambrook andRussell,Molecular Cloning:A Laboratory Manual中公开的。在一些情况中，特别是如果在特定步骤中采用少量DNA，那么有利的是每当仅可用少量样品DNA且经由例如对容器壁等的非特异性结合而有损失危险时提供要与样品DNA混合并一起使用的载体DNA，例如无关环状合成双链DNA。Nucleic acid isolation. Target genomic DNA is isolated using conventional techniques, such as those disclosed in Sambrook and Russell, Molecular Cloning: A Laboratory Manual, cited above. In some cases, particularly if small amounts of DNA are employed in a particular step, it may be advantageous to provide a carrier DNA, such as an unrelated circular synthetic double-stranded DNA, to be mixed and used with the sample DNA whenever only small amounts of sample DNA are available and there is a risk of loss, e.g., through nonspecific binding to container walls.

依照本发明的一些实施方案，在进行或不进行纯化的情况中从单个细胞或少量细胞获得基因组DNA或其它复杂核酸。According to some embodiments of the invention, genomic DNA or other complex nucleic acids are obtained from a single cell or a small number of cells, with or without purification.

长片段对于LFR是期望的。可以通过许多不同方法从细胞分离基因组核酸的长片段。在一个实施方案中，将细胞裂解，并用温和的离心步骤将完整的核沉淀。然后，经由蛋白酶K和RNA酶消化几小时释放基因组DNA。可以处理材料以降低剩余细胞废物的浓度，例如通过透析一段时间(即2-16小时)和/或稀释进行。由于此类方法不需要采用许多破坏性方法(诸如乙醇沉淀、离心和涡旋振荡)，基因组核酸很大程度上保持完整，产生具有超过150千碱基的长度的大多数片段。在一些实施方案中，片段的长度是约5至约750千碱基。在别的实施方案中，片段的长度是约150至约600、约200至约500、约250至约400和约300至约350千碱基。可以用于LFR的最小片段是含有至少两个杂合性的片段(约2-5kb)，并且没有最大理论大小，尽管片段长度可以由于源自起始核酸制备物操作的剪切而受到限制。产生较大片段的技术导致需要较少的等分试样，并且产生较短片段的那些技术可能需要较多的等分试样。Long fragments are desirable for LFR. Long fragments of genomic nucleic acid can be isolated from cells by many different methods. In one embodiment, the cells are lysed and the complete nuclear precipitate is obtained by a gentle centrifugation step. Then, the genomic DNA is released via proteinase K and RNase digestion for several hours. The material can be processed to reduce the concentration of residual cell waste, for example, by dialysis for a period of time (i.e., 2-16 hours) and/or dilution. Since this type of method does not require the use of many destructive methods (such as ethanol precipitation, centrifugation, and vortex oscillation), genomic nucleic acid remains intact to a large extent, producing most fragments with a length exceeding 150 kilobases. In some embodiments, the length of the fragment is about 5 to about 750 kilobases. In other embodiments, the length of the fragment is about 150 to about 600, about 200 to about 500, about 250 to about 400, and about 300 to about 350 kilobases. The minimum fragment that can be used for LFR is a fragment (about 2-5kb) containing at least two heterozygosities, and there is no maximum theoretical size, although the fragment length can be limited due to the shearing derived from the initial nucleic acid preparation operation. Techniques that produce larger fragments result in fewer aliquots being required, and those that produce shorter fragments may require more aliquots.

一旦分离DNA且在将其等分取样到单个孔中前，将其仔细片段化以避免材料的损失，特别是来自每个片段末端的序列，因为此类材料的损失可以导致最终基因组装配中的缺口。在一个实施方案中，通过使用罕见的切口酶避免序列损失，所述切口酶在彼此约100kb的距离处创建聚合酶，诸如phi29聚合酶的起始位点。由于聚合酶创建新的DNA链，它置换旧的链，这创建聚合酶起始位点附近的重叠序列。因此，有非常少的序列缺失。In case the DNA is separated and before being sampled in single wells in equal parts, it is carefully fragmented to avoid the loss of material, particularly from the sequence at each fragment end, because the loss of such material can result in the breach in the final genome assembly. In one embodiment, sequence loss is avoided by using a rare nicking enzyme that creates a polymerase, such as the start site of the phi29 polymerase, at a distance of approximately 100kb from each other. Because the polymerase creates a new DNA chain, it displaces the old chain, which creates overlapping sequences near the polymerase start site. Therefore, very few sequence deletions are arranged.

5’外切核酸酶的受控使用(在例如通过MDA的扩增之前或期间)可以促进初始DNA从单细胞的多重复制，如此使经由拷贝复制得到的早期误差的增长最小化。Controlled use of 5' exonucleases (before or during amplification, for example, by MDA) can promote multiple replications of initial DNA from a single cell, thus minimizing the growth of early errors through copy replication.

在其它实施方案中，以使剪切或DNA对容器的吸附最小化的方式分离并操作长DNA片段，包括例如在琼脂糖凝胶塞中的琼脂糖或油中分离细胞，或者使用特殊包被的管和板。In other embodiments, long DNA fragments are isolated and manipulated in a manner that minimizes shearing or adsorption of DNA to the container, including, for example, separation of cells in agarose or oil in agarose gel plugs, or use of specially coated tubes and plates.

在一些实施方案中，在等分取样前从单细胞进一步复制片段化DNA可以如下实现，即将衔接头与单链引发突出物连接并使用衔接头特异性引物和phi29聚合酶以从每个长片段生成两个拷贝。这可以从单细胞生成相当于4个细胞的DNA。In some embodiments, further replication of fragmented DNA from a single cell before aliquoting can be achieved by ligating adapters to single-stranded priming overhangs and using adapter-specific primers and phi29 polymerase to generate two copies of each long fragment. This can generate the equivalent of four cells' worth of DNA from a single cell.

片段化。然后，通过常规技术将靶基因组DNA分级或片段化至期望的大小，所述常规技术包括酶促消化、剪切或超声处理，其中后两种在本发明中特别有用。Fragmentation. The target genomic DNA is then fractionated or fragmented to the desired size by conventional techniques including enzymatic digestion, shearing, or sonication, the latter two of which are particularly useful in the present invention.

靶核酸的片段大小可以随来源靶核酸和使用的库构建方法而变化，但是对于标准的全基因组测序，此类片段的长度范围通常是50至600个核苷酸。在另一个实施方案中，片段的长度是300至600或200至2000个核苷酸。在又一个实施方案中，片段的长度是10-100,50-100,50-300,100-200,200-300,50-400,100-400,200-400,300-400,400-500,400-600,500-600,50-1000,100-1000,200-1000,300-1000,400-1000,500-1000,600-1000,700-1000,700-900,700-800,800-1000,900-1000,1500-2000,1750-2000和50-2000个核苷酸。较长的片段可用于LFR。The fragment size of the target nucleic acid can vary depending on the source target nucleic acid and the library construction method used, but for standard whole genome sequencing, the length of such fragments is generally in the range of 50 to 600 nucleotides. In another embodiment, the length of the fragment is 300 to 600 or 200 to 2000 nucleotides. In yet another embodiment, the length of the fragment is 10-100, 50-100, 50-300, 100-200, 200-300, 50-400, 100-400, 200-400, 300-400, 400-500, 400-600, 500-600, 50-1000, 100-1000, 200-1000, 300-1000, 400-1000, 500-1000, 600-1000, 700-1000, 700-900, 700-800, 800-1000, 900-1000, 1500-2000, 1750-2000 and 50-2000 nucleotides. Longer fragments can be used for LFR.

在别的实施方案中，分离特定大小或在特定大小范围中的片段。此类方法是本领域中公知的。例如，凝胶分级可以用于生成一定碱基对范围内的特定大小的片段群体，例如对于500个碱基对+50个碱基对。In other embodiments, fragments of a specific size or within a specific size range are isolated. Such methods are well known in the art. For example, gel fractionation can be used to generate a population of fragments of a specific size within a certain base pair range, for example, 500 base pairs + 50 base pairs.

在许多情况中，不需要对提取的DNA的酶促消化，因为裂解和提取过程中产生的剪切力会生成期望范围中的片段。在别的实施方案中，可以使用限制性内切核酸酶通过酶促片段化生成较短的片段(1-5kb)。在又一个实施方案中，约10至约1,000,000个基因组当量(equivalent)的DNA确保片段群体覆盖整个基因组。含有从重叠片段的此类群体生成的核酸模板的文库如此会包含靶核酸，该靶核酸的序列一旦得到鉴定并装配便会提供整个基因组的大部分或整个序列。In many cases, enzymatic digestion of the extracted DNA is not required because the shearing forces generated during cracking and extraction can generate fragments in the desired range. In other embodiments, restriction endonucleases can be used to generate shorter fragments (1-5 kb) by enzymatic fragmentation. In another embodiment, about 10 to about 1,000,000 genome equivalents of DNA ensure that the fragment colony covers the entire genome. A library containing nucleic acid templates generated from such colonies of overlapping fragments can thus include a target nucleic acid, the sequence of which, once identified and assembled, can provide a major part or the entire sequence of the entire genome.

在本发明的一些实施方案中，利用受控随机酶促(“CoRE”)片段化方法来制备片段。CoRE片段化是一种酶促端点测定法，并且具有酶促片段化的优点(诸如对较低量和/或体积的DNA使用它的能力)且没有其许多缺点(包括对底物或酶浓度变化的敏感性和对消化时间的敏感性)。In some embodiments of the present invention, fragments are prepared using a controlled random enzymatic ("CoRE") fragmentation method. CoRE fragmentation is an enzymatic endpoint assay and has the advantages of enzymatic fragmentation (such as the ability to use it on lower amounts and/or volumes of DNA) without many of its disadvantages (including sensitivity to variations in substrate or enzyme concentrations and sensitivity to digestion time).

在一个方面，本发明提供了在本文中称为受控随机酶促(CoRE)片段化的片段化方法，其可以单独或与本领域中已知的其它机械和酶促片段化方法组合使用。CoRE片段化涉及三个酶促步骤系列。首先，将核酸进行扩增方法处理，所述扩增方法在存在掺有一定比例的脱氧尿嘧啶(“dU”)或尿嘧啶(“U”)的dNTP的情况中进行以导致扩增产物的两条链中在限定的且可控制比例的T位置处的dUTP或UTP取代。任何合适的扩增方法可以在本发明的此步骤中使用。在某些实施方案中，在存在以与dTTP的限定比率掺有dUTP或UTP的dNTP的情况中的多重置换扩增(MDA)用于产生具有取代到两条链上的某些点中的dUTP或UTP的扩增产物。In one aspect, the present invention provides a fragmentation method referred to herein as controlled random enzymatic (CoRE) fragmentation, which can be used alone or in combination with other mechanical and enzymatic fragmentation methods known in the art. CoRE fragmentation involves a series of three enzymatic steps. First, the nucleic acid is subjected to an amplification method that is carried out in the presence of dNTPs doped with a certain ratio of deoxyuracil ("dU") or uracil ("U") to cause the dUTP or UTP substitution at a defined and controllable ratio of T positions in the two chains of the amplification product. Any suitable amplification method can be used in this step of the present invention. In certain embodiments, multiple displacement amplification (MDA) in the presence of dNTPs doped with dUTP or UTP at a defined ratio with dTTP is used to produce an amplification product with dUTP or UTP substituted into certain points on the two chains.

扩增和尿嘧啶模块插入后，然后，通常经由UDG、EndoVIII和T4PNK的组合切除尿嘧啶，以创建具有官能性5’磷酸根和3’羟基末端的单一碱基缺口。单一碱基缺口会以由MDA产物中U频率限定的平均间隔创建。也就是说，dUTP量越高，所得的片段越短。如本领域技术人员会领会的，也可以使用会导致核苷酸用可类似产生切割的经修饰的核苷酸选择性替换的其它技术，诸如化学或其它酶促易感性核苷酸。After amplification and insertion of the uracil module, the uracil is then typically removed via a combination of UDG, EndoVIII, and T4PNK to create single base gaps with functional 5' phosphate and 3' hydroxyl termini. Single base gaps are created at an average interval defined by the frequency of U in the MDA product. That is, the higher the amount of dUTP, the shorter the resulting fragments. As will be appreciated by those skilled in the art, other techniques that result in the selective replacement of nucleotides with modified nucleotides that can similarly produce cleavage, such as chemical or other enzymatically susceptible nucleotides, may also be used.

用具有外切核酸酶活性的聚合酶处理有缺口的核酸导致切口沿着核酸的长度“平移”或“移位”，直到相反链上的切口会聚，由此创建双链断裂，这产生相对同质大小的双链片段的相对群体。聚合酶(诸如Taq聚合酶)的外切核酸酶活性会切割靠近切口的短DNA链，而聚合酶活性会“填充”切口及随后所述链中的核苷酸(实际上，Taq沿着链移动，使用外切核酸酶活性切除碱基并且添加相同的碱基，结果是切口沿着链移位，直到酶达到末端)。Treatment of a nicked nucleic acid with a polymerase having exonuclease activity causes the nick to "translate" or "shift" along the length of the nucleic acid until the nicks on opposite strands converge, thereby creating a double-strand break, which produces a relative population of double-stranded fragments of relatively homogeneous size. The exonuclease activity of a polymerase such as Taq polymerase cuts a short DNA strand close to the nick, while the polymerase activity "fills in" the nick and subsequent nucleotides in the strand (actually, Taq moves along the strand, using its exonuclease activity to excise bases and add the same base, with the result that the nick shifts along the strand until the enzyme reaches the end).

由于双链片段的大小分布是MDA反应中使用的dTTP与dUTP或UTP的比率的结果，而不是由于酶促处理的持续时间或程度，此CoRE片段化方法产生高度的片段化再现性，这生成均为相似大小的双链核酸片段群体。Because the size distribution of double-stranded fragments is a result of the ratio of dTTP to dUTP or UTP used in the MDA reaction, rather than the duration or extent of the enzymatic treatment, this CoRE fragmentation method results in a high degree of fragmentation reproducibility, generating a population of double-stranded nucleic acid fragments that are all similar sizes.

片段末端修复和修饰。在某些实施方案中，在片段化后，将靶核酸进一步修饰以使它们制备好依照本发明方法插入多个衔接头。Fragment End Repair and Modification. In certain embodiments, after fragmentation, the target nucleic acids are further modified to prepare them for insertion into multiple adaptors according to the methods of the invention.

在物理片段化后，靶核酸通常具有平端和突出端的组合以及末端的磷酸根和羟基化学的组合。在此实施方案中，用几种酶处理靶核酸以创建具有特定化学的平端。在一个实施方案中，聚合酶和dNTP用于填充突出物的任何5’单链以创建平端。具有3’外切核酸酶活性的聚合酶(一般但不总是与5’活性酶相同的酶，诸如T4聚合酶)用于除去3’突出物。合适的聚合酶包括但不限于T4聚合酶、Taq聚合酶、大肠杆菌DNA聚合酶1、Klenow片段、逆转录酶、phi29相关聚合酶，包括野生型phi29聚合酶和此类聚合酶的衍生物、T7DNA聚合酶、T5DNA聚合酶、RNA聚合酶。可以使用这些技术来生成平端，其可用于多种应用。After physical fragmentation, the target nucleic acid generally has a combination of blunt ends and overhangs and a combination of phosphate radical and hydroxyl chemistry at the end. In this embodiment, the target nucleic acid is treated with several enzymes to create a blunt end with specific chemistry. In one embodiment, a polymerase and dNTPs are used to fill any 5' single strands of the overhang to create a blunt end. A polymerase with 3' exonuclease activity (generally but not always the same enzyme as the 5' active enzyme, such as T4 polymerase) is used to remove the 3' overhang. Suitable polymerases include but are not limited to T4 polymerase, Taq polymerase, Escherichia coli DNA polymerase 1, Klenow fragment, reverse transcriptase, phi29 related polymerases, including derivatives of wild-type phi29 polymerase and such polymerases, T7 DNA polymerase, T5 DNA polymerase, RNA polymerase. These techniques can be used to generate blunt ends, which can be used for a variety of applications.

在别的任选实施方案中，改变末端化学以避免靶核酸彼此连接。例如，在聚合酶外，蛋白质激酶也可以用于创建平端的过程，这通过利用其3’磷酸酶活性来将3’磷酸根基团转化成羟基基团进行。此类激酶可以包括但不限于商品化激酶诸如T4激酶，及非商品化但是具有期望活性的激酶。In another optional embodiment, the end chemistry is altered to prevent the target nucleic acids from ligating to each other. For example, in addition to polymerases, protein kinases can also be used to create blunt ends by utilizing their 3' phosphatase activity to convert 3' phosphate groups into hydroxyl groups. Such kinases can include, but are not limited to, commercially available kinases such as T4 kinase, as well as non-commercial kinases that have the desired activity.

类似地，可以使用磷酸酶来将末端磷酸根基团转化成羟基基团。合适的磷酸酶包括但不限于碱性磷酸酶(包括小牛肠磷酸酶)、南极磷酸酶、腺苷三磷酸双磷酸酶、焦磷酸酶、无机(酵母)热稳定性无机焦磷酸酶等，其是本领域中已知的。Similarly, phosphatases can be used to convert terminal phosphate groups into hydroxyl groups. Suitable phosphatases include, but are not limited to, alkaline phosphatase (including calf intestinal phosphatase), Antarctic phosphatase, apyrase, pyrophosphatase, inorganic (yeast) thermostable inorganic pyrophosphatase, and the like, which are known in the art.

这些修饰防止靶核酸在本发明方法的随后步骤中彼此连接，如此确保衔接头(和/或衔接头臂)与靶核酸末端连接的步骤期间，靶核酸会与衔接头而不与其它靶核酸连接。可以以期望的方向将靶核酸与衔接头连接。修饰末端避免不想要的构造，其中靶核酸彼此连接和/或衔接头彼此连接。也可以经由控制衔接头和靶核酸两者的末端化学来控制每个衔接头-靶核酸连接的方向。此类修饰可以防止含有以未知构造连接的不同片段的核酸模板的创建，如此降低和/或消除可源自此类不想要模板的序列鉴定和装配中的误差。These modifications prevent the target nucleic acids from being connected to each other in the subsequent steps of the method of the present invention, thus ensuring that during the step in which the adapter (and/or adapter arm) is connected to the end of the target nucleic acid, the target nucleic acid will be connected to the adapter and not to other target nucleic acids. The target nucleic acid can be connected to the adapter in a desired direction. The modified ends avoid unwanted structures in which the target nucleic acids are connected to each other and/or the adapters are connected to each other. The direction in which each adapter-target nucleic acid is connected can also be controlled via the terminal chemistry of the control adapter and the target nucleic acid. Such modifications can prevent the creation of nucleic acid templates containing different fragments connected with unknown structures, thus reducing and/or eliminating errors in sequence identification and assembly that can be derived from such unwanted templates.

可以在片段化后使DNA变性以生成单链片段。The DNA can be denatured following fragmentation to generate single-stranded fragments.

扩增。在一个实施方案中，在片段化后(且实际上在本文中概述的任何步骤之前或之后)，可以对片段化核酸群体应用扩增步骤以确保足够大浓度的所有片段可用于后续步骤。依照本发明的一个实施方案，提供了用于对少量复杂核酸，包括高等生物体的那些复杂核酸测序的方法，其中扩增此类复杂核酸以生成足够的核酸，用于通过本文中描述的方法测序。本文中描述的测序方法在充分扩增的情况下即使用一份基因当量作为起始材料以高响应率提供高精确序列。注意细胞包含约6.6皮克(pg)基因组DNA。可以通过本发明的方法实施来自单细胞或生物体(包括高等生物体诸如人)的少量细胞的全基因组或其它复杂核酸。可以使用1pg,5pg,10pg,30pg,50pg,100pg或1ng复杂核酸作为起始材料实现高等生物体的复杂核酸的测序，所述起始材料通过本领域中已知的任何核酸扩增方法扩增，以生成例如200ng,400ng,600ng,800ng,1μg,2μg,3μg,4μg,5μg,10μg或更大量的复杂核酸。我们还公开了使GC偏爱最小化的核酸扩增方案。然而，可以仅通过分离一个细胞或少量细胞，在本领域中已知的合适培养条件下将它们培养足够的时间，并使用一个或多个起始细胞的后代进行测序来进一步降低对扩增的需要及随后的GC偏爱。Amplification. In one embodiment, after fragmentation (and indeed before or after any step outlined herein), an amplification step can be applied to the fragmented nucleic acid population to ensure that all fragments of sufficiently large concentrations are available for subsequent steps. According to one embodiment of the present invention, a method for sequencing a small amount of complex nucleic acids, including those of higher organisms, is provided, wherein such complex nucleic acids are amplified to generate enough nucleic acids for sequencing by the methods described herein. The sequencing methods described herein provide highly accurate sequences with a high response rate using a gene equivalent as starting material in the case of sufficient amplification. Note that cells contain approximately 6.6 picograms (pg) of genomic DNA. The full genome or other complex nucleic acids of a small amount of cells from a single cell or organism (including higher organisms such as humans) can be implemented by the methods of the present invention. Sequencing of complex nucleic acids of higher organisms can be achieved using 1 pg, 5 pg, 10 pg, 30 pg, 50 pg, 100 pg or 1 ng of complex nucleic acid as starting material, which is amplified by any nucleic acid amplification method known in the art to generate, for example, 200 ng, 400 ng, 600 ng, 800 ng, 1 μg, 2 μg, 3 μg, 4 μg, 5 μg, 10 μg or more of complex nucleic acid. We also disclose nucleic acid amplification protocols that minimize GC bias. However, the need for amplification and subsequent GC bias can be further reduced by isolating only one cell or a small number of cells, culturing them for a sufficient time under appropriate culture conditions known in the art, and using the progeny of one or more starting cells for sequencing.

此类扩增方法包括但不限于：多重置换扩增(MDA)、聚合酶链式反应(PCR)、连接链式反应(有时称为寡核苷酸连接酶扩增OLA)、循环探针技术(CPT)、链置换测定法(SDA)、转录介导的扩增(TMA)、基于核酸序列的扩增(NASBA)、滚环扩增(RCA)(对于环化片段)和侵入性切割技术。Such amplification methods include, but are not limited to: multiple displacement amplification (MDA), polymerase chain reaction (PCR), ligation chain reaction (sometimes called oligonucleotide ligase amplification, OLA), cycling probe technology (CPT), strand displacement assay (SDA), transcription-mediated amplification (TMA), nucleic acid sequence-based amplification (NASBA), rolling circle amplification (RCA) (for circularized fragments), and invasive cleavage technology.

可以在片段化后或在本文中概述的任何步骤之前或之后实施扩增。Amplification can be performed after fragmentation or before or after any of the steps outlined herein.

具有降低的GC偏爱的MDA扩增方案。在一个方面，本发明提供了制备样品的方法，其中在库构建和测序前，如实扩增每等分试样约10Mb DNA，例如根据起始DNA量为约30,000倍。MDA Amplification Protocol with Reduced GC Bias In one aspect, the present invention provides a method for preparing samples wherein about 10 Mb of DNA per aliquot is faithfully amplified, for example, about 30,000-fold based on the amount of starting DNA, prior to library construction and sequencing.

依照本发明的LFR方法的一个实施方案，LFR以用5’外切核酸酶处理基因组核酸，通常是基因组DNA开始，以创建3’单链突出物。此类单链突出物充当MDA起始位点。使用外切核酸酶还消除对扩增前热或碱变性步骤的需要且不对片段群体引入偏爱。在另一个实施方案中，将碱变性与5’外切核酸酶处理组合，这导致偏爱的降低大于用任一单独处理看到的情况。然后，将用5’外切核酸酶并任选地用碱变性处理的DNA稀释至亚基因组浓度，并在多个等分试样间分散，如上文讨论的。在分成等分试样后，例如在多个孔间，将每个等分试样中的片段扩增。According to one embodiment of the LFR method of the present invention, LFR begins by treating genomic nucleic acid, typically genomic DNA, with a 5' exonuclease to create 3' single-stranded protrusions. Such single-stranded protrusions serve as MDA start sites. The use of exonucleases also eliminates the need for a pre-amplification heat or alkaline denaturation step and does not introduce bias into the fragment population. In another embodiment, alkaline denaturation is combined with a 5' exonuclease treatment, which results in a greater reduction in bias than seen with either treatment alone. The DNA treated with the 5' exonuclease and optionally alkaline denaturation is then diluted to a subgenomic concentration and dispersed among a plurality of aliquots, as discussed above. After being divided into aliquots, for example, between a plurality of wells, the fragments in each aliquot are amplified.

在一个实施方案中，使用基于phi29的多重置换扩增(MDA)。许多研究已经检查了不想要的扩增偏爱的范围、背景产物形成和经由基于phi29的MDA引入的嵌合矫作物，但是这些缺点中的许多已经在极端扩增条件(大于100万倍)下发生。通常，LFR采用实质上较低的扩增水平，并且以长DNA片段(例如约100kb)开始，这产生有效的MDA和更可接受的扩增偏爱水平及其它扩增相关问题。In one embodiment, multiple displacement amplification (MDA) based on phi29 is used. Many studies have examined the scope of unwanted amplification bias, background product formation, and chimeric artifacts introduced via phi29-based MDA, but many of these shortcomings have occurred under extreme amplification conditions (greater than 1 million times). Typically, LFR uses substantially lower amplification levels and starts with long DNA fragments (e.g., about 100 kb), which produces effective MDA and more acceptable amplification bias levels and other amplification-related issues.

我们已经开发出改进的MDA方案以克服与使用各种添加物(例如DNA修饰酶、糖和/或化学品，如DMSO)的MDA有关的问题，和/或降低、提高或取代MDA反应条件的不同组分以进一步改进方案。为了使嵌合物最小化，也可以包含如下的试剂，其用以降低起不正确模板作用用于延伸的DNA链(其是嵌合物形成的常见机制)的置换单链DNA的利用度。由MDA引入的覆盖偏爱的主要来源是由富含GC的区域对富含AT的区域之间的扩增差异引起。这可以通过使用MDA反应中的不同试剂和/或通过调节引物浓度以产生在基因组的所有％GC区间均匀引发的环境来校正。在一些实施方案中，在引发MDA中使用随机六聚体。在其它实施方案中，利用其它引物设计来降低偏爱。在别的实施方案中，在MDA之前或期间使用5’外切核酸酶可以帮助启动低偏爱成功引发，特别是用较长(即200kb至1Mb)片段进行，所述较长片段可用于测序以长区段复制(即在一些癌细胞中)和复杂重复为特征的区域。We have developed improved MDA scheme to overcome the problems relevant with the MDA using various additives (such as DNA modifying enzymes, sugars and/or chemicals, such as DMSO), and/or reduce, improve or replace the different components of MDA reaction conditions to further improve the scheme. In order to minimize the chimera, following reagent may also be included, which is used to reduce the availability of the displacement single-stranded DNA that plays an incorrect template role for the DNA chain (which is the common mechanism of chimera formation) that is extended. The main source of the coverage preference introduced by MDA is caused by the amplification difference between the GC-rich region and the AT-rich region. This can be corrected by using the different reagents in the MDA reaction and/or by regulating primer concentration to produce an environment uniformly triggered in all %GC intervals of genome. In some embodiments, random hexamer is used in triggering MDA. In other embodiments, other primer designs are utilized to reduce the preference. In other embodiments, the use of a 5' exonuclease before or during MDA can help initiate successful priming with low bias, particularly with longer (i.e., 200 kb to 1 Mb) fragments that can be used to sequence regions characterized by long segment duplications (i.e., in some cancer cells) and complex repeats.

在一些实施方案中，使用改进的、更有效的片段化和连接步骤，其将制备样品需要的MDA扩增轮次数目降低多达10,000倍，这进一步降低源自MDA的偏爱和嵌合物形成。In some embodiments, improved, more efficient fragmentation and ligation steps are used that reduce the number of MDA amplification rounds required to prepare a sample by up to 10,000-fold, which further reduces MDA-derived bias and chimera formation.

在一些实施方案中，MDA反应设计为将尿嘧啶引入扩增产物中以准备好进行CoRE片段化。在一些实施方案中，利用随机六聚体的标准MDA反应用于扩增每孔中的片段；或者，可以使用随机8聚体引物来降低片段群体中的扩增偏爱(例如GC偏爱)。在别的实施方案中，也可以对MDA反应添加几种不同酶以降低扩增偏爱。例如，可以使用低浓度的非进行性5’外切核酸酶和/或单链结合蛋白来创建8聚体的结合位点。也可以使用化学剂诸如甜菜碱、DMSO和海藻糖来降低偏爱。In some embodiments, the MDA reaction is designed to introduce uracil into the amplified product to prepare for CoRE fragmentation. In some embodiments, a standard MDA reaction using random hexamers is used to amplify the fragments in each well; alternatively, random 8-mer primers can be used to reduce amplification bias (e.g., GC bias) in the fragment population. In other embodiments, several different enzymes can also be added to the MDA reaction to reduce amplification bias. For example, low concentrations of non-processive 5' exonucleases and/or single-stranded binding proteins can be used to create binding sites for 8-mers. Chemical agents such as betaine, DMSO, and trehalose can also be used to reduce bias.

在扩增每个等分试样中的片段后，任选地，可以将扩增产物进行另一轮片段化处理。在一些实施方案中，CoRE方法用于进一步片段化扩增后每个等分试样中的片段。在此类实施方案中，每个等分试样中的片段的MDA扩增设计为将尿嘧啶掺入MDA产物中。用尿嘧啶DNA糖基化酶(UDG),DNA糖基化酶-裂合酶内切核酸酶VIII和T4多核苷酸激酶的混合物处理含有MDA产物的每个等分试样以切除尿嘧啶碱基，并创建具有官能性5’磷酸根和3’羟基基团的单碱基缺口。经由使用聚合酶诸如Taq聚合酶的切口平移导致双链平端断裂，这生成依赖于MDA反应中添加的dUTP浓度的大小范围的可连接片段。在一些实施方案中，使用的CoRE方法涉及通过phi29聚合和链置换除去尿嘧啶。也可以经由超声处理或酶促处理实现MDA产物的片段化。可以在此实施方案中使用的酶促处理包括但不限于DNA酶I、T7内切核酸酶I、微球菌核酸酶，等等。After amplifying the fragments in each aliquot, optionally, the amplified product can be subjected to another round of fragmentation. In some embodiments, the CoRE method is used to further fragment the fragments in each aliquot after amplification. In such embodiments, the MDA amplification of the fragments in each aliquot is designed to incorporate uracil into the MDA product. Each aliquot containing the MDA product is treated with a mixture of uracil DNA glycosylase (UDG), DNA glycosylase-lyase endonuclease VIII and T4 polynucleotide kinase to excise the uracil base and create a single base gap with a functional 5' phosphate root and 3' hydroxyl group. Double-stranded blunt-end breaks are caused by nick translation using a polymerase such as Taq polymerase, which generates ligatable fragments of a size range that depends on the dUTP concentration added in the MDA reaction. In some embodiments, the CoRE method used involves removing uracil by phi29 polymerization and chain displacement. Fragmentation of the MDA product can also be achieved via ultrasonic treatment or enzymatic treatment. Enzymatic treatments that can be used in this embodiment include, but are not limited to, DNase I, T7 endonuclease I, micrococcal nuclease, and the like.

在MDA产物片段化后，可以修复所得片段的末端。许多片段化技术可以生成具有突出端的末端和具有不可用于后来的连接反应的官能团，诸如3’和5’羟基基团和/或3’和5’磷酸根基团的末端。具有修复成具有平端的片段可以是有用的。也可以期望修饰末端以添加或除去磷酸根和羟基基团，从而阻止靶序列的“聚合”。例如，磷酸酶可以用于消除磷酸根基团，使得所有末端含有羟基基团。然后，可以将每个末端选择性改变以容许期望组分间的连接。然后，可以通过碱性磷酸酶处理“活化”片段的一个末端。然后，可以用衔接头使片段加标签以鉴定来自LFR方法中相同等分试样的片段。After the MDA product is fragmented, the ends of the resulting fragments can be repaired. Many fragmentation techniques can generate ends with overhangs and ends with functional groups that cannot be used for subsequent ligation reactions, such as 3' and 5' hydroxyl groups and/or 3' and 5' phosphate groups. Having fragments that are repaired to have blunt ends can be useful. It is also desirable to modify the ends to add or remove phosphate and hydroxyl groups, thereby preventing "polymerization" of the target sequence. For example, phosphatase can be used to eliminate phosphate groups so that all ends contain hydroxyl groups. Each end can then be selectively changed to allow for connection between the desired components. One end of the "activated" fragment can then be treated with alkaline phosphatase. The fragments can then be tagged with adapters to identify fragments from the same aliquot in the LFR method.

给每个等分试样中的片段加标签。扩增后，使每个等分试样中的DNA加标签，从而鉴定每个片段起源的等分试样。在别的实施方案中，可以在用衔接头加标签前进一步片段化每个等分试样中的扩增DNA，使得来自相同等分试样的片段均会包含相同标签；见例如US2007/0072208，其在此通过提及并入。The fragments in each aliquot are tagged. After amplification, the DNA in each aliquot is tagged, thereby identifying the aliquot from which each fragment originated. In other embodiments, the amplified DNA in each aliquot can be further fragmented before tagging with adapters so that fragments from the same aliquot all contain the same tag; see, for example, US 2007/0072208, which is incorporated herein by reference.

依照一个实施方案，在两个区段中设计衔接头：一个区段对于所有孔而言是共同的，并且平端使用本文中进一步描述的方法直接连接片段。“共同”衔接头作为两个衔接头臂添加：一个臂是与片段的5’端连接的平端，而另一个臂是与片段的3’端连接的平端。加标签衔接头的第二区段是对于每孔独特的“条形码”区段。此条形码一般是独特的核苷酸序列，并且对特定孔中的每个片段给予相同条形码。如此，在将来自所有孔的加标签片段重新组合以进行测序应用时，可以经由鉴定条形码衔接头鉴定来自同一孔的片段。将条形码与共同衔接头臂的5’端连接。可以将共同衔接头和条形码衔接头与片段序贯或同时连接。如本文中会更为详细描述的，共同衔接头和条形码衔接头的末端可以修饰为使得每个衔接头区段会以正确方向且与正确的分子连接。此类修饰通过确保片段不能彼此连接，且衔接头区段仅能够以例示的方向连接来防止衔接头区段或片段的“聚合”。According to one embodiment, the adapters are designed in two sections: one section is common to all wells and the blunt ends directly connect the fragments using the methods further described herein. The "common" adapter is added as two adapter arms: one arm is a blunt end that connects to the 5' end of the fragment, and the other arm is a blunt end that connects to the 3' end of the fragment. The second section of the tagging adapter is a "barcode" section that is unique to each well. This barcode is generally a unique nucleotide sequence, and the same barcode is given to each fragment in a particular well. In this way, when the tagged fragments from all wells are recombined for sequencing applications, fragments from the same well can be identified via the identification barcode adapter. The barcode is connected to the 5' end of the common adapter arm. The common adapter and the barcode adapter can be connected to the fragments sequentially or simultaneously. As will be described in more detail herein, the ends of the common adapter and the barcode adapter can be modified so that each adapter segment will connect in the correct orientation and to the correct molecule. Such modifications prevent "polymerization" of adapter segments or fragments by ensuring that the fragments cannot ligate to each other and that adapter segments can only ligate in the exemplified orientation.

在别的实施方案中，对用于使每孔中的片段加标签的衔接头利用三区段设计。此实施方案与上文描述的条形码衔接头设计类似，只是条形码衔接头区段分成两个区段。此设计容许一大批可能的条形码，其通过容许组合条形码衔接头区段通过将不同条形码区段连接在一起以形成完全条形码区段生成来实现。此组合设计在减少需要生成的完全大小条形码衔接头数目的情况下提供可能的条形码衔接头的较大全集。在别的实施方案中，用8-12个碱基对误差校正条形码实现每个等分试样的独特鉴定。在一些实施方案中，使用与孔相同数目的衔接头(上述非限制性例子中为384和1536)。在别的实施方案中，通过基于两组40个半条形码衔接头的新组合加标签方法降低与生成衔接头有关的成本。In another embodiment, a three-segment design is utilized for the adapters used to tag the fragments in each well. This embodiment is similar to the barcode adapter design described above, except that the barcode adapter segment is divided into two segments. This design allows for a large number of possible barcodes, which is achieved by allowing the combination of barcode adapter segments by connecting different barcode segments together to form a complete barcode segment generation. This combined design provides a larger complete set of possible barcode adapters while reducing the number of full-size barcode adapters that need to be generated. In another embodiment, 8-12 base pair error correction barcodes are used to achieve unique identification of each aliquot. In some embodiments, the same number of adapters as the wells are used (384 and 1536 in the above non-limiting example). In another embodiment, the cost associated with generating adapters is reduced by a new combined labeling method based on two groups of 40 half-barcode adapters.

在一个实施方案中，库构建涉及使用两个不同衔接头。A和B衔接头容易修饰为各含有不同半条形码序列以产生数千个组合。在别的实施方案中，在相同衔接头上掺入条形码序列。这可以通过将B衔接头分成两个部分来实现，所述两个部分各具有以用于连接的共同突出序列分开的半条形码序列。两个标签组分各具有4-6个碱基。8碱基(2x 4个碱基)标签组能够独特地使65,000个等分试样加标签。一个额外的碱基(2x 5个碱基)会容许误差检测，并且12个碱基标签(2x 6个碱基，1200万个独特的条形码序列)可以设计为容许在10,000或更多个等分试样中使用Reed-Solomon设计的实质性误差检测和校正(美国专利申请12/697,995，以US 2010/0199155公布，其通过提及并入本文)。2x 5碱基和2x 6碱基标签两者都可以包括使用简并碱基(即“百搭”)来实现最佳解码效率。In one embodiment, library construction involves the use of two different adapters. The A and B adapters are easily modified to contain different half-barcode sequences to produce thousands of combinations. In another embodiment, a barcode sequence is incorporated into the same adapter. This can be achieved by dividing the B adapter into two parts, each of which has a half-barcode sequence separated by a common protruding sequence for connection. The two tag components each have 4-6 bases. An 8-base (2x 4 base) tag set can uniquely label 65,000 aliquots. An additional base (2x 5 bases) allows error detection, and 12 base tags (2x 6 bases, 12 million unique barcode sequences) can be designed to allow substantial error detection and correction using the Reed-Solomon design in 10,000 or more aliquots (U.S. patent application 12/697,995, published as US 2010/0199155, which is incorporated herein by reference). Both 2 x 5 base and 2 x 6 base tags can include the use of degenerate bases (ie, "wild cards") to achieve optimal decoding efficiency.

在使每孔中的片段加标签后，将所有片段组合或合并以形成单一群体。然后，这些片段可以用于生成核酸模板或库构建体，用于测序。从这些加标签的片段生成的核酸模板根据与每个片段附接的条形码标签衔接头会可鉴定为属于特定孔。After the fragments in each well are tagged, all the fragments are combined or pooled to form a single population. These fragments can then be used to generate nucleic acid templates or library constructs for sequencing. The nucleic acid templates generated from these tagged fragments can be identified as belonging to a specific well based on the barcode tag adapter attached to each fragment.

长片段读取(LFR)技术Long fragment read (LFR) technology

概述Overview

个体人基因组在自然界中是二倍体的，半数的同源染色体源自每个亲本。在每个单个染色体上发生变异的背景对基因组的基因和其它转录区的表达和调节可以具有深远的影响。此外，测定两个潜在有害的突变是否在基因的一个或两个等位基因内发生具有极为重要的临床意义。Individual human genomes are diploid in nature, with half of the homologous chromosomes originating from each parent. The context in which mutations occur on each individual chromosome can have a profound impact on the expression and regulation of genes and other transcriptional regions of the genome. In addition, determining whether two potentially harmful mutations occur within one or both alleles of a gene has extremely important clinical implications.

用于全基因组测序的本方法缺乏以划算的方式分开装配亲本染色体并描述变异共同发生的背景(单元型)的能力。模拟实验显示了染色体水平单元型测定需要至少70-100kb范围间的等位基因连锁信息。这用使用扩增DNA的现有技术不能实现，所述现有技术由于难以一致扩增长DNA分子和测序中连锁信息损失而限于小于1000个碱基的读取结果。配对技术可以提供延长的读取结果长度的等值，但是由于生成此类DNA库的低效率(由于长度长于几kb的环状DNA的困难所致)而限于小于10kb。此方法还需要极端的读取结果覆盖以联系所有杂合子。The present method for whole genome sequencing lacks the ability of assembling parent chromosomes separately and describing the background (haplotype) of variation co-occurrence in a cost-effective manner. Simulation experiments have shown that chromosome level haplotype determination requires allele linkage information between at least 70-100kb range. This can not be achieved by using the prior art of amplified DNA, which is limited to readings less than 1000 bases due to the loss of linkage information in the long DNA molecules and order-checking that are difficult to consistently amplify. Pairing technology can provide the equivalent of extended reading length, but is limited to less than 10kb due to the low efficiency (due to the difficulty of the circular DNA longer than several kb) of generating this type of DNA library. This method also requires extreme reading coverage to contact all heterozygotes.

对大于100kb的DNA片段进行单分子测序如果是可行的，则当单分子测序的准确度较高、且检测/仪器成本较低时，该种测序可用于单元型测定。这非常难以以高产率对短分子实现，更别说对100kb片段实现。If single-molecule sequencing of DNA fragments larger than 100 kb is feasible, it could be used for haplotype determination when the accuracy of single-molecule sequencing is high and the detection/instrumentation costs are low. This is very difficult to achieve with high yield for short molecules, let alone 100 kb fragments.

已经在短读取结果长度(<200bp)、高度并行化系统上实施新近的人基因组测序，以几百纳克DNA开始。这些技术在快速且经济地产生大量数据方面是卓越的。不幸地，经常与小配对缺口大小(500bp-10kb)成对的短读取结果消除超出几千碱基的大部分SNP相信息(McKernan et al.,Genome Res.19:1527,2009)。此外，在没有由于剪切所致的片段化的多个处理步骤中非常难以维持较长的DNA片段。The recent human genome sequencing has been implemented on short read result length (<200bp), highly parallelized systems, starting with hundreds of nanograms of DNA. These technologies are remarkable in generating large amounts of data quickly and economically. Unfortunately, short read results often paired with small pairing gap sizes (500bp-10kb) eliminate most of the SNP phase information beyond several thousand bases (McKernan et al., Genome Res.19:1527,2009). In addition, it is very difficult to maintain longer DNA fragments in the absence of multiple processing steps of fragmentation due to shearing.

目前，三个个人基因组，即J.Craig Venter(Levy et al.,PLoS Biol.5:e254,2007)的三个个人基因组(一个印度古吉特拉裔(HapMap样品NA20847；Kitzman et al.,Nat.Biotechnol.29:59,2011)和两个欧洲裔(Max Planck One[MP1]；Suk et al.,GenomeRes.,2011；genome.cshlp.org/content/early/2011/09/02/gr.125047.111.full.pdf；and HapMap Sample NA 12878；Duitama et al.,Nucl.Acids Res.40:2041-2053,2012))已经进行了测序并且以二倍体装配。全部均涉及了以与构建人参照基因组期间使用的细菌人工染色体(BAC)测序类似的方法将长DNA片段克隆到构建体中(Venter et al.,Science291:1304,2001；Lander et al.,Nature 409:860,2001)。虽然这些方法生成较长的定相重叠群(350kb[Levy et al.,PLoS Biol.5:e254,2007]、386kb[Kitzman et al.,Nat.Biotechnol.29:59-63,2011]和1Mb[Suk et al.,Genome Res.21:1672-1685,2011]的N50s)，它们需要大量初始DNA、广泛的库处理，并且太昂贵以致不能用于常规的临床环境。Currently, three human genomes, namely those of J. Craig Venter (Levy et al., PLoS Biol. 5:e254, 2007) (one of Gujarati descent in India (HapMap sample NA20847; Kitzman et al., Nat. Biotechnol. 29:59, 2011) and two of European descent (Max Planck One [MP1]; Suk et al., Genome Res., 2011; genome.cshlp.org/content/early/2011/09/02/gr.125047.111.full.pdf; and HapMap Sample NA 12878; Duitama et al., Nucl. Acids Res. 40:2041-2053, 2012)), have been sequenced and assembled as diploids. All involve cloning long DNA fragments into constructs using a method similar to bacterial artificial chromosome (BAC) sequencing used during the construction of the human reference genome (Venter et al., Science 291: 1304, 2001; Lander et al., Nature 409: 860, 2001). Although these methods generate long phased contigs (N50s of 350 kb [Levy et al., PLoS Biol. 5: e254, 2007], 386 kb [Kitzman et al., Nat. Biotechnol. 29: 59-63, 2011], and 1 Mb [Suk et al., Genome Res. 21: 1672-1685, 2011]), they require large amounts of initial DNA, extensive library processing, and are too expensive to be used in routine clinical settings.

另外，全染色体单元型测定已经经由直接分离中期染色体得到证明(Zhang etal.,Nat.Genet.38:382-387,2006；Ma et al.,Nat.Methods 7:299-301,2010；Fan etal.,Nat.Biotechnol.29:51-57,2011；Yang et al.,Proc.Natl.Acad.Sci.USA 108:12-17,2011)。这些方法对于远程单元型测定是卓越的，但是尚未用于全基因组测序，并且需要制备和分离全中期染色体，其对于一些临床样品可以是挑战性的。In addition, whole chromosome haplotype determination has been demonstrated via direct isolation of metaphase chromosomes (Zhang et al., Nat. Genet. 38:382-387, 2006; Ma et al., Nat. Methods 7:299-301, 2010; Fan et al., Nat. Biotechnol. 29:51-57, 2011; Yang et al., Proc. Natl. Acad. Sci. USA 108:12-17, 2011). These methods are excellent for long-range haplotype determination, but have not yet been used for whole genome sequencing and require preparation and isolation of whole metaphase chromosomes, which can be challenging for some clinical samples.

LFR方法克服这些限制。LFR包括与相关算法和软件一起的DNA制备和加标签，从而以显著降低的实验和计算成本在二倍体基因组中实现亲本染色体的分开序列的精确装配(即完全单元型测定)。The LFR method overcomes these limitations. LFR includes DNA preparation and tagging together with associated algorithms and software to enable accurate assembly of separate sequences of parental chromosomes (i.e., complete haplotype determination) in diploid genomes at significantly reduced experimental and computational costs.

LFR基于多个不同等分试样间基因组DNA(或其它核酸)的长片段的物理分离，从而存在有在相同等分试样中呈现的母本和父本组分两者的基因组的任何给定区的低几率。通过在每个等分试样中放置独特的标识符并总计分析多个等分试样，DNA序列数据可以装配成二倍体基因组，例如可以测定每个亲本染色体的序列。LFR不需要将复杂核酸的片段克隆到载体中，如在使用大片段(例如BAC)库的单元型测定方法中一样。LFR也不需要直接分离生物体的各个染色体。最后，LFR可以对个体生物体实施，并且不需要生物体群体来实现单元型定相。LFR is based on the physical separation of long fragments of genomic DNA (or other nucleic acids) between multiple different aliquots, so that there is a low probability of any given region of the genome of both the maternal and paternal components present in the same aliquots. By placing a unique identifier in each aliquot and summing up multiple aliquots for analysis, DNA sequence data can be assembled into a diploid genome, for example, the sequence of each parental chromosome can be measured. LFR does not require the fragment of complex nucleic acid to be cloned into a vector, as in the unit type determination method using large fragments (such as BAC) libraries. LFR does not need to directly separate the individual chromosomes of an organism. Finally, LFR can be implemented to individual organisms and does not require an organism colony to realize unit type phasing.

如本文中使用的，术语“载体”意指插入外来DNA片段的质粒或病毒载体。载体用于将外来DNA导入合适的宿主细胞中，其中载体和插入的外来DNA由于载体中存在例如功能性复制起点或自主复制序列而复制。如本文中使用的，术语“克隆”指DNA片段对载体的插入及合适宿主细胞中具有插入的外来DNA的载体的复制。As used herein, the term "vector" refers to a plasmid or viral vector into which a foreign DNA segment is inserted. A vector is used to introduce foreign DNA into a suitable host cell, where the vector and the inserted foreign DNA replicate due to the presence of, for example, a functional origin of replication or an autonomously replicating sequence in the vector. As used herein, the term "clone" refers to the insertion of a DNA segment into a vector and the replication of the vector with the inserted foreign DNA in a suitable host cell.

LFR可以与本文中详细讨论的测序方法一起使用，且更一般地作为与本领域中已知的任何测序技术，包括短读取结果和较长读取结果方法两者一起的预处理方法使用。LFR也可以与各种类型的分析结合使用，所述分析包括例如分析转录物组、甲基化组，等等。由于它需要非常少的输入DNA，LFR可以用于对一个或少量细胞测序并测定单元型，这对于癌症、产前诊断学、和个人化医学可以是特别重要的。这可以促进家族型遗传病的鉴定，等等。通过使得有可能区别来自二倍体样品中两组染色体的响应，LFR也容许低覆盖的变体和非变体位置的较高置信度响应。LFR的其它应用包括解析癌症基因组中的广泛重排和可变剪接转录物的全长测序。LFR can be used with the sequencing methods discussed in detail herein, and more generally as a pre-processing method with any sequencing technology known in the art, including both short read and longer read methods. LFR can also be used in conjunction with various types of analysis, including, for example, analysis of transcriptomes, methylomes, and the like. Because it requires very little input DNA, LFR can be used to sequence one or a small number of cells and determine unit types, which can be particularly important for cancer, prenatal diagnostics, and personalized medicine. This can facilitate the identification of familial genetic diseases, among others. By making it possible to distinguish responses from two sets of chromosomes in a diploid sample, LFR also allows for higher confidence responses of low-coverage variants and non-variant positions. Other applications of LFR include parsing full-length sequencing of extensive rearrangements and alternatively spliced transcripts in cancer genomes.

LFR可以用于处理并分析复杂核酸，包括但不限于基因组DNA，其是纯化的或未纯化的，包括受到温和破坏以释放此类复杂核酸而不剪切和多度片段化此类复杂核酸的细胞和组织。LFR can be used to process and analyze complex nucleic acids, including but not limited to genomic DNA, purified or unpurified, including cells and tissues that are gently disrupted to release such complex nucleic acids without shearing and excessive fragmentation of such complex nucleic acids.

在一个方面，LFR产生长度约100-1000kb的虚读取结果长度。In one aspect, LFR generates a virtual read length of about 100-1000 kb in length.

另外，LFR也可以显著降低计算需要和任何短读取结果技术的关联成本。重要地，LFR消除对延长读取结果测序结果长度(若其降低总体产率)的需要。LFR的别的益处是可以源自目前的测序技术的误差或有疑问碱基响应的实质性(10至1000倍)降低，通常每100kb为1个，或每个人染色体基因组为30,000个假阳性响应，以及每个人基因组相似数目的未检出变体。误差的此显著降低使对追踪检测变体的构造的需要最小化，并且便于采用人基因组测序进行诊断应用。In addition, LFR can also significantly reduce the computational requirements and associated costs of any short read result technology. Importantly, LFR eliminates the need to extend the length of the read result sequencing result (if it reduces the overall yield). Another benefit of LFR is that the errors or questionable base calls that can be derived from current sequencing technology are substantially (10 to 1000 times) reduced, typically 1 per 100kb, or 30,000 false positive calls per human chromosome genome, and a similar number of undetected variants per human genome. This significant reduction in error minimizes the need for tracking the construction of detection variants and facilitates the use of human genome sequencing for diagnostic applications.

除可适用于所有测序平台外，基于LFR的测序可以适用于任何应用，包括但不限于癌症基因组中结构重排的研究、全甲基化组分析，包括甲基化位点的单元型，和甚至对复杂多倍体基因组，如植物中存在的基因组的宏基因组学或新基因组测序的重新装配应用。In addition to being applicable to all sequencing platforms, LFR-based sequencing can be applied to any application, including but not limited to the study of structural rearrangements in cancer genomes, global methylome analysis, including haplotypes of methylation sites, and even de novo assembly applications for metagenomics or new genome sequencing of complex polyploid genomes, such as those found in plants.

与仅亲本或相关染色体的共有序列形成对比，LFR提供了获得各个染色体的真实序列的能力(尽管其高相似性和长重复和区段复制的存在)。为了产生此类数据，一般在长DNA范围，诸如100kb至1Mb里建立序列的连续性。In contrast to the consensus sequence of only parental or related chromosomes, LFR provides the ability to obtain the true sequence of each chromosome (despite its high similarity and the presence of long repeats and segmental duplications). To generate such data, sequence continuity is generally established over long DNA ranges, such as 100 kb to 1 Mb.

本发明的又一个方面包括用于有效利用LFR数据进行全染色体单元型和结构变异定位及假阳性/阴性误差校正到少于每个人染色体300个误差的软件和算法。Yet another aspect of the invention includes software and algorithms for efficiently utilizing LFR data for whole chromosome haplotype and structural variation mapping and false positive/negative error correction to fewer than 300 errors per human chromosome.

在又一个方面，本发明的LFR技术根据使用的等分试样和细胞数目将每个等分试样中的DNA复杂性降低100-1000倍。大于100kb的长DNA中的复杂性降低和单元型分离可以有助于更有效且划算地(成本降低多至100倍)装配，并且检测人和其它二倍体基因组中的所有变异。In yet another aspect, the LFR technology of the present invention reduces the complexity of the DNA in each aliquot by 100-1000 fold, depending on the aliquot and the number of cells used. Complexity reduction and unit type separation in long DNAs greater than 100 kb can help to more efficiently and cost-effectively (up to 100-fold reduction in cost) assemble and detect all variations in human and other diploid genomes.

本文中描述的LFR方法可以作为预处理步骤使用，用于使用本领域中已知的任何测序方法对二倍体基因组测序。在其它实施方案中，本文中描述的LFR方法可以在许多测序平台上使用，所述测序平台包括例如但不限于基于聚合酶的合成测序(例如HiSeq 2500系统,Illumina,San Diego,CA)、基于连接的测序(例如SOLiD 5500,Life TechnologiesCorporation,Carlsbad,CA)、离子半导体测序(例如离子PGM或离子质子测序仪,LifeTechnologies Corporation,Carlsbad,CA)、零模波导(例如PacBio RS测序仪,PacificBiosciences,Menlo Park,CA)、纳米孔测序(例如Oxford Nanopore Technologies Ltd.,Oxford,United Kingdom)、焦磷酸测序(例如454Life Sciences,Branford,CT)或其它测序技术。这些中的一些测序技术是短读取结果技术，但是其它技术产生较长的读取结果，例如GS FLX+(454Life Sciences；多至1000bp)、PacBio RS(Pacific Biosciences；约1000bp)和纳米孔测序(Oxford Nanopore Technologies Ltd.；100kb)。对于单元型定相，较长的读取结果是有利的，需要少得多的计算，尽管它们趋于具有较高的误差率，并且可能需要在单元型定相前依照本文中列出的方法鉴定并校正此类长读取结果中的误差。The LFR method described herein can be used as a pre-treatment step for sequencing diploid genomes using any sequencing method known in the art. In other embodiments, the LFR method described herein can be used on many sequencing platforms, including, for example, but not limited to, polymerase-based synthetic sequencing (e.g., HiSeq 2500 system, Illumina, San Diego, CA), ligation-based sequencing (e.g., SOLiD 5500, Life Technologies Corporation, Carlsbad, CA), ion semiconductor sequencing (e.g., ion PGM or ion proton sequencer, Life Technologies Corporation, Carlsbad, CA), zero-mode waveguide (e.g., PacBio RS sequencer, Pacific Biosciences, Menlo Park, CA), nanopore sequencing (e.g., Oxford Nanopore Technologies Ltd., Oxford, United Kingdom), pyrophosphate sequencing (e.g., 454 Life Sciences, Branford, CT), or other sequencing technologies. Some of these sequencing technologies are short read technologies, but other technologies produce longer reads, such as GS FLX+ (454 Life Sciences; up to 1000 bp), PacBio RS (Pacific Biosciences; approximately 1000 bp), and nanopore sequencing (Oxford Nanopore Technologies Ltd.; 100 kb). For unit type phasing, longer reads are advantageous, requiring much less computation, although they tend to have higher error rates, and errors in such long reads may need to be identified and corrected according to the methods outlined herein before unit type phasing.

依照本发明的一个实施方案，LFR的基本步骤包括：(1)将复杂核酸(例如基因组DNA)的长片段分成等分试样，每个等分试样含有一份基因组当量的DNA；(2)扩增每个等分试样中的基因组片段；(3)片段化扩增的基因组片段以创建适合于库构建的大小的短片段(例如，在一个实施方案中长度约500个碱基)；(4)使短片段加标签以容许鉴定短片段起源的等分试样；(5)合并加标签的片段；(6)对合并的、加标签的片段测序；并(7)分析所得序列数据以定位并装配数据并获得单元型信息。依照一个实施方案，LFR使用在每孔中具有10-20％单倍体基因组的384孔板，产生每个片段的母本和父本等位基因两者的理论19-38x物理覆盖。初始DNA冗余19-38x确保完整的基因组覆盖及较高的变体响应和定相准确度。LFR避免复杂核酸片段对载体的亚克隆或者对分离各个染色体(例如中期染色体)的需要，并且它可以是完全自动化的，使得它适合于高通量、划算的应用。According to one embodiment of the present invention, the basic steps of LFR include: (1) dividing long fragments of complex nucleic acid (e.g., genomic DNA) into aliquots, each aliquot containing one genome equivalent of DNA; (2) amplifying the genomic fragments in each aliquot; (3) fragmenting the amplified genomic fragments to create short fragments of a size suitable for library construction (e.g., about 500 bases in length in one embodiment); (4) tagging the short fragments to allow identification of the aliquot from which the short fragment originated; (5) pooling the tagged fragments; (6) sequencing the pooled, tagged fragments; and (7) analyzing the resulting sequence data to locate and assemble the data and obtain haplotype information. According to one embodiment, LFR uses 384-well plates with 10-20% haploid genomes in each well, resulting in a theoretical 19-38x physical coverage of both the maternal and paternal alleles for each fragment. The 19-38x redundancy of the initial DNA ensures complete genome coverage and high variant calling and phasing accuracy. LFR avoids the need for subcloning of complex nucleic acid fragments into vectors or for isolation of individual chromosomes (eg, metaphase chromosomes), and it can be fully automated, making it suitable for high-throughput, cost-effective applications.

为了误差降低和本文中详述的其它目的，我们还已经开发出使用LFR的技术。LFR方法已经披露于美国专利申请No.12/816,365,12/329,365,12/266,385,和12/265,593，及美国专利No.7,906,285,7,901,891,和7,709,197，它们全部在此通过提及完整并入。We have also developed techniques using LFR for error reduction and other purposes detailed herein. LFR methods have been disclosed in U.S. Patent Application Nos. 12/816,365, 12/329,365, 12/266,385, and 12/265,593, and U.S. Patent Nos. 7,906,285, 7,901,891, and 7,709,197, all of which are hereby incorporated by reference in their entirety.

如本文中使用的，术语“单元型”意指染色体上邻近位置(基因座)处一起传递的等位基因组合，或备选地，染色体对的单一染色体上统计学关联的一组序列变体。每个人个体具有两组染色体，即一个父本和另一个母本。通常，DNA测序仅产生基因型信息，即沿着DNA区段的无序等位基因的序列。为基因型推断单元型将每个无序对中的等位基因分成两个各称作单元型的不同序列。单元型信息对于许多不同类型的遗传分析(包括疾病关联研究和对群体祖先进行推断)是必要的。As used in this article, the term "haplotype" means the allele combination transmitted together at the adjacent position (locus) on the chromosome, or alternatively, a group of sequence variants statistically associated on the single chromosome of the chromosome pair. Each individual has two sets of chromosomes, i.e., a father and another mother. Usually, DNA sequencing only produces genotype information, i.e., the sequence of the disordered alleles along the DNA segment. Haplotype is inferred to be genotype and the alleles in each disordered pair are divided into two different sequences each called haplotype. Haplotype information is necessary for many different types of genetic analysis (including disease association studies and inferring colony ancestors).

如本文中使用的，术语“定相(phasing)”(或解析(resolution))意指将序列数据分类成两组亲本染色体或单元型。单元型定相指接受一个个体或一个群体(即不止一个个体)的一组基因型作为输入，并输出每个个体的一对单元型(一个是父本的，而另一个是母本的)的问题。定相可以涉及解析基因组的区域的序列数据，或少到仅仅是读取结果或重叠群中的两个序列变体，其可以称为局部定相或微定相。它还可以涉及较大重叠群(一般包括约10个以上序列变体)或甚至全基因组序列的定相，其可以称为“通用定相(universalphasing)”。任选地，在基因组装配期间对序列变体进行定相。As used herein, the term "phasing" (or resolution) means that sequence data are classified into two groups of parental chromosomes or haplotypes. Haplotype phasing refers to the problem of accepting a group of genotypes of an individual or a population (i.e., more than one individual) as input and outputting a pair of haplotypes (one is paternal and the other is maternal) for each individual. Phasing can involve the sequence data of a region of the genome, or as little as two sequence variants in a read result or overlapping group, which can be referred to as local phasing or microphasing. It can also involve the phasing of larger overlapping groups (generally including about 10 or more sequence variants) or even whole genome sequences, which can be referred to as "universal phasing." Optionally, sequence variants are phased during genome assembly.

等分取样多份基因组当量的复杂核酸Aliquot and sample multiple genome equivalents of complex nucleic acids

LFR方法基于将长片段中的基因组随机物理分成多个等分试样，使得每个等分试样含有一份单倍体基因组。随着每个集合中基因组的分数降低，具有来自相同集合中的两个亲本染色体的相应片段的统计学概率显著减小。The LFR method is based on randomly physically dividing the genome in long fragments into multiple aliquots, so that each aliquot contains one haploid genome. As the fraction of the genome in each pool decreases, the statistical probability of having corresponding fragments from two parental chromosomes in the same pool decreases significantly.

在一些实施方案中，将10％的基因组当量等分取样到多孔板的每孔中。在其它实施方案中，将1％至50％的基因组当量的复杂核酸等分取样每孔中。如上文记录的，等分试样和基因组当量的数目可以取决于等分试样的数目、初始片段大小或其它因素。任选地，在等分取样前使双链核酸(例如人基因组)变性；如此，可以将单链互补物分配成不同等分试样。依照一个实施方案，每个等分试样包含复杂核酸的大多数链的2、4、6或更多个拷贝(或互补物)(或2、4、6或更多个互补物，若在等分取样前使双链核酸变性)。In some embodiments, 10% of the genome equivalents are sampled in equal portions into each well of a multiwell plate. In other embodiments, 1% to 50% of the genome equivalents of the complex nucleic acid are sampled in equal portions into each well. As noted above, the number of aliquots and genome equivalents can depend on the number of aliquots, initial fragment size, or other factors. Optionally, double-stranded nucleic acids (e.g., human genomes) are denatured before aliquoting; in this way, single-stranded complements can be distributed into different aliquots. According to one embodiment, each aliquot contains 2, 4, 6, or more copies (or complements) of the majority of the chains of the complex nucleic acid (or 2, 4, 6, or more complements if double-stranded nucleic acids are denatured before aliquoting).

例如，在每个等分试样0.1个基因组当量时(在每份人类基因组约6.6pg时，约0.66皮克或pg DNA)，两个片段会重叠有10％概率，且那些片段会源自不同亲本染色体有50％概率；这产生等分试样中的95％碱基对是非重叠的，即特定等分试样对于给定片段会不提供信息的5％总体概率，因为等分试样含有源自母本和父本染色体两者的片段。可以鉴定不提供信息的等分试样，因为源自此类等分试样的序列数据含有增加的“噪音”量，也就是说，杂合性对间连接矩阵的杂质。模糊干扰系统(FIS)容许针对某种程度杂质的稳健性，即，尽管有杂质(多至某个程度)，它可以进行正确的连接。甚至可以使用较小量的基因组DNA，特别是在微滴或纳米滴或乳剂的背景中，其中每滴可以包含一个DNA片段(例如基因组DNA的单一50kb片段或约1.5x 10-5个基因组当量)。甚至在50％的基因组当量，大多数等分试样会是提供信息的。在较高的水平，例如70％基因组当量，可以鉴定并使用提供信息的孔。依照本发明的一个方面，0.000015,0.0001,0.001,0.01,0.1,1,5,10,15,20,25,40,50,60,或70％基因组当量的复杂核酸存在于每个等分试样中。For example, at 0.1 genome equivalents per aliquot (approximately 0.66 picograms or pg of DNA at approximately 6.6 pg per human genome), there is a 10% probability that two fragments will overlap, and a 50% probability that those fragments will originate from different parental chromosomes; this yields a 95% overall probability that the base pairs in the aliquots are non-overlapping, i.e., a particular aliquot will be uninformative for a given fragment because the aliquot contains fragments originating from both maternal and paternal chromosomes. Uninformative aliquots can be identified because the sequence data derived from such aliquots contain an increased amount of "noise," that is, impurities in the matrix of heterozygous pair connections. The fuzzy interference system (FIS) allows for robustness against some degree of impurities, i.e., it can perform correct ligations despite the presence of impurities (up to a certain degree). Even smaller amounts of genomic DNA can be used, particularly in the context of microdroplets or nanodroplets or emulsions, where each droplet can contain one DNA fragment (e.g., a single 50 kb fragment of genomic DNA or approximately 1.5 x 10-5 genome equivalents). Even at 50% genome equivalents, most aliquots will be informative. At higher levels, such as 70% genome equivalents, informative wells can be identified and used. According to one aspect of the invention, 0.000015, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10, 15, 20, 25, 40, 50, 60, or 70% genome equivalents of complex nucleic acid are present in each aliquot.

应当领会，稀释因子可以取决于片段的初始大小。也就是说，使用温和技术来分离基因组DNA，可以获得大约100kb的片段，然后，将该片段等分取样。容许较大片段的技术导致需要较少的等分试样，并且生成较短片段的技术可以需要更多稀释。It will be appreciated that the dilution factor may depend on the initial size of the fragment. That is, using a gentle technique to separate genomic DNA, fragments of approximately 100 kb may be obtained, which are then sampled in equal portions. Techniques that tolerate larger fragments result in fewer aliquots, while techniques that generate shorter fragments may require more dilution.

我们已经在没有DNA纯化的同一反应中成功实施所有6个酶促步骤，这促进小型化和自动化，而且使得使LFR适用于极其多种平台和样品制备方法变得可行。We have successfully performed all six enzymatic steps in the same reaction without DNA purification, which facilitates miniaturization and automation and makes it feasible to adapt LFR to a wide variety of platforms and sample preparation methods.

依照一个实施方案，多孔板(例如384孔板)的分开孔中含有每个等分试样。然而，本领域中已知的任何合适类型的容器或系统可以用于容纳等分试样，或者可以使用微滴或乳剂实施LFR方法，如本文中描述的。依照本发明的一个实施方案，将体积降低至亚微升水平。在一个实施方案中，可以在1536孔形式中使用自动化移液方法。According to one embodiment, each aliquot is contained in a separate well of a multi-well plate (e.g., a 384-well plate). However, any suitable type of container or system known in the art can be used to hold the aliquots, or the LFR method can be implemented using droplets or emulsions, as described herein. According to one embodiment of the present invention, the volume is reduced to sub-microliter levels. In one embodiment, automated pipetting methods can be used in a 1536-well format.

一般地，随着等分试样数目增加，例如增加至1536，且基因组的百分比下降到约1％单倍体基因组，单元型的统计学支持显著增加，因为同一孔中的母本和父本单元型两者的偶发存在减少。因此，每个等分试样具有忽略的混合单元型频率的大量小等分试样容许使用较少的细胞。类似地，较长的片段(例如300kb或更长)有助于桥接缺乏杂合基因座的区段。In general, as the number of aliquots increases, for example, to 1536, and the percentage of the genome decreases to about 1% haploid genomes, the statistical support for the haplotype increases significantly because the occasional presence of both maternal and paternal haplotypes in the same well decreases. Thus, a large number of small aliquots with negligible mixed haplotype frequencies per aliquot allows for the use of fewer cells. Similarly, longer fragments (e.g., 300 kb or longer) help bridge segments lacking heterozygous loci.

提供50-100nl无触点移液的纳升(nl)分配工具(例如Hamilton Robotics Nano移液头，TTP LabTech Mosquito，等等)可以用于快速且低成本移液以平行产生数十个基因组库。等分试样数目增加(与384孔板相比)导致每孔内基因组复杂性的较大降低，这使总体计算成本降低超过10倍并且提高数据质量。另外，此方法的自动化提高通量，并且降低产生库的动手成本。Nanoliter (nl) dispensing tools (e.g., Hamilton Robotics Nano pipetting heads, TTP LabTech Mosquito, etc.) that provide 50-100nl contactless pipetting can be used for rapid and low-cost pipetting to generate dozens of genomic libraries in parallel. The increased number of aliquots (compared to 384-well plates) results in a significant reduction in genomic complexity per well, which reduces overall computational costs by more than 10-fold and improves data quality. In addition, the automation of this method increases throughput and reduces the hands-on cost of generating libraries.

使用较小等分试样体积(包括微滴和乳剂)的LFRLFR using smaller aliquot volumes (including microdroplets and emulsions)

甚至可以使用微滴实现进一步的成本降低和其它优点。在一些实施方案中，在乳剂或微流控装置中用组合加标签实施LFR。在10,000个等分试样中体积下降至皮升水平可以由于较低的试剂和计算成本而实现甚至更大的成本降低。Even further cost reductions and other advantages can be achieved using microdroplets. In some embodiments, LFR is performed with combinatorial tagging in emulsions or microfluidic devices. Volumes down to picoliter levels in 10,000 aliquots can achieve even greater cost reductions due to lower reagent and computational costs.

在一个实施方案中，LFR在384孔形式中对每孔使用10微升(μl)体积的试剂。例如，可以通过在1536孔形式中使用商品化自动化移液方法降低至此类体积。进一步的体积降低可以使用提供50-100nl无触点移液的纳升(nl)分配工具(例如Hamilton Robotics Nano移液头，TTP LabTech Mosquito，等等)实现，该纳升(nl)分配工具可以用于快速且低成本移液以平行产生数十个基因组库。增加等分试样数目导致每孔内基因组复杂性的较大降低，这使总体计算成本降低并且提高数据质量。另外，此方法的自动化提高通量，并且降低产生库的成本。In one embodiment, LFR uses 10 microliters (μl) volumes of reagent for each well in a 384-well format. For example, such volume can be reduced to by using a commercial automated pipetting method in a 1536-well format. Further volume reduction can be achieved using a nanoliter (nl) dispensing tool (such as Hamilton Robotics Nano pipetting head, TTP LabTech Mosquito, etc.) providing 50-100nl contactless pipetting, which can be used for fast and low-cost pipetting to produce tens of genomic libraries in parallel. Increasing the number of aliquots results in a larger reduction in genomic complexity in each well, which reduces overall computational cost and improves data quality. In addition, the automation of this method improves throughput and reduces the cost of producing a library.

在别的实施方案中，用8-12个碱基对误差校正条形码实现每个等分试样的独特鉴定。在一些实施方案中，使用与孔相同数目的衔接头。In other embodiments, an 8-12 base pair error correction barcode is used to achieve unique identification of each aliquot.In some embodiments, the same number of adapters as wells are used.

在别的实施方案中，使用一种新颖的组合加标签方法，其基于两组40个半条形码衔接头。在一个实施方案中，库构建涉及使用两个不同衔接头。A和B衔接头容易修饰为各含有不同半条形码序列以产生数千个组合。在别的实施方案中，在相同衔接头上掺入条形码序列。这可以通过将B衔接头分成两个部分来实现，所述两个部分各具有以用于连接的共同突出序列分开的半条形码序列。两个标签组分各具有4-6个碱基。8碱基(2x 4个碱基)标签组能够独特地使65,000个等分试样加标签。一个额外的碱基(2x 5个碱基)会容许误差检测，并且12个碱基标签(2x 6个碱基，1200万个独特的条形码序列)可以设计为容许在10,000或更多个等分试样中使用Reed-Solomon设计的实质性误差检测和校正。在例示性的实施方案中，采用2x 5碱基和2x 6碱基标签两者，包括使用简并碱基(即“百搭(wild-cards)”)来实现最佳解码效率。In another embodiment, a novel combinatorial tagging method is used, which is based on two groups of 40 half-barcode adapters. In one embodiment, library construction involves the use of two different adapters. The A and B adapters are easily modified to each contain a different half-barcode sequence to produce thousands of combinations. In another embodiment, a barcode sequence is incorporated into the same adapter. This can be achieved by dividing the B adapter into two parts, each of which has a half-barcode sequence separated by a common protruding sequence for connection. The two tag components each have 4-6 bases. An 8-base (2x 4 base) tag group can uniquely tag 65,000 aliquots. An additional base (2x 5 bases) allows error detection, and 12 base tags (2x 6 bases, 12 million unique barcode sequences) can be designed to allow substantial error detection and correction using the Reed-Solomon design in 10,000 or more aliquots. In exemplary embodiments, both 2 x 5-base and 2 x 6-base tags are employed, including the use of degenerate bases (ie, "wild-cards") to achieve optimal decoding efficiency.

体积下降至皮升水平(例如在10,000个等分试样中)可以实现甚至更大的试剂和计算成本降低。在一些实施方案中，此水平的成本降低和广泛等分取样经由将LFR方法与组合加标签组合到乳剂或微流控型装置来实现。在没有DNA纯化的同一反应中实施所有酶促步骤的能力促进使此方法小型化和自动化的能力，而且导致对极其多种平台和样品制备方法的适应性。Volumes down to the picoliter level (e.g., in 10,000 aliquots) can achieve even greater reductions in reagent and computational costs. In some embodiments, this level of cost reduction and extensive aliquot sampling is achieved by combining the LFR method with combinatorial labeling into an emulsion or microfluidic device. The ability to perform all enzymatic steps in the same reaction without DNA purification facilitates the ability to miniaturize and automate this method and results in adaptability to a wide variety of platforms and sample preparation methods.

在一个实施方案中，LFR方法与乳剂型装置结合使用。使LFR适合于乳剂型装置的第一步是制备每滴具有单一独特条形码的有条形码标签的组合衔接头的乳剂试剂。两组100个半条形码足以独特鉴定10,000个等分试样。然而，将半条形码衔接头的数目增加至超过300可以容许以任何两个等分试样含有相同条形码组合的低概率随机添加要与样品DNA组合的条形码液滴。可以生成组合条形码衔接头液滴，并在单管中以试剂贮存，用于数千个LFR库。In one embodiment, the LFR method is used in conjunction with an emulsion-based device. The first step in adapting LFR to an emulsion-based device is to prepare an emulsion reagent with barcode-tagged combinatorial adapters that have a single unique barcode per droplet. Two sets of 100 half-barcodes are sufficient to uniquely identify 10,000 aliquots. However, increasing the number of half-barcode adapters to over 300 allows for random addition of barcoded droplets to be combined with sample DNA with a low probability that any two aliquots contain the same barcode combination. Combinatorial barcode adapter droplets can be generated and stored as reagents in a single tube for thousands of LFR libraries.

在一个实施方案中，将本发明从10,000扩大至100,000或更多个等分试样库。在别的实施方案中，通过增加初始半条形码衔接头的数目使LFR方法适合于进行此类扩大。然后，将这些组合衔接头液滴与含有代表小于1％单倍体基因组的准备好连接的DNA的液滴一对一融合。使用保守估值每个液滴1nl和10,000个液滴，这代表对于整个LFR库为总体积10μl。In one embodiment, the present invention is scaled up from 10,000 to 100,000 or more aliquot libraries. In other embodiments, the LFR method is adapted for such scale-up by increasing the number of initial semi-barcoded adapters. These combined adapter droplets are then fused one-to-one with droplets containing ligation-ready DNA representing less than 1% of a haploid genome. Using conservative estimates of 1 nl per droplet and 10,000 droplets, this represents a total volume of 10 μl for the entire LFR library.

最近的研究还已经提出通过将反应体积下降至纳升大小得到的扩增(例如通过MDA进行)后GC偏爱改善和背景扩增降低。Recent studies have also suggested improved GC bias and reduced background amplification after amplification (eg, by MDA) obtained by reducing reaction volumes to nanoliter sizes.

目前有几类微流控装置(例如由Advanced Liquid Logic,Morrisville,NC出售的装置)或皮/纳滴(例如RainDance Technologies,Lexington,MA)，其具有皮/纳滴生成、融合(3000/秒)和收集功能，并且可以在LFR的此类实施方案中使用。在其它实施方案中，使用改善的纳移液或声学液滴喷出技术(例如LabCyte Inc.,Sunnyvale,CA)或使用能够处理多至9216个单个反应孔的微流控装置(例如由Fluidigm,South San Francisco,CA生产的装置)，将约10-20纳升滴在3072-6144以上形式(仍然是划算的总MDA体积60μl，且不丧失计算成本节约或对来自少量细胞的基因组DNA测序的能力)中在板中或在玻璃载玻片上放置。增加等分试样数目导致每孔内基因组复杂性的较大降低，这使总体计算成本降低并且提高数据质量。另外，此方法的自动化提高通量，并且降低产生库的成本。There are currently several types of microfluidic devices (such as those sold by Advanced Liquid Logic, Morrisville, NC) or pico/nanodroplets (such as RainDance Technologies, Lexington, MA) with pico/nanodroplet generation, fusion (3000/ seconds) and collection capabilities, and can be used in such embodiments of LFR. In other embodiments, using improved nanopipette or acoustic droplet ejection technology (such as LabCyte Inc., Sunnyvale, CA) or using a microfluidic device capable of handling up to 9216 single reaction wells (such as those produced by Fluidigm, South San Francisco, CA), about 10-20 nanoliters are dropped in a 3072-6144 or more form (still a cost-effective total MDA volume of 60 μl, without losing the ability to calculate cost savings or to sequence the genomic DNA from a small amount of cells) in a plate or on a glass slide. Increasing the number of aliquots results in a greater reduction in genomic complexity within each well, which reduces overall computational cost and improves data quality. Additionally, automation of this method increases throughput and reduces the cost of generating libraries.

扩增Amplification

依照一个实施方案，LFR方法以用5’外切核酸酶对基因组DNA的短处理开始以创建充当MDA起始位点的3’单链突出。外切核酸酶的使用消除对扩增前热或碱变性步骤的需要且不将偏爱引入片段群体中。碱变性可以与5’外切核酸酶处理组合，这导致偏爱的进一步降低。然后，将DNA稀释至亚基因组浓度并等分取样。在等分取样后，例如，使用MDA方法扩增每孔中的片段。在某些实施方案中，MDA反应是一种改良的基于phi29聚合酶的扩增反应，尽管可以使用另一种已知的扩增方法。According to one embodiment, the LFR method begins with a short treatment of genomic DNA with a 5' exonuclease to create a 3' single-stranded overhang that serves as an MDA start site. The use of an exonuclease eliminates the need for a pre-amplification heat or alkaline denaturation step and does not introduce bias into the fragment population. Alkaline denaturation can be combined with a 5' exonuclease treatment, which results in a further reduction in bias. The DNA is then diluted to a subgenomic concentration and aliquoted. After aliquoting, the fragments in each well are amplified, for example, using the MDA method. In certain embodiments, the MDA reaction is a modified phi29 polymerase-based amplification reaction, although another known amplification method can be used.

在一些实施方案中，MDA反应设计为将尿嘧啶引入扩增产物中。在一些实施方案中，利用随机六聚体的标准MDA反应用于扩增每孔中的片段。在许多实施方案中，与随机六聚体不同，使用随机8聚体引物来降低片段群体中的扩增偏爱。在其它实施方案中，也可以将几种不同酶添加至MDA反应以降低扩增偏爱。例如，可以使用低浓度的非进行性5’外切核酸酶和/或单链结合蛋白来创建8聚体的结合位点。化学剂诸如甜菜碱、DMSO和海藻糖也可以用于经由相似的机制降低偏爱。In some embodiments, the MDA reaction is designed to introduce uracil into the amplified product. In some embodiments, the standard MDA reaction utilizing random hexamer is used to amplify the fragment in every hole. In many embodiments, different from random hexamer, random 8 aggressiveness primers are used to reduce the amplification preference in the fragment colony. In other embodiments, several different enzymes can also be added to the MDA reaction to reduce the amplification preference. For example, the non-carrying 5 ' exonuclease and/or single-stranded binding protein of low concentration can be used to create a binding site for 8 aggressiveness. Chemical agents such as betaine, DMSO and trehalose can also be used to reduce the preference via similar mechanisms.

片段化Fragmentation

依照一个实施方案，在每孔中的DNA扩增后，将扩增产物进行一轮片段化处理。在一些实施方案中，在扩增后使用上文描述的CoRE方法将每孔中的片段进一步片段化。为了使用CoRE方法，用于扩增每孔中的片段的MDA反应设计为将尿嘧啶掺入MDA产物中。也可以经由超声处理或酶促处理实现MDA产物的片段化。According to one embodiment, after the DNA in each well is amplified, the amplified product is subjected to a round of fragmentation. In some embodiments, the fragments in each well are further fragmented using the CoRE method described above after amplification. To use the CoRE method, the MDA reaction used to amplify the fragments in each well is designed to incorporate uracil into the MDA product. Fragmentation of the MDA product can also be achieved via ultrasonication or enzymatic treatment.

若使用CoRE方法来片段化MDA产物，则用尿嘧啶DNA糖基化酶(UDG)、DNA糖基化酶-裂合酶内切核酸酶VIII和T4多核苷酸激酶的混合物处理含有扩增DNA的每孔以切除尿嘧啶碱基并创建具有官能性5’磷酸根和3’羟基基团的单一碱基缺口。经由使用聚合酶诸如Taq聚合酶的切口平移导致双链平端断裂，这生成依赖于MDA反应中添加的dUTP浓度的大小范围的可连接片段。在一些实施方案中，使用的CoRE方法涉及通过phi29聚合和链置换除去尿嘧啶。If the CoRE method is used to fragment the MDA product, each well containing amplified DNA is treated with a mixture of uracil DNA glycosylase (UDG), DNA glycosylase-lyase endonuclease VIII, and T4 polynucleotide kinase to excise the uracil base and create a single base gap with a functional 5' phosphate and 3' hydroxyl group. Nick translation using a polymerase such as Taq polymerase results in double-stranded blunt-end breaks, which generate ligatable fragments of a size range that depends on the concentration of dUTP added in the MDA reaction. In some embodiments, the CoRE method used involves the removal of uracil by phi29 polymerization and strand displacement.

MDA产物的片段化后，可以修复所得片段的末端。此类修复可以是必要的，因为许多片段化技术可以生成具有突出端的末端和具有不可用于后来的连接反应的官能团，诸如3’和5’羟基基团和/或3’和5’磷酸根基团的末端。在本发明的许多方面，具有修复成具有平端的片段可以是有用的，且在一些情况中，可以期望改变末端光化学，使得不存在正确的磷酸根和羟基基团方向，从而阻止靶序列的“聚合”。可以使用本领域中已知的方法提供对末端化学的控制。例如，在一些情况中，磷酸酶的使用消除所有磷酸根基团，使得所有末端含有羟基基团。然后，可以通过碱性磷酸酶处理“活化”片段的一个末端。然后，可以将每个末端选择性改变以容许期望组分间的连接。然后，可以“活化”片段的一个末端，在一些实施方案中通过用碱性磷酸酶处理进行。After fragmentation of the MDA product, the ends of the resulting fragments can be repaired. Such repair may be necessary because many fragmentation techniques can generate ends with overhangs and ends with functional groups that cannot be used for subsequent ligation reactions, such as 3' and 5' hydroxyl groups and/or 3' and 5' phosphate groups. In many aspects of the present invention, it may be useful to have fragments that are repaired to have blunt ends, and in some cases, it may be desirable to change the end photochemistry so that there is no correct phosphate and hydroxyl group orientation, thereby preventing "polymerization" of the target sequence. Control of the end chemistry can be provided using methods known in the art. For example, in some cases, the use of phosphatase eliminates all phosphate groups so that all ends contain hydroxyl groups. Then, one end of the fragment can be "activated" by alkaline phosphatase treatment. Then, each end can be selectively changed to allow connection between the desired components. Then, one end of the fragment can be "activated", in some embodiments by treatment with alkaline phosphatase.

在片段化及任选地末端修复后，用衔接头使片段加标签。Following fragmentation and optional end repair, the fragments are tagged with adapters.

加标签Tagging

一般地，在两个区段中设计标签衔接头臂：一个区段对于所有孔而言是共同的，并且平端使用本文中进一步描述的方法直接连接片段。第二区段对于每个孔而言是独特的，并且含有“条形码”序列，使得在组合每孔的内容物时，可以鉴定来自每孔的片段。Typically, tag adapter arms are designed in two segments: one segment is common to all wells and blunt-ended to directly ligate fragments using methods described further herein. The second segment is unique to each well and contains a "barcode" sequence so that when the contents of each well are combined, the fragments from each well can be identified.

依照一个实施方案，“共同”衔接头作为两个衔接头臂添加：一个臂是与片段的5’端连接的平端，而另一个臂是与片段的3’端连接的平端。加标签衔接头的第二区段是对于每孔独特的“条形码”区段。此条形码一般是独特的核苷酸序列，并且对特定孔中的每个片段给予相同条形码。如此，在将来自所有孔的加标签片段重新组合以进行测序应用时，可以经由鉴定条形码衔接头鉴定来自同一孔的片段。将条形码与共同衔接头臂的5’端连接。可以将共同衔接头和条形码衔接头与片段序贯或同时连接。共同衔接头和条形码衔接头的末端可以修饰为使得每个衔接头区段会以正确方向且与正确的分子连接。此类修饰通过确保片段不能彼此连接，且衔接头区段仅能够以例示的方向连接来防止衔接头区段或片段的“聚合”。According to one embodiment, the "common" adapter is added as two adapter arms: one arm is a blunt end that connects to the 5' end of the fragment, and the other arm is a blunt end that connects to the 3' end of the fragment. The second section of the tagging adapter is a "barcode" segment that is unique to each well. This barcode is generally a unique nucleotide sequence, and the same barcode is given to each fragment in a particular well. In this way, when the tagged fragments from all wells are recombined for sequencing applications, fragments from the same well can be identified by identifying the barcode adapter. The barcode is connected to the 5' end of the common adapter arm. The common adapter and the barcode adapter can be connected to the fragments sequentially or simultaneously. The ends of the common adapter and the barcode adapter can be modified so that each adapter segment will connect in the correct orientation and to the correct molecule. Such modifications prevent "polymerization" of adapter segments or fragments by ensuring that fragments cannot connect to each other and that adapter segments can only connect in the illustrated orientation.

在别的实施方案中，对用于使每孔中的片段加标签的衔接头利用三区段设计。此实施方案与上文描述的条形码衔接头设计类似，只是条形码衔接头区段分成两个区段。此设计容许一大批可能的条形码，其通过容许组合条形码衔接头区段通过将不同条形码区段连接在一起以形成完全条形码区段生成来实现。此组合设计在减少需要生成的完全大小条形码衔接头数目的情况下提供可能的条形码衔接头的较大全集。In another embodiment, a three-segment design is utilized for the adapters used to tag the fragments in each well. This embodiment is similar to the barcode adapter design described above, except that the barcode adapter segment is divided into two segments. This design allows for a large number of possible barcodes by allowing combinatorial barcode adapter segments to be generated by connecting different barcode segments together to form a complete barcode segment. This combinatorial design provides a larger repertoire of possible barcode adapters while reducing the number of full-size barcode adapters that need to be generated.

依照一个实施方案，在使每孔中的片段加标签后，将所有片段组合以形成单一群体。然后，这些片段可以用于生成本发明的核酸模板，用于测序。从这些加标签的片段生成的核酸模板根据与每个片段附接的条形码标签衔接头可鉴定为源自特定孔。类似地，在对标签测序后，与其附接的基因组序列也可鉴定为源自该孔。According to one embodiment, after the fragments in each well are tagged, all fragments are combined to form a single population. These fragments can then be used to generate nucleic acid templates of the present invention for sequencing. The nucleic acid templates generated from these tagged fragments can be identified as originating from a specific well based on the barcode tag adapter attached to each fragment. Similarly, after sequencing the tag, the genomic sequence attached to it can also be identified as originating from that well.

在一些实施方案中，本文中描述的LFR方法不包括多个水平或层级的片段化/等分取样，如记载于2006年6月13日提交的美国专利申请No.11/451,692的，其通过提及完整并入本文用于所有目的。也就是说，一些实施方案仅利用一轮等分取样，并且也容许重新合并用于单一阵列的等分试样，而不是对每个等分试样使用不同阵列。In some embodiments, the LFR methods described herein do not include multiple levels or layers of fragmentation/aliquoting, as described in U.S. Patent Application No. 11/451,692, filed June 13, 2006, which is incorporated herein by reference in its entirety for all purposes. That is, some embodiments utilize only one round of aliquoting and also allow for the re-combining of aliquots for a single array, rather than using a different array for each aliquot.

使用一个或少量细胞作为复杂核酸的来源的LFRLFR using one or a small number of cells as a source of complex nucleic acids

依照一个实施方案，LFR方法用于分析单个细胞或少量细胞的基因组。在此情况中用于分离DNA的方法与上文描述的方法类似，但是可以在更小的体积中发生。According to one embodiment, the LFR method is used to analyze the genome of a single cell or a small number of cells. The method used to isolate DNA in this case is similar to the method described above, but can take place in a smaller volume.

如上文讨论的，可以通过多种不同方法实现从细胞分离基因组核酸的长片段。在一个实施方案中，将细胞裂解，并用温和的离心步骤将完整的核沉淀。然后，经由蛋白酶K和RNA酶消化几小时释放基因组DNA。在一些实施方案中，可以处理材料以降低剩余细胞废物的浓度，此类处理是本领域中公知的，并且可以包括但不限于透析一段时间(即2-16小时)和/或稀释。由于分离核酸的此类方法不涉及许多破坏性方法(诸如乙醇沉淀、离心和涡旋振荡)，基因组核酸很大程度上保持完整，产生具有超过150千碱基的长度的大多数片段。在一些实施方案中，片段的长度是约100至约750千碱基。在别的实施方案中，片段的长度是约150至约600、约200至约500、约250至约400和约300至约350千碱基。As discussed above, it is possible to achieve long fragments from cell separation of genomic nucleic acid by a variety of different methods. In one embodiment, by cell lysis, and with gentle centrifugation steps by complete nuclear precipitation. Then, via proteinase K and RNAse digestion for several hours, genomic DNA is released. In some embodiments, material can be processed to reduce the concentration of residual cell waste, and this type of processing is well known in the art and can include but is not limited to dialysis for a period of time (i.e. 2-16 hours) and/or dilution. Because this type of method for separating nucleic acid does not relate to many destructive methods (such as ethanol precipitation, centrifugation and vortex oscillation), genomic nucleic acid remains intact to a great extent, produces most fragments with a length exceeding 150 kilobases. In some embodiments, the length of the fragment is about 100 to about 750 kilobases. In other embodiments, the length of the fragment is about 150 to about 600, about 200 to about 500, about 250 to about 400 and about 300 to about 350 kilobases.

一旦分离DNA且在将其等分取样到单个孔中前，必须将基因组DNA仔细片段化以避免材料的损失，特别是避免来自每个片段末端的序列的损失，因为此类材料的损失可以导致最终基因组装配中的缺口。在一个情况中，通过使用罕见的切口酶避免序列损失，所述切口酶在彼此约100kb的距离处创建聚合酶，诸如phi29聚合酶的起始位点。由于聚合酶创建新的DNA链，它置换旧的链，最终结果是在聚合酶起始位点附近有重叠序列，导致非常少的序列缺失。Once the DNA is isolated and before it is aliquoted into individual wells, the genomic DNA must be carefully fragmented to avoid loss of material, particularly loss of sequence from the ends of each fragment, as loss of such material can result in gaps in the final genome assembly. In one instance, sequence loss was avoided by using a rare nicking enzyme that creates start sites for polymerases, such as phi29 polymerase, at a distance of approximately 100 kb from each other. As the polymerase creates a new DNA chain, it displaces the old chain, with the end result being overlapping sequences near the polymerase start site, resulting in very few sequence deletions.

在一些实施方案中，5’外切核酸酶的受控使用(在MDA反应之前或期间)可以促进初始DNA从单细胞的多重复制，如此使经由拷贝复制得到的早期误差的增长最小化。In some embodiments, the controlled use of a 5' exonuclease (before or during the MDA reaction) can promote multiple replications of the initial DNA from a single cell, thereby minimizing the growth of early errors through copy replication.

在一个方面，本发明的方法产生来自单细胞的质量基因组数据。假设没有DNA损失，有以少量细胞(10个或更少)代替使用来自大量制备的等同量DNA开始的益处。以小于10个细胞开始且对基本上所有DNA精确等分取样确保基因组的任何给定区域的长片段中的一致覆盖。以5个以下细胞开始容许每个等分试样中每100kb DNA片段的4倍或更大的覆盖且不使总读取结果数目增加得高于120Gb(6Gb二倍体基因组的20倍覆盖)。然而，大量等分试样(10,000或更多)和较长的DNA片段(>200kb)对于从少数细胞测序甚至更为重要，因为对于任何给定的序列，仅有与重叠片段一样多的起始细胞数目，并且来自一个等分试样中两个亲本染色体的重叠片段的出现可以是毁灭性的信息损失。In one aspect, method of the present invention produces the quality genomic data from unicellular.Assuming that there is no DNA loss, there is the benefit of using the equivalent amount DNA starting from a large amount of preparation instead of a small amount of cells (10 or less).Start with less than 10 cells and ensure the consistent coverage in the long fragment of any given region of genome to basically all DNA accurate equal-division sampling.Start with 5 or less cells to allow 4 times or larger coverage of every 100kb DNA fragment in each aliquot and do not make the total reading result number increase to be higher than 120Gb (20 times of coverage of 6Gb diploid genome). However, a large amount of aliquots (10,000 or more) and longer DNA fragments (>200kb) are even more important for order-checking from a few cells, because for any given sequence, only have the same number of starting cells as overlapping fragments, and the appearance of overlapping fragments from two parent chromosomes in an aliquot can be a devastating information loss.

LFR完全适合于此问题，因为它以相当于起始输入基因组DNA的仅约10个细胞开始产生卓越的结果，且即使一个单细胞会提供足够的DNA来实施LFR。一般地，LFR中的第一步是低偏爱全基因组扩增，其可以特别用于单细胞基因组分析。由于处理中的DNA链断裂和DNA损失，即使单分子测序方法也有可能会需要从单细胞的一定水平DNA扩增。对单细胞测序的困难来自尝试扩增整个基因组。使用MDA对细菌实施的研究已经遭受最终装配序列中大致一半基因组的损失及那些测序区间覆盖的相当大量的变化。这可以部分解释为是由于具有切口和链断裂的初始基因组DNA，其不能在末端复制，如此在MDA方法期间损失。LFR提供了针对此问题的解决办法，其经由在MDA前创建基因组的长重叠片段进行。依照本发明的一个实施方案，为了实现这点，使用温和的方法从细胞分离基因组DNA。然后，用常见的切口酶轻度处理很大程度上完整的基因组DNA，从而生成半随机切口的基因组。然后，使用phi29的链置换能力来从切口聚合，创建非常长的(>200kb)重叠片段。然后，这些片段用作LFR的起始模板。LFR is perfectly suited to this problem because it produces excellent results starting with only about 10 cells, equivalent to the starting input genomic DNA, and even a single cell will provide enough DNA to implement LFR. Generally, the first step in LFR is low-bias whole genome amplification, which can be particularly useful for single-cell genomic analysis. Due to DNA chain breaks and DNA loss during processing, even single-molecule sequencing methods may require a certain level of DNA amplification from a single cell. The difficulty in sequencing single cells comes from trying to amplify the entire genome. Studies using MDA on bacteria have suffered from the loss of roughly half of the genome in the final assembled sequence and a considerable amount of variation in the coverage of those sequencing intervals. This can be explained in part by the initial genomic DNA having nicks and chain breaks, which cannot be replicated at the ends and are thus lost during the MDA method. LFR provides a solution to this problem by creating long overlapping fragments of the genome before MDA. According to one embodiment of the present invention, in order to achieve this, a gentle method is used to separate genomic DNA from cells. The largely intact genomic DNA is then gently treated with a common nicking enzyme to generate a genome with semi-random nicks. The strand displacement ability of phi29 is then used to polymerize from the nicks, creating very long (>200 kb) overlapping fragments. These fragments are then used as starting templates for LFR.

使用LFR的甲基化分析Methylation analysis using LFR

在又一个方面，本发明的方法和组合物用于基因组甲基化分析。目前有几种方法可用于全局基因组甲基化分析。一种方法涉及基因组DNA的硫酸氢盐处理及对重复元件或通过甲基化特异性限制酶片段化获得的基因组部分测序。此技术产生关于总体甲基化的信息，但是不提供基因座特异性数据。下一更高的分辨率水平使用DNA阵列，并且受限于芯片上特征的数目。最后，最高分辨率且最昂贵的方法需要硫酸氢盐处理，接着对整个基因组测序。使用LFR，有可能对基因组的所有碱基测序，并且用关于人基因组中每个胞嘧啶位置的甲基化水平(即5-碱基测序)的数字信息装配完整的二倍体基因组。此外，LFR容许将100kb或更大的甲基化序列区组连接以对单元型测序，提供甲基化单元型测定，即不可能用任何目前可用的方法实现的信息。In yet another aspect, the method and composition of the present invention are used for genome methylation analysis. Several methods are currently available for global genome methylation analysis. One method involves bisulfate treatment of genomic DNA and sequencing of the genome portion obtained by fragmentation of repetitive elements or by methylation-specific restriction enzymes. This technology produces information about overall methylation, but does not provide locus-specific data. The next higher resolution level uses DNA arrays and is limited to the number of features on the chip. Finally, the highest resolution and most expensive method requires bisulfate treatment, followed by sequencing of the entire genome. Using LFR, it is possible to sequence all bases of the genome, and assemble a complete diploid genome with digital information about the methylation level (i.e., 5-base sequencing) of each cytosine position in the human genome. In addition, LFR allows 100kb or larger methylation sequence blocks to be connected to cell type sequencing, providing methylation cell type determination, which is information that cannot be achieved with any currently available method.

在一个非限制性的例示性实施方案中，在如下的方法中获得甲基化状态，其中首先将基因组DNA等分取样并变性以进行MDA。接着，用亚硫酸氢盐处理DNA(即需要变性的DNA的步骤)。剩余的制备遵循那些记载于例如6/13/2006提交的美国申请流水号11/451,692和12/15/2008提交的12/335,168的方法，每篇在此通过提及完整并入用于所有目的且特别是用于所有涉及依照长片段读取结果技术对片段混合物的核酸分析的教导。In one non-limiting exemplary embodiment, methylation status is obtained in a method in which genomic DNA is first aliquoted and denatured for MDA. Next, the DNA is treated with bisulfite (i.e., the step requiring denatured DNA). The remaining preparation follows the methods described in, for example, U.S. application serial numbers 11/451,692 filed June 13, 2006 and 12/335,168 filed December 15, 2008, each of which is hereby incorporated by reference in its entirety for all purposes and in particular for all teachings relating to nucleic acid analysis of fragment mixtures according to long fragment read technology.

在一个方面，MDA会扩增特定片段的每条链，其对于任何给定胞嘧啶位置独立产生50％读取结果为不受亚硫酸氢盐影响(即胞嘧啶相对的碱基鸟嘌呤不受硫酸氢盐影响)和50％提供甲基化状态。每个等分试样降低的DNA复杂性有助于精确定位和装配不太提供信息的、通常3-碱基(A,T,G)读取结果。In one aspect, MDA amplifies each strand of a specific fragment, which independently generates 50% reads for any given cytosine position that are unaffected by bisulfite (i.e., the base opposite cytosine, guanine, is unaffected by bisulfite) and 50% that provide methylation status. The reduced DNA complexity of each aliquot facilitates precise positioning and assembly of less informative, typically 3-base (A, T, G) reads.

已经报告了亚硫酸氢盐处理使DNA片段化。然而，变性和硫酸氢盐缓冲液的小心滴定可以避免基因组DNA的广泛片段化。在LFR中可以容许胞嘧啶对尿嘧啶的50％转变，这容许DNA对亚硫酸氢盐的暴露降低以使片段化最小化。在一些实施方案中，等分取样后某个程度的片段化是可接受的，因为它不会影响单元型测定。Bisulfite treatment has been reported to fragment DNA. However, denaturation and careful titration of the bisulfite buffer can avoid extensive fragmentation of genomic DNA. A 50% conversion of cytosine to uracil can be tolerated in LFR, which reduces exposure of DNA to bisulfite and minimizes fragmentation. In some embodiments, some degree of fragmentation after aliquoting is acceptable as it does not affect haplotype determination.

使用LFR来分析癌症基因组Using LFR to analyze cancer genomes

已经提出了超过90％的癌症含有人基因组区中的显著损失或获得，称作异倍体性，已经观察到一些个别癌症含有一些染色体的超过4个拷贝。染色体和染色体内区域的拷贝数的此升高的复杂性使对癌症基因组测序实质上变得更加困难。LFR技术对非常长的(>100kb)基因组片段测序和装配的能力使得其完全适合于完整癌症基因组的测序。It has been proposed that more than 90% of cancers contain significant losses or gains in human genomic regions, known as aneuploidy, and some individual cancers have been observed to contain more than 4 copies of some chromosomes. This increased complexity of the copy number of chromosomes and intrachromosomal regions makes sequencing cancer genomes substantially more difficult. The ability of LFR technology to sequence and assemble very long (>100 kb) genomic fragments makes it perfectly suited for sequencing complete cancer genomes.

通过对多个等分试样中的靶核酸测序进行的误差降低Error reduction by sequencing target nucleic acid in multiple aliquots

依照一个实施方案，即使不实施基于LFR的定相并且使用标准测序方法，也将靶核酸分成多个等分试样，其各含有一定量的靶核酸。在每个等分试样中，将靶核酸片段化(若需要片段化)，并且在扩增前用等分试样特异性标签(或等分试样特异性标签组)使片段加标签。或者，在处理组织样品时，可以将一个或多个细胞分配到多个等分试样之每个，之后进行细胞破坏，片段化，用等分试样特异性标签使片段加标签，并扩增。在任一情况中，可以将自每个等分试样扩增的DNA分开测序或者合并并在合并后测序。此方法的一个优点是可以鉴定并校正由于扩增(或每个等分试样中发生的其它步骤)引入的误差。例如，若碱基响应存在于来自两个或更多个等分试样(或其它阈值数目)的序列数据中，或在实质上大多数预期等分试样中(例如在至少51、70或80％中)，其中分母可以限于在特定位置处具有碱基响应的等分试样，则序列数据的特定位置(例如相对于参照物)处的碱基响应(例如鉴定特定碱基，诸如A,C,G或T)可以接受为真的。碱基响应可以包括改变杂合性或潜在杂合性的一个等位基因。若特定位置处的碱基响应仅存在于一个等分试样(或其它阈值数目的等分试样)中，或者在实质上少数等分试样(例如，小于10、5或3个等分试样或如用相对数目测量，诸如20或10％)中，则它可以接受为假的。阈值数值可以基于测序数据预先确定或动态确定。若特定位置处的碱基响应不存在于实质上少数中和在实质上大多数预期等分试样中(例如在40-60％中)，则它可以转化/接受为“无响应”。在一些实施方案和实现中，可以使用多个参数(例如在分布、概率和/或其它函数或统计学中)来表征什么可以认为是实质上少数或实质上大多数等分试样。此类参数的例子包括但不限于下列一项或多项：鉴定特定碱基的碱基响应的数目；特定位置处响应碱基的覆盖或总数；产生包括特定碱基响应的序列数据的独特等分试样的数目和/或身份；产生包含特定位置处的至少一个碱基响应的序列数据的独特等分试样的总数；特定位置处的参照碱基；等等。在一个实施方案中，用于特定碱基响应的上述参数的组合可以输入到函数以测定特定碱基响应的得分(例如概率)。然后，可以将得分与一个或多个阈值数值比较作为测定碱基响应是否是接受的(例如高于阈值)、错误的(例如低于阈值)、还是无响应(例如若碱基响应的所有得分低于阈值)的一部分。碱基响应的确定可以依赖于其它碱基响应的得分。According to one embodiment, even if LFR-based phasing is not implemented and standard sequencing methods are used, the target nucleic acid is divided into multiple aliquots, each containing a certain amount of target nucleic acid. In each aliquot, the target nucleic acid is fragmented (if fragmentation is required) and the fragments are tagged with aliquot-specific tags (or aliquot-specific tag sets) before amplification. Alternatively, when processing tissue samples, one or more cells can be distributed to each of multiple aliquots, followed by cell disruption, fragmentation, tagged with aliquot-specific tags, and amplification. In either case, the DNA amplified from each aliquot can be sequenced separately or combined and sequenced after combining. One advantage of this method is that errors introduced due to amplification (or other steps occurring in each aliquot) can be identified and corrected. For example, a base call (e.g., identifying a specific base, such as A, C, G, or T) at a specific position in the sequence data (e.g., relative to a reference) can be accepted as true if the base call is present in sequence data from two or more aliquots (or other threshold number of aliquots), or in a substantial majority of expected aliquots (e.g., in at least 51, 70, or 80%), where the denominator can be limited to aliquots having a base call at the specific position. A base call can include an allele that changes heterozygosity or potential heterozygosity. A base call at a specific position can be accepted as false if it is present in only one aliquot (or other threshold number of aliquots), or in a substantial minority of aliquots (e.g., less than 10, 5, or 3 aliquots, or as measured by relative numbers, such as 20 or 10%). The threshold value can be predetermined or dynamically determined based on the sequencing data. If a base call at a particular position is not present in a substantial minority and in a substantial majority of expected aliquots (e.g., in 40-60%), it can be converted/accepted as a "no call." In some embodiments and implementations, multiple parameters (e.g., in distributions, probabilities, and/or other functions or statistics) can be used to characterize what can be considered a substantial minority or a substantial majority of aliquots. Examples of such parameters include, but are not limited to, one or more of the following: the number of base calls that identify a particular base; the coverage or total number of calling bases at a particular position; the number and/or identity of unique aliquots that generate sequence data including a particular base call; the total number of unique aliquots that generate sequence data containing at least one base call at a particular position; the reference base at a particular position; and the like. In one embodiment, a combination of the above parameters for a particular base call can be input into a function to determine a score (e.g., probability) for the particular base call. The score can then be compared to one or more threshold values as part of determining whether the base call is accepted (e.g., above a threshold), erroneous (e.g., below a threshold), or a no call (e.g., if all scores for the base calls are below a threshold). The determination of a base call can depend on the scores of other base calls.

作为一个基本的例子，若碱基响应A存在于超过35％(得分的例子)的含有感兴趣位置读取结果的等分试样中，且碱基响应C存在于超过35％的这些等分试样中，且其它碱基响应各具有小于20％的得分，则可以认为该位置是由A和C构成的杂合性，可能服从其它标准(例如最小数目的含有感兴趣位置处的读取结果的等分试样)。如此，每个得分可以输入到另一个函数(例如试探法，其可以使用比较或模糊逻辑)中以提供所述位置的碱基响应的最终测定。As a basic example, if base call A is present in more than 35% (example of a score) of the aliquots containing reads at a position of interest, and base call C is present in more than 35% of these aliquots, and the other base calls each have a score of less than 20%, then the position can be considered heterozygous for A and C, possibly subject to other criteria (such as a minimum number of aliquots containing reads at the position of interest). Thus, each score can be input into another function (such as a heuristic, which may use comparisons or fuzzy logic) to provide a final determination of the base call for that position.

作为另一个例子，可以使用特定数目的含有碱基响应的等分试样作为阈值。例如，在分析癌症样品时，可以有低流行体细胞突变。在此类情况中，碱基响应可以在小于10％的覆盖所述位置的等分试样中出现，但是仍然可以认为碱基响应是正确的，可能服从其它标准。如此，多个实施方案可以使用绝对数或相对数，或两者(例如作为对比较或模糊逻辑的输入)。并且，等分试样的此类数目可以输入到函数(如上文提及的)，以及与每个数目对应的阈值，并且函数可以提供得分，该得分也可以与一个或多个阈值比较以做出关于特定位置处的碱基响应的最终测定。As another example, a specific number of aliquots containing base calls can be used as a threshold. For example, when analyzing cancer samples, there may be a low prevalence of somatic mutations. In such cases, a base call may appear in less than 10% of the aliquots covering the position, but the base call may still be considered correct, possibly subject to other criteria. Thus, various embodiments may use absolute or relative numbers, or both (e.g., as input to a comparison or fuzzy logic). Furthermore, such numbers of aliquots can be input to a function (as mentioned above), along with a threshold corresponding to each number, and the function can provide a score that can also be compared to one or more thresholds to make a final determination regarding the base call at a particular position.

误差校正函数的别的例子涉及原始读取结果中的序列误差，该序列误差导致与其它变体响应及其单元型不一致的推定变体响应。若变体A的20个读取结果存在于9和8个属于相应单元型的等分试样中，且变体G的7个读取结果存在于6孔(其中的5或6个与具有A读取结果的等分试样共享)中，则逻辑可以拒绝变体G为测序误差，因为对于二倍体基因组，仅一个变体可以驻留于每个单元型中的一个位置。变体A得到实质上更多阅读支持，而G读取结果实质上遵循A读取结果的等分试样，指示它们最可能是由于错误读取结果G而不是A而产生。若G读取结果几乎专门在与A分开的等分试样中，则这可以指示G读取结果错误定位或者它们来自污染性DNA。Other examples of error correction functions relate to sequence errors in the original reads, which result in putative variant responses that are inconsistent with other variant responses and their haplotypes. If 20 reads of variant A are present in 9 and 8 aliquots belonging to the corresponding haplotypes, and 7 reads of variant G are present in 6 wells (5 or 6 of which are shared with aliquots with A reads), then logic can reject variant G as a sequencing error because for a diploid genome, only one variant can reside in one position in each haplotype. Variant A is substantially more supported by readings, while G reads substantially follow the aliquots of A reads, indicating that they are most likely to be generated due to erroneous reads G rather than A. If G reads are almost exclusively in aliquots separated from A, then this can indicate that G reads are mislocalized or that they are from contaminating DNA.

鉴定具有短串联重复的区域中的扩充(expansions)Identification of expansions in regions with short tandem repeats

DNA中的短串联重复(STR)是具有强周期性样式的DNA区段。在两个或更多个核苷酸的样式重复且重复的序列彼此直接相邻时发生STR；重复可以是完全的或不完全的，即可以有不匹配周期性基序的几个碱基对。一般地，所述样式的长度范围为2至5个碱基对(bp)。STR通常位于非编码区中，例如在内含子中。在同源STR基因座在个体间的重复数目上有所不同时发生短串联重复多态性(STRP)。STR分析经常用于测定遗传概况，用于法庭目的。存在于基因外显子中的STR可以代表与人疾病关联的超突变区(Madsen et al,BMC Genomics9:410,2008)。Short tandem repeats (STRs) in DNA are segments of DNA with a strongly periodic pattern. STRs occur when a pattern of two or more nucleotides repeats, and the repeated sequences are directly adjacent to each other; the repetition can be complete or incomplete, meaning there can be several base pairs that do not match the periodic motif. Generally, the pattern ranges from 2 to 5 base pairs (bp) in length. STRs are often located in noncoding regions, such as introns. Short tandem repeat polymorphisms (STRPs) occur when homologous STR loci differ in the number of repeats between individuals. STR analysis is often used to determine genetic profiles for forensic purposes. STRs present in gene exons can represent hypermutation regions associated with human disease (Madsen et al, BMC Genomics 9:410, 2008).

在人基因组(和其它生物体的基因组)中，STR包括三核苷酸重复，例如CTG或CAG重复。三核苷酸重复扩充，又称为三联体重复扩充，是由DNA复制期间的滑动引起的，并且与分类为三核苷酸重复病症诸如亨廷顿病的某些疾病有关。一般地，扩充越大，越有可能引起疾病或提高疾病的严重性。此特性导致三核苷酸重复病症中看到的“早现”特征，也就是说，由于这些重复的扩充经过受累家族的连续世代疾病发作的年龄减小和症状严重性升高的趋势。鉴定三核苷酸重复的扩充可以用于对三核苷酸重复病症精确预测发作年龄和疾病进展。In the human genome (and the genomes of other organisms), STRs include trinucleotide repeats, such as CTG or CAG repeats. Trinucleotide repeat expansions, also known as triplet repeat expansions, are caused by slippage during DNA replication and are associated with certain diseases classified as trinucleotide repeat disorders, such as Huntington's disease. Generally, the larger the expansion, the more likely it is to cause disease or increase the severity of the disease. This characteristic leads to the "early onset" feature seen in trinucleotide repeat disorders, that is, a trend of decreasing age of disease onset and increasing symptom severity over successive generations of affected families due to the expansion of these repeats. Identifying trinucleotide repeat expansions can be used to accurately predict the age of onset and disease progression for trinucleotide repeat disorders.

使用下一代测序方法，STR诸如三核苷酸重复的扩充可以是难以鉴定的。此类扩充不能定位，并且在库中可以是缺少或呈现不足的。使用LFR，有可能看到STR区中序列覆盖的显著下降。例如，具有STR的区域与没有此类重复的区域相比在特征上会具有更低的覆盖水平，并且若存在有所述区域的扩充，则会有所述区域中覆盖的实质性降低，其在覆盖对基因组中位置的图中可观察到。Using next-generation sequencing methods, expansions of STRs, such as trinucleotide repeats, can be difficult to identify. Such expansions cannot be located and may be absent or underrepresented in the library. Using LFR, it is possible to see a significant decrease in sequence coverage in STR regions. For example, regions with STRs will characteristically have lower coverage levels than regions without such repeats, and if expansions are present in such regions, there will be a substantial decrease in coverage in such regions, which can be observed in a plot of coverage versus position in the genome.

图14显示了受影响胚胎中CTG重复扩充的检测的例子。LFR用于测定胚胎的亲本单元型。在均值标准化的克隆覆盖对位置的图中，具有扩充CTG重复的单元型没有或具有非常少量的穿过扩充区的DNB，导致区域中覆盖的降低。降低也可以在两个单元型的组合序列覆盖中检出；然而，一个单元型的下降可能更难以鉴定。例如，若序列覆盖是平均约20，则具有扩充区的区域会具有显著下降，例如若受影响单元型在扩充区中具有0覆盖，则下降至10。如此，会发生50％下降。然而，若比较两个单元型的序列覆盖，则覆盖在正常单元型中是10，而在受影响单元型中是0，这是下降10，但是总体百分比下降100％。或者，可以分析相对量，其对于组合序列覆盖是2:1(正常对扩充区中的覆盖)，但是是10:0(单元型1对单元型2)，这是无穷大或0(取决于如何形成比率)，如此是较大的区别。Figure 14 shows the example of the detection of CTG repeat expansion in the affected embryo.LFR is used to measure the parental haplotype of the embryo.In the figure of the clone coverage to position of the mean standardization, the haplotype with expanded CTG repeat does not have or has a very small amount of DNB that passes through the expansion zone, resulting in the reduction of coverage in the region.Reduction can also be detected in the combined sequence coverage of the two haplotypes; However, the decline of a haplotype may be more difficult to identify.For example, if the sequence coverage is an average of about 20, the region with the expansion zone will have a significant decline, for example, if the affected haplotype has 0 coverage in the expansion zone, then it will decline to 10.Like this, a 50% decline will occur.However, if the sequence coverage of the two haplotypes is compared, then coverage is 10 in the normal haplotype, and 0 in the affected haplotype, which is a decline of 10, but the overall percentage decline is 100%. Alternatively, one can analyze the relative amounts, which for the combined sequence coverage is 2:1 (normal versus coverage in the expanded region), but is 10:0 (haplotype 1 versus haplotype 2), which is either infinity or 0 (depending on how the ratios are formed), so a larger difference.

序列数据的诊断用途Diagnostic uses of sequence data

使用本发明方法产生的序列数据可用于极其多种目的。依照一个实施方案，本发明的测序方法用于鉴定复杂核酸序列(例如全基因组序列)中的序列变异，例如其提供关于患者或胚胎或胎儿的特征性或医学状态，诸如胚胎或胎儿的性别或具有遗传组分的疾病(包括例如囊性纤维化病、镰状细胞贫血、马方综合征、亨延顿氏病和血色素沉着病或多种癌症，诸如乳腺癌)的存在或预后的信息。依照另一个实施方案，本发明的测序方法用于提供序列信息，其以来自患者(包括但不限于胎儿或胚胎)的1-20个细胞开始并且基于序列评估患者的特征。The sequence data generated using the methods of the present invention can be used for a wide variety of purposes. According to one embodiment, the sequencing methods of the present invention are used to identify sequence variations in complex nucleic acid sequences (e.g., whole genome sequences), which provide, for example, information about a characteristic or medical condition of a patient or embryo or fetus, such as the sex of the embryo or fetus, or the presence or prognosis of a disease with a genetic component (including, for example, cystic fibrosis, sickle cell anemia, Marfan syndrome, Huntington's disease, and hemochromatosis, or various cancers, such as breast cancer). According to another embodiment, the sequencing methods of the present invention are used to provide sequence information starting with 1-20 cells from a patient (including, but not limited to, a fetus or embryo) and assessing the patient's characteristics based on the sequence.

癌症诊断学Cancer Diagnostics

全基因组测序在评估疾病的遗传基础中是一种有价值的工具。许多有遗传基础的疾病(例如囊性纤维化病)是已知的。Whole genome sequencing is a valuable tool in assessing the genetic basis of disease. Many diseases (such as cystic fibrosis) are known to have a genetic basis.

全基因组测序的一个应用是了解癌症。下一代测序对癌症基因组学的最重要影响是对单一患者及给定癌症类型的多个患者样品的匹配肿瘤和正常基因组再测序、分析和比较的能力。使用全基因组测序，可以考虑整个范围的序列变异，包括种系易感性基因座、体细胞单核苷酸多态性(SNP)、小插入和缺失(indel)突变、拷贝数变化(CNV)和结构变体(SV)。One application of whole genome sequencing is to understand cancer. The most important impact of next-generation sequencing on cancer genomics is the ability to resequence, analyze, and compare matched tumor and normal genomes of a single patient and multiple patient samples of a given cancer type. Using whole genome sequencing, a full range of sequence variations can be considered, including germline susceptibility loci, somatic single nucleotide polymorphisms (SNPs), small insertion and deletion (indel) mutations, copy number variations (CNVs), and structural variants (SVs).

一般地，癌症基因组由患者的种系DNA构成，所述种系DNA上已经叠加体细胞基因组改变。通过测序鉴定的体细胞突变可以分类为“驱动(driver)”或“乘客”突变。所谓的驱动突变是那些通过对细胞赋予生长或存活优势而直接促成肿瘤进展的突变。乘客突变涵盖已经在细胞分裂、DNA复制和修复中的误差期间获得的中性体细胞突变；这些突变可以在细胞在表型上正常时或者在新生物变化明显后获得。Typically, a cancer genome is composed of the patient's germline DNA onto which somatic genomic changes have been superimposed. Somatic mutations identified by sequencing can be classified as "driver" or "passenger" mutations. So-called driver mutations are those that directly contribute to tumor progression by conferring a growth or survival advantage on the cell. Passenger mutations encompass neutral somatic mutations that have been acquired during errors in cell division, DNA replication, and repair; these mutations can be acquired when the cell is phenotypically normal or after neoplastic changes are apparent.

在历史上，已经尝试阐明癌症的分子机制，并且已经鉴定出几个“驱动”突变或生物标志物，诸如HER2/neu2。基于此类基因，已经开发出治疗性方案来特异性靶向具有已知遗传变化的肿瘤。此方法的最佳限定例子是曲妥单抗(trastuzumab)(Herceptin)对乳腺癌细胞中HER2/neu的靶向。然而，癌症不是简单的单成因疾病，取而代之，而是以个体间可以有所不同的遗传变化组合为特征。因此，这些对基因组的其它干扰可以使一些药物方案对某些个体变得无效。Historically, attempts have been made to elucidate the molecular mechanisms of cancer, and several "driver" mutations or biomarkers have been identified, such as HER2/neu2. Based on this type of gene, therapeutic regimens have been developed to specifically target tumors with known genetic changes. The best limiting example of this approach is the targeting of HER2/neu by trastuzumab (Herceptin) to breast cancer cells. However, cancer is not a simple single-cause disease, but rather a combination of genetic changes that can be different between individuals. Therefore, these other interferences to the genome can make some drug regimens become invalid for some individuals.

用于全基因组测序的癌细胞可以获自全肿瘤活检(包括少量细胞的微活检)，自患者的血流或其它体液分离的癌细胞，或本领域中已知的任何其它来源。Cancer cells for whole genome sequencing can be obtained from a whole tumor biopsy (including microbiopsies of a small number of cells), cancer cells isolated from a patient's bloodstream or other bodily fluid, or any other source known in the art.

植入前遗传诊断学Preimplantation genetic diagnosis

本发明方法的一个应用是用于植入前遗传诊断学。约2至3％出生婴儿具有某些类型的主要出生缺陷。由于遗传物质(染色体)的遗传分离所致的一些问题的风险随母亲年龄而升高。这些类型的问题的约50％机会是由于唐氏综合征，其是染色体21的第三个拷贝(三体性21)。另一半源自其它类型的染色体异常，包括三体性、点突变、结构变异、拷贝数变化，等等。许多这些染色体问题导致严重受累的婴儿或者甚至没有存活到分娩的。One application of the method of the present invention is for preimplantation genetic diagnostics. About 2 to 3% of babies are born with some type of major birth defect. The risk of some problems due to genetic segregation of genetic material (chromosomes) increases with maternal age. About 50% of the chances of these types of problems are due to Down syndrome, which is a third copy of chromosome 21 (trisomy 21). The other half are due to other types of chromosomal abnormalities, including trisomy, point mutations, structural variations, copy number variations, and the like. Many of these chromosomal problems result in severely affected babies or even those that do not survive to delivery.

在医学和(临床)遗传学中，植入前遗传诊断学(PGD或PIGD)(又称为胚胎筛选)指在植入前对胚胎，有时甚至在受精前对卵母细胞实施的规程。PGD可以容许父母避免选择性妊娠终止。术语植入前遗传筛选(PGS)用于指不寻找特定疾病，但是使用PGD技术来鉴定由于例如可以导致疾病的遗传状况而有风险的胚胎的规程。在受精前对性细胞实施的规程可以取而代之称为卵母细胞选择或精子选择的方法，尽管该方法和目的与PGD部分重叠。In medicine and (clinical) genetics, preimplantation genetic diagnosis (PGD or PIGD) (also known as embryo screening) refers to a procedure performed on embryos before implantation, and sometimes even on oocytes before fertilization. PGD can allow parents to avoid selective pregnancy termination. The term preimplantation genetic screening (PGS) is used to refer to a procedure that does not look for a specific disease, but uses PGD technology to identify embryos that are at risk due to, for example, a genetic condition that can cause a disease. The procedure performed on sex cells before fertilization may instead be referred to as a method of oocyte selection or sperm selection, although the method and purpose partially overlap with PGD.

植入前遗传序型分析(PGP)是一种辅助生殖技术以实施胚胎选择的方法，所述胚胎似乎具有成功妊娠的最大机会。在用于晚期母体年龄的女性及用于反复体外受精(IVF)失败的患者时，主要实施PGP作为用于检测染色体异常诸如非整倍性、相互易位和罗伯逊易位、和其它异常诸如染色体倒位或缺失的筛选。另外，PGP可以对遗传标志物检查特征，包括多种疾病状态。PGP使用后面的原则是，由于已知许多染色体遗传解释大多数妊娠丧失病例，并且较大比例的人胚胎是非整倍体，整倍体胚胎的选择性替换应当提高成功IVF治疗的机会。全基因组测序提供了全面染色体分析方法，诸如阵列全面基因组杂交(aCGH)、定量PCR和SNP微阵列等方法的备选。例如，整个全基因组测序可以提供information关于单碱基变化、插入、缺失、结构变化和拷贝数变化的信息。Preimplantation genetic profiling (PGP) is a kind of assisted reproductive technology to implement the method for embryo selection, and described embryo seems to have the maximum chance of successful pregnancy.When being used for the women of late maternal age and for the patient of repeated in vitro fertilization (IVF) failure, mainly implement PGP as for detecting chromosomal abnormality such as aneuploidy, reciprocal translocation and Robertsonian translocation and other abnormal screening such as chromosomal inversion or deletion.In addition, PGP can be to genetic marker inspection feature, including multiple disease states.PGP uses the principle of back, because known many chromosomal genetic explanations most of pregnancy loss cases, and the human embryo of larger proportion is aneuploidy, and the selective replacement of euploid embryo should improve the chance of successful IVF treatment.Whole genome sequencing provides comprehensive chromosome analysis method, such as array comprehensive genomic hybridization (aCGH), quantitative PCR and SNP microarray methods alternative.For example, whole whole genome sequencing can provide information about single base variation, insertion, deletion, structural variation and copy number variation.

由于可以对来自不同发育阶段的细胞实施PGD，活检规程相应变化。可以在所有植入前阶段，包括但不限于未受精的和经受精的卵母细胞(对于极体，PB)，对三天卵裂阶段胚胎(对于卵裂球)和对胚泡(对于滋养外胚层细胞)实施活检。Since PGD can be performed on cells from different developmental stages, the biopsy protocol varies accordingly. Biopsy can be performed at all preimplantation stages, including but not limited to unfertilized and fertilized oocytes (for polar bodies, PBs), three-day cleavage stage embryos (for blastomeres), and blastocysts (for trophectoderm cells).

鉴于本发明的上述详细描述，依照本发明的一个方面，提供了用于对生物体(例如哺乳动物诸如人，无论是单一单个生物体或包含超过一个个体的群体)的复杂核酸测序的方法，此类方法包括：(a)等分取样复杂核酸的样品以生成多个等分试样，每个等分试样包含一定量的复杂核酸；(b)对来自每个等分试样的所述量的复杂核酸测序以从每个等分试样产生一个或多个读取结果；并(c)装配来自每个等分试样的读取结果，从而产生复杂核酸的装配序列，其在响应率70，75，80，85，90或95％或更大时每兆碱基包含不超过1，0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1,0.08,0.06,0.04或更小的假单核苷酸变体。若复杂核酸是哺乳动物(例如人)基因组，任选地，装配序列具有70％或更大的基因组响应率和70,75,80,85,90或95％或更大的外显子组响应率。依照一个实施方案，复杂核酸包含至少1千兆碱基。In view of the above detailed description of the present invention, according to one aspect of the present invention, a method for sequencing complex nucleic acids of an organism (e.g., a mammal such as a human, whether a single organism or a population comprising more than one individual) is provided, such method comprising: (a) aliquoting a sample of the complex nucleic acid to produce a plurality of aliquots, each aliquot containing a certain amount of the complex nucleic acid; (b) sequencing the amount of the complex nucleic acid from each aliquot to produce one or more read results from each aliquot; and (c) assembling the read results from each aliquot to produce an assembled sequence of the complex nucleic acid, which contains no more than 1, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.08, 0.06, 0.04 or less pseudo single nucleotide variants per megabase at a response rate of 70, 75, 80, 85, 90 or 95% or greater. If the complex nucleic acid is a mammalian (eg, human) genome, optionally, the assembled sequence has a genome call rate of 70% or greater and an exome call rate of 70, 75, 80, 85, 90, or 95% or greater. According to one embodiment, the complex nucleic acid comprises at least 1 gigabase.

依照此类方法的一个实施方案，复杂核酸是双链的，并且所述方法包括在等分取样前分开双链复杂核酸的单链。According to one embodiment of such methods, the complex nucleic acid is double-stranded, and the method comprises separating the single strands of the double-stranded complex nucleic acid prior to aliquoting.

依照另一个实施方案，此类方法包括使每个等分试样中的所述量的复杂核酸片段化，以生成复杂核酸的片段。依照一个实施方案，此类方法进一步包括用等分试样特异性标签(或等分试样特异性标签组)使每个等分试样中复杂核酸的片段加标签，通过等分试样特异性标签(或等分试样特异性标签组)，可确定加标签片段起源的等分试样。在一个实施方案中，此类标签是多核苷酸，包括例如包含误差校正代码或误差校正代码，包括但不限于Reed-Solomon误差校正代码的标签。According to another embodiment, such methods include fragmenting the amount of complex nucleic acid in each aliquot to generate fragments of the complex nucleic acid. According to one embodiment, such methods further include tagging the fragments of the complex nucleic acid in each aliquot with an aliquot-specific tag (or aliquot-specific set of tags), wherein the aliquot-specific tag (or aliquot-specific set of tags) allows identification of the aliquot from which the tagged fragments originated. In one embodiment, such tags are polynucleotides, including, for example, tags comprising an error correction code or error correction codes, including but not limited to Reed-Solomon error correction codes.

依照另一个实施方案，此类方法包括在测序前合并等分试样。According to another embodiment, such methods comprise pooling the aliquots prior to sequencing.

依照此类方法的另一个实施方案，序列包含序列位置处的碱基响应，并且此类方法包括若碱基响应源自两个或更多个等分试样，或来自源自两个或更多个等分试样的三个或更多个读取结果，则将其鉴定为真的。According to another embodiment of such methods, the sequence comprises base calls at sequence positions, and such methods comprise identifying a base call as genuine if it originates from two or more aliquots, or from three or more reads originating from two or more aliquots.

依照另一个实施方案，此类方法包括鉴定装配序列中的多个序列变体并对序列变体定相。According to another embodiment, such methods include identifying a plurality of sequence variants in the assembled sequence and phasing the sequence variants.

依照此类方法的另一个实施方案，复杂核酸的样品包含生物体的1至20个细胞或自细胞分离的基因组DNA，其可以是纯化的或未纯化的。依照另一个实施方案，样品包含1pg-100ng，例如1pg,6pg,10pg,100pg,1ng,10ng或100ng基因组DNA，或1pg至1ng、或1pg至100pg、或6pg至100pg。对于参照目的，单一人细胞含有约6.6pg基因组DNA。According to another embodiment of such methods, the sample of complex nucleic acid comprises 1 to 20 cells of an organism or genomic DNA isolated from cells, which may be purified or unpurified. According to another embodiment, the sample comprises 1 pg to 100 ng, such as 1 pg, 6 pg, 10 pg, 100 pg, 1 ng, 10 ng, or 100 ng of genomic DNA, or 1 pg to 1 ng, or 1 pg to 100 pg, or 6 pg to 100 pg. For reference purposes, a single human cell contains approximately 6.6 pg of genomic DNA.

依照另一个实施方案，此类方法包括扩增每个等分试样中所述量的复杂核酸。According to another embodiment, such methods comprise amplifying said amount of complex nucleic acid in each aliquot.

依照此类方法的另一个实施方案，复杂核酸选自下组：基因组、外显子组、转录物组、甲基化组、不同生物体的基因组的混合物、生物体的不同细胞类型的基因组的混合物及其亚组。According to another embodiment of such methods, the complex nucleic acid is selected from the group consisting of a genome, an exome, a transcriptome, a methylome, a mixture of genomes of different organisms, a mixture of genomes of different cell types of an organism, and subsets thereof.

依照此类方法的另一个实施方案，装配序列具有80x,70x,60x,50x,40x,30x,20x,10x,或5x的覆盖。较低的覆盖可以与较长的读取结果一起使用。According to another embodiment of such a method, the assembled sequence has a coverage of 80x, 70x, 60x, 50x, 40x, 30x, 20x, 10x, or 5x. Lower coverage can be used with longer reads.

依照本发明的另一个方面，提供了哺乳动物复杂核酸的装配序列，其在响应率70％或更大时每兆碱基包含少于1的假单核苷酸变体。According to another aspect of the present invention, an assembled sequence of a mammalian complex nucleic acid is provided that contains less than 1 spurious single nucleotide variant per megabase at a call rate of 70% or greater.

依照本发明的另一个方面，提供了对生物体的复杂核酸测序的方法，该方法包括：(a)提供包含1pg至10ng复杂核酸的样品；(b)扩增复杂核酸以生成扩增的核酸；并(c)对扩增的核酸测序以生成具有至少70％复杂核酸响应率的序列。依照一个此类方法，复杂核酸是未纯化的。依照另一个实施方案，此类方法包括通过多重置换扩增来扩增复杂核酸。依照另一个实施方案，此类方法包括将复杂核酸扩增至少10,100,1000,10,000或100,000倍或更多。依照此类方法的另一个实施方案，样品包含1至20个包含复杂核酸的细胞(或细胞核)。依照另一个实施方案，此类方法包括裂解细胞(或核)，所述细胞包含复杂核酸和细胞杂质，并在存在细胞杂质的情况下扩增复杂核酸。依照此类方法的另一个实施方案，细胞是来自高等生物体的血液的循环非血细胞。依照此类方法的另一个实施方案，装配序列具有70,75,80,85,90或95％或更多的响应率。依照此类方法的另一个实施方案，序列每兆碱基包含2,1,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1,0.08,0.06,0.04或更小的假单核苷酸变体。依照另一个实施方案，此类方法进一步包括：对样品等分取样以生成多个等分试样，每个等分试样包含一定量的复杂核酸；扩增每个等分试样中所述量的复杂核酸以在每个等分试样中生成扩增的核酸；对来自每个等分试样的扩增核酸测序以从每个等分试样产生一个或多个读取结果；并装配读取结果以产生序列。依照另一个实施方案，此类方法进一步包括：使每个等分试样中的扩增核酸片段化以在每个等分试样中生成扩增核酸的片段；并用等分试样特异性标签将每个等分试样中的扩增核酸的片段加标签以在每个等分试样中生成加标签的片段。依照此类方法的另一个实施方案，若序列位置处的碱基响应存在于来自两个或更多个等分试样的读取结果中，或更严格地，在来自两个或更多个等分试样的读取结果中出现3次或更多次，则其接受为真的。依照另一个实施方案，此类方法进一步包括鉴定序列中的序列变异，其提供关于生物体特征(例如医学状态)的信息。依照另一个实施方案，细胞是来自高等生物体的血液(或其它样品)的循环非血细胞，包括但不限于来自母亲血液的胎儿细胞和来自患有癌症的患者的血液的癌细胞。依照本发明的另一个实施方案，复杂核酸是循环核酸(CNA)。如此，要评估的生物体的特征可以包括但不限于癌症的存在和关于癌症的信息(无论生物体是否是妊娠的)，和关于由妊娠个体携带的胎儿的性别或遗传信息。例如，此类方法可用于鉴定与疾病可能性、医学诊断或预后等相关联的单碱基变异、插入、缺失、拷贝数变化、结构变化或重排，等等。依照本发明的另一个实施方案，提供了评估胚胎的遗传状态(例如性别、亲子关系、遗传异常的存在或缺乏或与疾病素因有关的基因型，等等)的方法，其包括：(a)提供约1-20个胚胎细胞；(b)获得通过对所述细胞的基因组DNA测序产生的装配序列，其中所述装配序列具有至少80％的响应率；并(c)比较装配序列与参照序列以评估胚胎的遗传状态。例如，此类方法可用于鉴定与疾病可能性、医学诊断或预后等相关联的单碱基变异、插入、缺失、拷贝数变化、结构变化或重排，等等。依照另一个实施方案，提供了评估胚胎的遗传状态(例如性别、亲子关系、遗传异常的存在或缺乏或与疾病素因有关的基因型，等等)的方法，其包括：(a)提供约1-20个胚胎细胞；(b)获得通过对所述细胞的基因组DNA测序产生的装配序列，其中所述装配序列具有至少80％胚胎基因组的响应率；并(c)比较装配序列与参照序列以评估胚胎的遗传状态。According to another aspect of the present invention, a method for sequencing complex nucleic acids from an organism is provided, the method comprising: (a) providing a sample comprising 1 pg to 10 ng of complex nucleic acids; (b) amplifying the complex nucleic acids to generate amplified nucleic acids; and (c) sequencing the amplified nucleic acids to generate a sequence having a complex nucleic acid call rate of at least 70%. According to one such method, the complex nucleic acids are unpurified. According to another embodiment, the method comprises amplifying the complex nucleic acids by multiple displacement amplification. According to another embodiment, the method comprises amplifying the complex nucleic acids by at least 10, 100, 1000, 10,000, or 100,000-fold or more. According to another embodiment of the method, the sample comprises 1 to 20 cells (or cell nuclei) comprising complex nucleic acids. According to another embodiment, the method comprises lysing cells (or nuclei) comprising complex nucleic acids and cellular impurities, and amplifying the complex nucleic acids in the presence of the cellular impurities. According to another embodiment of the method, the cells are circulating non-blood cells from the blood of a higher organism. According to another embodiment of such a method, the assembled sequence has a call rate of 70, 75, 80, 85, 90 or 95% or more. According to another embodiment of such a method, the sequence comprises 2, 1, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.08, 0.06, 0.04 or less pseudo single nucleotide variants per megabase. According to another embodiment, such a method further comprises: aliquoting the sample to generate a plurality of aliquots, each aliquot comprising an amount of complex nucleic acid; amplifying the amount of complex nucleic acid in each aliquot to generate amplified nucleic acid in each aliquot; sequencing the amplified nucleic acid from each aliquot to generate one or more reads from each aliquot; and assembling the reads to generate a sequence. According to another embodiment, such methods further include: fragmenting the amplified nucleic acid in each aliquot to generate fragments of amplified nucleic acid in each aliquot; and tagging the fragments of amplified nucleic acid in each aliquot with aliquot-specific tags to generate tagged fragments in each aliquot. According to another embodiment of such methods, a base call at a sequence position is accepted as true if it is present in reads from two or more aliquots, or more strictly, occurs three or more times in reads from two or more aliquots. According to another embodiment, such methods further include identifying sequence variations in the sequence that provide information about a characteristic of the organism (e.g., a medical condition). According to another embodiment, the cells are circulating non-blood cells from the blood (or other sample) of a higher organism, including but not limited to fetal cells from maternal blood and cancer cells from the blood of a patient with cancer. According to another embodiment of the present invention, the complex nucleic acid is circulating nucleic acid (CNA). Thus, characteristics of the organism to be assessed can include but are not limited to the presence of and information about cancer (regardless of whether the organism is pregnant), and information about the sex or genetics of a fetus carried by the pregnant individual. For example, such methods can be used to identify single base variations, insertions, deletions, copy number variations, structural changes, or rearrangements associated with disease likelihood, medical diagnosis, or prognosis, etc. According to another embodiment of the present invention, a method for assessing the genetic status of an embryo (e.g., sex, parentage, the presence or absence of a genetic abnormality, or a genotype associated with a disease predisposition, etc.) is provided, comprising: (a) providing approximately 1-20 embryonic cells; (b) obtaining an assembly sequence generated by sequencing the genomic DNA of the cells, wherein the assembly sequence has a response rate of at least 80%; and (c) comparing the assembly sequence with a reference sequence to assess the genetic status of the embryo. For example, such methods can be used to identify single base variations, insertions, deletions, copy number variations, structural changes, or rearrangements associated with disease likelihood, medical diagnosis, or prognosis, etc. According to another embodiment, a method for assessing the genetic status of an embryo (e.g., sex, parentage, presence or absence of a genetic abnormality or a genotype associated with a disease predisposition, etc.) is provided, comprising: (a) providing approximately 1-20 embryonic cells; (b) obtaining an assembled sequence generated by sequencing the genomic DNA of the cells, wherein the assembled sequence has a response rate of at least 80% of the embryonic genome; and (c) comparing the assembled sequence with a reference sequence to assess the genetic status of the embryo.

依照本发明的另一个方面，提供了装配的全人基因组序列，该序列每兆碱基包含不超过1个假单核苷酸变体和至少70％的响应率，其中通过对1pg-10ng人基因组DNA测序产生所述序列。According to another aspect of the present invention, an assembled whole human genome sequence is provided, comprising no more than 1 false single nucleotide variant per megabase and a call rate of at least 70%, wherein the sequence is generated by sequencing 1 pg-10 ng of human genomic DNA.

依照本发明的另一个方面，提供了用于对包含多个染色体的个体生物体的基因组序列变体定相的方法，该方法包括：(a)提供包含所述多个染色体中每个的无载体片段的混合物的样品；(b)对无载体片段测序以产生包含多个序列变体的基因组序列；并(c)对序列变体定相。依照一个实施方案，此类方法包括对至少70,75,80,85,90,或95％或更多的序列变体定相。依照此类方法的另一个实施方案，基因组序列具有至少70％基因组的响应率。依照此类方法的另一个实施方案，样品包含1pg至10ng基因组，或个体生物体的1至20个细胞。依照此类方法的另一个实施方案，基因组序列具有每兆碱基少于1个假单核苷酸变体。According to another aspect of the present invention, a method for phasing genomic sequence variants of an individual organism comprising a plurality of chromosomes is provided, the method comprising: (a) providing a sample comprising a mixture of vector-free fragments of each of the plurality of chromosomes; (b) sequencing the vector-free fragments to generate a genomic sequence comprising a plurality of sequence variants; and (c) phasing the sequence variants. According to one embodiment, such a method comprises phasing at least 70, 75, 80, 85, 90, or 95% or more of the sequence variants. According to another embodiment of such a method, the genomic sequence has a call rate of at least 70% of the genome. According to another embodiment of such a method, the sample comprises 1 pg to 10 ng of genome, or 1 to 20 cells of the individual organism. According to another embodiment of such a method, the genomic sequence has less than 1 false single nucleotide variant per megabase.

依照本发明的另一个方面，提供了用于对包含多个染色体的个体生物体的基因组序列变体定相的方法，该方法包括：提供包含所述多个染色体的片段的样品；在没有在载体中克隆片段的情况下对片段测序以产生全基因组序列，其中全基因组序列包含多个序列变体；并对序列变体定相。依照此类方法的一个实施方案，在全基因组序列的装配期间发生对序列变体定相。According to another aspect of the present invention, a method for phasing genomic sequence variants of an individual organism comprising multiple chromosomes is provided, the method comprising: providing a sample comprising fragments of the multiple chromosomes; sequencing the fragments without cloning the fragments in vectors to generate a whole genome sequence, wherein the whole genome sequence comprises multiple sequence variants; and phasing the sequence variants. According to one embodiment of such a method, phasing the sequence variants occurs during assembly of the whole genome sequence.

实施例Example

实施例1：DNA扩增方法的比较Example 1: Comparison of DNA amplification methods

植入前遗传诊断学(PGD)是一种产前诊断学形式，其由遗传筛选体外受精(IVF)产生的胚胎(通常每个循环平均为10个)，之后将它们转移到未来的母体构成。它通常适用于晚期母体年龄(大于34岁)的女性或者有传递遗传病风险的夫妇。目前用于遗传筛选的技术是荧光原位杂交(FISH)、比较基因组杂交(CGH)、用于检测染色体异常的SNP阵列和阵列CGH、和用于检测基因缺陷的SNP阵列和PCR。用于单基因缺陷的PGD目前由对于每个患者而言独特的定制设计测定法组成，该测定法经常组合特定的突变检测与连锁分析作为备份并控制和监测污染。通常，在发育的第3天从每个胚胎活检获得1个细胞，并在第5天(其是可以转移胚胎的最近日)给出结果。开始应用胚泡活检，其由从胚泡(第5天胚胎)滋养外胚层的3-15个细胞的活检，接着是胚胎冷冻组成。胚胎可以在不显著丧失潜力的情况下无限期保持冷冻，其适合于全基因组测序，容许在一个部位获得活检，然后转移至另一个部位，用于全基因组测序。胚泡活检的全基因组测序会使得用于单一基因缺陷和可以通过此技术鉴定的其它遗传异常的“通用”PGD测试变得可能。Preimplantation genetic diagnosis (PGD) is a form of prenatal diagnostics that consists of embryos (typically 10 per cycle on average) produced by genetic screening in vitro fertilization (IVF) and then transferring them to future mothers. It is generally suitable for women in late maternal age (greater than 34 years old) or couples with the risk of transmitting genetic diseases. The technology currently used for genetic screening is fluorescence in situ hybridization (FISH), comparative genomic hybridization (CGH), SNP arrays and array CGH for detecting chromosomal abnormalities, and SNP arrays and PCR for detecting gene defects. PGD for single-gene defects currently consists of a unique custom-designed assay for each patient, which often combines specific mutation detection with linkage analysis as a backup and controls and monitors contamination. Typically, 1 cell is obtained from each embryo biopsy on the 3rd day of development, and results are given on the 5th day (which is the latest day that embryos can be transferred). Blastocyst biopsy is initially used, which consists of a biopsy of 3-15 cells from the trophectoderm of the blastocyst (5th day embryo), followed by embryo freezing. Embryos can be kept frozen indefinitely without significant loss of potency and are suitable for whole genome sequencing, allowing a biopsy to be obtained at one site and then transferred to another site for whole genome sequencing. Whole genome sequencing of blastocyst biopsies will enable "universal" PGD testing for single gene defects and other genetic abnormalities that can be identified by this technology.

在常规的卵巢刺激和取卵后，通过细胞浆内精子注射(ICSI)使卵受精以避免PGD测试中的精子污染。在生长到第3天后，使用细玻璃针活检取得胚胎，并从每个胚胎取出一个细胞。将每个卵裂球单独添加到干净的管，用分子级油覆盖，并在冰上运到PGD实验室。在到达后立即使用设计用于扩增基因DMPK中CTG重复扩充的突变和两个连锁标志物的测试处理样品。After conventional ovarian stimulation and egg retrieval, ovum is fertilized to avoid the sperm contamination in the PGD test by intracytoplasmic sperm injection (ICSI).After growing to the 3rd day, fine glass needle biopsy is used to obtain embryo, and one cell is taken out from each embryo.Each blastomere is added to clean pipe separately, covered with molecular grade oil, and transported to PGD laboratory on ice.After arrival, use the test treatment sample that is designed for the sudden change that CTG repeats expansion and two linkage markers in the amplification gene DMPK immediately.

在临床PGD测试和胚胎转移后，将未使用的胚胎捐赠给IVF诊所，并在开发新的PGD测试形式中使用。捐赠8个胚泡，并在这些实验中使用。After clinical PGD testing and embryo transfer, unused embryos were donated to an IVF clinic and used in the development of new PGD test formats. Eight blastocysts were donated and used in these experiments.

胚泡活检提供每个细胞约6.6皮克(pg)基因组DNA。扩增提供足够用于全基因组测序的DNA。图15显示了使用我们的方案(如下文描述的)通过MDA扩增1.031pg、8.25pg和66pg纯化的基因组DNA标准品和1或10个PVP40细胞的结果。可以运行MDA反应，长达对于获得特定测序方法需要的DNA量必要(例如30分钟至120分钟)。预期扩增程度越大，会产生越多GC偏爱。A blastocyst biopsy provides approximately 6.6 picograms (pg) of genomic DNA per cell. Amplification provides enough DNA for whole-genome sequencing. Figure 15 shows the results of MDA amplification of 1.031 pg, 8.25 pg, and 66 pg of purified genomic DNA standards and 1 or 10 PVP40 cells using our protocol (described below). The MDA reaction can be run for as long as necessary to obtain the amount of DNA required for a particular sequencing method (e.g., 30 to 120 minutes). It is expected that greater amplification will produce more GC bias.

比较两种DNA扩增方法以鉴定在使GC偏爱引入最小化的情况下生成对于全基因组序列分析足够质量的模板DNA的方法。我们比较我们的方案与通常用于阵列CGH的SurePlex扩增(Rubicon Genomics Inc.,Ann Arbor,Michigan)和修改的MDA。Two DNA amplification methods were compared to identify a method that generates template DNA of sufficient quality for whole genome sequence analysis while minimizing the introduction of GC bias.We compared our protocol with SurePlex amplification (Rubicon Genomics Inc., Ann Arbor, Michigan) and a modified MDA commonly used for array CGH.

从受到强直性肌营养不良的R-1MT突变影响的胚胎获得10-20个细胞的活检。将样品裂解，并在单一管中使DNA变性，然后，使用我们的方案和SurePlex试剂盒依照制造商的说明书通过MDA扩增。通过这两种扩增方法生成约2ug DNA。在全基因组序列分析前，用分散在基因组间的96个独立的qPCR标志物筛选扩增的样品以选择具有最低量偏爱的样品。图16显示了结果。简言之，我们测定跨整个板的平均循环数目，并将该数目从每个单独的标志物扣除以计算“△循环”数目。相对于每个标志物周围的1000个碱基对的GC含量将△循环绘图，以指示每个样品的相对GC偏爱。为了弄清样品的总体“噪音”，将每个△循环的绝对值求和以产生“△和”量度。较低的△和和相对于GC含量的相对平坦的数据绘图产生我们的经验中的呈现良好的全基因组序列。△和是61(对于我们的MDA方法)和287(对于SurePlex扩增的DNA)，指示我们的方案比SurePlex方案产生少得多的GC偏爱。A biopsy of 10-20 cells was obtained from an embryo affected by the R-1MT mutation of myotonic dystrophy. The sample was lysed and the DNA denatured in a single tube, then amplified by MDA using our protocol and the SurePlex kit according to the manufacturer's instructions. Approximately 2 ug of DNA was generated by these two amplification methods. Before whole-genome sequence analysis, the amplified samples were screened with 96 independent qPCR markers scattered across the genome to select samples with the lowest amount of preference. Figure 16 shows the results. In brief, we determined the average number of cycles across the entire plate and deducted this number from each individual marker to calculate the "Δ cycle" number. The Δ cycle was plotted relative to the GC content of 1000 base pairs around each marker to indicate the relative GC preference of each sample. In order to clarify the overall "noise" of the sample, the absolute value of each Δ cycle was summed to produce a "Δ sum" measure. A lower Δ sum and a relatively flat data plot relative to GC content produce well-presented whole-genome sequences in our experience. The sum of the deltas is 61 (for our MDA method) and 287 (for SurePlex amplified DNA), indicating that our protocol produces much less GC bias than the SurePlex protocol.

实施例2：用于植入前遗传诊断学(PGD)的胚泡活检的完全基因组测序Example 2: Complete genome sequencing of blastocyst biopsies for preimplantation genetic diagnosis (PGD)

采用修改的多重置换扩增(MDA)(Dean et al.(2002)Proc Natl Acad Sci U S A99,5261-5266)来生成足够用于全基因组序列分析的模板DNA(约1μg)，如本文中描述的。简言之，将5-20个来自每个5天龄胚胞的细胞分离，冷冻，并在干冰上从分离它们的实验室运输。将样品融化，并裂解以释放基因组DNA。在不纯化基因组DNA使其远离细胞杂质的情况中，通过添加1μl400mM KOH/10mM EDTA使DNA碱变性。使用基于phi29聚合酶的多重置换扩增(MDA)反应对胚胎基因组DNA进行全基因组扩增以生成足够量的DNA(约1μg)以进行测序。碱变性后1分钟，对变性DNA添加受硫醇保护的随机8聚体。在2分钟后中和混合物，并添加含有终浓度50mM Tris-HCl(pH 7.5),10mM MgCl2,10mM(NH4)2SO4,4mM DTT,250μM dNTPs(USB,Cleveland,OH)和12个单位的phi29聚合酶(Enzymatics,Beverly,MA)的主混合物以产生总反应体积100ul。将MDA反应于37℃温育45分钟，并于65℃灭活5分钟。通过MDA反应生成约2μg DNA。然后，将此扩增的DNA片段化，并用于文库构建和测序，如上文描述的。Adopt the multiple displacement amplification (MDA) (Dean et al. (2002) Proc Natl Acad Sci U S A99,5261-5266) of modification to generate enough template DNA (about 1 μ g) for whole genome sequence analysis, as described herein. In brief, 5-20 cells from each 5-day-old embryonic cell are separated, frozen, and on dry ice, transported from the laboratory that separates them. The sample is melted, and cracked to release genomic DNA. In the situation that genomic DNA is not purified and is kept away from cell impurities, DNA is alkaline denatured by adding 1 μ l 400 mM KOH/10 mM EDTA. Use the multiple displacement amplification (MDA) reaction based on phi29 polymerase to carry out whole genome amplification to generate enough DNA (about 1 μ g) to carry out sequencing to embryonic genomic DNA. After 1 minute of alkaline denaturation, the denatured DNA is added with random 8 polymers protected by thiol. After 2 minutes, the mixture was neutralized and a master mix containing a final concentration of 50 mM Tris-HCl (pH 7.5), 10 mM MgCl2, 10 mM (NH4)2SO4, 4 mM DTT, 250 μM dNTPs (USB, Cleveland, OH) and 12 units of phi29 polymerase (Enzymatics, Beverly, MA) was added to yield a total reaction volume of 100 μl. The MDA reaction was incubated at 37°C for 45 minutes and inactivated at 65°C for 5 minutes. Approximately 2 μg of DNA was generated by the MDA reaction. This amplified DNA was then fragmented and used for library construction and sequencing as described above.

强直性肌营养不良1型(DM1)是一种由编码强直性肌营养不良蛋白激酶(DMPK)的基因的3'非翻译区中的三核苷酸重复扩充，即胞嘧啶-胸腺嘧啶-鸟嘌呤(CTG)n引起的常染色体显性疾病。我们检查了DMPK CTG重复区间的克隆覆盖。本文中描述的测序技术导致35bp配对末端读取结果，其通常跨越约400bp。对于未受累的个体和一份未知样品，400bp足以跨越两个等位基因的此CTG重复区，导致约2的拷贝数。在受累个体和一份未知样品中，观察到约1的拷贝数，提示了重复扩充对于400bp配对末端太大以致于不能跨越；仅未受累的等位基因在此区域中具有覆盖。Myotonic dystrophy type 1 (DM1) is an autosomal dominant disease caused by a trinucleotide repeat expansion, i.e., cytosine-thymine-guanine (CTG)n, in the 3' untranslated region of the gene encoding myotonic dystrophy protein kinase (DMPK). We examined the clonal coverage of the DMPK CTG repeat interval. The sequencing technology described herein results in 35bp paired-end reads, which typically span approximately 400bp. For unaffected individuals and an unknown sample, 400bp is sufficient to span this CTG repeat region of both alleles, resulting in a copy number of approximately 2. In affected individuals and an unknown sample, a copy number of approximately 1 was observed, suggesting that the repeat expansion is too large for the 400bp paired-end to span; only the unaffected allele has coverage in this region.

下文表1提供了用于定位和装配PGD胚胎样品的汇总信息。所有变异和定位统计学相对于国立生物技术信息中心(NCBI)第37版人基因组参照装配而言。样品2A、5B和5C的扩增质量较差，导致基因组的响应较少，且鉴定的SNP总数降低。样品5B和5C是来自同一胚胎的不同活检。样品NA20502在文库制备前按照标准规程处理且不扩增。Table 1 below provides summary information for positioning and assembling PGD embryo samples. All variation and positioning statistics are relative to the National Center for Biotechnology Information (NCBI) 37th edition human genome reference assembly. Samples 2A, 5B and 5C had poor amplification quality, resulting in less genomic response and a decrease in the total number of SNPs identified. Samples 5B and 5C are from different biopsies of the same embryo. Sample NA20502 was processed according to standard procedures and not amplified before library preparation.

图17显示了两个样品(7C和10C)的基因组覆盖。使用相对于单倍体基因组覆盖标准化的100千碱基覆盖窗的10兆碱基移动平均值对覆盖绘图。拷贝数目1和3的虚线分别代表单倍体和三倍体拷贝数目。这两个胚胎是男性的，并且对于X和Y染色体具有单倍体拷贝数目。没有全染色体或染色体大区段的其它丧失或获得在这些样品中是明显的。Figure 17 shows the genome coverage of two samples (7C and 10C).Use the 10 megabase moving average of the 100 kilobase coverage window of standardization relative to the haploid genome to cover the coverage drawing.The dotted lines of copy number 1 and 3 represent haploid and triploid copy number respectively.These two embryos are males, and have haploid copy number for X and Y chromosome.Other loss or acquisition without full chromosome or chromosome large segment is obvious in these samples.

表现最差的样品实现基因组覆盖85％，而最佳样品覆盖95％的基因组，即与通过使用几微克纯化的、未扩增的人基因组DNA的上文描述的方法进行的标准全基因组测序方法(“标准测序”)相似的水平。一般地，与标准测序相比，覆盖是“嘈杂的”，但是使用10兆碱基的移动平均值容许精确检测全基因组和染色体臂扩增和缺失。我们还证明了可以检测许多多态性，并且除DMPK突变外，形成某些疾病的风险可以用于胚泡植入选择。The worst performing sample achieved 85% genome coverage, while the best sample covered 95% of the genome, a level similar to that of a standard whole genome sequencing method ("standard sequencing") performed using the method described above using a few micrograms of purified, unamplified human genomic DNA. Generally, coverage is "noisy" compared to standard sequencing, but using a moving average of 10 megabases allows accurate detection of whole genome and chromosome arm amplifications and deletions. We also demonstrated that many polymorphisms can be detected, and that, in addition to DMPK mutations, the risk of developing certain diseases can be used for blastocyst implantation selection.

在此实施例中，将起始基因组DNA广泛扩增(超过必要约10倍)以确保足够量的基因组DNA可用于测序。预期降低扩增程度会改善序列覆盖和测序质量。也可以通过容许活检得到的组织(或其它起始材料，诸如癌症活检或针吸出物、自血流分离的胎儿或癌细胞，等等)在培养物中生长来降低扩增。此方法稍微增加方法的总体周转时间。然而，培养少量可用细胞导致染色体复制的细胞过程中基因组DNA的高保真性“扩增”。In this embodiment, the initial genomic DNA is extensively amplified (more than necessary approximately 10 times) to ensure that a sufficient amount of genomic DNA can be used for sequencing. Expectedly, reducing the degree of amplification improves sequence coverage and sequencing quality. Amplification can also be reduced by allowing the tissue obtained by biopsy (or other starting materials, such as cancer biopsy or needle aspirates, fetuses or cancer cells separated from blood flow, etc.) to grow in culture. This method slightly increases the overall turnaround time of the method. However, cultivating a small amount of available cells results in high-fidelity " amplification" of genomic DNA in the cell process of chromosome replication.

因为DMPK突变是一种三核苷酸重复疾病，使用采用长约400bp的配偶对读取结果的目前测序方法难以分析突变。较长的配偶对读取结果(例如1千碱基或更长)可以用于跨越这些区域并且因此在这些区域间测序，这导致重复大小的精确测定。Because DMPK mutation is a trinucleotide repeat disease, it is difficult to analyze the mutation using current sequencing methods that use mate pair reads of approximately 400 bp in length. Longer mate pair reads (e.g., 1 kilobase or longer) can be used to span these regions and therefore sequence between them, which leads to accurate determination of repeat size.

实施例3：来自10-20个人细胞的临床精确基因组测序和单元型测定Example 3: Clinically accurate genome sequencing and haplotype determination from 10-20 human cells

在此实施例中，将65-130pg(10-20个细胞)长人基因组DNA(50％长度60-500kb)分成384个等分试样，在每个等分试样中扩增，片段化，并加标签。测序后，在没有DNA克隆或中期染色体的分离的情况下装配二倍体(定相)基因组。使用10个LFR库来从7个独特基因组产生约3.3太碱基(Tb)定位读取结果。将多至97％杂合单核苷酸变体(SNV)装配成重叠群，其中50％覆盖碱基(N50)在长于约500kb(对于欧洲种族性样品)和约1Mb(对于非洲样品)的重叠群中。在重复文库间的广泛比较中，发现LFR单元型是高度精确的，每10兆碱基(Mb)具有1个假阳性SNV。尽管以100皮克(pg)DNA开始和10,000倍体外扩增，实现与非LFR基因组相比准确度的此20-30倍升高(Drmanac et al.,Science 327:78,2010；Roach et al.,Am.J.Hum.Genet.89:382-397,2011)，因为大多数误差与真实单元型不一致。我们已经证明了来自10-20个人细胞的划算且临床精确的基因组测序和单元型测定。In this example, 65-130 pg (10-20 cells) of human genomic DNA (50% 60-500 kb in length) was divided into 384 aliquots, amplified, fragmented, and tagged in each aliquot. After sequencing, the diploid (phased) genome was assembled without DNA cloning or separation of metaphase chromosomes. Ten LFR libraries were used to generate approximately 3.3 terabases (Tb) of mapping reads from 7 unique genomes. Up to 97% of heterozygous single nucleotide variants (SNVs) were assembled into contigs, with 50% of the coverage bases (N50) being in contigs longer than about 500 kb (for European ethnic samples) and about 1 Mb (for African samples). In extensive comparisons between replicate libraries, the LFR unit type was found to be highly accurate, with 1 false positive SNV per 10 megabases (Mb). Despite starting with 100 picograms (pg) of DNA and 10,000-fold in vitro amplification, this 20-30-fold increase in accuracy compared to non-LFR genomes was achieved (Drmanac et al., Science 327:78, 2010; Roach et al., Am. J. Hum. Genet. 89:382-397, 2011), because most errors were inconsistent with the true haplotype, we have demonstrated cost-effective and clinically accurate genome sequencing and haplotype determination from 10-20 human cells.

LFR技术是一种没有克隆或全中期染色体分离的划算的DNA预处理步骤，其容许以临床相关成本和规模完全测序和装配不同亲本染色体。LFR可以适合于用作任何测序方法前的预处理步骤，尽管我们采用短读取结果测序技术，如上文详述描述的。LFR technology is a cost-effective DNA pre-processing step without cloning or whole metaphase chromosome isolation, which allows complete sequencing and assembly of different parental chromosomes at a clinically relevant cost and scale. LFR can be used as a pre-processing step before any sequencing method, although we used short-read sequencing technology, as described in detail above.

LFR可以产生长范围定相SNP，因为它与长度为10-1000kb的片段的单分子测序在概念上相似。这通过在没有任何DNA克隆步骤的情况下将相应的亲本DNA片段随机分成物理上独特的集合，接着片段化以生成较短的片段(这与fosmid克隆的等分取样(Kitzman etal.,Nat.Biotechnol.29:59-63,2011；Suk et al.,Genome Res.21:1672-1685,2011)相似)实现。由于每个集合中基因组的分数降低至小于单倍体基因组，具有来自相同集合中的两个亲本染色体的相应片段的统计学概率显著降低。同样地，询问的单个集合越多，来自母本和父本同系物的片段会在不同集合中得到分析的次数越大。LFR can produce long range phased SNPs because it is conceptually similar to single molecule sequencing of fragments with a length of 10-1000kb. This is achieved by randomly dividing the corresponding parental DNA fragments into physically unique sets in the absence of any DNA cloning step, followed by fragmentation to generate shorter fragments (this is similar to the aliquot sampling of fosmid clones (Kitzman et al., Nat. Biotechnol. 29: 59-63, 2011; Suk et al., Genome Res. 21: 1672-1685, 2011)). Due to the fact that the fraction of genomes in each set is reduced to less than the haploid genome, the statistical probability of having corresponding fragments from the two parental chromosomes in the same set is significantly reduced. Similarly, the more single sets are inquired, the larger the number of times the fragments from maternal and paternal homologs are analyzed in different sets.

例如，在每孔中具有0.1个基因组当量的384孔板产生每个片段的母本和父本等位基因两者的理论19x覆盖。约19x的此类高初始DNA冗余比使用采用fosmid集合的策略(其导致范围为约3x(Kitzman et al.,Nat.Biotechnol29:59-63,2011)至约6x(Suk et al.,Genome Res.21:1672-1685,2011)的覆盖)的实现产生更完整的基因组覆盖和更高的变体响应和定相准确度。For example, a 384-well plate with 0.1 genome equivalents in each well produces a theoretical 19x coverage of both the maternal and paternal alleles of each fragment. This type of high initial DNA redundancy of approximately 19x produces a more complete genome coverage and higher variant response and phasing accuracy than using a strategy employing a fosmid collection (which results in a range of approximately 3x (Kitzman et al., Nat. Biotechnol 29:59-63, 2011) to approximately 6x (Suk et al., Genome Res. 21:1672-1685, 2011) coverage).

为了以高通量方式制备LFR库，我们开发出一种自动化方法，其在同一384孔板中实施所有LFR特定步骤。以下是方法的概述。首先，使用经修改的基于phi29的多重置换扩增(MDA；Dean et al.,Proc.Natl.Acad.Sci.U.S.A.99:5261,2002)实施高度一致的扩增以将每个片段复制约10,000倍。接着，经由在没有居间纯化步骤的情况下在每孔内的酶促步骤过程，将DNA片段化并与条形码衔接头连接。简言之，通过受控随机酶促片段化(CoRE)将长DNA分子加工成平端300-1,500bp片段。CoRE经由除去尿苷碱基使DNA片段化，所述去尿苷碱基通过尿嘧啶DNA糖基化酶和内切核酸酶IV在MDA过程中以预先确定的频率掺入。用大肠杆菌聚合酶1从所得的单碱基缺口进行的切口平移解决片段，并产生平端。然后，将独特的10碱基Reed-Solomon误差校正条形码衔接头(PCT/US2010/023083，以WO 2010/091107公布，其通过提及并入本文)(其设计为降低由每个条形码的序列和浓度差异引起的任何偏爱(图18))连接以使用高产率、低嵌合物形成方案(Drmanac et al.,Science 327:78,2010)使每孔中的DNA片段化。最后，将所有384孔组合，并使用与连接衔接头共同的引物采用不饱和的聚合酶链式反应以生成对于短读取结果测序平台足够的模板。以下提供了关于我们采用的LFR方案的更多详情。In order to prepare LFR libraries in a high-throughput manner, we developed an automated method that implements all LFR-specific steps in the same 384-well plate. The following is an overview of the method. First, a modified phi29-based multiple displacement amplification (MDA; Dean et al., Proc. Natl. Acad. Sci. U.S.A. 99: 5261, 2002) is used to implement highly consistent amplification to replicate each fragment approximately 10,000 times. Next, the DNA is fragmented and connected to the barcode adapters via an enzymatic step process in each well without an intervening purification step. In brief, long DNA molecules are processed into blunt-ended 300-1,500bp fragments by controlled random enzymatic fragmentation (CoRE). CoRE fragments the DNA by removing uridine bases, which are incorporated at a predetermined frequency during the MDA process by uracil DNA glycosylase and endonuclease IV. The fragments were resolved by nick translation from the resulting single-base gaps using E. coli polymerase 1, and blunt ends were produced. Then, unique 10-base Reed-Solomon error-corrected barcode adapters (PCT/US2010/023083, published as WO 2010/091107, incorporated herein by reference) (designed to reduce any bias (Figure 18) caused by sequence and concentration differences of each barcode) were connected to fragment the DNA in each well using a high-yield, low-chimera formation scheme (Drmanac et al., Science 327:78, 2010). Finally, all 384 wells were combined and unsaturated polymerase chain reaction was employed to generate templates sufficient for a short read sequencing platform using primers common to the connection adapters. More details about the LFR scheme we adopted are provided below.

使用RecoverEase DNA分离试剂盒(Agilent,La Jolla,CA)遵循制造商的方案从细胞系GM12877,GM12878,GM12885,GM12886,GM12891,GM12892 GM19240,和GM20431(Coriell Institute for Medical Research,Camden,NJ)纯化高分子量DNA。将高分子量DNA部分剪切以使其更适合于使用Rainin P1000移液器通过移液20-40次进行的操作。使用BioRad CHEF-DR II用以下参数在具有0.5X TBE缓冲液的1％琼脂糖凝胶上分析200ng基因组DNA：6V/cm,50-90秒渐变转换时间和20小时总运行。使用500ng酵母染色体PFG标志物(New England Biolabs,Ipswich,MA)和Lambda Ladder PFG标志物(New EnglandBiolabs,Ipswich,MA)来测定纯化的基因组DNA的长度。High molecular weight DNA was purified from cell lines GM12877, GM12878, GM12885, GM12886, GM12891, GM12892 GM19240, and GM20431 (Coriell Institute for Medical Research, Camden, NJ) using the RecoverEase DNA Isolation Kit (Agilent, La Jolla, CA) following the manufacturer's protocol. The high molecular weight DNA was partially sheared to make it more suitable for manipulation using a Rainin P1000 pipette by pipetting 20-40 times. 200 ng of genomic DNA was analyzed on a 1% agarose gel with 0.5X TBE buffer using a BioRad CHEF-DR II using the following parameters: 6 V/cm, 50-90 seconds gradient transition time, and a 20-hour total run. The length of purified genomic DNA was determined using 500 ng of Yeast Chromosomal PFG Marker (New England Biolabs, Ipswich, MA) and Lambda Ladder PFG Marker (New England Biolabs, Ipswich, MA).

另外，将永生化细胞系GM19240(Coriell Institute for Medical Research,Camden,NJ)在细胞培养的标准环境条件下在补充有10％FBS的RPMI中培养。将单个细胞在200倍放大率下用显微操作器(Eppendorf,Hamburg,Germany)分离，并放到1.5ml具有10uldH2O的微型管中。将细胞用1ul 20mM KOH和0.5mM EDTA变性。然后，让变性的细胞进入LFR过程中。Separately, the immortalized cell line GM19240 (Coriell Institute for Medical Research, Camden, NJ) was cultured in RPMI supplemented with 10% FBS under standard cell culture conditions. Single cells were isolated at 200x magnification using a micromanipulator (Eppendorf, Hamburg, Germany) and placed into 1.5 ml microtubes containing 10 μl of dHO. The cells were denatured with 1 μl of 20 mM KOH and 0.5 mM EDTA. The denatured cells were then subjected to the LFR process.

将来自多个细胞系中每个的DNA稀释，并在20mM KOH和0.5mM EDTA溶液中以50pg/ul的浓度变性。于室温温育1分钟后，将120pg变性的DNA取出，并添加到32ul 1mM 3’受硫醇保护的随机八聚体(IDT,Coralville,IA)。在2分钟后，用dH2O将混合物达到体积400ul，并将1ul分配到384孔板的每孔。将1μl2X基于phi29聚合酶(Enzymatics Inc.,Beverly,MA)的多重置换扩增(MDA)混合物添加到每孔以生成约3-10纳克DNA(10,000至25,000倍扩增)。MDA反应由50mM Tris-HCl(pH 7.5),10mM MgCl2,10mM(NH4)2SO4,4mM DTT,250uM dNTP(USB,Cleveland,OH),10uM 2'-脱氧尿苷5'-三磷酸(dUTP)(USB,Cleveland,OH),和0.25个单位的phi29聚合酶组成。DNA from each of the multiple cell lines was diluted and denatured at a concentration of 50 pg/ul in a 20 mM KOH and 0.5 mM EDTA solution. After incubation at room temperature for 1 minute, 120 pg of denatured DNA was removed and added to 32 ul of 1 mM 3'thiol-protected random octamer (IDT, Coralville, IA). After 2 minutes, the mixture was brought to a volume of 400 ul with dH2O, and 1 ul was dispensed into each well of a 384-well plate. 1 μl of 2X multiple displacement amplification (MDA) mix based on phi29 polymerase (Enzymatics Inc., Beverly, MA) was added to each well to generate approximately 3-10 ng of DNA (10,000 to 25,000-fold amplification). The MDA reaction consisted of 50 mM Tris-HCl (pH 7.5), 10 mM MgCl2, 10 mM (NH4)2SO4, 4 mM DTT, 250 uM dNTP (USB, Cleveland, OH), 10 uM 2'-deoxyuridine 5'-triphosphate (dUTP) (USB, Cleveland, OH), and 0.25 units of phi29 polymerase.

然后，实施受控随机酶促片段化(CoRE)。使过量的核苷酸失活，并通过于37℃将MDA反应与0.031个单位的虾碱性磷酸酶(SAP)(USB,Cleveland,OH)、0.039个单位的尿嘧啶DNA糖基化酶(New England Biolabs,Ipswich,MA)和0.078个单位的内切核酸酶IV(NewEngland Biolabs,Ipswich,MA)的混合物一起温育120分钟除去尿嘧啶碱基。将SAP于65℃热灭活15分钟。在具有0.1纳摩尔dNTP(USB,Cleveland,OH)添加的相同缓冲液中用0.1个单位大肠杆菌DNA聚合酶1(New England Biolabs,Ipswich,MA)的60分钟室温切口平移解决缺口，并使DNA片段化成300-1,300个碱基对片段。将大肠杆菌DNA聚合酶1于65℃热灭活10分钟。通过于37℃与0.031个单位的SAP(USB,Cleveland,OH)一起温育60分钟除去剩余的5’磷酸根。将SAP于65℃热灭活15分钟。Then, controlled random enzymatic fragmentation (CoRE) was performed. Excess nucleotides were inactivated and uracil bases were removed by incubating the MDA reaction with a mixture of 0.031 units of shrimp alkaline phosphatase (SAP) (USB, Cleveland, OH), 0.039 units of uracil DNA glycosylase (New England Biolabs, Ipswich, MA), and 0.078 units of endonuclease IV (New England Biolabs, Ipswich, MA) at 37°C for 120 minutes. SAP was heat-inactivated at 65°C for 15 minutes. Gaps were resolved by 60 minutes of room temperature nick translation with 0.1 units of Escherichia coli DNA polymerase 1 (New England Biolabs, Ipswich, MA) in the same buffer with 0.1 nanomolar dNTPs (USB, Cleveland, OH), and the DNA was fragmented into 300-1,300 base pair fragments. E. coli DNA polymerase 1 was heat-inactivated at 65°C for 10 minutes. Remaining 5' phosphate groups were removed by incubation with 0.031 units of SAP (USB, Cleveland, OH) for 60 minutes at 37°C. SAP was heat inactivated at 65°C for 15 minutes.

然后，实施加标签衔接头连接和切口平移。使用两部分定向连接方法将10个碱基的DNA条形码衔接头(对于每个孔而言独特的)附着于片段化的DNA。将约0.03pmol片段化MDA产物于室温在总体积7ul中的反应中温育4小时，所述反应含有50mM Tris-HCl(pH7.8),2.5％PEG 8000,10mM MgCl2,1mM rATP,100倍摩尔过量的5’-磷酸化(5’PO4)且3’双脱氧末端的(3’dd)共同Ad1(图18)和75个单位的T4 DNA连接酶(Enzymatics,Beverly,MA)。Ad1含有用于与独特条形码衔接头连接和杂交的共同突出区。在4小时后，将200倍摩尔过量的独特5’磷酸化的加标签衔接头添加至每孔，并容许温育16小时。将384孔组合成总体积约2.5ml，并通过添加2.5ml AMPure珠(Beckman-Coulter,Brea,CA)纯化。实施一轮PCR以创建具有一侧的5’衔接头和标签和另一侧的3’平端的分子。如上文所描述的，在与5’衔接头相似的连接反应中添加3’衔接头。为了密封通过连接创建的切口，将DNA在含有0.33uM Ad1PCR1引物、10mM Tris-HCl(pH 78.3)、50mM KCl、1.5mM MgCl2、1mM rATP、100uM dNTP的反应中于60℃温育5分钟以用以3’-OH为末端的Ad1 PCR1引物交换3’双脱氧末端的Ad1寡聚物。然后，将反应冷却至37℃，并在添加90个单位的Taq DNA聚合酶(New England Biolabs,Ipswich,MA)和21600个单位的T4 DNA连接酶后，于37℃再温育30分钟，以通过Taq催化的切口平移从Ad1 PCR1引物3’-OH端创建官能性5’-PO4gDNA端，且以通过T4 DNA连接密封所得的修复切口。在此点时，将材料掺入标准DNA纳米阵列测序方法中。Then, tagging adapter ligation and nick translation were performed. 10 base DNA barcode adapters (unique for each well) were attached to the fragmented DNA using a two-part directional ligation method. Approximately 0.03 pmol of fragmented MDA product was incubated for 4 hours at room temperature in a reaction in a total volume of 7 ul containing 50 mM Tris-HCl (pH 7.8), 2.5% PEG 8000, 10 mM MgCl2, 1 mM rATP, a 100-fold molar excess of 5'-phosphorylated (5'PO4) and 3' dideoxy-terminated (3'dd) common Ad1 (Figure 18) and 75 units of T4 DNA ligase (Enzymatics, Beverly, MA). Ad1 contains a common overhang region for ligation and hybridization with unique barcode adapters. After 4 hours, a 200-fold molar excess of unique 5' phosphorylated tagging adapters was added to each well and allowed to incubate for 16 hours. 384 wells were assembled into a total volume of approximately 2.5 ml and purified by adding 2.5 ml of AMPure beads (Beckman-Coulter, Brea, CA). One round of PCR was performed to create molecules with a 5' adapter and tag on one side and a 3' blunt end on the other side. The 3' adapter was added in a ligation reaction similar to the 5' adapter, as described above. To seal the nick created by ligation, the DNA was incubated at 60°C for 5 minutes in a reaction containing 0.33 uM Ad1 PCR1 primer, 10 mM Tris-HCl (pH 78.3), 50 mM KCl, 1.5 mM MgCl2, 1 mM rATP, 100 uM dNTPs to exchange the 3' dideoxy-terminated Ad1 oligomer with the 3'-OH-terminated Ad1 PCR1 primer. The reaction was then cooled to 37°C and incubated for an additional 30 minutes at 37°C after the addition of 90 units of Taq DNA polymerase (New England Biolabs, Ipswich, MA) and 21,600 units of T4 DNA ligase to create a functional 5'-PO4 gDNA end from the 3'-OH end of the Ad1 PCR1 primer by Taq-catalyzed nick translation and to seal the resulting repaired nick by T4 DNA ligation. At this point, the material was incorporated into a standard DNA nanoarray sequencing method.

从总RNA开始，使用Ovation RNA-Seq试剂盒(NuGen,San Carlos,CA)和SPRIWork(Beckman-Coulter,Brea,CA)得到RNA-Seq数据以制备具有平均插入物大小150-200bp的测序库。在个性化遗传医学中心(Center for Personalized Genetic Medicine)(HarvardMedical School,Boston,MA)在HiSeq 2000(Illumina,San Diego,CA)上实施75bp配对末端测序反应。使用bowtie v0.12.7(Langmead et al.,Genome Biol.10:R25,2009)用tophat v1.2.0(Trapnell et al.,Bioinformatics 25:1105-1111,2009)装配配对末端读取结果，并用参照用hg19和注释已知SNP的dbSNP第132版使用GATK UnifiedGenotyperv1.1(http://www.broadinstitute.org/gsa/wiki/index.php/GATK_release_1.1)响应单核苷酸变体(SNV)。同时将SNV定位到来自RefSeq的基因及如cufflinks v1.0.3(http://cufflinks.cbcb.umd.edu/tutorial.html)鉴定的转录物组中的同等型。Starting from total RNA, RNA-Seq data were obtained using the Ovation RNA-Seq kit (NuGen, San Carlos, CA) and SPRIWork (Beckman-Coulter, Brea, CA) to prepare sequencing libraries with an average insert size of 150-200 bp. 75 bp paired-end sequencing reactions were performed on a HiSeq 2000 (Illumina, San Diego, CA) at the Center for Personalized Genetic Medicine (Harvard Medical School, Boston, MA). Paired-end reads were assembled using bowtie v0.12.7 (Langmead et al., Genome Biol. 10: R25, 2009) with tophat v1.2.0 (Trapnell et al., Bioinformatics 25: 1105-1111, 2009), and single nucleotide variants (SNVs) were mapped using GATK UnifiedGenotyper v1.1 (http://www.broadinstitute.org/gsa/wiki/index.php/GATK_release_1.1) with reference hg19 and dbSNP version 132 for annotating known SNPs. SNVs were also mapped to genes from RefSeq and isoforms in transcriptomes identified using cufflinks v1.0.3 (http://cufflinks.cbcb.umd.edu/tutorial.html).

为了鉴定共表达等位基因的单元型，过滤关于杂合SNV的数据，所述杂合SNV同时在相同LFR重叠群上及在具有至少一个另一杂合SNV的相同基因上发生。在转录物展现出等位基因特异性表达的情况中，LFR定相单元型上表达的杂合等位基因应当比另一单元型上的其对应物均具有更高的，或均具有更低的读取结果计数。在这里，我们将较高表达的单元型鉴定为大多数其杂合等位基因比其对应物展现出更高表达的单元型。若其表达与其含有的单元型一致，则杂合计算为“一致的”。在平分的情况中(其中没有单元型多数)，半数的杂合SNV计算为一致的。另外，为了被完全考虑，需要杂合SNV具有至少20倍RNA-Seq读取结果覆盖。通过随机使用二项检验与选择ASE和覆盖的概率比较对杂合SNV进一步过滤来自GATK基因型测定器(genotyper)的噪音。In order to identify the haplotypes of co-expressed alleles, data on heterozygous SNVs were filtered that occurred simultaneously on the same LFR contig and on the same gene with at least one other heterozygous SNV. In the case where transcripts exhibit allele-specific expression, the heterozygous alleles expressed on the LFR phased haplotype should have a higher or lower read count than their counterparts on the other haplotype. Here, we identify the haplotypes with higher expression as haplotypes in which most of their heterozygous alleles exhibit higher expression than their counterparts. If its expression is consistent with the haplotype it contains, the heterozygous calculation is "consistent". In the case of a split (where there is no haplotype majority), half of the heterozygous SNVs are calculated as consistent. In addition, in order to be fully considered, heterozygous SNVs are required to have at least 20 times RNA-Seq read coverage. The noise from the GATK genotyper was further filtered for heterozygous SNVs by randomly using a binomial test with the probability of selecting ASE and coverage.

出于误差校正目的，用具有用于未知误差位置的1碱基误差校正能力或在已知误差位置时的2碱基误差校正能力(美国专利申请12/697,995，以US2010/0199155公布，其通过提及并入本文)的10个碱基的Reed-Solomon码加标签每个DNB。这384个代码选自全面的一组4096个具有上述特性的Reed-Solomon码(美国专利申请12/697,995，其通过提及并入本文)。来自此组的每个代码具有距该组中的任何另一代码的最小汉明距离3。对于此研究，假设误差位置是未知的。For error correction purposes, each DNB was tagged with a 10-base Reed-Solomon code with either 1-base error correction capability for unknown error positions or 2-base error correction capability when the error position was known (U.S. patent application Ser. No. 12/697,995, published as US2010/0199155, incorporated herein by reference). These 384 codes were selected from a comprehensive set of 4096 Reed-Solomon codes with the properties described above (U.S. patent application Ser. No. 12/697,995, incorporated herein by reference). Each code from this set had a minimum Hamming distance of 3 to any other code in the set. For this study, the error position was assumed to be unknown.

结果。为了证明LFR测定精确二倍体基因组序列的能力，我们产生约鲁巴人女性HapMap样品NA19240的三个库。广泛询问NA19240作为HapMap Project(Consortium,Nature437:1299-1320,2005；Frazer et al.,Nature449:851-861,2007)、1,000Genomes Project(Nature 467:1061-1073,2010)及我们自身努力(www.completegenomics.com/sequence-data/download-data/)中三个一组的一部分(NA19240是样品NA19238和NA19239的子代)。因此，基于亲本样品NA19238和NA19239的冗余序列数据，可以产生关于170万个杂合SNP的高度精确单元型信息。以相应的永生化B细胞系的10个细胞(65pg DNA)开始，生成1个NA19240 LFR库。基于总有效读取结果覆盖60x及使用384个独特的片段等分试样或集合，我们估计若将DNA在分配到孔(20个细胞当量的dsDNA；下文表1)中前变性，则起始细胞的最佳数目会是10。从估计的100-130pg(15-20个细胞当量)变性高分子量基因组DNA产生2个重复文库。测定了在从变性的分离的DNA开始时，每个库的最佳量会是约100pg。此量选择为通过使样品的随机取样最小化实现较为一致的基因组覆盖。 Results . To demonstrate the ability of LFR to determine accurate diploid genome sequences, we generated three libraries of the Yoruba female HapMap sample NA19240. NA19240 was extensively interrogated as part of a triplicate study (NA19240 is a descendant of samples NA19238 and NA19239) in the HapMap Project (Consortium, Nature 437:1299-1320, 2005; Frazer et al., Nature 449:851-861, 2007), the 1,000 Genomes Project (Nature 467:1061-1073, 2010), and our own efforts (www.completegenomics.com/sequence-data/download-data/). Thus, based on the redundant sequence data of the parental samples NA19238 and NA19239, highly accurate haplotype information for 1.7 million heterozygous SNPs could be generated. Starting with 10 cells (65pg DNA) of the corresponding immortalized B cell line, one NA19240 LFR library was generated. Based on the total effective read coverage of 60x and the use of 384 unique fragment aliquots or collections, we estimated that if the DNA was denatured before being distributed to the wells (20 cell equivalents of dsDNA; Table 1 below), the optimal number of starting cells would be 10. Two replicate libraries were generated from an estimated 100-130pg (15-20 cell equivalents) of denatured high molecular weight genomic DNA. It was determined that the optimal amount of each library would be about 100pg when starting from denatured isolated DNA. This amount was selected to achieve more consistent genome coverage by minimizing random sampling of the sample.

使用DNA纳米阵列测序(Drmanac et al.,Science 327:78-81,2010)分析所有三个库。使用定制比对算法(Drmanac et al.,Science 327:78-81,2010；Carnevali et al.,J.Computational Biol.,19,2011)将35个碱基的配偶对读取结果定位到参照基因组，以大于80x的平均基因组覆盖平均产生超过230Gb定位数据(下文表1)。对定位LFR数据的分析显示了可归因于MDA的2个独特特征：富含GC的序列的轻微呈现不足(图19)和嵌合序列的增加。另外，100kb窗间标准化的覆盖的可变性多约2倍。不过，几乎所有基因组区覆盖有足够的读取结果(5或更多)，表明通过我们的优化方案进行的10,000倍MDA扩增可以用于全面的基因组测序。All three libraries were analyzed using DNA nanoarray sequencing (Drmanac et al., Science 327:78-81, 2010). Custom alignment algorithms (Drmanac et al., Science 327:78-81, 2010; Carnevali et al., J. Computational Biol., 19, 2011) were used to map the mate pair reads of 35 bases to the reference genome, generating an average of more than 230Gb of positioning data (Table 1 below) with an average genome coverage greater than 80x. Analysis of the positioning LFR data showed two unique features attributable to MDA: a slight underrepresentation of GC-rich sequences (Figure 19) and an increase in chimeric sequences. Additionally, the variability of the standardized coverage between 100kb windows was approximately 2 times greater. However, nearly all genomic regions were covered with enough reads (5 or more), indicating that the 10,000 times of MDA amplification performed by our optimization scheme can be used for comprehensive genome sequencing.

使用条形码以基于其在每个文库内的物理孔位置(其显示覆盖的脉冲，即几乎没有读取结果覆盖的长跨越间散布的覆盖的稀疏区)图形分组定位读取结果。平均每个孔含有长度范围为10kb至超过300kb的片段中10-20％的单倍体基因组(300-600Mb)，N50为约60kb(图20)。初始片段覆盖在染色体间是非常一致的。如从所有检测片段评估的，从提取的DNA产生两个文库实际使用的DNA总量是约62pg和84pg(9.4和12.7个细胞当量，图20)。这小于预期的100-130pg，指示一些损失或检测不到的DNA或DNA定量的不精确。令人感兴趣地，10个细胞的文库表现为从约90pg(13.6个细胞)的DNA生成，最可能是由于一些细胞在分离期间处于S期所致(图20)。Barcodes were used to graphically group the reads based on their physical well positions within each library (which showed pulses of coverage, i.e., sparse regions of coverage interspersed between long spans with little to no read coverage). On average, each well contained 10-20% of the haploid genome (300-600 Mb) in fragments ranging in length from 10 kb to more than 300 kb, with an N50 of approximately 60 kb (Figure 20). Initial fragment coverage was very consistent across chromosomes. As assessed from all detected fragments, the total amount of DNA actually used to generate the two libraries from the extracted DNA was approximately 62 pg and 84 pg (9.4 and 12.7 cell equivalents, Figure 20). This is less than the expected 100-130 pg, indicating some loss or undetectable DNA or imprecision in DNA quantification. Interestingly, the library of 10 cells appeared to be generated from approximately 90 pg (13.6 cells) of DNA, most likely due to some cells being in the S phase during separation (Figure 20).

使用设计为从约40个单个孔询问低覆盖读取结果数据(小于2x覆盖)的两步定制基因型测定算法，来自位于不同孔中的同一亲本染色体的片段的重叠杂合SNP装配为单元型重叠群(图21)。与其他实验方法(Kitzman et al.,Nat.Biotechnol.29:59-63,2011；Suket al.,Genome Res.21:1672-1685,2011；Duitama et al.,Nucl.Acids Res.40:2041-2053,2012)不同，LFR并不限定每个初始片段的单元型。取而代之，LFR通过在等分试样的数目和给定读取结果覆盖方面使DNA片段输入最大化来确保基因组的完全呈现。Using a two-step custom genotyping algorithm designed to query low coverage read result data (less than 2x coverage) from about 40 single wells, overlapping heterozygous SNPs from fragments of the same parent chromosome located in different wells are assembled into unit type contigs (Figure 21). Unlike other experimental methods (Kitzman et al., Nat. Biotechnol. 29: 59-63, 2011; Suk et al., Genome Res. 21: 1672-1685, 2011; Duitama et al., Nucl. Acids Res. 40: 2041-2053, 2012), LFR does not limit the unit type of each initial fragment. Instead, LFR ensures complete presentation of the genome by maximizing DNA fragment input in terms of the number of aliquots and given read result coverage.

在第一步中，将来自未定相NA19240基因组装配(www.completegenomics.com/sequence-data/download-data/)的杂合SNP与每个LFR库组合以创建全面的SNP组用于定相。接着，对每个染色体构建网络，其中节点对应于杂合SNP响应，而连接涉及每对SNP间的连接性得分。与连接得分一起，还获得方向作为搜索关于每对杂合SNP的最佳假设的一部分。然后，使用域知识修整此高度冗余的稀疏连接网络，随后使用Kruskal的最小跨度树(MST)算法优化。这产生较长的重叠群，来自950-1200kb的N50获自这些文库(图20)。In the first step, the heterozygous SNPs from the unphased NA19240 genome assembly (www.completegenomics.com/sequence-data/download-data/) are combined with each LFR library to create a comprehensive SNP group for phasing. Then, a network is constructed for each chromosome, in which nodes correspond to heterozygous SNP responses, and connections relate to the connectivity score between each pair of SNPs. Together with the connection score, a direction is also obtained as part of the search for the best hypothesis about each pair of heterozygous SNPs. This highly redundant sparse connection network is then trimmed using domain knowledge and subsequently optimized using Kruskal's minimum spanning tree (MST) algorithm. This produces longer overlapping groups, with N50 from 950-1200kb available from these libraries (Figure 20).

通过LFR在每个文库中定相总共约240万个杂合SNP(图20)。LFR定相预期会逐步采用这些文库的杂合SNP的约90％。10个细胞的文库定相由自分离的DNA生成的两个文库定相的变体的超过98％，证明LFR通过少量分离的细胞起作用的潜力。使读取结果数目倍增至约160x覆盖将定相杂合SNP的数目进一步增加到超过258万，由此将定相率增加到96％(图20)。组合重复1和2(总共768个独立孔)(各具有80x覆盖)产生超过265万个定相的杂合SNP，且产生97％的定相率。仅使用定相用的LFR文库中响应的SNP基因座(省略LFR算法的步骤1)通常导致定相SNP总数降低5-15％(图20)。By LFR phasing in each library a total of about 2.4 million heterozygous SNPs (Figure 20). LFR phasing is expected to gradually adopt about 90% of the heterozygous SNPs of these libraries. The library phasing of 10 cells is more than 98% of the variants of the two library phasings generated by self-isolated DNA, proving the potentiality that LFR works by a small amount of isolated cells. The number of read results is doubled to about 160x coverage and the number of phased heterozygous SNPs is further increased to more than 2.58 million, thus increasing the phasing rate to 96% (Figure 20). Combination repeats 1 and 2 (768 independent holes in total) (each with 80x coverage) produce more than 2.65 million phased heterozygous SNPs, and produce 97% phasing rate. Only using the SNP loci (omitting step 1 of the LFR algorithm) that responds in the LFR library for phasing usually causes the phased SNP sum to reduce by 5-15% (Figure 20).

重要地，仅通过LFR(仅从10-20个细胞的DNA开始)得到的定相SNP数目略高于通过目前的fosmid方法(Kitzman et al.,Nat.Biotechnol.29:59-63,2011；Suk et al.,Genome Res.21:1672-1685,2011；Duitama et al.,Nucl.Acids Res.40:2041-2053,2012)定相的SNP数目。由于双亲共享儿童中的较大分数的变体，这是可以通过使用标准亲本序列(Roach et al.,Am.J.Hum.Genet.89:382-397,2011)定相的杂合SNP的实质上超过81％。将亲本衍生的单元型数据添加到768孔文库将定相率改善至98％。约115,000(约4％)定相杂合SNP来自高覆盖LFR文库，并且在标准文库中没有被响应，指示MDA扩增和160x覆盖有助于一些区域得到足以正确响应的读取结果(5个或更多个)。可以调节高覆盖LFR定相率以平衡单元型完整性对定相误差。Importantly, the number of phased SNPs obtained by LFR alone (starting with DNA from only 10-20 cells) is slightly higher than the number of SNPs phased by current fosmid methods (Kitzman et al., Nat. Biotechnol. 29:59-63, 2011; Suk et al., Genome Res. 21:1672-1685, 2011; Duitama et al., Nucl. Acids Res. 40:2041-2053, 2012). Because parents share a large fraction of variants in their children, this is substantially more than 81% of the heterozygous SNPs that can be phased using standard parental sequences (Roach et al., Am. J. Hum. Genet. 89:382-397, 2011). Adding parent-derived haplotype data to the 768-well library improved the phasing rate to 98%. Approximately 115,000 (approximately 4%) phased heterozygous SNPs were from the high-coverage LFR library and were not called in the standard library, indicating that MDA amplification and 160x coverage helped some regions get enough correctly called reads (5 or more). The high-coverage LFR phasing rate can be adjusted to balance haplotype completeness against phasing error.

欧洲裔谱系的单元型测定。为了使我们进一步了解LFR的性能，我们从欧洲人祖先谱系生成额外的文库。选择CEPH家族1463，因为它具有三个世代的个体，容许全面研究遗传性。此家族先前已经作为公共数据释放(www.completegenomics.com/sequence-data/download-data/)的一部分研究。从每个世代的个体产生文库。对NA12877,NA12885,NA12886,NA12891,和NA12892产生总共超过1.6Tb序列数据。一般地，定相在具有定相到重叠群中的约92％的尝试SNP的所有样品间是非常高的(图20)。组合两个LFR文库(图20)或LFR与基于亲本的定相将定相SNP的总体比率改善到97％。所有分析家族成员间的N50重叠群长度是500-600kb。此长度限于低于NA19240的长度。SNP在几个不同族群的基因组间分布的调查解释此差异。 Unit typing of European pedigrees . To further our understanding of the performance of LFR, we generated additional libraries from European ancestral pedigrees. CEPH family 1463 was selected because it has three generations of individuals, allowing for a comprehensive study of heritability. This family has previously been studied as part of a public data release (www.completegenomics.com/sequence-data/download-data/). Libraries were generated from individuals of each generation. A total of more than 1.6 Tb of sequence data was generated for NA12877, NA12885, NA12886, NA12891, and NA12892. In general, phasing was very high across all samples with approximately 92% of the attempted SNPs phased into the contig (Figure 20). Combining two LFR libraries (Figure 20) or LFR with parent-based phasing improved the overall ratio of phased SNPs to 97%. The N50 contig length across all analyzed family members was 500-600 kb. This length was limited to less than the length of NA19240. Investigation of the distribution of SNPs across the genomes of several different ethnic groups explains this difference.

非非洲裔群体中低杂合性区域的起源和影响。在欧洲裔谱系样品中比在NA19240中具有多约两倍的30kb-3Mb的低杂合性区域(RLH，定义为每10kb具有小于1.4个杂合SNP的30kb基因组区域，比种植密度低约7倍)，澄清非非洲裔中纯合子的先前报告的相对过量(Gibson et al.,Hum.Mol.Genet.15:789-795,2006；Lohmueller et al.,Nature 451:994-997,2008)并且通过分析52个完整基因组(Nicholas Schork,个人通信)得到进一步支持。这些区域是定相的障碍，导致小两倍的N50重叠群长度。欧洲裔基因组中的超过90％重叠群以无关个体间有所变化的这些RLH结束。 The origin and impact of low heterozygosity regions in non-African populations . There are about twice as many 30kb-3Mb low heterozygosity regions (RLHs, defined as 30kb genomic regions with less than 1.4 heterozygous SNPs per 10kb, about 7 times lower than the planting density) in European pedigree samples than in NA19240, clarifying the previously reported relative excess of homozygotes in non-Africans (Gibson et al., Hum. Mol. Genet. 15: 789-795, 2006; Lohmueller et al., Nature 451: 994-997, 2008) and further supported by analysis of 52 complete genomes (Nicholas Schork, personal communication). These regions are obstacles to phasing, resulting in twice as small N50 contig lengths. More than 90% of contigs in European genomes end with these RLHs, which vary between unrelated individuals.

非非洲裔基因组中约3％的所有杂合SNP(30-60％的所有非定相杂合SNP)属于这些RLH，其覆盖非常大分数(30-40％)的这些基因组。在华裔和欧洲裔基因组中，较长的RLH对每Mb聚簇45个左右的杂合SNP(基因组覆盖是在RLH外部每Mb的约1000个)，指示它们在37,000-43,000年前左右共享共同的祖先(基于每20年世代的60-70个SNP的突变率；Roachet al.,Science328:636-639,2010；Conrad et al.,Nat.Genet.43:712-714,2011)。这可能是由于人类离开非洲时或之后且在10,000-65,000年前的先前确定的范围内的强瓶颈所致(Li and Durbin,Nature 475:493-496,2011)。此外，在欧洲裔和印度裔女性(NA12885,NA12892和NA20847)中在与非洲裔女性(NA19240)相比时在X染色体上观察到过量的RLH，分别涵盖此染色体的约50％对17％(对于这些相同个体中的整个基因组为30％对14％)。这指示甚至X染色体方面更强的离开非洲(out-of-Africa)瓶颈。可能的解释是实质上较少的女性留在非洲，并且与多个男性具有后代。About 3% of all heterozygous SNPs in non-African genomes (30-60% of all unphased heterozygous SNPs) belong to these RLHs, which cover a very large fraction (30-40%) of these genomes. In Chinese and European genomes, longer RLHs cluster around 45 heterozygous SNPs per Mb (genomic coverage is about 1000 per Mb outside the RLHs), indicating that they shared a common ancestor around 37,000-43,000 years ago (based on a mutation rate of 60-70 SNPs per 20-year generation; Roach et al., Science 328:636-639, 2010; Conrad et al., Nat. Genet. 43:712-714, 2011). This may be due to a strong bottleneck at or after the time humans left Africa and within the previously determined range of 10,000-65,000 years ago (Li and Durbin, Nature 475:493-496, 2011). In addition, an excess of RLHs was observed on the X chromosome in European and Indian women (NA12885, NA12892, and NA20847) when compared to African women (NA19240), covering approximately 50% versus 17% of this chromosome, respectively (30% versus 14% for the entire genome in these same individuals). This indicates an even stronger out-of-Africa bottleneck on the X chromosome. A possible explanation is that substantially fewer women remained in Africa and had offspring with multiple men.

这些观察提示了数千个多种多样的基因组中的全基因组变异分析，包括单元型测定会提供对人群体遗传学的深刻了解和这些广泛“近亲繁殖”区(其通常各包含大于100个纯合子变体)对人疾病和其他极端表型的影响。另外，它显示了长度大于100kb的约2,000个RLH会存在于所有非非洲裔个体中。具有有限数目的高频率单元型的群体(其可以源自新近的瓶颈或近亲繁殖(Gibson et al.,Hum.Mol.Genet.15:789-795,2006))也可以具有双亲中存在的相同杂合SNP的长运行，这限制亲本用于定相或分配较短的LFR重叠群。如此，群体史和一些生殖方式可以使定相变成挑战性的，如由非非洲裔女性的X染色体展现的。不管这些因素，LFR定相性能是大致等同的，在欧洲裔和非洲裔个体两者中定相多至97％的杂合SNP，即应当在所有群体间转化的结果。如下文描述的组合LFR与一个亲本的标准基因型测定(会更限于一些家族的策略，如上文讨论的)外，使用长于300kb的初始DNA片段(例如通过在凝胶块中俘获细胞或预纯化的DNA(Cook,EMBO J.3:1837-1842,1984))会跨越约95％的所有RLH，并对这些区域中发生的大多数重新突变测定单元型。这用限于40kb片段的目前fosmid克隆策略(Kitzman et al.,Nat.Biotechnol.29:59-63,2011；Suk et al.,GenomeRes.21:1672-1685,2011)会是不可行的。These observations suggest that whole-genome variation analysis in thousands of diverse genomes, including haplotype determination, will provide a deep understanding of human population genetics and the impact of these extensive "inbreeding" zones (which typically each contain more than 100 homozygous variants) on human disease and other extreme phenotypes. In addition, it shows that approximately 2,000 RLHs greater than 100 kb in length are present in all non-African individuals. Populations with a limited number of high-frequency haplotypes (which may be derived from recent bottlenecks or inbreeding (Gibson et al., Hum. Mol. Genet. 15: 789-795, 2006)) may also have long runs of the same heterozygous SNPs present in both parents, which limits the parents for phasing or allocating shorter LFR overlaps. Thus, population history and some reproductive methods can make phasing challenging, as shown by the X chromosomes of non-African women. Regardless of these factors, LFR phasing performance is roughly equivalent, phasing up to 97% of heterozygous SNPs in both European and African individuals, a result that should translate across all populations. In addition to combining LFR with standard genotyping of one parent as described below (a strategy that would be more limited to some families, as discussed above), using initial DNA fragments longer than 300 kb (e.g., by capturing cells in gel blocks or pre-purified DNA (Cook, EMBO J. 3: 1837-1842, 1984)) would span approximately 95% of all RLHs and determine haplotypes for most de novo mutations occurring in these regions. This would not be feasible with current fosmid cloning strategies (Kitzman et al., Nat. Biotechnol. 29: 59-63, 2011; Suk et al., Genome Res. 21: 1672-1685, 2011), which are limited to 40 kb fragments.

LFR再现性和定相误差率分析。致力于了解LFR的再现性，我们比较两个NA19240重复文库间的单元型数据。一般地，文库是非常一致的，这两个文库仅定相约220万个杂合SNP中每个文库的64个差异(图22)。这代表定相误差率0.003％或44Mb中的1个误差。在与自先前通过多个方法测序的亲本基因组NA19238和NA19239产生的保守但精确的全染色体定相相比时，LFR也是高度精确的。仅发现157万个相当的单个基因座中的约60个例子，其中LFR定相与亲本单元型测定的变体不一致的变体(若半数的不一致性是由于亲本基因组中的测序误差所致，则假定相率为0.002％)。LFR数据还含有每个文库约135个重叠群(2.2％)，其具有一个或多个翻转的单元型块(图22)。将这些分析延伸到样品NA12877的欧洲裔重复文库(图22)并将它们与最近使用NA12877的4个儿童及其母亲NA12878进行的基于家庭的高质量分析(Roach et al.,Am.J.Hum.Genet.89:382-397,2011)比较产生相似的结果，其假设每个方法贡献观察到的不一致性的一半。在NA19240和NA12877文库两者中，几个重叠群具有许多的翻转区段。大多数这些重叠群趋向于位于低杂合性区域(RLH)、低读取结果覆盖区、或在数目大得出乎意料的孔中观察到的重复区(例如亚端粒(subtelomeric)或着丝粒区)。 LFR reproducibility and phasing error rate analysis . In an effort to understand the reproducibility of LFR, we compared the haplotype data between the two NA19240 replicate libraries. In general, the libraries were very consistent, with the two libraries phasing only 64 differences in each library out of approximately 2.2 million heterozygous SNPs (Figure 22). This represents a phasing error rate of 0.003% or 1 error in 44Mb. LFR was also highly accurate when compared to the conservative but accurate whole chromosome phasing generated from the parental genomes NA19238 and NA19239, which had been previously sequenced by multiple methods. Only about 60 examples of 1.57 million equivalent single loci were found in which LFR phased variants that were inconsistent with the variants determined by the parental haplotype (if half of the inconsistencies were due to sequencing errors in the parental genomes, the phase rate was assumed to be 0.002%). The LFR data also contained approximately 135 overlapping groups (2.2%) per library that had one or more flipped haplotype blocks (Figure 22). Extending these analyses to the European descent repeat library of sample NA12877 ( FIG. 22 ) and comparing them to a recent high-quality family-based analysis (Roach et al., Am. J. Hum. Genet. 89:382-397, 2011) using 4 children of NA12877 and their mother NA12878 yielded similar results, assuming that each method contributes half of the observed inconsistencies. In both the NA19240 and NA12877 libraries, several contigs had numerous flipped segments. Most of these contigs tended to be located in regions of low heterozygosity (RLHs), low read coverage, or in repeat regions (e.g., subtelomeric or centromeric regions) observed in an unexpectedly large number of wells.

将单元型重叠群归入亲本染色体。可以通过对这些区域中的末端重叠群强加LFR定相算法校正大多数翻转误差。或者，可以通过将来自至少一个亲本的标准高密度阵列基因型数据(约100万个或更大的SNP)对LFR装配的简单、低成本添加来除去这些误差。另外，我们发现了亲本基因型可以连接全染色体间98％的LFR定相的杂合SNP。另外，此数据容许将单元型归入母本和父本谱系，即可用于在遗传诊断中掺入亲本印记的信息。若得不到亲本数据，也可以使用群体基因型数据来连接全染色体间的LFR重叠群，尽管此方法可以增加定相误差(Browning and Browning,Nat.Rev.Genet.12:703-714,2011)。即使技术上有挑战的办法诸如中期染色体分离(其已经证明全染色体单元型测定)在没有某种形式的亲本基因型数据的情况下不能分配亲本起源(Fan et al.,Nat.Biotechnol.29:51-57,2011)。两个简单技术(即LFR和亲本基因型测定)的此组合以低成本提供精确的、完全的、且注释的单元型。 The unit type overlap group is classified into the parent chromosome . Most of the flip errors can be corrected by imposing the LFR phasing algorithm on the terminal overlap groups in these regions. Alternatively, these errors can be removed by simply and inexpensively adding the standard high-density array genotype data (about 1 million or more SNPs) from at least one parent to the LFR assembly. In addition, we found that the parental genotype can connect 98% of the LFR phased heterozygous SNPs between the whole chromosomes. In addition, this data allows the unit type to be classified into the maternal and paternal pedigrees, which can be used to incorporate parental imprint information in genetic diagnosis. If parental data are not available, population genotype data can also be used to connect the LFR overlap groups between the whole chromosomes, although this method can increase phasing errors (Browning and Browning, Nat. Rev. Genet. 12: 703-714, 2011). Even technically challenging approaches such as metaphase chromosome segregation (which has demonstrated whole chromosome haplotype determination) cannot assign parental origins without some form of parental genotype data (Fan et al., Nat. Biotechnol. 29:51-57, 2011). This combination of two simple techniques (i.e., LFR and parental genotyping) provides accurate, complete, and annotated haplotypes at a low cost.

定相重新突变。作为我们二倍体基因组测序的完全性和准确度的证明，我们评估最近在NA19240基因组中报告的35个重新突变的定相(Conrad et al.,Nat.Genet.43:712-714,2011)。这些突变中的34个在标准基因组或LFR文库之一中响应。那些之中，在两个重复LFR文库的至少一个中定相32个重新突变(16个来自每个亲本)。不令人惊讶地，两个非定相变体驻留于RLH中。这32个变体中，通过Conrad et al.(同上)定相21个，并且18个与LFR定相结果一致。三个不一致性可能是由于先前研究中的误差(Matthew Hurles个人通信)，确认LFR准确度，而不影响报告的本质结论。 Phased de novo mutations . As a demonstration of the completeness and accuracy of our diploid genome sequencing, we evaluated the phasing of 35 de novo mutations recently reported in the NA19240 genome (Conrad et al., Nat. Genet. 43: 712-714, 2011). 34 of these mutations responded in one of the standard genome or LFR libraries. Among those, 32 de novo mutations (16 from each parent) were phased in at least one of the two replicate LFR libraries. Not surprisingly, two non-phased variants reside in the RLH. Of these 32 variants, 21 were phased by Conrad et al. (supra), and 18 were consistent with the LFR phasing results. Three inconsistencies may be due to errors in previous studies (Matthew Hurles personal communication), confirming LFR accuracy without affecting the essential conclusions of the report.

仅使用LFR文库从100pg DNA的基因组测序和单元型测定。上文描述的分析掺入来自标准和LFR文库两者的杂合SNP。然而，鉴于由于以与10-20个细胞中存在的DNA量等同的DNA量开始而预期基因组的完全呈现，有可能仅使用LFR文库。我们已经证明了MDA提供足够一致的扩增，且凭借高(80x)总体读取结果覆盖，单独采用的LFR文库容许在不对我们的标准文库变异-响应算法的任何修改的情况中检测多至93％的杂合SNP。为了证明仅使用LFR文库的潜力，我们定相NA19240重复1以及额外的250Gb的来自同一文库的读取结果(总共500Gb)。我们观察到定相的SNP总数分别降低15％和5％(图20)。鉴于从60pg DNA，代替最佳量的200pg生成此文库(下文表1)且还鉴于在通过MDA的体外扩增期间掺入的先前提及的GC偏爱，此结果不是令人惊讶的。另一个285Gb LFR文库从组合的标准和LFR文库响应并定相仅90％的所有变体(图20)。尽管定相的总SNP减少，重叠群长度很大程度上不受影响(N50>1Mb)。 Only LFR libraries were used to sequence the genome and determine the unit type from 100pg DNA . The analysis described above incorporates heterozygous SNPs from both standard and LFR libraries. However, in view of the complete presentation of the expected genome due to starting with the DNA amount equivalent to the DNA amount present in 10-20 cells, it is possible to use only LFR libraries. We have demonstrated that MDA provides sufficiently consistent amplification, and with high (80x) overall read coverage, the LFR library used alone allows detection of up to 93% of heterozygous SNPs without any modification to our standard library variation-response algorithm. In order to demonstrate the potential of using only LFR libraries, we phased NA19240 repeat 1 and an additional 250Gb of reads from the same library (500Gb in total). We observed that the total number of phased SNPs decreased by 15% and 5% (Figure 20), respectively. This result is not surprising in view of the fact that this library was generated from 60 pg DNA, instead of the optimal amount of 200 pg (Table 1 below) and also in view of the previously mentioned GC bias incorporated during in vitro amplification by MDA. Another 285 Gb LFR library responded and phased only 90% of all variants from the combined standard and LFR libraries (Figure 20). Despite the reduction in the total number of SNPs phased, the contig length was largely unaffected (N50>1 Mb).

通过用于从10个细胞的精确基因组测序的LFR实现的误差降低。实质性误差校正(100-1,000个响应的千碱基中的约1个SNV)是所有目前大规模并行化测序技术的共同属性。这些比率对于诊断用途可能是太高的，并且它们使搜索新突变的许多研究变得复杂。绝大多数假阳性变异不再可能在母本或父本染色体上发生。LFR可以利用这种缺乏与周围的真变异的一致连接性以从最终装配的单元型消除这些误差。约鲁巴人三人一组和欧洲裔谱系两者提供了用于证明LFR的误差降低能力的卓越平台。我们限定了NA19240和NA12877中的一组杂合SNP(大于85％的所有杂合SNP)，其以高置信度在个人父母的每位中报告为在两个等位基因上匹配人参照基因组。存在着满足此标准的NA19240中的约44,000个杂合SNP和NA12877中的30,000的。依靠其在亲本基因组中的不存在性，这些变异是重新突变、细胞系特异性体细胞突变、或假阳性变体。在两个来自样品NA19240和NA12877的重复文库中的每个可再现定相这些变体中的约1,000-1,500个(图23)。这些数目与那些对NA19240中的重新和细胞系特异性突变报告的数目相似(Conrad et al.,Nat.Genet.43:712-714,2011)。剩余的变体可能是初始的假阳性，其中每个文库定相仅约500个。这代表定相的那些变异中假阳性率的60倍降低。这些假变体中仅约2,400个存在于标准文库中，其中仅定相约260个(20Mb中小于1个假阳性SNV；5700个单倍体Mb/260个误差)。与通过标准方法测序的基因组相比，每个LFR文库展现出定相前文库特异性假阳性响应的15倍增加。大多数这些假阳性SNV可能已经被MDA引入；罕见的细胞系变体的取样可以造成较小的百分比。尽管从100pgDNA生成LFR文库并经由MDA扩增引入大量误差，应用LFR定相算法将总体测序误差率降低至99.99999％(约600个假杂合SNV/6Gb)，即比使用相同的基于连接的测序化学观察到的误差率低约10倍(Roach et al.,Am.J.Human Genet.89:382-397,2011)。 Error reduction achieved by LFR for accurate genome sequencing from 10 cells . Substantial error correction (about 1 SNV in 100-1,000 kilobases of response) is a common attribute of all current massively parallel sequencing technologies. These rates may be too high for diagnostic purposes, and they complicate many studies searching for new mutations. The vast majority of false-positive variants are no longer likely to occur on maternal or paternal chromosomes. LFR can exploit this lack of consistent connectivity with surrounding true variants to eliminate these errors from the final assembled unit type. Both the Yoruba trio and the European pedigree provide an excellent platform for demonstrating the error reduction capabilities of LFR. We defined a set of heterozygous SNPs in NA19240 and NA12877 (greater than 85% of all heterozygous SNPs) that were reported with high confidence in each of the individual's parents as matching the human reference genome on both alleles. There are approximately 44,000 heterozygous SNPs in NA19240 and 30,000 in NA12877 that meet this criterion. The variants of the present invention are shown in Figure 23. The variants of the present invention are shown in Figure 24. The variants of the present invention are shown in Figure 25. The variants of the present invention are shown in Figure 25. The variants of the present invention are shown in Figure 25. The variants of the present invention are shown in Figure 25. The variants of the present invention are shown in Figure 25. The variants of the present invention are shown in Figure 25. The variants of the present invention are shown in Figure 25. The variants of the present invention are shown in Figure 25. Compared to genomes sequenced by standard methods, each LFR library exhibited a 15-fold increase in library-specific false positive responses before phasing. Most of these false-positive SNVs may have been introduced by MDA; sampling of rare cell line variants can cause a smaller percentage. Although LFR libraries were generated from 100 pg DNA and amplified via MDA, the application of the LFR phasing algorithm reduced the overall sequencing error rate to 99.99999% (approximately 600 false heterozygous SNVs/6 Gb), which is about 10 times lower than the error rate observed using the same ligation-based sequencing chemistry (Roach et al., Am. J. Human Genet. 89: 382-397, 2011).

改善具有LFR信息的碱基响应。在定相和消除假阳性杂合SNV外，LFR可以通过评估支持每个碱基响应的读取结果的孔起源来“挽救”“无响应”位置或验证其它响应(例如纯合参照或纯合变体)。作为证明，我们发现NA19240重复1的基因组中没有响应，但是与邻近定相杂合SNP相邻的位置。在这些例子中，所述位置能够被“再响应”，因为定相的杂合SNP的确针对相邻定相SNP和无响应位置之间的共享孔的存在(图24)。虽然LFR可以不能挽救所有无响应位置，但是此简单的演示突出显示LFR在更精确响应所有基因组位置以降低无响应中的有用性。 Improve base calls with LFR information . In addition to phasing and eliminating false positive heterozygous SNVs, LFR can "rescue""no-response" positions or verify other responses (such as homozygous references or homozygous variants) by evaluating the origin of the holes that support the read results of each base call. As a demonstration, we found positions in the genome of NA19240 repeat 1 that had no response but were adjacent to adjacent phased heterozygous SNPs. In these examples, the positions were able to be "re-responded" because the phased heterozygous SNPs were indeed targeted at the presence of shared holes between the adjacent phased SNPs and the no-response position (Figure 24). Although LFR may not be able to rescue all no-response positions, this simple demonstration highlights the usefulness of LFR in more accurately responding to all genomic positions to reduce no-responses.

非洲裔和非非洲裔基因组中存在的高度趋异的单元型。通过大规模基因型测定研究诸如HapMap项目实现的单元型分析对于了解群体遗传学是非常重要的。然而，个体的完整单元型的解析很大程度上是难处理的或昂贵得惊人的。高度精确的单元型(过滤掉由于重复区的假定位而积累的聚簇假杂合子)(Li and Durbin,Nature 475:493-496,2011；Roach et al.,Science328:636-639,2010)会有助于了解个体基因组内找到的许多群体现象。作为证明，我们对NA19240的LFR重叠群扫描母本和父本拷贝之间的高趋异区。鉴定7000个含有大于33个SNV的10-kb区；比预期的10个SNV增加3倍。假设每100万年为0.1％持续变异(standing variation)和0.15％碱基差异(基于从共同祖先进化约600万年的人和黑猩猩基因组的1％趋异性)，我们的计算提示了此非洲裔基因组中找到的这些区域中的约50Mb(约2.0％的“非近亲繁殖”基因组)可能已经分开进化超过150万年。若黑猩猩-人分开小于500万年前，此估值更接近1Myr(Hobolth et al.,Genome Res.21:349-356,2011)。此全基因组分析与Hammer等对非洲裔群体中几个靶定基因组区域(假设非洲的不同人种的可能的杂种繁殖(interbreeding))的目前研究(Proc.Natl.Acad.Sci.U.S.A.108:15123-15128,2011)一致。我们的分析显示了2.1％的欧洲裔非近亲繁殖基因组也具有类似趋异的序列，通常在不同基因组位置处。这些中的大多数可能在人类离开非洲前引入。 Highly divergent haplotypes exist in African and non-African genomes . Haplotype analysis, achieved through large-scale genotyping studies such as the HapMap project, is very important for understanding population genetics. However, the resolution of complete haplotypes for individuals is largely intractable or prohibitively expensive. Highly accurate haplotypes (filtering out clustered pseudoheterozygotes accumulated due to false positioning of repeat regions) (Li and Durbin, Nature 475:493-496, 2011; Roach et al., Science 328:636-639, 2010) will help understand many population phenomena found within individual genomes. As a demonstration, we scanned the LFR contigs of NA19240 for highly divergent regions between the maternal and paternal copies. 7,000 10-kb regions containing greater than 33 SNVs were identified; a 3-fold increase over the expected 10 SNVs. Assuming 0.1% standing variation and 0.15% base divergence per million years (based on 1% divergence between the human and chimpanzee genomes, which evolved approximately 6 million years from a common ancestor), our calculations suggest that approximately 50 Mb of these regions found in this African genome (approximately 2.0% of the "non-inbreeding" genome) may have evolved separately for more than 1.5 million years. If the chimpanzee-human split was less than 5 million years ago, this estimate is closer to 1 Myr (Hobolth et al., Genome Res. 21:349-356, 2011). This genome-wide analysis is consistent with a current study by Hammer et al. (Proc. Natl. Acad. Sci. USA 108:15123-15128, 2011) of several targeted genomic regions in African populations, assuming possible interbreeding between different hominin species in Africa. Our analysis revealed that 2.1% of non-inbred genomes of European descent also harbor similarly divergent sequences, often at different genomic locations. Most of these were likely introduced before humans left Africa.

单个基因组含有多个在两个等位基因中都具有失活变异的基因。高度精确的二倍体基因组对于使人基因组测序对临床背景有价值是一种必需。为了证明LFR可以如何用于诊断/预后环境，我们NA19240的编码SNP数据分析无义和剪接位点破坏变异。我们使用PolyPhen2(Adzhubei et al.,Nat.Methods7:248-249,2010)进一步分析所有错义变异以仅选择那些编码不利变化的变异。认为“可能损害”和“大概损害”两者对于蛋白质功能是不利的，因为都是无义突变。3485个变体匹配这些标准。定相和除去假阳性后，仅保留1252个变体；即潜在误导性信息的重要降低。我们进一步降低该列表以仅检查那316个杂合变体，其中至少两个在同一基因中共发生。使用定相数据，我们能够鉴定79个基因内同一等位基因中存在的189个变体。发现剩余的127个SNP在47个在每个等位基因中具有至少一个不利变异的基因间分散(图25)。通过组合两个LFR文库对NA19240测定单元型将此数目增加到65个基因。将此分析延伸到欧洲裔谱系证明相似数目的基因(在两个等位基因中都具有编码突变的32-49个)潜在改变至表达很少至没有有效蛋白质产物的点(图25)。将此分析延伸到破坏转录因子结合位点(TFBS)的变体引入每个个体额外的约100个基因。这些中的许多有可能是功能变化的部分损失或无损失。由于LFR的高准确度，不太可能的是，这些变体是测序误差的结果。许多发现的不利突变可以已经在这些细胞系的增殖中引入。在无关个体中发现少数这些基因，提示了它们可以是不正确注释的或者系统性定位或参照误差的结果。NA19240的基因组在功能种类的完全丧失中含有额外的约10个基因；这最可能是由于通过使用欧洲裔参照基因组注释非洲裔基因组引入的偏爱所致。不过，这些数目与几个目前关于定相个别基因组的研究中找到的数目一致(Suk et al.,Genome Res.21:1672-1685,2011；Lohmueller et al.,Nature 451:994-997,2008)，并且提示了大多数一般健康个体可能具有正常生活不绝对需要的少量基因，其编码无效的蛋白质产物。我们已经证明了LFR能够将SNP放入较大基因组距离里的单元型，其中那些SNP的相可以引起潜在完全的功能丧失发生。此类信息对于患者基因组的有效临床解读及对于携带者筛选会是至关重要的。 A single genome contains multiple genes with inactivating variants in both alleles . Highly accurate diploid genomes are essential for making human genome sequencing valuable in clinical settings. To demonstrate how LFR can be used in a diagnostic/prognostic setting, we analyzed nonsense and splice site disrupting variants in the coding SNP data of NA19240. We further analyzed all missense variants using PolyPhen2 (Adzhubei et al., Nat. Methods 7: 248-249, 2010) to select only those variants that encode unfavorable changes. Both "likely to damage" and "probably to damage" are considered unfavorable for protein function because they are both nonsense mutations. 3485 variants matched these criteria. After phasing and removing false positives, only 1252 variants remained; that is, a significant reduction in potentially misleading information. We further reduced the list to examine only those 316 heterozygous variants, at least two of which co-occurred in the same gene. Using phased data, we were able to identify 189 variants present in the same allele within 79 genes. The remaining 127 SNPs were found to be dispersed among 47 genes with at least one adverse variation in each allele ( FIG. 25 ). This number was increased to 65 genes by combining two LFR libraries to determine the haplotypes for NA19240. Extending this analysis to European ancestry demonstrated that a similar number of genes (32-49 with coding mutations in both alleles) were potentially altered to the point of expressing little to no effective protein product ( FIG. 25 ). Extending this analysis to variants that disrupt transcription factor binding sites (TFBS) introduced an additional approximately 100 genes per individual. Many of these are likely to be partial loss of function changes or no loss of function changes. Due to the high accuracy of LFR, it is unlikely that these variants are the result of sequencing errors. Many of the adverse mutations found may have been introduced during the proliferation of these cell lines. A small number of these genes were found in unrelated individuals, suggesting that they may be incorrectly annotated or the result of systematic positioning or reference errors. The genome of NA19240 contains approximately 10 additional genes in the complete loss of function category; this is most likely due to a bias introduced by annotating the African genome using a European reference genome. However, these numbers are consistent with those found in several current studies of phased individual genomes (Suk et al., Genome Res. 21: 1672-1685, 2011; Lohmueller et al., Nature 451: 994-997, 2008) and suggest that most generally healthy individuals may have a small number of genes that are not absolutely required for normal life and encode ineffective protein products. We have demonstrated that LFR can place SNPs into haplotypes at large genomic distances where the phase of those SNPs can cause potential complete loss of function to occur. Such information will be crucial for effective clinical interpretation of patient genomes and for carrier screening.

与等位表达差异联系的TFBS破坏。涵盖顺式调节区和编码序列两者的长单元型对于了解和预测基因的每个等位基因的表达水平是至关重要的。通过分析来自对NA20431的淋巴细胞的RNA测序的5.6Gb非穷尽表达数据，我们鉴定少量在等位基因表达方面具有显著差异的基因。在这些基因的每个中，对转录起始稳点上游的5kb调节区和下游的1kb扫描SNV，该SNV显著改变超过300个不同转录因子的结合位点(Sandelin et al.,32:D91-D94,2004)。在六个例子中(图26)，发现两个等位基因间的1-3个碱基在每个基因中存在不同，对一个或多个推定的结合位点引起显著的影响并潜在解释等位基因间观察的差异表达。虽然这仅是一个数据集并且目前并不清楚这些变化对转录因子结合具有多大的影响，但是这些结果证明凭借此类型的大规模研究(Rozowsky et al.,Mol.Syst.Biol.7:522,2011)，使用LFR单元型测定变得可行的是，可以阐明对转录因子结合位点的序列变化的后果。 TFBS destruction associated with allelic expression differences . Long unit types covering both cis-regulatory regions and coding sequences are crucial for understanding and predicting the expression level of each allele of a gene. By analyzing the 5.6Gb non-exhaustive expression data from RNA sequencing of lymphocytes from NA20431, we identified a small number of genes with significant differences in allelic expression. In each of these genes, a 5kb regulatory region upstream of the transcription start stable point and a 1kb scan downstream were found to have significantly altered binding sites for more than 300 different transcription factors (Sandelin et al., 32: D91-D94, 2004). In six examples (Figure 26), it was found that the 1-3 bases between the two alleles were different in each gene, causing a significant impact on one or more putative binding sites and potentially explaining the differential expression observed between the alleles. Although this is only one data set and it is not yet clear how much impact these changes have on transcription factor binding, these results demonstrate that with the type of large-scale studies (Rozowsky et al., Mol. Syst. Biol. 7:522, 2011) that become feasible using LFR unit typing, it is possible to elucidate the consequences of sequence changes on transcription factor binding sites.

讨论。我们已经证明LFR将基因组中多至97％的所有检测的杂合SNP精确定相到DNA的长连续区段(长度为400-1500kb的N50)中的能力。即使在没有来自标准文库的候选杂合SNP的情况下且如此仅使用10-20个人细胞定相的LFR库能够定相85-94％的可用SNP，尽管目前的实现有限制。在几个例子中，此文章中使用的LFR文库具有小于最佳的起始输入DNA(例如NA20431)。通过组合两个重复文库(样品NA19240和NA12877)或以更多DNA(NA12892)开始看到的定相率改善与此结论一致。另外，富含GC的序列的呈现不足导致响应的较少基因组(90-93％对大于96％(对于标准文库))。对MDA方法(例如通过添加区域特异性引物或通过改善其它步骤中的产率使用较少扩增进行)或我们实施LFR文库中碱基和变体响应的方式(可能通过使用读取结果对孔的分配进行)的改进会有助于提高这些区域中的覆盖。此外，随着全基因组测序的成本不断下降，较高覆盖的文库(其显著改善响应率和定相)会变得更加负担得起。 Discussion . We have demonstrated the ability of LFR to accurately phase up to 97% of all detected heterozygous SNPs in the genome into long contiguous segments of DNA (N50 of 400-1500 kb in length). Even in the absence of candidate heterozygous SNPs from a standard library and thus using only 10-20 human cells, LFR libraries phased were able to phase 85-94% of the available SNPs, although current implementations have limitations. In several examples, the LFR libraries used in this article had less than optimal starting input DNA (e.g., NA20431). The improvements in phasing rates seen by combining two replicate libraries (samples NA19240 and NA12877) or starting with more DNA (NA12892) are consistent with this conclusion. Additionally, insufficient representation of GC-rich sequences resulted in fewer genomes being responded to (90-93% versus greater than 96% for the standard library). Improvements to the MDA method (e.g., by adding region-specific primers or using less amplification by improving yields in other steps) or the way we implement base and variant calling in LFR libraries (perhaps by using the assignment of reads to wells) could help improve coverage in these regions. Furthermore, as the cost of whole-genome sequencing continues to decline, higher-coverage libraries (which significantly improve call rates and phasing) will become more affordable.

共有单倍体序列对于许多应用是足够的；然而，它缺乏两个非常重要的关于个人化基因组的数据的部分：定相杂合变体和假阳性和阴性变体响应的鉴定。个人基因组的目的之一是检测引起变体的疾病及极端确信测定个体是否携带此类变体或者具有一个或两个未受影响的等位基因。通过独立提供来自母本和父本染色体两者的序列信息，LFR能够检测基因组装配中仅已经覆盖一个等位基因的区域。同样地，避免假阳性响应，因为LFR在不同等分试样中将母本和父本染色体两者独立测序10-20次。结果是随机序列误差会在一个亲本等位基因上的相同碱基位置处几个等分试样中重复出现的统计学低概率。如此，LFR第一次容许对来自少数(优选是10-20个)人细胞的基因组的既精确又划算的测序，尽管使用体外DNA扩增和所致的大量不可避免的聚合酶误差。此外，通过在几百个千碱基至多个兆碱基里定相SNP(或在整个染色体里通过整合LFR与一个或两个亲本的常规基因型测定进行)，LFR能够更精确预测复合调节变体和亲本印记对多个组织类型中等位基因特异性基因表达和功能的影响。总之，这提供了关于可以引起蛋白质功能获得或丧失的潜在基因组变化的高度精确报告。便宜地对每个患者获得的此种信息低于基因组数据的临床使用会是至关重要的。此外，从10个细胞开始的人基因组的成功且负担得起的二倍体测序打开来自多种多样的组织来源，诸如循环肿瘤细胞或经由体外受精生成的植入前胚胎的微活检的全面且精确的遗传筛选的可能性。The consensus haploid sequence is sufficient for many applications; however, it lacks two very important parts of the data for personalized genomes: phasing heterozygous variants and identification of false positive and negative variant responses. One of the goals of a personal genome is to detect disease-causing variants and to determine with extreme confidence whether an individual carries such variants or has one or two unaffected alleles. By independently providing sequence information from both maternal and paternal chromosomes, LFR is able to detect regions in the genome assembly that have only covered one allele. Similarly, false positive responses are avoided because LFR independently sequences both maternal and paternal chromosomes 10-20 times in different aliquots. The result is a statistically low probability that a random sequence error will recur in several aliquots at the same base position on a parental allele. In this way, LFR allows for the first time both accurate and cost-effective sequencing of genomes from a small number (preferably 10-20) of human cells, despite the use of in vitro DNA amplification and the resulting large number of unavoidable polymerase errors. In addition, by phasing SNPs at hundreds of kilobases to multiple megabases (or by integrating LFR with conventional genotyping of one or two parents in the entire chromosome), LFR can more accurately predict the impact of complex regulatory variants and parental imprinting on allele-specific gene expression and function in multiple tissue types. In short, this provides a highly accurate report on potential genomic changes that can cause protein function gain or loss. The clinical use of such information obtained cheaply for each patient is less than that of genomic data and will be crucial. In addition, the successful and affordable diploid sequencing of the human genome starting from 10 cells opens the possibility of comprehensive and accurate genetic screening from a variety of tissue sources, such as circulating tumor cells or microbiopsies of pre-implantation embryos generated via in vitro fertilization.

虽然多种不同形式的实施方案满足本发明，如结合本发明的优选实施方案详细描述的，但是应当理解，应当认为本公开内容是本发明原理例示性的，而并不意图将本发明限于本文中例示和描述的具体实施方案。本领域技术人员可以在不偏离本发明精神的前提下做出许多变化。本发明的范围会以所附权利要求书及其等同方案测量。摘要和发明名称不应解释为限制本发明的范围，因为其目的是使合适的权力机构及一般公众能够快速确定本发明的一般性质。在所附权利要求书中，除非使用术语“手段”，其中叙述的特征或要素无一应当解释为属于35 U.S.C.§112,6的手段加功能限定。While many different forms of embodiments satisfy the present invention, as described in detail in conjunction with the preferred embodiments of the present invention, it should be understood that the present disclosure should be considered illustrative of the principles of the invention and is not intended to limit the invention to the specific embodiments illustrated and described herein. Many changes can be made by those skilled in the art without departing from the spirit of the invention. The scope of the invention will be measured by the appended claims and their equivalents. The abstract and the invention title should not be construed as limiting the scope of the invention, as their purpose is to enable the appropriate authorities and the general public to quickly ascertain the general nature of the invention. In the appended claims, unless the term "means" is used, none of the features or elements described therein should be construed as falling within the means-plus-function limitation of 35 U.S.C. §112,6.

本发明提供了以下各项：The present invention provides the following:

1.测定一种或多种生物体的复杂核酸的序列的方法，该方法包括：1. A method for determining the sequence of a complex nucleic acid of one or more organisms, the method comprising:

(a)在一个或多个计算装置上接收所述复杂核酸的多个读取结果；并(a) receiving, on one or more computing devices, a plurality of reads of the complex nucleic acid; and

(b)用所述一个或多个计算装置从所述读取结果产生所述复杂核酸的装配序列(assembled sequence)，所述装配序列在70％或更大的响应率(call rate)时每兆碱基包含不到1个假单核苷酸变体。(b) generating an assembled sequence of the complex nucleic acid from the reads using the one or more computing devices, the assembled sequence comprising less than one spurious single nucleotide variant per megabase at a call rate of 70% or greater.

2.项1的方法，其进一步包括鉴定所述装配序列中的多个序列变体，并对所述多个序列变体定相(phase)以产生定相序列。2. The method of claim 1 , further comprising identifying a plurality of sequence variants in the assembled sequence, and phasing the plurality of sequence variants to produce a phased sequence.

3.项2的方法，其包括对至少三个所述序列变体定相，并将与至少两个序列变体的定相不一致的序列变体鉴定为误差(error)。3. The method of claim 2, comprising phasing at least three of the sequence variants, and identifying sequence variants that are inconsistent with the phasing of at least two sequence variants as errors.

4.项2的方法，其中所述装配序列是全基因组序列，所述方法包括对至少70％的所述序列变体定相。4. The method of claim 2, wherein the assembled sequence is a whole genome sequence, and the method comprises phasing at least 70% of the sequence variants.

5.项2的方法，其中所述装配序列是全基因组序列，所述方法包括对至少80％的所述序列变体定相。5. The method of claim 2, wherein the assembled sequence is a whole genome sequence, and the method comprises phasing at least 80% of the sequence variants.

6.项2的方法，其中所述装配序列是全基因组序列，所述方法包括对至少85％的所述序列变体定相。6. The method of claim 2, wherein the assembled sequence is a whole genome sequence, and the method comprises phasing at least 85% of the sequence variants.

7.项2的方法，其中所述装配序列是全基因组序列，所述方法包括对至少90％的所述序列变体定相。7. The method of claim 2, wherein the assembled sequence is a whole genome sequence, and the method comprises phasing at least 90% of the sequence variants.

8.项2的方法，其中所述装配序列是全基因组序列，所述方法包括对至少95％的所述序列变体定相。8. The method of claim 2, wherein the assembled sequence is a whole genome sequence, and the method comprises phasing at least 95% of the sequence variants.

9.项1的方法，其中接收所述复杂核酸的多个读取结果的步骤是，接收多个等分试样之每个的多个读取结果，每个等分试样包含所述复杂核酸的一个或多个片段。9. The method of claim 1 , wherein the step of receiving multiple read results for the complex nucleic acid comprises receiving multiple read results for each of a plurality of aliquots, each aliquot comprising one or more fragments of the complex nucleic acid.

10.项9的方法，其包括响应所述装配序列一个位置处的碱基，是基于两个或更多个等分试样在该位置处的初步碱基响应来进行。10. The method of claim 9, comprising calling a base at a position in the assembled sequence based on preliminary base calls at that position from two or more aliquots.

11.项9的方法，其包括将两个或更多个等分试样的读取结果中出现3次或更多次的碱基响应鉴定为真的。11. The method of claim 9, comprising identifying as true a base call that occurs three or more times in the read results of two or more aliquots.

12.项9的方法，其中将等分试样特异性标签附着于每个所述片段，所述方法还包括，通过鉴定所述等分试样特异性标签，来确定是哪个等分试样给出了所述读取结果。12. The method of claim 9, wherein an aliquot-specific tag is attached to each of said fragments, said method further comprising determining which aliquot gave said read result by identifying said aliquot-specific tag.

13.项12的方法，其中所述等分试样特异性标签包含误差校正代码，并且每个读取结果包含标签序列数据和片段序列数据，其中所述标签序列数据是正确的标签序列数据或包含一个或多个误差的不正确标签序列数据；该方法进一步包括：13. The method of claim 12, wherein the aliquot-specific tags comprise an error correction code, and each read comprises tag sequence data and fragment sequence data, wherein the tag sequence data is correct tag sequence data or incorrect tag sequence data comprising one or more errors; the method further comprising:

(c)使用所述误差校正代码来校正所述不正确标签序列数据，由此产生经校正的标签序列数据和不能校正的标签序列数据；(c) correcting the incorrect tag sequence data using the error correction code, thereby generating corrected tag sequence data and uncorrectable tag sequence data;

(d)在要求标签序列数据的第一计算机方法中，使用包含所述正确标签序列数据和所述经校正的标签序列数据的读取结果，并且产生第一输出；并(d) using the read comprising the corrected tag sequence data and the corrected tag sequence data in a first computer method requiring tag sequence data and generating a first output; and

(e)在不要求标签序列数据的第二计算机方法中，使用包含所述不能校正的标签序列数据的读取结果，并且产生第二输出。(e) using the reads comprising the uncorrectable tag sequence data in a second computer method that does not require tag sequence data, and generating a second output.

14.项13的方法，其中所述第一计算机方法选自：样品多路复用、库多路复用、定相、和采用标签序列数据的误差校正方法。14. The method of claim 13, wherein the first computer method is selected from the group consisting of: sample multiplexing, library multiplexing, phasing, and error correction methods using tag sequence data.

15.项13的方法，其中所述第二计算机方法包括定位、装配和基于集合的统计学。15. The method of claim 13, wherein the second computer method comprises positioning, assembly, and set-based statistics.

16.项13的方法，其中所述误差校正代码是Reed-Solomon码。16. The method of claim 13, wherein the error correction code is a Reed-Solomon code.

17.项1的方法，其中所述方法进一步包括：17. The method of claim 1, further comprising:

(c)提供所述复杂核酸的一个区域的第一定相序列，所述区域包含短串联重复；(c) providing a first phased sequence of a region of the complex nucleic acid, the region comprising short tandem repeats;

(d)将所述区域的第一定相序列的读取结果与所述区域的第二定相序列的读取结果进行比较；并(d) comparing the reading result of the first phased sequence for the region with the reading result of the second phased sequence for the region; and

(e)基于所述比较，鉴定所述短串联重复在所述第一定相序列或所述第二定相序列之一中的扩充。(e) identifying, based on the comparing, an expansion of the short tandem repeat in one of the first phased sequence or the second phased sequence.

18.项1的方法，其进一步包括从所述生物体的至少一个亲本获得基因型数据，并从所述读取结果和所述至少一个亲本的基因型数据产生所述复杂核酸的装配序列。18. The method of claim 1 , further comprising obtaining genotypic data from at least one parent of the organism, and generating an assembled sequence of the complex nucleic acid from the read results and the genotypic data of the at least one parent.

19.项1的方法，其进一步包括添加群体基因型数据，并从所述读取结果和所述群体基因型数据产生所述复杂核酸的装配序列。19. The method of claim 1 , further comprising adding population genotype data and generating an assembled sequence of the complex nucleic acid from the read results and the population genotype data.

20.项1的方法，其进一步包括：20. The method of claim 1, further comprising:

(c)比对所述复杂核酸的第一区的多个读取结果，由此在被比对的读取结果之间创建重叠；(c) aligning the plurality of reads of the first region of the complex nucleic acid, thereby creating overlaps between the aligned reads;

(d)鉴定所述重叠内的N个杂合候选物，其中N是大于2的整数；(d) identifying N heterozygous candidates within the overlap, where N is an integer greater than 2;

(e)聚簇所述N个杂合候选物的2^N至4^N种可能性的空间或所述空间的选定子空间，由此创建多个簇；(e) clustering the space of ^2N to ^4N possibilities of the N heterozygous candidates or a selected subspace of the space, thereby creating a plurality of clusters;

(f)鉴定两个具有最高密度的簇，每个被鉴定的簇包含基本上无噪音的中心；并(f) identifying two clusters with the highest density, each identified cluster containing a substantially noise-free center; and

(g)对所述复杂核酸的一个或多个别的区域重复步骤(a)-(d)。(g) repeating steps (a)-(d) for one or more additional regions of the complex nucleic acid.

21.项1的方法，其中所述装配序列每兆碱基包含少于0.8个假单核苷酸变体。21. The method of claim 1 , wherein the assembled sequence comprises fewer than 0.8 false single nucleotide variants per megabase.

22.项1的方法，其中所述装配序列每兆碱基包含少于0.6个假单核苷酸变体。22. The method of claim 1 , wherein the assembled sequence comprises fewer than 0.6 false single nucleotide variants per megabase.

23.项1的方法，其中所述装配序列每兆碱基包含少于0.4个假单核苷酸变体。23. The method of claim 1 , wherein the assembled sequence comprises fewer than 0.4 false single nucleotide variants per megabase.

24.项1的方法，其中所述装配序列每兆碱基包含少于0.2个假单核苷酸变体。24. The method of claim 1 , wherein the assembled sequence comprises less than 0.2 false single nucleotide variants per megabase.

25.项1的方法，其中所述装配序列每兆碱基包含少于0.1个假单核苷酸变体。25. The method of claim 1 , wherein the assembled sequence comprises less than 0.1 false single nucleotide variants per megabase.

26.项1的方法，其中所述装配序列有所述复杂核酸至少80％的响应率。26. The method of claim 1 , wherein the assembled sequence has a response rate of at least 80% of the complex nucleic acid.

27.项1的方法，其中所述装配序列具有至少85％的响应率。27. The method of claim 1 , wherein the assembled sequence has a response rate of at least 85%.

28.项1的方法，其中所述装配序列具有至少90％的响应率。28. The method of claim 1 , wherein the assembled sequence has a response rate of at least 90%.

29.项1的方法，其进一步包括：(a)提供一定量的所述复杂核酸，并(b)对所述量的所述复杂核酸测序以产生所述多个读取结果。29. The method of claim 1, further comprising: (a) providing a quantity of the complex nucleic acid, and (b) sequencing the quantity of the complex nucleic acid to generate the plurality of reads.

30.项1的方法，其中所述复杂核酸选自下组：基因组、外显子组(exome)、转录物组、甲基化组(methylome)、不同生物体基因组的混合物、一个生物体的不同细胞类型的基因组的混合物、及它们的子集。30. The method of claim 1 , wherein the complex nucleic acid is selected from the group consisting of a genome, an exome, a transcriptome, a methylome, a mixture of genomes from different organisms, a mixture of genomes from different cell types of one organism, and a subset thereof.

31.项1的方法，其中所述生物体是哺乳动物。31. The method of claim 1 , wherein the organism is a mammal.

32.项1的方法，其中所述生物体是人。32. The method of claim 1 , wherein the organism is a human.

33.一种或多种计算机可读的非短暂存储介质，其存储通过项1的方法产生的装配人基因组序列。33. One or more computer-readable non-transitory storage media storing the assembled human genome sequence produced by the method of item 1.

34.计算机可读的非短暂存储介质存储指令，其在由一种或多种计算装置执行时引起所述一种或多种计算装置实施项1的方法。34. A computer-readable non-transitory storage medium storing instructions that, when executed by one or more computing devices, cause the one or more computing devices to implement the method of item 1.

35.一种测定人基因组序列的方法，该方法包括：35. A method for determining a human genome sequence, the method comprising:

(a)在一个或多个计算装置上接收所述基因组的多个读取结果；并(a) receiving, on one or more computing devices, a plurality of reads of the genome; and

(b)用所述一个或多个计算装置从所述读取结果产生所述基因组的装配序列，所述装配序列在70％或更大的基因组响应率时包含每千兆碱基小于600个假单核苷酸变体。(b) generating, using the one or more computing devices, an assembled sequence of the genome from the reads, the assembled sequence comprising fewer than 600 false single nucleotide variants per gigabase at a genome call rate of 70% or greater.

36.项34的方法，其中所述人基因组的装配序列包含70％的基因组响应率和70％或更大的外显子组响应率。36. The method of claim 34, wherein the assembled sequence of the human genome comprises a genome call rate of 70% and an exome call rate of 70% or greater.

37.计算机可读的非短暂存储介质存储指令，其在由一种或多种计算装置执行时引起所述一种或多种计算装置实施项35的方法。37. A computer-readable non-transitory storage medium storing instructions that, when executed by one or more computing devices, cause the one or more computing devices to implement the method of item 35.

38.一种测定人基因组序列的方法，该方法包括：38. A method for determining a human genome sequence, the method comprising:

(a)在一个或多个计算装置上接收来自多个等分试样之每个的多个读取结果，每个等分试样包含所述人基因组的片段；并(a) receiving, on one or more computing devices, a plurality of reads from each of a plurality of aliquots, each aliquot comprising a fragment of the human genome; and

(b)用所述一种或多种计算装置从所述读取结果产生所述基因组的定相装配序列，所述装配序列在70％或更大的基因组响应率时包含每千兆碱基小于1000个的假单核苷酸变体。(b) generating, with the one or more computing devices, a phased assembly of the genome from the reads, the assembly comprising fewer than 1000 false single nucleotide variants per gigabase at a genome call rate of 70% or greater.

39.计算机可读的非短暂存储介质存储指令，其在由一种或多种计算装置执行时引起所述一种或多种计算装置实施项38的方法。39. A computer-readable non-transitory storage medium storing instructions that, when executed by one or more computing devices, cause the one or more computing devices to implement the method of item 38.

序列表Sequence Listing

<110> 考利达基因组股份有限公司(Complete Genomics, Inc.)<110> Complete Genomics, Inc.

Drmanac, RadojeDrmanac, Radoje

Peters, Brock A.Peters, Brock A.

Kermani, Bahram G.Kermani, Bahram G.

<120> 复杂核酸序列数据的处理和分析<120> Processing and analysis of complex nucleic acid sequence data

<130> 92171-836153 (5039-US)<130> 92171-836153 (5039-US)

<140> US 13/448,279<140> US 13/448,279

<141> 2012-04-16<141> 2012-04-16

<150> US 61/546,516<150> US 61/546,516

<151> 2011-10-12<151> 2011-10-12

<150> US 61/527,428<150> US 61/527,428

<151> 2011-08-25<151> 2011-08-25

<150> US 61/517,196<150> US 61/517,196

<151> 2011-04-14<151> 2011-04-14

<160> 10<160> 10

<170> PatentIn version 3.5<170> PatentIn version 3.5

<210> 1<210> 1

<211> 20<211> 20

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<220><220>

<223> 合成的多核苷酸<223> Synthetic polynucleotides

<400> 1<400> 1

ccgcagtagc ttacgaatcg 20ccgcagtagc ttacgaatcg 20

<210> 2<210> 2

<211> 20<211> 20

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<220><220>

<223> 合成的多核苷酸<223> Synthetic polynucleotides

<400> 2<400> 2

gatttaactg agcacttggc 20gatttaactg agcacttggc 20

<210> 3<210> 3

<211> 10<211> 10

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<220><220>

<223> 合成的多核苷酸<223> Synthetic polynucleotides

<400> 3<400> 3

aacgagtatt 10aacgagtatt 10

<210> 4<210> 4

<211> 10<211> 10

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<220><220>

<223> 合成的多核苷酸<223> Synthetic polynucleotides

<400> 4<400> 4

tttggcgttc 10tttggcgttc 10

<210> 5<210> 5

<211> 10<211> 10

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<220><220>

<223> 合成的多核苷酸<223> Synthetic polynucleotides

<400> 5<400> 5

gtagtaccgg 10gtagtaccgg 10

<210> 6<210> 6

<211> 10<211> 10

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<220><220>

<223> 合成的多核苷酸<223> Synthetic polynucleotides

<400> 6<400> 6

aactgagcgg 10aactgagcgg 10

<210> 7<210> 7

<211> 12<211> 12

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<220><220>

<223> 合成的多核苷酸<223> Synthetic polynucleotides

<400> 7<400> 7

cagtcaagtg at 12cagtcaagtg at 12

<210> 8<210> 8

<211> 12<211> 12

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<220><220>

<223> 合成的多核苷酸<223> Synthetic polynucleotides

<400> 8<400> 8

catgatgagg ac 12catgatgagg ac 12

<210> 9<210> 9

<211> 12<211> 12

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<220><220>

<223> 合成的多核苷酸<223> Synthetic polynucleotides

<400> 9<400> 9

tcttagcatg ta 12tcttagcatg ta 12

<210> 10<210> 10

<211> 12<211> 12

<212> DNA<212> DNA

<213> 人工序列<213> Artificial sequence

<220><220>

<223> 合成的多核苷酸<223> Synthetic polynucleotides

<400> 10<400> 10

gtaactattc ag 12gtaactattc ag 12

Claims

1. A method for analyzing the genomic DNA of an organism, the method comprising:

Multiple reads corresponding to fragments of genomic DNA from multiple aliquots are received at one or more computing devices, each fragment of genomic DNA being tagged with an aliquot-specific tag sequence, and each read containing a sequence from the fragment of genomic DNA and an aliquot-specific tag sequence, wherein the genomic DNA contained in each aliquot of the multiple aliquots is less than a haploid genome equivalent.

The origin of the read segment was determined by identifying the specific tag sequence of the aliquot sample;

The following uses one or more computing devices to generate a phase sequence from the read segment:

Identify multiple heterozygous loci corresponding to at least a portion of the genome of the organism; and

Phased sequences of the plurality of heterozygous loci to generate a first haplotype and a second haplotype, wherein the phasing uses aliquots of reads corresponding to the plurality of heterozygous loci to determine which alleles at the heterozygous loci are in the same haplotype, the phasing sequence corresponding to at least a portion of the genome of the organism, wherein phasing of the plurality of heterozygous loci includes:

For each pair of multiple heterozygous loci,

A matrix is used to determine the number of shared sample segments between alleles of the heterozygous loci of the pair on the read, wherein the heterozygous loci of the pair are located within a specified distance from each other.

2. The method of claim 1, wherein phasing of the plurality of heterozygous loci further comprises:

The score and orientation of the corresponding pair of heterozygous loci are calculated using each matrix; and

The first and second unit types are determined using the scores and directions.

3. The method of claim 2, wherein the direction specifies which allele of the first heterozygous locus of the corresponding pair is connected to the first allele of the second heterozygous locus of the corresponding pair, and wherein the forward direction specifies that the two alleles are connected as in the list, and the reverse direction specifies that the two alleles are connected in the reverse order of the list.

4. The method of claim 3, wherein a score is calculated for the connection of corresponding pairs of heterozygous loci, and wherein the calculation comprises:

Determine the first value in the positive direction; and

A second value for the opposite direction is determined, wherein the direction is determined based on the larger of the first value and the second value.

5. The method of claim 2, wherein a score is calculated for the connection of corresponding pairs of heterozygous loci, and wherein the calculation comprises:

The impurity value is determined as the ratio of the sum of all matrix elements except the two connected matrix elements to the total sum of all matrix elements; and

The score is calculated using the impurity values and the two matrix elements.

6. The method of claim 5, wherein the score is determined using a fuzzy inference engine based on the impurity value and the two matrix elements corresponding to the connection.

7. The method of claim 2, wherein determining the first and second unit types using the score and the direction comprises:

Based on the scores and directions, the connection graph between pairs of heterozygous loci is optimized.

8. The method of claim 7, wherein the graph is optimized by generating a minimum span tree.

9. The method of claim 7, wherein optimizing the connection graph provides a plurality of subtrees, the method further comprising:

Each of the plurality of subtrees is simplified into an overlap group, thereby forming a plurality of overlap groups; and

Sequencing information from the parent organism of the said organism is used to phase multiple contigs to generate the first haplotype and the second haplotype.

10. The method of claim 7, further comprising:

The first heterozygous locus is removed from the graph as a node when the first heterozygous locus has no at least two connections to another heterozygous locus in one direction and no at least one connection to another heterozygous locus in another direction.

11. The method of claim 1, wherein the matrix for determining the number of shared subsamples between alleles of a specific pair at a heterozygous locus comprises locating the reads to the heterozygous locus of the specific pair and calculating the location reads of the subsamples sharing the alleles.

12. The method of claim 1, further comprising:

The assembly sequence of the first and second unit types is generated using the one or more computing devices.

13. A method for analyzing the genomic DNA of an organism, the method comprising:

Multiple reads corresponding to fragments of genomic DNA from multiple aliquots are received at one or more computing devices, each fragment of genomic DNA being tagged with an aliquot-specific tag sequence, and each read containing a sequence from the fragment of genomic DNA and an aliquot-specific tag sequence, wherein the genomic DNA contained in each aliquot of the multiple aliquots is less than a haploid genomic equivalent.

One or more computing devices are used to generate multiple assembled sequences aligned with overlapping regions of the genome of the organism, each assembled sequence in the overlapping regions corresponding to a different aliquot sample;

Identify a plurality of heterozygous loci corresponding to at least a portion of the genome of the organism, wherein the plurality of heterozygous loci comprise N heterozygous loci, where N is an integer greater than 1; and

Based on the alleles at N heterozygous loci for the corresponding assembled sequence, multiple clusters are created by clustering the assembled sequences in a space of ^2N to ^4N possibilities; and

The two clusters with the highest density were identified as corresponding to the first and second modular types.

14. The method of claim 13, wherein phasing of the plurality of heterozygous loci comprises:

Calculate an N-dimensional matrix, where each dimension corresponds to a heterozygous locus, and each matrix element corresponds to the number of assembled sequences having allelic combinations corresponding to the matrix element;

Identify the first matrix element and the second matrix element, each of which is the center of one of the two clusters;

Determine the first haplotype at N heterozygous loci from the first matrix element; and

The second haplotype is determined at N heterozygous loci from the second matrix element.

15. The method of claim 14, further comprising:

Weights are assigned to each of the matrix elements, wherein a first weight of a first combination of alleles observed in the population of interest is greater than a second weight of a second combination of alleles observed in the population of interest.

16. The method of claim 13, further comprising:

Phase morphology is repeated for multiple regions, each corresponding to a different plurality of heterozygous loci, thereby forming two contigs from the two clusters with the highest density for each of the plurality of regions, to obtain multiple contigs; and

Sequencing information from the parents of the organism is used to phase the multiple contigs to generate the first and second haplotypes.

17. A method for analyzing the genomic DNA of an organism, the method comprising:

The plurality of heterozygous loci are phased to produce a first haplotype and a second haplotype. The phasing uses aliquots of reads corresponding to the origins of the plurality of heterozygous loci to determine which alleles at the heterozygous loci are in the same haplotype. The phased sequence corresponds to at least a portion of the genome of the organism.

Identify phased SNPs from multiple heterozygous loci, wherein the phased SNPs have a first allele and a second allele;

Identify loci that are the nearest neighbors of the phased SNP, wherein the loci have no response and the loci have reads with third and fourth alleles.

Calculate the first number of shared aliquots containing the first allele at the phase-determined SNP and the third allele at the locus; and

The third allele at the locus is determined based on the first number of the shared equally divided samples.

18. The method of claim 17, further comprising:

When the first number of the shared equally divided samples is higher than a threshold, the location of the third allele at the locus is determined, and the threshold is 2 or higher.

19. The method of claim 17, further comprising:

Calculate the second number of shared aliquots containing the second allele at the phased SNP and the third allele at the locus;

Calculate the third number of shared aliquots containing the first allele at the phase-determined SNP and the fourth allele at the locus; and

When the first number and the second number are higher than the threshold, and the third number is lower than the threshold, the locus is determined to be homozygous for the third allele.

20. The method of claim 17, further comprising:

Calculate the second number of shared aliquots containing the second allele at the phased SNP and the fourth allele at the locus;

When all reads carrying the third allele share an equal sample with the first allele, and all reads carrying the fourth allele share an equal sample with the second allele, the locus is determined to be heterozygous for both the third and fourth alleles.

21. A computer-readable storage medium storing instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform the method of any one of claims 1-20.

22. A computer system comprising the computer-readable storage medium of claim 21.

23. A computer system comprising one or more processors configured to perform the method of any one of claims 1-20.

24. A computer system comprising means for carrying out the method of any one of claims 1-20.

25. A computer system, comprising:

One or more computer-readable storage media for storing assembled whole human genome sequences having no more than one pseudomononucleotide variant per megabase and a response rate of at least 70%, wherein the assembled whole human genome sequence is generated by sequencing between 1 pg and 10 ng of human genomic DNA.