CN1771336A

CN1771336A - Methods and means for nucleic acid sequencing

Info

Publication number: CN1771336A
Application number: CNA2004800097143A
Authority: CN
Inventors: S·林纳松
Original assignee: GENIZON SVENSKA AB
Current assignee: GENIZON SVENSKA AB
Priority date: 2003-02-12
Filing date: 2004-02-09
Publication date: 2006-05-10
Also published as: GB2398301A; GB2398383A; GB0303191D0; GB2398301B; GB2398383B; GB0402773D0

Abstract

Nucleic acid sequencing-by-synthesis. Primed synthesis of a second strand complementary to a template strand in repeated sets of steps, each step comprising providing one or more of the possible nucleotide complementarity classes for incorporation into the synthesized strand, and each set of steps comprising providing all four possible nucleotide complementarity classes. Three of the four possible nucleotide complementarity classes may first be provided for incorporation into the synthesized strand, then separately the fourth nucleotide complementarity class alone. Also, a DNA molecule consisting of a stem portion and first and second loop portions, wherein the stem portion consists of a first strand and a second strand, wherein the first strand and second strand are equal in length, complementary and annealed together, wherein the first loop portion joins the 3' end of the first strand to the 5' end of the second strand and the second loop portion joins the 3' end of the second strand to the 5' end of the first strand so the DNA molecule has no free 5' or 3' ends, and uses thereof, especially in sequencing.

Description

Methods and tools for nucleic acid sequencing

本发明涉及核酸测序。本发明尤其涉及“经由合成的测序”(SBS)，其中具有游离3’端的核酸链与含有需要其序列信息的模板的核酸退火，并用于引发第二链合成，其中核苷酸掺入的确定提供了序列信息。本发明部分地是基于这样一个精致的概念，其允许在所谓的“色度测序(chroma sequencing)”中使用非封闭的核苷酸，从而克服现有测序技术所具有的多种问题，并允许使用标准试剂和设备在单个工作日中获得极为大量的序列。优选的实施方案允许获得另外的益处。本发明也涉及用于序列分析的算法和技术，以及用于测序的设备和系统。本发明允许大量测序工作的自动化，而仅使用本领域很容易获得的标准台式设备。The present invention relates to nucleic acid sequencing. The invention relates in particular to "sequencing by synthesis" (SBS), in which a nucleic acid strand with a free 3' end is annealed to a nucleic acid containing a template whose sequence information is desired, and used to prime second strand synthesis, wherein the determination of nucleotide incorporation Sequence information is provided. The present invention is based in part on an elegant concept that allows the use of unblocked nucleotides in so-called "chroma sequencing", thereby overcoming various problems with existing sequencing technologies and allowing Obtain extremely large numbers of sequences in a single workday using standard reagents and equipment. Preferred embodiments allow additional benefits to be obtained. The invention also relates to algorithms and techniques for sequence analysis, and devices and systems for sequencing. The present invention allows the automation of a large number of sequencing jobs using only standard benchtop equipment readily available in the art.

本发明涉及在重复的步骤组中互补于模板链的第二链的引发合成，每一步骤包括提供提供一种或多种但任选地少于所有可能的核苷酸互补性类型，用于掺入到合成链中，并且每一组步骤包括提供所有四种可能的核苷酸互补性类型，任选地是在两个或更多个步骤中，其中至少一个步骤包括添加超过一种核苷酸互补性类型。优选地，这包括首先提供四种可能的核苷酸互补性类型中的三种，用于掺入到合成链中，然后单独地仅提供第四种核苷酸互补性类型。链延伸随着最后一个步骤的核苷酸掺入而终止，如在提供第四种核苷酸时，这是因为不存在其他核苷酸。确定终止之间核苷酸的数目和任选地其种类，允许快速地确定有关模板碱基组成和/或序列的信息。当每次使用单个“终止核苷酸”时，利用四种不同核苷酸中的每一种进行四轮以终止延伸，可提供能够用于极为快速且容易地确定全部模板序列的信息。The present invention involves the primed synthesis of a second strand complementary to a template strand in repeated sets of steps, each step comprising providing one or more, but optionally less than all, possible types of nucleotide complementarity for incorporated into the synthetic strand, and each set of steps includes providing all four possible types of nucleotide complementarity, optionally in two or more steps, at least one of which includes the addition of more than one core Types of nucleotide complementarity. Preferably, this involves first providing three of the four possible nucleotide complementarity types for incorporation into the synthetic strand, and then providing only the fourth nucleotide complementarity type alone. Chain elongation terminates with nucleotide incorporation at the last step, as when a fourth nucleotide is provided, because no other nucleotides are present. Determining the number and optionally the type of nucleotides between the stops allows rapid determination of information about the base composition and/or sequence of the template. Four rounds of each of the four different nucleotides to terminate extension, when using a single "stop nucleotide" at a time, provide information that can be used to determine the entire template sequence very quickly and easily.

尽管在基因组研究中使用许多不同的方法，但直接测序是迄今最有价值的。事实上，如果能够使得测序足够有效，则基因组学中的所有三个主要科学问题(序列测定、基因型分析和基因表达分析)就都能够解决了。可对模式物种测序，可通过全基因组测序对个体进行基因型分析，并且可通过转化为cDNA及测序详尽地分析RNA群体(直接计数每一种mRNA的拷贝数)。Although many different methods are used in genome research, direct sequencing is by far the most valuable. In fact, all three major scientific problems in genomics (sequencing, genotyping, and gene expression analysis) could be solved if sequencing could be made sufficiently efficient. Model species can be sequenced, individuals can be genotyped by whole genome sequencing, and RNA populations can be analyzed in detail by conversion to cDNA and sequencing (directly counting the copy number of each mRNA).

可通过测序解决的科学及医学问题的其他实例包括表观基因组学(epigenomics)(基因组中甲基化胞嘧啶的研究-通过将非甲基化的胞嘧啶亚硫酸氢盐转化为尿嘧啶，然后将所得的序列与未转化的模板序列进行比较)、蛋白-蛋白相互作用(通过对在酵母双杂交实验中所获得的命中目标进行测序)、蛋白-DNA相互作用(通过对在染色体免疫沉淀后所获得的DNA片段进行测序)等等。从而，需要高度有效的用于DNA测序的方法。Other examples of scientific and medical questions that can be addressed by sequencing include epigenomics (the study of methylated cytosines in the genome - by converting unmethylated cytosine bisulfite to uracil, and then Comparing the resulting sequence to the untransformed template sequence), protein-protein interactions (by sequencing hits obtained in yeast two-hybrid experiments), protein-DNA interactions (by sequencing The obtained DNA fragments are sequenced) and so on. Thus, highly efficient methods for DNA sequencing are required.

但是为了取代辅助方法如微阵列和PCR片段分析，需要极高的测序通量。例如，活细胞含有大约300,000拷贝的信使RNA，平均每一拷贝长约2,000个碱基。因此为了完全测序即使是单个细胞中的RNA，也必须探测6亿个核苷酸。在由许多不同细胞类型组成的复杂组织中，该任务变得甚至更加困难，因为细胞类型特异性转录物被进一步稀释。将需要每日千兆碱基的通量以满足这些需求。下表显示了对于每种实验所需通量的一些估计(人，除非另行指出)：实验需要的通量基因组序列(10×从头的) 30Gbp 全基因组多态性 3Gbp 完整单倍型图谱(200个个体) 600Gbp 基因表达 600Mbp 表观基因组学 3Gbp 1千万蛋白相互作用 400Mbp 整个生物圈(每属一个物种) ～300Tbp But in order to replace auxiliary methods such as microarray and PCR fragment analysis, extremely high sequencing throughput is required. For example, a living cell contains approximately 300,000 copies of messenger RNA, each approximately 2,000 bases long on average. Thus to fully sequence the RNA in even a single cell, 600 million nucleotides must be probed. In complex tissues composed of many different cell types, the task becomes even more difficult as cell type-specific transcripts are further diluted. A throughput of gigabases per day will be required to meet these demands. The table below shows some estimates of the throughput required for each experiment (persons, unless otherwise noted): experiment required flux Genome sequence (10× de novo) 30Gbp genome-wide polymorphism 3Gbp Complete haplotype map (200 individuals) 600Gbp gene expression 600Mbps Epigenomics 3Gbp 10 million protein interactions 400Mbps Entire biosphere (one species per genus) ~300Tbp

本发明使得以上全部均能够以合理的花费实现。The present invention enables all of the above to be achieved at reasonable expense.

用于DNA测序的方法Methods for DNA Sequencing

利用荧光双脱氧核苷酸的Sanger测序(Sanger等PNAS 74 no.12：5463-5467，1977)是最为广泛使用的方法，并业已在96和甚至384毛细管测序仪中成功地自动化。然而，该方法依赖于对应于模板每个碱基位置的大量片段的物理分离，因而不容易提高为超高通量的测序(当前最好的仪器每天产生～2百万核苷酸的序列)。Sanger sequencing using fluorescent dideoxynucleotides (Sanger et al. PNAS 74 no. 12:5463-5467, 1977) is the most widely used method and has been successfully automated in 96 and even 384 capillary sequencers. However, this method relies on the physical separation of a large number of fragments corresponding to each base position of the template, and thus cannot be easily scaled up to ultra-high-throughput sequencing (currently the best instruments generate ~2 million nucleotide sequences per day) .

序列也可以通过用从一组探针中选择的探针探测靶多核苷酸而间接获得。Sequences can also be obtained indirectly by probing a target polynucleotide with probes selected from a set of probes.

经由杂交的测序使用代表了所有可能的最长为一定长度的序列(即一组所有的k聚体，其中k受限于能够装到微阵列表面上的探针的数目；对于100万探针，可使用k＝10)的一组探针，并且与模板杂交。由该组探针重建模板序列很复杂，并且由于杂交动力学固有的不可预知的性质以及测序较大模板所需探针数目的组合激增，使之变得更为困难。即使这些问题能够克服，但通量将必然地低，因为每一个模板都需要携带数百万探针的微阵列，且阵列通常不可以再度使用。Sequencing via hybridization uses sequences representing all possible sequences up to a certain length (i.e., a set of all k-mers, where k is limited by the number of probes that can fit onto the microarray surface; for 1 million probes , a set of probes with k=10) can be used and hybridized to the template. Reconstruction of template sequences from this set of probes is complex and made more difficult by the inherently unpredictable nature of hybridization kinetics and the combined explosion of the number of probes required to sequence larger templates. Even if these problems could be overcome, the throughput would necessarily be low, since each template would require a microarray carrying millions of probes, and the arrays are usually not reusable.

毫微孔(nanopore)测序(US Genomics，美国专利6,355,420)利用了这样的事实，即当使长DNA分子强行穿过分隔两反应室的毫微孔时，结合的探针可作为所述反应室之间电导的变化检测出。通过用所有可能k聚体的亚群修饰DNA，有可能推断出部分的序列。迄今为止，尚未提出过可行的策略，以通过毫微孔途径获得全长序列，尽管如果有可能的话，原则上能够获得惊人的通量(30分钟一个人基因组的量级)。Nanopore sequencing (US Genomics, U.S. Patent 6,355,420) takes advantage of the fact that when long DNA molecules are forced through a nanopore separating two reaction chambers, bound probes act as the reaction chambers. The change in conductance between them was detected. By modifying the DNA with a subpopulation of all possible k-mers, it is possible to deduce the partial sequence. To date, no viable strategy has been proposed to obtain full-length sequences via the nanopore route, although in principle, if possible, amazing throughput (on the order of a human genome in 30 minutes) could be achieved.

业已设计了多种途径用于经由合成的测序(SBS)。Various approaches have been devised for sequencing by synthesis (SBS).

为了提高测序通量，将期望能够显现并行的大量模板上每个碱基的掺入，如，位于玻璃表面或类似反应室上的。这是通过SBS实现的(参见如Malamede等US4863849、Kumar US5908755)。存在两种通往SBS的途径：或者检测由每个掺入的核苷酸所释放的副产物，或者检测永久性地附着的标记。To increase sequencing throughput, it would be desirable to be able to visualize the incorporation of each base on a large number of templates in parallel, eg, on a glass surface or similar reaction chamber. This is achieved by SBS (see eg Malamede et al US4863849, Kumar US5908755). There are two routes to SBS: either detection of by-products released by each incorporated nucleotide, or detection of permanently attached labels.

焦磷酸测序(pyrosequencing)(如WO9323564)通过检测无机二磷酸(PPi)形式的每个掺入单体的副产物来测定模板序列。为了保持所有模板分子反应的同步化，每次添加一种单体，且未掺入的单体在下次添加前被降解。然而，同聚亚序列(成串的相同单体)造成了问题，因为不能防止多重掺入。同步化最终被破坏(因为小部分模板上没有掺入或错误掺入的总数最终压倒了真实的信号)，且当前最好的系统只能读取大约20-30个碱基，其联合通量大约为200,000个碱基/天。Pyrosequencing (eg WO9323564) determines the template sequence by detecting the by-product of each incorporated monomer in the form of inorganic diphosphate (PPi). To keep the reactions of all template molecules synchronized, monomers are added one at a time and unincorporated monomers are degraded before the next addition. However, homopolymeric subsequences (strings of identical monomers) pose a problem because multiple incorporation cannot be prevented. Synchronization is eventually broken (because the total number of non-incorporated or misincorporated on a small fraction of the template eventually overwhelms the true signal), and the best current systems can only read about 20-30 bases, with a combined throughput of Approximately 200,000 bases/day.

虽然Sanger测序对每个模板都需要精致的仪器(即毛细管)，但焦磷酸测序很容易地在单个反应室中进行并行化处理。US6274320描述了滚环扩增用来生产串联重复的线性单链DNA分子的用途，所述DNA分子附着于光学纤维，在焦磷酸测序反应中进行分析，所述反应随之可并行地进行。原则上，此种系统的通量仅仅受限于表面积(模板分子的数目)、反应速度和成像设备(分辨率)。然而，防止PPi在转化为可检测的信号之前从检测器扩散开来的需要意味着实际上必须限制反应位点的数目。在US6274320中，限制每个反应在位于光学纤维尖端的微型反应容器中进行，从而将序列数目限于每个光学纤维一个序列。While Sanger sequencing requires delicate instrumentation (i.e., capillaries) for each template, pyrosequencing is easily parallelized within a single reaction chamber. US6274320 describes the use of rolling circle amplification to produce tandemly repeated linear single-stranded DNA molecules attached to optical fibers for analysis in a pyrosequencing reaction which can then be performed in parallel. In principle, the throughput of such systems is only limited by the surface area (number of template molecules), reaction speed and imaging equipment (resolution). However, the need to prevent PPi from diffusing away from the detector before it is converted into a detectable signal means that the number of reactive sites must actually be limited. In US6274320, each reaction is limited to a miniature reaction vessel located at the tip of an optical fiber, thereby limiting the number of sequences to one sequence per optical fiber.

甚至更为受限的是焦磷酸测序所实现的短的读取长度(＜30bp)。此类短序列在全基因组测序中并非直接可用，且平衡反应的复杂设置使得难以进一步延长读取长度。仅仅是偶尔地并且是对于特定的模板，曾经报道过最长为100bp的读取长度。Even more limiting are the short read lengths (<30 bp) achieved by pyrosequencing. Such short sequences are not directly available in whole-genome sequencing, and the complex setup of equilibrium reactions makes it difficult to extend read lengths further. Only occasionally and for specific templates, read lengths of up to 100 bp have been reported.

US6255083中描述了检测释放标记的类似方案。WO01/23610中描述了顺序添加核苷酸、并检测随即由外切核酸酶切割下来的标记的方案。A similar protocol for detecting released markers is described in US6255083. A protocol for the sequential addition of nucleotides and detection of the label which is subsequently cleaved by an exonuclease is described in WO 01/23610.

检测释放的标记或副产物的原理上的优势在于模板在后续步骤中保持不含标记。然而，由于信号从模板扩散开来，所以可能难以在固体表面如微阵列上并行此类测序方案。The principle advantage of detecting released label or by-products is that the template remains label-free in subsequent steps. However, parallelizing such sequencing schemes on solid surfaces such as microarrays can be difficult due to signal diffusion from the template.

代替检测释放的副产物，人们可在每一个掺入的核苷酸被添加到生长中的聚合物中时检测所述核苷酸。原则上，此种方案将如焦磷酸测序(每次添加一种碱基，在四种天然核苷酸之间循环)一样进行，但是相反，将使用标记的核苷酸类似物(即荧光)。作为实例，Polony测序(Mitra RD，Church GM.，Nucleic Acids Res 1999 Dec 15；27(24)：e34“In situ localized amplification and contactreplication of many individual DNA molecules”)是建立在顺序添加荧光标记的核苷酸的基础上的。Instead of detecting released by-products, one can detect each incorporated nucleotide as it is added to the growing polymer. In principle, such a protocol would be performed like pyrosequencing (adding one base at a time, cycling between the four natural nucleotides), but instead would use labeled nucleotide analogs (i.e. fluorescent) . As an example, Polony sequencing (Mitra RD, Church GM., Nucleic Acids Res 1999 Dec 15;27(24):e34 "In situ localized amplification and contact replication of many individual DNA molecules") is based on the sequential addition of fluorescently labeled nucleosides on an acid basis.

检测附着于每一个所掺入的核苷酸的标记提出了另外的困难，其在于必须去除、计算扣除或物理猝灭每一步骤中所产生的信号，以为下一步作准备。此类去除可例如通过光漂白或者通过在核苷酸和标记之间使用可切割接头来完成。例如，Polony测序使用特异性设计的荧光核苷酸，其在核苷酸和荧光染料之间携带二硫酚接头。根据未发表的观察资料，使用还原剂如二硫苏糖醇可有效切割该接头，得到至少99.8％纯的核苷酸。Detection of labels attached to each incorporated nucleotide presents additional difficulties in that the signal generated at each step must be removed, computationally subtracted, or physically quenched in preparation for the next step. Such removal can be accomplished, for example, by photobleaching or by using a cleavable linker between the nucleotide and the label. For example, Polony sequencing uses specifically designed fluorescent nucleotides that carry a dithiol linker between the nucleotide and the fluorescent dye. According to unpublished observations, the linker is efficiently cleaved using a reducing agent such as dithiothreitol, resulting in at least 99.8% pure nucleotides.

因为SBS法中的读取长度主要地受限于每一步骤中发生的同步性的丧失，所以将期望能够向测序反应中添加所有四种核苷酸，而又保留在每一次碱基掺入之间停止反应的能力。那样，所有四种核苷酸将总是可以利用(从而限制错误掺入率)，而又将有可能监控每一个掺入碱基。Because the read length in the SBS method is primarily limited by the loss of synchrony that occurs at each step, it would be desirable to be able to add all four nucleotides to the sequencing reaction while retaining the time limit for each base incorporation. The ability to stop the reaction between. That way, all four nucleotides will always be available (thereby limiting the rate of misincorporation), yet it will be possible to monitor each incorporated base.

许多研究者已独立地设想了有时被称为碱基添加测序策略(BASS)的解决方案。通过使用3’-封闭的单体，可防止反应每次进行超过一个步骤，但所述封闭部分是不稳定的(如可光致断裂的或化学可降解的)，从而可暴露3’-OH基团，以为下一个合成步骤作准备。A number of researchers have independently devised a solution sometimes called the base-addition sequencing strategy (BASS). Reactions are prevented from proceeding more than one step at a time by using a 3'-blocking monomer, but the blocking moiety is labile (eg, photocleavable or chemically degradable), exposing the 3'-OH group in preparation for the next synthetic step.

BASS包括：Bass includes:

1.提供单链模板和退火的引物；1. Provide single-stranded template and annealed primer;

2.添加3’-OH封闭的荧光核苷酸；2. Add 3'-OH blocked fluorescent nucleotides;

3.添加聚合酶，掺入单个核苷酸；3. Add a polymerase to incorporate a single nucleotide;

4.读取荧光；4. Read fluorescence;

5.去除封闭基团，例如通过光致断裂；5. Removal of blocking groups, e.g. by photocleavage;

6.重复步骤2-5。6. Repeat steps 2-5.

这种方案的变形使用永久性3’-OH封闭的核苷酸，其利用外切核酸酶去除(WO1/23610、WO93/21340)，或者使用不稳定的3’-OH封闭的核苷酸，其可恢复为功能性的3’-OH基团(US5302509、WO00/50642、WO91/06678、WO93/05183)。Variations of this protocol use either permanent 3'-OH blocked nucleotides, which are removed using exonucleases (WO1/23610, WO93/21340), or unstable 3'-OH blocked nucleotides, It can be reverted to a functional 3'-OH group (US5302509, WO00/50642, WO91/06678, WO93/05183).

所有的BASS方案都具有如下共性：All BASS programs have the following in common:

·使用封闭或终止核苷酸，以防止每次合成进行超过一个步骤。• Use blocking or terminating nucleotides to prevent more than one step per synthesis.

·每一步骤掺入的核苷酸也被标记，通常是用荧光染料。• Nucleotides incorporated at each step are also labeled, usually with fluorescent dyes.

·在每一循环的结尾，去除封闭部分(或整个末端核苷酸)，以为下一个循环作准备。• At the end of each cycle, the blocking portion (or the entire terminal nucleotide) is removed in preparation for the next cycle.

合起来，这些需求对BASS中所用的酶提出了难以克服的要求：Taken together, these demands place formidable demands on the enzymes used in BASS:

·它们必须接受同时在其3’封闭(其中修饰通常并不为酶所耐受)且被荧光标记的核苷酸。- They must accept both a nucleotide that is blocked at their 3' (where modifications are generally not tolerated by the enzyme) and that is fluorescently labeled.

·它们必须足够有效地掺入此类核苷酸，从而在每一个循环中所有模板仅有可忽略不计的一部分脱离同步化。• They must incorporate such nucleotides efficiently enough that only a negligible fraction of all templates fall out of synchronization in each cycle.

·它们必须能够严格地辨别此类核苷酸的碱基配对。• They must be able to strictly discriminate base pairing of such nucleotides.

·它们必须不要过早地去除封闭基团或终止核苷酸。• They must not prematurely remove blocking groups or terminating nucleotides.

迄今尚无人能够使得BASS运作的事实提示这些困难是不可克服的。例如，在(Metzker等“Termination of DNA synthesis by novel3’-modified-deoxyribonucleoside 5’-triphosphates”，NucleicAcids Res 1994：22(20)：4259-67)中，所研究的8个酶中没有酶能够耐受3’-封闭的dUTP和3’-封闭的dCTP两者的，即使没有荧光标记所添加的复杂化。从而寻找能够接受3’-封闭的且被荧光标记形式的所有四种核苷酸的酶看起来几乎是没有希望的。The fact that no one has hitherto been able to make BASS work suggests that these difficulties are insurmountable. For example, in (Metzker et al. "Termination of DNA synthesis by novel 3'-modified-deoxyribonucleoside 5'-triphosphates", Nucleic Acids Res 1994:22(20):4259-67), none of the 8 enzymes studied was resistant to Both 3'-blocked dUTP and 3'-blocked dCTP are affected, even without the added complication of fluorescent labeling. The search for an enzyme capable of accepting all four nucleotides in 3'-blocked and fluorescently labeled form thus seemed almost hopeless.

总之，若能够使得经由掺入的测序方法运作，则人们能够令人信服地对附着在表面上的数百万模板进行并行测序。检测掺入的而非释放的标记的主要吸引力在于反应能够在表面上并行。例如，在10×10cm的表面上，此种系统将能够对例如三千七百万个模板以约600000bp/s进行测序，每个循环60s(假定为1个模板/10μm的泊松分布)，从而实现50Gb/24小时。原则上，在此种系统上每天可测序十个人的基因组。该系统的成本将与荧光扫描仪相当，而运行成本将与目前的Sanger测序仪的成本相当。In conclusion, if a sequencing-by-incorporation approach could be made to work, one could convincingly sequence millions of templates attached to a surface in parallel. The main attraction of detecting incorporated rather than released labels is that the reactions can be parallelized across the surface. For example, on a surface of 10×10 cm, such a system would be able to sequence, for example, 37 million templates at about 600,000 bp/s, with each cycle of 60 s (assuming a Poisson distribution of 1 template/10 μm), Thus realizing 50Gb/24 hours. In principle, ten human genomes can be sequenced per day on such a system. The cost of the system would be comparable to that of a fluorescence scanner, and the running costs would be comparable to those of current Sanger sequencers.

剩余的实现所述目标的主要障碍在于：首先，在SBS中读取长度太短，以至于在大基因组的测序中不可用，以及第二点，尚未开发出以足够高密度将模板安置在表面上的可靠方式。The remaining major hurdles to achieving the stated goal are: firstly, the read lengths in SBS are too short to be usable in the sequencing of large genomes, and secondly, no method has been developed to place templates on surfaces at a sufficiently high density. reliable way.

本发明在多个方面巧妙地解决了现有技术的问题。The present invention cleverly solves the problems of the prior art in many aspects.

附图简述Brief description of the drawings

图1图解说明了利用每一种天然核苷酸(示于左侧)作为终止核苷酸，由色度测序法测序的模板(顶行，显示测序链)。每一种色度序列表示为一系列破折号(测量插入碱基的数目)和字母(测量连续的终止核苷酸的数目)。由该图看来，显然通过排列读数，可由读数栏获取原始序列。Figure 1 illustrates a template (top row, showing sequenced strands) sequenced by chroma sequencing using each natural nucleotide (indicated on the left) as a terminating nucleotide. Each chrominance sequence is represented as a series of dashes (measuring the number of intervening bases) and letters (measuring the number of consecutive terminating nucleotides). From this figure, it is clear that by aligning the reads, the raw sequence can be obtained from the read column.

图2figure 2

在实施例II的核苷酸掺入测定中，该图显示了在有和无DNA聚合酶(克列诺(Klenow))下尝试掺入dTTP(以Cy3标记)、dATP和dGTP之后的荧光(任意单位)。预期结果是两个掺入的dTTP，而该图清晰地证明，由此种掺入事件产生了足够的信号，以至于能可靠地检测高于背景噪音的掺入。In the nucleotide incorporation assay of Example II, the figure shows the fluorescence after attempted incorporation of dTTP (labeled with Cy3), dATP and dGTP with and without DNA polymerase (Klenow) ( any unit). The expected result is two incorporated dTTPs, and the figure clearly demonstrates that sufficient signal is generated from this incorporation event to reliably detect incorporation above background noise.

图3图解说明了适合在规则的微阵列扫描仪中用于固相色度测序的反应室的实施方案。该图解显示了使用规则的25×75mm玻璃载玻片(1)的反应室组件，所述载玻片上可点样或随机附着模板。反应期间，橡胶垫圈(2)将玻璃密封于反应室。进口(3)和出口(4)由连接器(5)连于如图4中所图解说明的试剂分配系统。Figure 3 illustrates an embodiment of a reaction chamber suitable for use in solid-phase chromatic sequencing in a regular microarray scanner. The diagram shows the reaction chamber assembly using a regular 25 x 75 mm glass slide (1 ) onto which templates can be spotted or randomly attached. During the reaction, the rubber gasket (2) seals the glass to the reaction chamber. The inlet (3) and outlet (4) are connected by a connector (5) to the reagent distribution system as illustrated in Figure 4 .

图4图解说明了适合在图3的反应室中用于实施色度测序的试剂分配系统的实施方案。10-端口阀(1)使得试剂能够进出室(2)和废料管(6)进行分配，且最多八个的试剂容器(3)可容纳如任一给定的色度测序方案所需的不同试剂和洗涤缓冲液。注射器泵(4)和阀(1)可连同扫描仪(5，显示了载玻片架的部分视图)一起被容易地机动化和计算机控制，以用于完全自动化的系统。FIG. 4 illustrates an embodiment of a reagent dispensing system suitable for performing colorimetric sequencing in the reaction chamber of FIG. 3 . A 10-port valve (1) allows reagents to be dispensed into and out of the chamber (2) and waste tube (6), and up to eight reagent containers (3) can accommodate the different Reagents and wash buffers. The syringe pump (4) and valve (1 ) can be easily motorized and computer controlled along with the scanner (5, showing a partial view of the slide holder) for a fully automated system.

本发明是建立在新型测序策略的开发的基础上的，所述策略改进了先前描述的经由合成的测序方法，同时使得其多数困难得以避免。它是这样的策略，即易于并行化、直接显现每个单体的掺入(即无需大小分级分离)且提供了长读取长度的可能性。The present invention is based on the development of a novel sequencing strategy that improves upon the previously described sequencing-by-synthesis approach while allowing most of its difficulties to be avoided. It is a strategy that is easy to parallelize, directly visualizes the incorporation of each monomer (ie without size fractionation) and offers the possibility of long read lengths.

本发明基于这样的认识，即在SBS法中，与曾经假定的相反，并非必须在每个位置上停止(如在焦磷酸测序或WO1/23610的方法中是每次添加一种碱基，或者如BASS中那样使用封闭的核苷酸)。The present invention is based on the recognition that in the SBS method, contrary to what has been assumed, it is not necessary to stop at every position (as in pyrosequencing or the method of WO1/23610 by adding one base at a time, or Blocked nucleotides were used as in BASS).

相反，测序能够跳跃进行，从每个出现的特定“终止”核苷酸跳跃至下一个。可以标记插入核苷酸。可以标记终止核苷酸。这提供了改进，其可能是两种方案之间理想的折衷，即：使用封闭基团的方案(其中每一步骤都是生产性的，但是解除封闭成问题)和通过以每次添加一种碱基实现同步化的方案(其中，以使得更多步骤是非生产性的、加剧了同步性丧失问题的成本避免了解除封闭)。同样，与BASS的情况相比，本发明消除了将标记置于同样的核苷酸上作为封闭基团的需要。Instead, sequencing can proceed in skips, skipping from each occurrence of a particular "stop" nucleotide to the next. Inserted nucleotides can be labeled. Terminating nucleotides can be labeled. This offers an improvement, which may be an ideal compromise between the approach of using blocking groups (where each step is productive, but unblocking is problematic) and A scheme in which bases are synchronized (where unblocking is avoided at the cost of making more steps unproductive, exacerbating the problem of loss of synchrony). Also, in contrast to the case of BASS, the present invention eliminates the need to place a label on the same nucleotide as the blocking group.

本发明的一个方面提供了经由合成的测序法，其特征在于以逐步的方式掺入核苷酸，其中一个步骤潜在地允许掺入超过一种核苷酸。One aspect of the invention provides a sequencing-by-synthesis method characterized by the incorporation of nucleotides in a stepwise fashion, with one step potentially allowing the incorporation of more than one nucleotide.

在优选的实施方案中，一个步骤潜在地允许掺入四种可能的核苷酸中的三种，这依赖于潜在的模板序列。优选地，不同的步骤允许掺入第四种可能的核苷酸，即不同于在第一步骤中能够潜在地掺入的三种的剩余的那种。In a preferred embodiment, one step potentially allows the incorporation of three out of four possible nucleotides, depending on the underlying template sequence. Preferably, a different step allows for the incorporation of a fourth possible nucleotide, ie the remaining ones other than the three that could potentially be incorporated in the first step.

在其他实施方案中，实施不同的步骤，以允许在一组步骤中掺入所有四种核苷酸，其中至少一个步骤允许掺入超过一种但是少于所有可能的核苷酸。正如下文进一步讨论的那样，现有技术的方法可以概括为或者具有能够循环的一组四个不同的重复步骤，每个步骤原则上允许仅仅掺入四种核苷酸中的一种(掺入的核苷酸的实际数目依赖于潜在的模板序列)，或者具有包括所有四种封闭的核苷酸的单个重复步骤，再次地在每一步骤中允许仅仅掺入四种核苷酸中的一种，这两者都可以概括为“1-1-1-1”的方法。原则上允许掺入所有四种核苷酸的单个步骤不可用于测序，其可概括为“4”的方法，这是因为测序的链将会立即与模板末端聚合。本发明在不同的实施方案中允许实施经由合成测序的方法，其特征在于在遵照不同于“4”或“1-1-1-1”模式的步骤中掺入核苷酸。从而，在优选的实施方案中，在遵照“3-1”的一组步骤中掺入核苷酸，如早已提及的那样。在其他实施方案中，一组步骤遵照“2-2”或“1-2-1”，或遵照不规则模式，其中在一组步骤内核苷酸可能重复(如“2-2-3”)。使步骤组按需进行循环。此外，可对具有不同模式的步骤组进行组合。In other embodiments, different steps are performed to allow incorporation of all four nucleotides in one set of steps, at least one of which allows incorporation of more than one but less than all possible nucleotides. As discussed further below, prior art methods can be generalized or have a set of four distinct iterative steps capable of cycling, each step allowing in principle the incorporation of only one of the four nucleotides (incorporation depends on the underlying template sequence), or have a single iterative step that includes all four blocked nucleotides, again allowing only one of the four nucleotides to be incorporated in each step Both of these can be summarized as a "1-1-1-1" approach. A single step that would in principle allow the incorporation of all four nucleotides is not available for sequencing, which can be generalized as a "4" approach, since the sequenced strand will immediately polymerize with the template ends. The present invention allows, in various embodiments, to implement a method of sequencing by synthesis, characterized by the incorporation of nucleotides in steps following a pattern other than "4" or "1-1-1-1". Thus, in a preferred embodiment, the nucleotides are incorporated in a set of steps following "3-1", as already mentioned. In other embodiments, a set of steps follows "2-2" or "1-2-1", or follows an irregular pattern where nucleotides may repeat within a set of steps (eg, "2-2-3") . Make step groups cycle on demand. Furthermore, groups of steps with different modes can be combined.

根据本发明的一个方面，提供了确定核酸序列和/或碱基组成信息的方法，所述方法包括：According to one aspect of the present invention, a method for determining nucleic acid sequence and/or base composition information is provided, the method comprising:

(i)提供包含第一链的核酸，所述第一链包含核酸模板，其中与所述核酸模板第一链退火的核酸链的游离3’端允许互补于核酸模板的核酸链延伸，这是由模板依赖性的核酸聚合酶，通过模板序列依赖性地将核苷酸掺入到互补于核酸模板的核酸链中实现的；(i) providing a nucleic acid comprising a first strand comprising a nucleic acid template, wherein the free 3' end of the nucleic acid strand annealed to the first strand of the nucleic acid template allows extension of the nucleic acid strand complementary to the nucleic acid template, which is It is achieved by template-dependent nucleic acid polymerase through template-sequence-dependent incorporation of nucleotides into a nucleic acid strand complementary to the nucleic acid template;

(ii)实施一组的一个或多个步骤，以期望的次数循环该组一个或多个步骤，或与其他组的一个或多个步骤联合实施，以延伸互补于核酸模板的核酸链，从而允许获得表示所述核酸碱基组成或序列的信息，(ii) performing a set of one or more steps, cycling the set of one or more steps a desired number of times, or performing in conjunction with other sets of one or more steps, to extend a nucleic acid strand complementary to a nucleic acid template, thereby permits access to information representing the base composition or sequence of said nucleic acid,

其中一个步骤包括：One of these steps includes:

(a)在存在下述时：(a) in the presence of:

包含第一链的核酸，所述第一链包含核酸模板，nucleic acid comprising a first strand comprising a nucleic acid template,

与所述核酸模板的第一链退火的核酸链的游离3’端，和the free 3' end of the nucleic acid strand annealed to the first strand of the nucleic acid template, and

模板依赖性核酸聚合酶；Template-dependent nucleic acid polymerase;

提供选自一种、两种、三种或四种核苷酸互补性类型的核苷酸，以用于由所述核酸聚合酶将所述核苷酸模板依赖性地掺入到互补于核酸模板的核酸链中，其中每一种所述核苷酸是天然核苷酸或核苷酸类似物，它们能够在核酸链的游离3’端由核酸聚合酶模板依赖性地掺入到DNA链中，且在每一种核苷酸互补性类型内，所述核苷酸和核苷酸类似物与腺苷(A)、胞嘧啶(C)、胸腺嘧啶(T)和鸟嘌呤(G)之一互补；Nucleotides selected from one, two, three or four nucleotide complementarity types are provided for template-dependent incorporation of said nucleotide into a complementary nucleic acid by said nucleic acid polymerase In the nucleic acid strand of the template, wherein each of said nucleotides is a natural nucleotide or a nucleotide analogue, they can be template-dependently incorporated into the DNA strand by a nucleic acid polymerase at the free 3' end of the nucleic acid strand In, and within each type of nucleotide complementarity, the nucleotides and nucleotide analogs associated with adenosine (A), cytosine (C), thymine (T) and guanine (G) one is complementary;

和and

(b)除去或灭活未掺入的核苷酸；(b) removing or inactivating unincorporated nucleotides;

且and

其中在一组步骤内where within a set of steps

提供选自所有四种核苷酸互补性类型的核苷酸，并且其可用于进行模板依赖性的掺入，providing nucleotides selected from all four nucleotide complementarity types and which can be used for template-dependent incorporation,

在至少一个步骤中，提供选自超过一种、任选两种、三种或四种核苷酸互补性类型的核苷酸，并且其可用于进行模板依赖性的掺入，且至少一种核苷酸互补性类型中的核苷酸，若掺入到互补于核酸模板的核酸链中，则允许互补于核酸模板的核酸链进一步延伸，和In at least one step, nucleotides selected from more than one, optionally two, three or four nucleotide complementarity types are provided and are available for template-dependent incorporation, and at least one Nucleotides of the type of nucleotide complementarity which, if incorporated into a nucleic acid strand complementary to a nucleic acid template, allow further extension of the nucleic acid strand complementary to the nucleic acid template, and

任选地在超过一个步骤中不提供核苷酸互补性类型，或者在该组步骤内不超过一个步骤中提供每一种核苷酸互补性类型；和optionally providing no nucleotide complementarity types in more than one step, or providing each nucleotide complementarity type in no more than one step within the set of steps; and

其中，若在一个步骤中提供选自所有四种互补性类型的核苷酸，则一种、两种或三种核苷酸互补性类型中的核苷酸，若掺入到互补于核酸模板的核酸链中，则防止互补于核酸模板和存在的所有拷贝的核酸链进一步延伸，如果存在多拷贝的话；Wherein, if nucleotides selected from all four complementarity types are provided in one step, nucleotides in one, two or three nucleotide complementarity types, if incorporated into a nucleic acid template complementary to In the nucleic acid strand of , prevent further extension of the nucleic acid strand complementary to the nucleic acid template and all copies present, if there are multiple copies;

(iii)实施多组所述步骤，循环所述的步骤组和/或与不同的步骤组联合实施所述的步骤组；(iii) performing multiple sets of said steps, cycling said set of steps and/or performing said set of steps in conjunction with a different set of steps;

(iv)确定至少一组步骤中掺入到互补于核酸模板的核酸链中的核苷酸的性质和/或量，这是通过确定每一组的至少一个步骤中掺入到互补于核酸模板的核酸链中的核苷酸的性质和/或量实现的，对所述组要确定所掺入的核苷酸的性质和/或量。(iv) determining the nature and/or amount of nucleotides incorporated into the nucleic acid strand complementary to the nucleic acid template in at least one set of steps by determining the incorporation into the nucleic acid strand complementary to the nucleic acid template in at least one step of each set The nature and/or amount of nucleotides in the nucleic acid strands of the group for which the nature and/or amount of nucleotides incorporated are determined.

正如所指出的那样，本发明允许进行测序而无需进行大小分级分离。As noted, the present invention allows sequencing without size fractionation.

与位于核酸(如DNA)模板(关于它的序列信息和/或碱基组成信息是所期望的)5’的第一链退火的核酸的游离3’端，可以由与第一链退火的引物(如寡核苷酸引物)提供，可以由与第一链退火的第二链中的缺口提供(在此情况下，在延伸期间，第二链中最初与核酸模板退火的部分被置换或降解)，或者可以通过自身的环提供，即允许自身引发的向后成环的第一链的延长物。The free 3' end of the nucleic acid that anneals to the first strand located 5' to the nucleic acid (e.g., DNA) template (about which sequence information and/or base composition information is desired) can be determined by a primer that anneals to the first strand (such as an oligonucleotide primer), which may be provided by a gap in the second strand that anneals to the first strand (in which case, during extension, the portion of the second strand that originally annealed to the nucleic acid template is displaced or degraded ), or may be provided by the loop itself, an extension of the first strand that allows self-initiated backward looping.

核苷酸或核苷酸类似物可由其碱基配对性质定义。从而互补于天然腺苷而掺入的所有核苷酸或核苷酸类似物属于胸腺嘧啶的核苷酸互补性类型，互补于天然鸟嘌呤而掺入的那些属于胞嘧啶的核苷酸互补性类型，互补于天然胸腺嘧啶而掺入的那些属于腺苷的核苷酸互补性类型，而互补于天然胞嘧啶而掺入的那些属于鸟嘌呤的核苷酸互补性类型。从而核苷酸互补性类型描述和定义了核苷酸或核苷酸类似物就模板指导的聚合而言的逻辑性质。Nucleotides or nucleotide analogs can be defined by their base pairing properties. Thus all nucleotides or nucleotide analogues incorporated complementary to natural adenosine belong to the type of nucleotide complementarity of thymine, and those incorporated complementary to natural guanine belong to the type of nucleotide complementarity of cytosine Types, those incorporated complementary to natural thymine belong to the type of nucleotide complementarity of adenosine, while those incorporated complementary to natural cytosine belong to the type of nucleotide complementarity of guanine. The type of nucleotide complementarity thus describes and defines the logical properties of nucleotides or nucleotide analogs with respect to template-directed polymerization.

通过在反应介质中提供核苷酸，用以通过模板依赖性的聚合酶掺入，而潜在地允许其掺入核苷酸。Incorporation of nucleotides is potentially allowed by providing the nucleotides in the reaction medium for incorporation by the template-dependent polymerase.

核酸模板可以是脱氧核糖核酸(DNA)，核酸聚合酶可以是DNA依赖性的DNA聚合酶，而核苷酸可以是脱氧核糖核苷酸或脱氧核糖核苷酸类似物。The nucleic acid template can be deoxyribonucleic acid (DNA), the nucleic acid polymerase can be a DNA-dependent DNA polymerase, and the nucleotides can be deoxyribonucleotides or deoxyribonucleotide analogs.

核酸模板可以是脱氧核糖核酸(DNA)，核酸聚合酶可以是DNA依赖性的核糖核酸(RNA)聚合酶，而核苷酸可以是核糖核苷酸或核糖核苷酸类似物。The nucleic acid template can be deoxyribonucleic acid (DNA), the nucleic acid polymerase can be a DNA-dependent ribonucleic acid (RNA) polymerase, and the nucleotides can be ribonucleotides or ribonucleotide analogs.

核酸模板可以是核糖核酸(RNA)，核酸聚合酶可以是反转录酶，而核苷酸可以是脱氧核糖核苷酸或脱氧核糖核苷酸类似物。The nucleic acid template can be ribonucleic acid (RNA), the nucleic acid polymerase can be reverse transcriptase, and the nucleotides can be deoxyribonucleotides or deoxyribonucleotide analogs.

在本发明各个方面的优选实施方案中，其中潜在地掺入超过一种不同的核苷酸的步骤中所用的核苷酸选自标准核苷酸。In preferred embodiments of the various aspects of the invention, the nucleotides used in the step wherein potentially incorporating more than one different nucleotide are selected from standard nucleotides.

在本发明各个方面的一些优选实施方案中，其中潜在地仅仅掺入不同核苷酸中的一种的步骤中所用的核苷酸是选自标准核苷酸的核苷酸。In some preferred embodiments of the various aspects of the invention, the nucleotides used in the step wherein potentially only one of the different nucleotides are incorporated are nucleotides selected from standard nucleotides.

在其他实施方案中，可采用修饰的核苷酸或类似物，如文中别处所进一步讨论的那样。In other embodiments, modified nucleotides or analogs may be employed, as discussed further elsewhere herein.

本发明所采用的核苷酸可被标记，且标记可包括荧光标记。不同的核苷酸(如A、C、G和T的互补性类型之间那样)可由不同的标记来标记，例如，可能为不同颜色的不同荧光标记。Nucleotides employed in the present invention may be labeled, and labels may include fluorescent labels. Different nucleotides (as between complementarity types of A, C, G, and T) can be labeled with different labels, eg, different fluorescent labels, possibly of different colors.

正如所指出的那样，本发明提供了经由合成的测序方法，其特征在于以不同于4或1-1-1-1的方案掺入核苷酸。As noted, the present invention provides sequencing-by-synthesis methods characterized by incorporation of nucleotides in a scheme other than 4 or 1-1-1-1.

从而，优选掺入方案首先允许潜在地掺入2个或3个核苷酸，然后，一般是继洗涤步骤以除去未掺入的核苷酸之后，在不同的步骤中，该掺入方案允许潜在地掺入2个核苷酸或1个核苷酸。可进行步骤组的组合，以提供总的反应方案。Thus, preferred incorporation protocols first allow for the potential incorporation of 2 or 3 nucleotides, and then, generally following a washing step to remove unincorporated nucleotides, in a different step, the incorporation protocol allows Potentially incorporate 2 nucleotides or 1 nucleotide. Combinations of groups of steps can be performed to provide an overall reaction scheme.

当然，要依照本领域可利用的知识和技术，在反应介质中提供适当的条件，以用于在DNA链的3’端实施模板依赖性的核苷酸掺入。Of course, appropriate conditions are provided in the reaction medium for template-dependent incorporation of nucleotides at the 3' end of the DNA strand, in accordance with the knowledge and techniques available in the art.

在一个实施方案中，本发明提出了这样的方法，其包括一个循环的步骤或步骤组：提供DNA模板，其中与位于DNA模板5’的第一链退火的核酸链(如退火引物)的游离3’端允许合成互补于DNA模板的DNA链，在第一步中在聚合酶的存在下、在将核苷酸掺入到互补于模板的延长链的条件下，添加一组标记的核苷酸(称为“插入”核苷酸)，紧接着洗涤以除去未掺入的核苷酸，然后在第二步中在聚合酶的存在下、在基于引物地将核苷酸掺入到延长链的条件下，添加第二组标记的核苷酸(“终止”核苷酸)，紧接着洗涤以除去未掺入的核苷酸，并确定掺入核苷酸的标记。该组步骤可重复所需要的循环或次数。In one embodiment, the present invention proposes a method comprising a cyclical step or set of steps: providing a DNA template in which the dissociation of a nucleic acid strand (such as an annealing primer) that anneals to the first strand located 5' to the DNA template The 3' end allows the synthesis of a DNA strand complementary to the DNA template by adding a set of labeled nucleosides in the first step in the presence of a polymerase under conditions that incorporate nucleotides into the extended strand complementary to the template acid (referred to as "intercalated" nucleotides), followed by a wash to remove unincorporated nucleotides, and then in a second step in the presence of a polymerase, primer-based incorporation of nucleotides into the elongated Under strand conditions, a second set of labeled nucleotides ("stop" nucleotides) is added, followed by a wash to remove unincorporated nucleotides and determine the label for incorporated nucleotides. The set of steps can be repeated as many cycles or times as desired.

从而在每一步骤中确定掺入核苷酸的数目(但非次序)。如果针对不同核苷酸的标记是可区分的，则每一个掺入的核苷酸种类的数目(但非次序)将会得以确定。The number (but not the order) of incorporated nucleotides is thus determined at each step. If the labels for different nucleotides are distinguishable, the number (but not the order) of each incorporated nucleotide species will be determined.

以这种方式，即通过确定标记所获得的关于掺入核苷酸的信息被称为色度。色度并非标准的DNA序列，但是：The information obtained about incorporated nucleotides in this way, ie by determining the label, is called colorimetric. Chroma is not a standard DNA sequence, but:

·它可用作为签名(signature)序列，并与已知的DNA序列进行比对；It can be used as a signature sequence and compared with known DNA sequences;

·一组四个(通常地)这样的序列能够重新组装成为正常的DNA序列(如文中进一步所阐释的那样)。• A set of four (typically) such sequences can be reassembled into normal DNA sequences (as explained further in the text).

本发明的实施方案，以及色度的概念，可通过参照使用dA、dC和dG作为插入核苷酸而使用dT作为终止核苷酸所获得的一般序列来举例说明，例如，记做如下：Embodiments of the invention, as well as the concept of chromaticity, can be illustrated by reference to a general sequence obtained using dA, dC and dG as intervening nucleotides and dT as a terminating nucleotide, for example, written as follows:

dT[1A，2C，1G，1T]-[2A，2C，1G，3T]-[2A，2C，1G，1T]-[0A，1C，0G，1T]dT[1A, 2C, 1G, 1T]-[2A, 2C, 1G, 3T]-[2A, 2C, 1G, 1T]-[0A, 1C, 0G, 1T]

其中，括弧中的数字给出了在每次出现dT之间的每个插入核苷酸的丰度，如通过其标记强度所测量的那样，并加上连续dT的数目。where the numbers in parentheses give the abundance of each inserted nucleotide between each occurrence of dT, as measured by its labeling intensity, plus the number of consecutive dTs.

若干个DNA序列能够产生该数据，例如：Several DNA sequences are capable of generating this data, such as:

ACCGTGCACATTTACAGCTCTACCGTGCACATTTACAGCTCT

CAGCTCCAAGTTTCACGATCTCAGCTCCAAGTTTCACGATCT

等… wait…

下文提供了碱基引入(base-calling)策略，其使用由四种这样的序列读数(相继使用四种核苷酸中的每一种作为终止核苷酸)所获得的信息或色度以明确地确定原始序列。Provided below is a base-calling strategy that uses information or colorimetrics obtained from four such sequence reads (using each of the four nucleotides sequentially as a terminating nucleotide) to identify determine the original sequence.

在一个方面，本发明优选的实施方案提供了这样的方法(方案I)，其包括：In one aspect, preferred embodiments of the present invention provide a method (Scheme I) comprising:

1.为单链模板提供具有3’端的退火DNA链，以作为引物起作用。1. A single-stranded template is provided with an annealed DNA strand having a 3' end to function as a primer.

2.添加一组一种或多种标记的核苷酸(称为“插入核苷酸”)，如此选择它们，从而至少一种互补于模板的核苷酸(称为“终止核苷酸”)被排除在该组标记核苷酸之外。通常，添加携带可区分标记的三种核苷酸(第四种天然核苷酸为终止核苷酸)。2. Adding a set of one or more labeled nucleotides (called "insertion nucleotides"), selecting them so that at least one nucleotide that is complementary to the template (called "stop nucleotides") ) are excluded from the set of labeled nucleotides. Typically, three nucleotides are added bearing a distinguishable label (the fourth natural nucleotide being the stop nucleotide).

3.任选地，添加一种或多种封闭核苷酸(不同于标记核苷酸)。这些也是“终止核苷酸”。实例包括3’-O-修饰的核苷酸，其可以携带可光致断裂的基团，当照射时留下3’-OH，或者其他修饰，即无环核苷酸和双脱氧核苷酸。3. Optionally, one or more blocking nucleotides (different from the labeled nucleotides) are added. These are also "terminating nucleotides". Examples include 3'-O-modified nucleotides, which can carry a photocleavable group that leaves a 3'-OH when irradiated, or other modifications, namely acyclic nucleotides and dideoxynucleotides .

4.任选地，添加一种或多种非掺入的抑制剂核苷酸(不同于标记核苷酸和封闭的核苷酸)，它们可以发挥防止在所述标记或封闭核苷酸组中没有互补物的模板位置上的错误掺入的作用。实例包括5’-双-和单-磷酸核苷酸，5’-(α-β-亚甲基)三磷酸核苷酸。4. Optionally, add one or more non-incorporated inhibitor nucleotides (different from labeled and blocked nucleotides) that can act to prevent The role of misincorporation at template positions where there is no complement in . Examples include 5'-bis- and mono-phosphate nucleotides, 5'-(α-β-methylene)triphosphate nucleotides.

5.与适当的聚合酶在导致核苷酸添加到生长链的条件下温育。5. Incubation with an appropriate polymerase under conditions that result in the addition of nucleotides to the growing strand.

6.洗去未掺入的核苷酸。6. Wash away unincorporated nucleotides.

7.如果在步骤3中添加了任何封闭核苷酸，则7. If any blocking nucleotides were added in step 3, then

a.除去封闭部分，如通过光致断裂、酶促转化或化学反应。a. Removal of the blocking moiety, eg by photocleavage, enzymatic conversion or chemical reaction.

b.备选地，通过外切核酸酶处理且随后掺入非封闭的核苷酸来取代全部核苷酸(例如参见WO1/23610、WO93/21340)。b. Alternatively, replacement of all nucleotides by exonuclease treatment followed by incorporation of non-blocking nucleotides (see eg WO1/23610, WO93/21340).

8.添加剩余的核苷酸(“终止核苷酸”)，它们是确保存在于模板中的所有核苷酸都添加了互补物所必需的，并与聚合酶(并非必须和步骤5中的相同)在导致核苷酸添加到生长链的条件下温育。该终止核苷酸可任选地被标记，和/或3’-封闭(如在BASS中那样)。8. Add the remaining nucleotides ("stop nucleotides") that are necessary to ensure that all nucleotides present in the template have added complements and interact with the polymerase (not necessarily the same as the one in step 5) same) were incubated under conditions that result in the addition of nucleotides to the growing chain. The terminating nucleotide may optionally be labeled, and/or 3'-blocked (as in BASS).

9.洗去未掺入的核苷酸。9. Wash away unincorporated nucleotides.

10.检测每一个标记核苷酸的存在和/或量。10. Detecting the presence and/or amount of each labeled nucleotide.

11.任选地，除去所述标记和/或3’-封闭基团或使其失效。例如，荧光标记可被光漂白。11. Optionally, removing or deactivating the label and/or 3'-blocking group. For example, fluorescent labels can be photobleached.

12.重复步骤2-11，直至所需的循环数得以完成。12. Repeat steps 2-11 until desired number of cycles are completed.

此种测序方法尤其适合于在固相上并行化，这既是因为其简单易行，又是因为它提供了稳健的同步化方法。该方案可通过在步骤1由新鲜引物重新启动而重复多次。This sequencing method is particularly well suited for parallelization on solid phase, both because of its simplicity and because it provides a robust synchronization method. This protocol can be repeated multiple times by restarting at step 1 with fresh primers.

在步骤3和8中所添加的核苷酸被称为终止核苷酸，因为它们在步骤5中防止(通过被封闭或通过不存在)聚合继续越过其互补物。所述终止核苷酸组可有变化。例如，如果反应自步骤1起进行四次，则四种天然核苷酸中的每一种都可用作为终止核苷酸。The nucleotides added in steps 3 and 8 are called stop nucleotides because they prevent (either by being blocked or by absence) polymerization from continuing beyond its complement in step 5. The set of terminating nucleotides can vary. For example, if the reaction from step 1 is performed four times, each of the four natural nucleotides can be used as a stop nucleotide.

引物通过碱基互补性与模板退火，留下游离的3’端，其上可通过模板依赖性的DNA聚合酶逐个地添加核苷酸。正如所指出的那样，游离3’端可通过使双链DNA分子的一条链产生缺口，或者通过允许单链的游离3’端向后成环用于自我引发而产生。The primer anneals to the template by base complementarity, leaving a free 3' end to which nucleotides can be added one by one by a template-dependent DNA polymerase. As noted, a free 3' end can be generated by nipping one strand of a double-stranded DNA molecule, or by allowing the free 3' end of a single strand to loop back for self-priming.

注意：“标记的”分子应当理解为包括纯的标记分子以及标记和未标记分子的混合物。例如，标记的dTTP可以是纯的荧光素标记的dTTP，或者荧光素标记的dTTP和常规未标记dTTP的混合物。标记与未标记的最佳比率是由若干因素决定的：Note: "labeled" molecules should be understood to include pure labeled molecules as well as mixtures of labeled and unlabeled molecules. For example, the labeled dTTP can be pure fluorescein-labeled dTTP, or a mixture of fluorescein-labeled dTTP and conventional unlabeled dTTP. The optimal ratio of marked to unmarked is determined by several factors:

·获得足够信号以克服设备噪音的需要。例如，在PerkinElmerScanArray上，2.5的荧光染料/象素产生三倍于噪音水平的信号。• The need to obtain sufficient signal to overcome equipment noise. For example, on a PerkinElmerScanArray, 2.5 fluorochromes/pixel yields a signal three times the noise level.

·避免多个荧光染料紧邻的需要，以避免荧光共振能量转移(FRET，这导致一种荧光染料猝灭另一种)。FRET随着距离的六次幂而衰减，但在几个核苷酸的范围内仍然可以是重要的。• Avoid the need for multiple fluorochromes in close proximity to avoid fluorescence resonance energy transfer (FRET, which causes one fluorochrome to quench the other). FRET decays with distance to the sixth power, but can still be significant over a range of a few nucleotides.

·避免多个荧光染料紧邻的需要，以避免抑制随后由聚合酶掺入核苷酸(其可能受大体积荧光染料空间效应的抑制)。• Avoid the need for multiple fluorochromes in close proximity to avoid inhibiting subsequent incorporation of nucleotides by the polymerase (which may be inhibited by steric effects of bulky fluorochromes).

·作为另一种选择，人们可强制标记的核苷酸级分用于终止生长链，例如，通过使用标记的无环或双脱氧核苷酸，或通过将标记置于3’-OH上或其附近。只要标记的核苷酸仅占全部核苷酸的一小部分，则由终止所导致的信号的丧失依然无关紧要，而同时由于酶对修饰核苷酸更低的亲和力所致的同步性的丧失可完全避免。Alternatively, one can force a labeled nucleotide fraction to terminate the growing chain, for example, by using labeled acyclic or dideoxynucleotides, or by placing a label on the 3'-OH or its vicinity. As long as the labeled nucleotides represent only a small fraction of the total nucleotides, the loss of signal due to termination remains insignificant, while at the same time the loss of synchrony due to the enzyme's lower affinity for modified nucleotides can be avoided entirely.

发明人实验室的工作发现～2.5％或更少的标记核苷酸可良好地起作用(见下文实施例)。假设模板为100bp序列的1000个串联重复拷贝，则对于每个掺入的核苷酸，获得每个模板至少25个荧光染料(即，在PerkinElmer ScanArra上，若每一个模板都在象素内，则高于噪音水平10倍以上)。假设在一个平均的循环中掺入了四个核苷酸，则标记间距平均为1000个碱基，避免了猝灭和聚合酶抑制两者。Work in the inventor's laboratory found that -2.5% or less labeled nucleotides worked well (see Examples below). Assuming that the template is 1000 tandem repeat copies of a 100bp sequence, for each incorporated nucleotide, at least 25 fluorescent dyes per template are obtained (i.e., on a PerkinElmer ScanArra, if each template is within a pixel, is more than 10 times higher than the noise level). Assuming four nucleotides are incorporated in an average cycle, label spacing averages 1000 bases, avoiding both quenching and polymerase inhibition.

在本发明另外的实施方案中，方案I(举例来说)允许了缓和对聚合酶的一些制约的BASS变形。如果插入核苷酸组被标记但未被封闭，同时终止核苷酸未被标记但被封闭，则可在单个步骤中作为混合物添加所有四种核苷酸，然后如上文那样洗涤和扫描。可使用接受封闭核苷酸和标记核苷酸两者的聚合酶，或者可以使用不同的聚合酶，在第一步骤中添加标记的插入核苷酸，而在第二步骤中添加封闭的终止核苷酸。此种修饰的方案的色度差异在于同聚物作为没有掺入的邻近循环而被检出；它们各自随着掺入的单个终止核苷酸而终止，从而逐步地扫描同聚物，而非在单次运行中填充它。在这样的方案中，可能期望使用可光致断裂的荧光染料(见下文)以及可光致断裂的3’-封闭基团。备选地，可以使用通过温和的化学处理可除去的封闭基团，例如，Kamal等(Tetrahedron Letters 1999，vol.40，pp.371-372)中描述的烯丙基基团。In additional embodiments of the invention, Scheme I, for example, allows for BASS variants that ease some of the constraints on the polymerase. If the insert nucleotide set is labeled but not blocked, while the stop nucleotides are not labeled but blocked, then all four nucleotides can be added as a mixture in a single step, then washed and scanned as above. A polymerase that accepts both blocking and labeled nucleotides can be used, or a different polymerase can be used that adds the labeled insert nucleotide in the first step and the blocked termination core in the second step. glycosides. The colorimetric difference of this modified scheme is that homopolymers are detected as adjacent loops without incorporation; they each terminate with the incorporation of a single stop nucleotide, thereby scanning for homopolymers stepwise, rather than Populate it in a single run. In such a protocol, it may be desirable to use a photocleavable fluorescent dye (see below) as well as a photocleavable 3'-blocking group. Alternatively, blocking groups removable by mild chemical treatment can be used, for example, allyl groups as described in Kamal et al. (Tetrahedron Letters 1999, vol. 40, pp. 371-372).

在特别简单的实施方案中，本发明的一方面提供了这样的方法(方案II)，其包括：In a particularly simple embodiment, an aspect of the invention provides a method (Scheme II) comprising:

1.为单链模板提供退火DNA链上的游离3’端，以作为引物发挥功能。1. Provide the free 3' end on the annealed DNA strand to the single-stranded template to function as a primer.

2.添加携带可区分标记的三种核苷酸，如可区分的荧光标记。2. Add three nucleotides carrying a distinguishable label, such as a distinguishable fluorescent label.

3.任选地，添加一种或多种非掺入的抑制剂核苷酸(不同于标记核苷酸)。实例包括5’-双-和单-磷酸核苷酸，5’-(α-β-亚甲基)三磷酸核苷酸。3. Optionally, one or more non-incorporated inhibitor nucleotides (other than the labeled nucleotides) are added. Examples include 5'-bis- and mono-phosphate nucleotides, 5'-(α-β-methylene)triphosphate nucleotides.

4.与适当的聚合酶在导致核苷酸添加到生长链的条件下温育。4. Incubation with an appropriate polymerase under conditions that result in the addition of nucleotides to the growing strand.

5.洗去未掺入的核苷酸。5. Wash away unincorporated nucleotides.

6.添加剩余的核苷酸(标记的，例如荧光地)，并与聚合酶(并非必须和步骤5中的相同)在导致核苷酸添加到生长链的条件下温育。6. Add remaining nucleotides (labeled, eg fluorescently) and incubate with polymerase (not necessarily the same as in step 5) under conditions that result in the addition of nucleotides to the growing strand.

7.洗去未掺入的核苷酸。7. Wash away unincorporated nucleotides.

8.检测每一个标记核苷酸的存在和量。8. Detect the presence and amount of each labeled nucleotide.

9.使标记失效(例如，通过光漂白，并非每个循环都必需，或者，通过用例如二硫苏糖醇进行化学处理以切割二硫键)。9. Inactivation of labeling (eg by photobleaching, which is not necessary for every cycle, or by chemical treatment with eg dithiothreitol to cleave disulfide bonds).

10.重复步骤2-7，直至所需的循环数得以完成。10. Repeat steps 2-7 until desired number of cycles are completed.

例如，人们可在步骤2中使用dA/dG/dC(如标记为红色/绿色/蓝色)，然后在步骤6中添加dT(如标记为黄色)。步骤4将添加任何数目的dA、dG和dC，直至模板中首次出现dA，然后因为没有互补的核苷酸而终止。步骤8中dA/dG/dC(如红色/绿色/蓝色)的荧光读数将与每一个dT之间dA、dG和dC的数目成比例，而掺入的dA(如黄色)的荧光将与连续dT的数目成比例，并且在光谱分离后，可定量各种贡献。获得的序列一般可记做为四种成员的序列，给出了每一个dT之间dA、dG和dC的数目(但非次序)。For example, one can use dA/dG/dC in step 2 (as marked red/green/blue) and then add dT in step 6 (as marked yellow). Step 4 will add any number of dA, dG, and dC until the first occurrence of dA in the template and then terminate because there are no complementary nucleotides. The fluorescence readout for dA/dG/dC (e.g. red/green/blue) from step 8 will be proportional to the number of dA, dG and dC between each dT, while the fluorescence of incorporated dA (e.g. yellow) will be proportional to the The number of consecutive dTs is proportional, and after spectral separation, the various contributions can be quantified. The resulting sequence can generally be written as a sequence of four members, giving the number (but not the order) of dA, dG and dC between each dT.

例如，序列ACGCTACGCATCAGACTTC(即模板TGCGATGCGTAGTCTGAAG)可记做为[1A，2C，1G，1T]-[2A，2C，1G，1T]-[2A，2C，1G，2T]-[0A，1C，0G，0T]。For example, the sequence ACGCTACGCATCAGACTTC (i.e. template TGCGATGCGTAGTCTGAAG) can be recorded as [1A, 2C, 1G, 1T]-[2A, 2C, 1G, 1T]-[2A, 2C, 1G, 2T]-[0A, 1C, 0G, 0T].

通过根据方案II实施四种不同的反应，在四种可能性之间改变终止核苷酸，人们能够确保在四种反应之一中在每一个不同碱基处终止。By carrying out four different reactions according to Scheme II, varying the terminating nucleotide between the four possibilities, one can ensure termination at each of the different bases in one of the four reactions.

尽管荧光染料便于使用，但并非所有的荧光染料都易于漂白。在上述操作中可使用其他类型的标记，只要对每个循环而言，它们能够去除、灭活或计算扣除即可。不过，在另外的实施方案中，为了允许更宽的标记选择，去除(如荧光染料的光漂白)可任选地替换为完全重新开始，例如如下：Although fluorochromes are convenient to use, not all fluorochromes are easily bleached. Other types of markers can be used in the above manipulations as long as they can be removed, inactivated, or deducted computationally for each cycle. However, in additional embodiments, in order to allow a wider selection of labels, removal (such as photobleaching of fluorescent dyes) can optionally be replaced by a complete restart, for example as follows:

首先，用标记的如荧光核苷酸进行一个循环。除去新近合成的DNA链，如通过甲酰胺处理，并使新鲜引物退火，以重新开始该过程。这次，用未标记的核苷酸进行一个循环，接着用标记的核苷酸进行一个循环。重复该过程，每一次用渐增循环的未标记核苷酸。以这种方式，仅仅每次重新开始的末次循环被标记，消除了去除在前循环的标记(如漂白荧光染料)的需要。First, a cycle is performed with labeled, eg fluorescent, nucleotides. Freshly synthesized DNA strands are removed, such as by formamide treatment, and fresh primers are allowed to anneal to restart the process. This time, one cycle was performed with unlabeled nucleotides, followed by one cycle with labeled nucleotides. This process is repeated, each time with increasing cycles of unlabeled nucleotides. In this way, only the last cycle of each restart is labeled, eliminating the need to remove labels from previous cycles, such as bleached fluorescent dyes.

也可利用相同的途径越过非目的区，有点像移动磁带录音机的读磁头。The same approach can also be used to cross non-intended areas, a bit like moving the read head of a tape recorder.

作为光漂白的备选方案，可使用在核苷酸和荧光染料之间携带可切割接头的修饰的荧光核苷酸。例如，业已描述过携带二硫键的此类核苷酸，其可由还原剂如二硫苏糖醇有效地切割(见Rob Mitra和George Church关于用来测序和基因型分析的polony技术的工作，关于包括化学结构在内的细节，可利用浏览器在因特网上找到，如http：//cbcg.lbl.gov/Genome9/Talks/mitra.pdf。类似地，Li等(PNAS 2003，vol.100 no.2，pp.414-419)描述了包含光不稳定的2-硝基苄基接头的可光致断裂的荧光核苷酸。As an alternative to photobleaching, modified fluorescent nucleotides carrying a cleavable linker between the nucleotide and the fluorescent dye can be used. For example, such nucleotides carrying disulfide bonds have been described, which can be efficiently cleaved by reducing agents such as dithiothreitol (see the work of Rob Mitra and George Church on the polony technique for sequencing and genotype analysis, pp. Details, including chemical structures, can be found on the Internet using a browser, such as http://cbcg.lbl.gov/Genome9/Talks/mitra.pdf. Similarly, Li et al. (PNAS 2003, vol.100 no .2, pp.414-419) describe photocleavable fluorescent nucleotides comprising a photolabile 2-nitrobenzyl linker.

根据方案II的方法允许实现许多优点：The method according to scheme II allows to achieve a number of advantages:

·与当前多数循环是非生产性的SBS法(因为在此法中，人们每次添加单个碱基，在该位置上互补的几率＜50％)相比，由于四种反应之一在每一个模板位置上终止(忽略同聚物)，所以测序n个碱基所需的循环数为n。·Compared to the current SBS method where most cycles are unproductive (because in this method, each time one adds a single base, the chance of being complementary at that position is <50%), since one of the four reactions is in each template Termination at position (ignoring homopolymers), so the number of cycles required to sequence n bases is n.

·由于对四种反应中的每一种而言，合成是由引物重新开始的，所以主要地依赖于循环数的因素将以4倍低的程度有问题。特别是，在许多循环后将会发生同步性的丧失，但是对于四种反应的每一种，由于所有模板都被有效地重新同步化，所以与SBI或焦磷酸测序法相比，在相似条件下，能够读取四倍之多的碱基(见下实施例)。• Since synthesis is reinitiated by primers for each of the four reactions, factors that depend primarily on cycle number will be problematic to a 4-fold lower degree. In particular, a loss of synchrony will occur after many cycles, but for each of the four reactions, since all templates are effectively resynchronized, it is possible to achieve a similar result under similar conditions compared to SBI or pyrosequencing. , able to read four times as many bases (see Example below).

·不需要完整序列的应用(即用于基因表达的签名测序，用于表现基因组学的甲基-胞嘧啶测序，以及用于特定SNP的SNP分析)可使用只从四种反应之一中所获得的部分序列。所获得的序列包含等价于每个循环1个碱基对的信息。见下文的方案III。还见图1关于由不同反应中每一种dA、dC、dG和dT的组成可获得的数据的图解。那些数据的任何一种都足以用于所需的目的，例如，确定若干可能的序列(如在dA核苷酸中存在差异)中的哪一个存在于测试样品之中。Applications that do not require complete sequences (i.e., signature sequencing for gene expression, methyl-cytosine sequencing for epigenomics, and SNP analysis for specific SNPs) can use only The partial sequence obtained. The obtained sequence contained information equivalent to 1 base pair per cycle. See Scheme III below. See also Figure 1 for a diagram of the data obtainable from the composition of each of dA, dC, dG and dT in different reactions. Any of those data are sufficient for the desired purpose, eg, determining which of several possible sequences (eg, differences in dA nucleotides) are present in the test sample.

·同聚物链总是测量四次，使得它们比在SBI或焦磷酸测序法中更易于正确地碱基引入。见下文的碱基引入算法II。• Homopolymer strands are always measured four times, making them easier to base correctly than in SBI or pyrosequencing methods. See Base Incorporation Algorithm II below.

碱基引入算法I(基本策略)Base introduction algorithm I (basic strategy)

这部分公开内容列举了本发明以下方面的例示性的实施方案，这些方面涉及根据通过包括使用所公开的终止和插入核苷酸的方法所获得的信息鉴定序列。This section of the disclosure sets forth illustrative embodiments of aspects of the invention related to the identification of sequences based on information obtained by methods including the use of the disclosed termination and insertion nucleotides.

通过根据方案II实施四种不同的反应，在四种可能性之间改变终止核苷酸，人们能够确保在四种反应之一中在每一个不同的碱基处终止。下文的表显示了在使用四种终止核苷酸中的每一种的四个循环中将会由序列ACGCTACGCATCAGACTC(模板TGCGATGCGTAGTCTGAG)获得的结果或色度：终止所获得的序列(前4个循环)： dTdAdGdC [1A，2C，1G，1T]-[2A，2C，1G，1T]-[2A，2C，1G，1T]-[0A，1C，0G，0T][0C，0G，0T，1A]-[2C，1G，1T，1A]-[2C，1G，0T，1A]-[1C，0G，1T，1A][1A，1C，0T，1G]-[1A，2C，1T，1G]-[2A，2C，1T，1G]-[1A，2C，1T，0G][1A，0G，0T，1C]-[0A，1G，0T，1C]-[1A，0G，1T，1C]-[0A，1G，0T，1C] By performing four different reactions according to Scheme II, varying the terminating nucleotide between the four possibilities, one can ensure termination at each of the different bases in one of the four reactions. The table below shows the results or shades that would be obtained from the sequence ACGCTACGCATCAGACTC (template TGCGATGCGTAGTCTGAG) in four cycles using each of the four stop nucleotides: termination The sequence obtained (first 4 cycles): dTdAdGdC [1A, 2C, 1G, 1T] - [2A, 2C, 1G, 1T] - [2A, 2C, 1G, 1T] - [0A, 1C, 0G, 0T] [0C, 0G, 0T, 1A] - [ 2C, 1G, 1T, 1A]-[2C, 1G, 0T, 1A]-[1C, 0G, 1T, 1A][1A, 1C, 0T, 1G]-[1A, 2C, 1T, 1G]-[2A , 2C, 1T, 1G] - [1A, 2C, 1T, 0G] [1A, 0G, 0T, 1C] - [0A, 1G, 0T, 1C] - [1A, 0G, 1T, 1C] - [0A, 1G, 0T, 1C]

自左向右读，人们能够容易地看出第一个核苷酸一定是A(因为有关A的第一步骤没有产生任何其他碱基的荧光，因此一定是终止的而不含任何掺入核苷酸)。除去相应的条目，并记录A，可以产生：终止所获得的序列： dTdAdGdC [1A，2C，1G，1T]-[2A，2C，1G，1T]-[2A，2C，1G，1T]-[0A，1C，0G，0T][2C，1G，1T，1A]-[2C，1G，0T，1A]-[1C，0G，1T，1A][1A，1C，0T，1G]-[1A，2C，1T，1G]-[2A，2C，1T，1G]-[1A，2C，1T，0G][1A，0G，0T，1C]-[0A，1G，0T，1C]-[1A，0G，1T，1C]-[0A，1G，0T，1C] Reading from left to right, one can easily see that the first nucleotide must be A (since the first step on A does not produce any fluorescence of other bases, it must therefore be terminated without any incorporation into the nucleus glycosides). Removing the corresponding entry, and logging A, yields: termination The sequence obtained: dTdAdGdC [1A, 2C, 1G, 1T] - [2A, 2C, 1G, 1T] - [2A, 2C, 1G, 1T] - [0A, 1C, 0G, 0T] [2C, 1G, 1T, 1A] - [ 2C, 1G, 0T, 1A]-[1C, 0G, 1T, 1A][1A, 1C, 0T, 1G]-[1A, 2C, 1T, 1G]-[2A, 2C, 1T, 1G]-[1A , 2C, 1T, 0G] [1A, 0G, 0T, 1C] - [0A, 1G, 0T, 1C] - [1A, 0G, 1T, 1C] - [0A, 1G, 0T, 1C]

序列：ASequence: A

现在左侧唯一相符的条目是有关C的，因为其表明了只有1个A的存在。除去相应的条目，并记录C，我们得到：终止所获得的序列： dTdAdGdC [1A，2C，1G，1T]-[2A，2C，1G，1T]-[2A，2C，1G，1T]-[0A，1C，0G，0T][2C，1G，1T，1A]-[2C，1G，0T，1A]-[1C，0G，1T，1A][1A，1C，0T，1G]-[1A，2C，1T，1G]-[2A，2C，1T，1G]-[1A，2C，1T，0G][0A，1G，0T，1C]-[1A，0G，1T，1C]-[0A，1G，0T，1C] Now the only matching entry on the left is for C, since it indicates the presence of only 1 A. Removing the corresponding entry, and logging C, we get: termination The sequence obtained: dTdAdGdC [1A, 2C, 1G, 1T] - [2A, 2C, 1G, 1T] - [2A, 2C, 1G, 1T] - [0A, 1C, 0G, 0T] [2C, 1G, 1T, 1A] - [ 2C, 1G, 0T, 1A]-[1C, 0G, 1T, 1A][1A, 1C, 0T, 1G]-[1A, 2C, 1T, 1G]-[2A, 2C, 1T, 1G]-[1A , 2C, 1T, 0G] [0A, 1G, 0T, 1C] - [1A, 0G, 1T, 1C] - [0A, 1G, 0T, 1C]

序列：ACSequence: AC

现在左侧唯一相符的条目是有关G的：Now the only matching entry on the left is for G:

终止 terminate 所获得的序列： The obtained sequence: dTdAdGdC dTdAdGdC [1A，2C，1G，1T]-[2A，2C，1G，1T]-[2A，2C，1G，1T]-[0A，1C，0G，0T][2C，1G，1T，1A]-[2C，1G，0T，1A]-[1C，0G，1T，1A][1A，2C，1T，1G]-[2A，2C，1T，1G]-[1A，2C，1T，0G][0A，1G，0T，1C]-[1A，0G，1T，1C]-[0A，1G，0T，1C] [1A, 2C, 1G, 1T] - [2A, 2C, 1G, 1T] - [2A, 2C, 1G, 1T] - [0A, 1C, 0G, 0T] [2C, 1G, 1T, 1A] - [ 2C, 1G, 0T, 1A] - [1C, 0G, 1T, 1A] [1A, 2C, 1T, 1G] - [2A, 2C, 1T, 1G] - [1A, 2C, 1T, 0G] [0A, 1G, 0T, 1C]-[1A, 0G, 1T, 1C]-[0A, 1G, 0T, 1C]

序列：ACGSequence: ACG

现在左侧唯一相符的条目是有关C的，因为其表明了在这个和之前的C之间只有1个G，与我们迄今所得的序列一致。The only matching entry on the left now is for C, since it shows that there is only 1 G between this and the previous C, consistent with the sequence we have had so far.

照此继续，最终提供完整的序列：ACGCTACGCATCAGACTC。Continuing in this way eventually provides the complete sequence: ACGCTACGCATCAGACTC.

事实上，容易看出由每一步骤的插入核苷酸所获得的荧光总数度量每一个终止核苷酸之间的总距离，而来自终止核苷酸的荧光度量连续的终止核苷酸的数目，并且，人们因此总是能够由一组四种反应来确定序列。这个事实进一步参照图1举例说明。In fact, it is easy to see that the total amount of fluorescence obtained from intervening nucleotides at each step measures the total distance between each stop nucleotide, while the fluorescence from stop nucleotides measures the number of consecutive stop nucleotides , and one can thus always determine the sequence from a set of four reactions. This fact is further illustrated with reference to FIG. 1 .

扫视图1中的四行，便使得能够“读取”序列。有可能这样获得序列，即通过简单地确定每一个循环中所掺入的终止核苷酸的数目(通过所测标记如荧光的量级)，以及每一个循环中所掺入的插入核苷酸的数目(再一次地通过所测标记的量级)，并将使用四种不同核苷酸中的每一种作为终止核苷酸的四次运行中的每一次运行的结果进行排列。不过，优选地，确定每一次运行中插入核苷酸的性质(这可能意味着特性)，从而提供允许极快和准确确定序列的信息简并性，允许标记量级测量中的误差，例如如本文进一步所讨论的那样。Scanning the four rows in Figure 1 enables the "reading" of the sequence. It is possible to obtain sequences by simply determining the number of stop nucleotides incorporated per cycle (by the magnitude of a measured label such as fluorescence), and the intervening nucleotides incorporated per cycle (again by the magnitude of the label measured) and the results of each of the four runs using each of the four different nucleotides as the stop nucleotide were aligned. Preferably, however, the nature of the inserted nucleotide (which could mean identity) is determined for each run, thereby providing the degeneracy of information that allows for extremely fast and accurate determination of sequence, allowing for errors in marker magnitude measurements such as as discussed further in this paper.

碱基引入算法IIBase Introduction Algorithm II

可使用例如动态程序设计、最小二乘方最优化和/或正则表达式执行更为复杂的碱基引入算法，以在面对测量误差时找寻最佳序列。此类算法也可以更好地利用可获得信息的冗余性。换言之，与仅使用每次出现的相同核苷酸之间的测量长度相反，此类算法将找寻最佳序列，其使所预期的和所观察到的三种插入核苷酸中的每一种的丰度之间的差异最小化。More sophisticated base calling algorithms can be implemented using, for example, dynamic programming, least squares optimization, and/or regular expressions to find the optimal sequence in the face of measurement error. Such algorithms can also make better use of the redundancy of available information. In other words, instead of just using the measured length between each occurrence of the same nucleotide, such an algorithm will find the optimal sequence that makes each of the expected and observed three intervening nucleotides The difference between the abundances is minimized.

本发明人提供了可运行的动态程序设计算法，尽管有20-25％的噪声，其也运行良好。其首先使用动态程序设计进行四个系列测量的多重比对，从而在每一步使所预期的和所观察到的三种插入核苷酸中的每一种的丰度之间的差异最小化。然后，基于四种可获得的距离测量，利用最小二乘方最优化找寻每一种同聚物链最可能的长度。The inventors have provided a working dynamic programming algorithm that works well despite 20-25% noise. It first performs a multiple alignment of four series of measurements using dynamic programming, minimizing at each step the difference between the expected and observed abundance of each of the three intervening nucleotides. Then, based on the four available distance measures, a least squares optimization was used to find the most likely length of each homopolymer chain.

术语和定义Terms and Definitions

同聚物是一种特定核苷酸的连续序列。同聚物序列是其中同聚物记做为数字而非重复的字母的DNA序列，即ACCGGT记做为ACGT，并具有同聚物长度1，2，2，1。A homopolymer is a contiguous sequence of specific nucleotides. A homopolymer sequence is a DNA sequence in which homopolymers are written as numbers rather than repeated letters, ie ACCGGT is written as ACGT and has a homopolymer length of 1,2,2,1.

设色度为利用四种天然核苷酸中的每一种作为终止核苷酸，通过将本发明的方法如方案I重复四次所获得的一组测量值。从而色度为由循环、终止核苷酸和所测核苷酸索引的测量值的三维数组。例如，如果对每一个终止核苷酸进行10个循环，则色度将包含10(循环数)乘以4(终止核苷酸的数目)乘以4(所测核苷酸的数目)个成员，且位置{4，‘A’，‘C’}上的数字将是在循环数为4时当腺苷用作为终止核苷酸时所测的胞嘧啶的荧光。为方便起见，设x的色度为全部色度中包含由x作为终止核苷酸所获得的测量值的亚组。从而A的色度是全部色度的四分之一。Let chromaticity be a set of measurements obtained by repeating the method of the present invention as in Protocol I four times using each of the four natural nucleotides as the stop nucleotide. Thus chroma is a three-dimensional array of measurements indexed by cycle, stop nucleotide, and measured nucleotide. For example, if 10 cycles are performed for each stop nucleotide, the chroma will contain 10 (number of cycles) by 4 (number of stop nucleotides) by 4 (number of nucleotides measured) members , and the number at position {4, 'A', 'C'} will be the fluorescence of cytosine measured at cycle number 4 when adenosine is used as the stop nucleotide. For convenience, let the chromaticity of x be the subset of all chromaticities that include measurements obtained with x as the terminating nucleotide. Thus the chromaticity of A is a quarter of the total chromaticity.

设N为在每一次重复中所进行的循环数。因此色度为由标记测量值推断出的4*4*N个成员。Let N be the number of cycles performed in each repetition. Chromaticity is thus 4*4*N members inferred from marker measurements.

设引入序列为核苷酸序列S₀，S₁，...S_k(其中每一个S为[A，C，G，T]之一)。碱基引入的目的是在给定色度时找寻最佳引入序列。为方便起见，我们将同聚物链表示为量而不是重复相同的碱基，换言之，我们使引入序列中的每一个位置i与一个量q_i相关，它给出了碱基S_i估计的重复数。为保持一致，我们如此限定序列，从而对于所有的n，S_n+1≠S_n。Let the introduced sequence be the nucleotide sequence S ₀ , S ₁ , ... S _k (wherein each S is one of [A, C, G, T]). The purpose of base introduction is to find the optimal introduction sequence given the chromaticity. For convenience, we express homopolymeric chains as quantities rather than repeating the same base, in other words, we associate each position i in the introduced sequence with a quantity q _i which gives an estimate of the base S _i number of repetitions. For consistency, we restrict the sequence such that S _n+1 ≠ S _n for all n.

碱基引入阶段I，动态程序设计Base introduction phase I, dynamic programming

碱基引入的目的是在给定色度时找寻最佳引入序列。然而，存在4*3^k-1个可能的长度为k的引入序列，即使是对于相当小的k，也是极大的数目(k＝20时，存在超过40亿个可能的引入序列)。为了找寻可用的碱基引入算法，简化此问题的复杂性。The purpose of base introduction is to find the optimal introduction sequence given the chromaticity. However, there are 4*3k ^-1 possible incoming sequences of length k, an enormous number even for fairly small k (with k=20, there are over 4 billion possible incoming sequences). In order to find a usable base introduction algorithm, the complexity of this problem is simplified.

引入序列可通过每一种核苷酸的出现次数归类。例如，碱基计数{1，2，0，4}对应于任一含有1A、2C、无G和4T的序列。此种序列的一个实例是TCTATCT。Introduced sequences can be classified by the number of occurrences of each nucleotide. For example, base counts {1, 2, 0, 4} correspond to any sequence containing 1A, 2C, no G, and 4T. An example of such a sequence is TCTATCT.

根据本发明提供的算法利用了这样的事实，即在一些简单的情况下，我们能够容易地得出最佳的引入序列，且通过递归，更为困难的情况能够由较为简单的情况推断出来。The algorithm provided according to the invention exploits the fact that in some simple cases we can easily derive the optimal introduction sequence, and by recursion more difficult cases can be deduced from simpler cases.

一些简单的情况容易解决。碱基计数{0，0，0，0}对应于空的引入序列。计数{1，0，0，0}仅能够对应于引入序列‘A’，并且对于C、G和T也是类似的。Some simple cases are easy to solve. Base counts {0, 0, 0, 0} correspond to the empty incoming sequence. The count {1, 0, 0, 0} can only correspond to the incoming sequence 'A', and similarly for C, G and T.

然而，碱基计数{1，1，1，1}可对应于‘ACGT’、‘TCGA’等等。在此类情况下，色度可用于找寻最佳的引入序列。However, base counts {1, 1, 1, 1} may correspond to 'ACGT', 'TCGA', and so on. In such cases, chroma can be used to find the best incoming sequence.

注意具有碱基计数{i，j，k，l}的任何引入序列必须精确地对应于特定亚组的色度，也就是说，包括i个循环A的色度、j个循环C的色度、k个循环G的色度以及l个循环T的色度的亚组。因此可将引入序列预测的色度与实际所测量的色度进行比较。{i，j，k，l}的最佳引入序列将是其预测色度与相应亚组的实际测量色度最为相似的那一个。相似性可由多种方式测量，例如，作为差分和(sum ofdifferences)、方差和(sum of square differences)、Pearson相关系数等。相似性可报告为得分，即作为待进行最小化的误差得分或待进行最大化的相似性得分。Note that any incoming sequence with base counts {i,j,k,l} must correspond exactly to a specific subgroup of chromas, that is, include i cycles of A's chroma, j cycles of C's chroma , a subgroup of k chromaticities of cycle G and l chromaticities of cycle T. It is thus possible to compare the chromaticity introduced into the sequence prediction with the actually measured chromaticity. The best introduced sequence for {i,j,k,l} will be the one whose predicted chromaticity is most similar to the actual measured chromaticity of the corresponding subgroup. Similarity can be measured in various ways, for example, as sum of differences, sum of square differences, Pearson correlation coefficient, etc. The similarity can be reported as a score, ie as an error score to be minimized or a similarity score to be maximized.

所述一般情况{i，j，k，l}不能直接解决。但{i，j，k，l}的最佳引入序列可由最多四种不同的方式由较短的序列产生：通过向{i-1，j，k，l}的最佳序列中添加‘A’，通过向{i，j-1，k，l}的最佳序列中添加‘C’，通过向{i，j，k-1，l}的最佳序列中添加‘G’，或者通过向{i，j，k，l-1}的最佳序列中添加‘T’。The general case {i, j, k, l} described cannot be solved directly. But the optimal incoming sequence of {i, j, k, l} can be generated from shorter sequences in up to four different ways: by adding 'A ', by adding 'C' to the best sequence of {i,j-1,k,l}, by adding 'G' to the best sequence of {i,j,k-1,l}, or by Add 'T' to the best sequence of {i,j,k,l-1}.

通过计算得分(如上所述，通过比较预测色度与实际色度)并选择最小值(或最大值，视所用的措施而定)，人们能够找出(最多)四种延伸中的哪一个是最佳的一个。在下文显示了这是如何进行的，但暂时假定已经得到了这样的得分。By computing scores (as above, by comparing predicted chromaticity with actual chromaticity) and choosing a minimum (or maximum, depending on the measure used), one is able to find out which of (at most) four extensions is one of the best. The following shows how this is done, but assumes for the moment that such a score has been obtained.

我们将q设置为由色度所获得的实际测量的量的新近引入的碱基。例如，当考虑使用‘A’的延伸(即由{i-1，j，k，l}到{i，j，k，l})时，则将会由位置{i，‘A’，‘A’}处的色度获得q，即在循环i中使用腺苷作为终止核苷酸时所测得的标记腺苷的量。We set q to the actual measured amount of newly introduced bases obtained by chroma. For example, when considering the extension of 'A' (i.e. from {i-1, j, k, l} to {i, j, k, l}), then the positions {i, 'A', ' The chromaticity at A'} obtains q, the amount of labeled adenosine measured when using adenosine as the terminating nucleotide in cycle i.

从而，{i，j，k，l}的最佳引入序列总是能够通过找出少含一个引入碱基之一的序列的最佳延伸而得到。然后该操作可对每一个更短的情况进行重复，直至达到全部为零的解的情况如{1，0，0，0}。因此通过递归性地应用同样的简单操作，总是有可能得到任何长度的最佳引入序列。作为副产物，获得了如在色度中所测量的同聚物长度q_i。Thus, the optimal incoming sequence for {i, j, k, l} can always be found by finding the optimal extension of the sequence that contains one of the incoming bases less. This operation can then be repeated for each shorter case until a case of an all zero solution such as {1, 0, 0, 0} is reached. It is therefore always possible to obtain the optimal incoming sequence of any length by recursively applying the same simple operation. As a by-product, the homopolymer length q _i as measured in the colorimeter is obtained.

应用的少数限制：A few restrictions apply:

·序列不能含有少于零的任何碱基。因而我们不能通过用‘T’延伸{i，j，k，-1}而得到{i，j，k，0}的最佳引入序列。由于这个限制，所有的递归必须最后终止于[0，0，0，0}空序列。• The sequence cannot contain any bases less than zero. Thus we cannot get the optimal introduction sequence of {i, j, k, 0} by extending {i, j, k, -1} with 'T'. Due to this restriction, all recursions must end up in the [0, 0, 0, 0} empty sequence.

·我们对引入序列的约束，即对于所有的n，S_n+1≠S_n，意味着如果{i-1，j，k，l}的最佳引入序列止于‘A’，则我们不能用‘A’延伸，对于其他碱基也是诸如此类的。· Our constraint on the incoming sequence, that S _n+1 ≠ S _n for all n, means that if the optimal incoming sequence of {i-1,j,k,l} ends at 'A', then we cannot Extend with 'A', and so on for other bases.

·在有些情况下，没有可能的延伸。例如，{2，0，0，0}不能通过用另一个‘A’延伸{1，0，0，0}而产生。在此类情况下，不存在引入序列。• In some cases, no extension is possible. For example, {2, 0, 0, 0} cannot be produced by extending {1, 0, 0, 0} with another 'A'. In such cases, there is no incoming sequence.

相似性得分可由逐步的方式计算。因为它们仅差一个循环，所以当计算{i，j，k，l}的得分时，{i-1，j，k，l}的得分可重复使用，等等。这可通过跟踪每一个{i，j，k，l}的最佳引入序列的长度以及运行得分(running score)而实现。当研究从比方说从{i-1，j，k，l}到{i，j，k，l}的可能延伸(即通过‘A’的延伸)时，只需计算对应于‘A’的额外循环的预测色度部分。这可通过研究引入序列中追溯到最新近的‘A’的插入碱基来计算。因为{i-1，j，k，l}的最佳引入序列已知，所以也已知道它是如何获得的。特别是，已知每一个插入核苷酸的测量的量q。对于每一个‘C’、‘G’和‘T’，把这些量都加起来，一直追溯到最新近的‘A’，以获得关于预测的色度中缺少的循环的预测值。然后将这些预测值之间的差分(或方差等)和实际所测色度中对应的循环加到运行得分上。然后通过计算被引入序列长度相除的运行得分，可获得标准化的得分。The similarity score can be calculated in a stepwise manner. Because they differ by only one cycle, the scores for {i-1, j, k, l} are reused when computing the scores for {i, j, k, l}, etc. This can be achieved by keeping track of the length of the optimal incoming sequence and the running score for each {i, j, k, l}. When studying possible extensions from, say, {i-1,j,k,l} to {i,j,k,l} (i.e., extensions through 'A'), one only needs to compute the corresponding The predicted chroma part of the extra loop. This can be calculated by studying the inserted bases back to the most recent 'A' in the incoming sequence. Since the optimal introduction sequence of {i-1, j, k, l} is known, it is also known how it was obtained. In particular, the measured quantity q of each inserted nucleotide is known. For each 'C', 'G' and 'T', these quantities are added up, all the way back to the most recent 'A', to obtain the predicted value for the missing cycle in the predicted chroma. The difference (or variance, etc.) between these predicted values and the corresponding cycle in the actual measured chromaticity is then added to the running score. A normalized score can then be obtained by calculating the running score divided by the length of the incoming sequence.

现在注意，为计算{3，2，2，2}的最佳引入序列，仍需计算{2，2，2，2}、{1，2，2，2}等的得分。但是为了找到总体上最好的序列，人们必须系统地研究最高到某一限度(例如{N，N，N，N})的所有的可能性，其中每一个都将导致追溯至{0，0，0，0}的得分的重算，所以组合激增依然存在。然而，动态程序设计是避免此类组合激增的聪明的方式。Note now that to compute the best incoming sequence for {3, 2, 2, 2}, the scores for {2, 2, 2, 2}, {1, 2, 2, 2}, etc. are still computed. But to find the overall best sequence, one has to systematically study all possibilities up to a certain limit (e.g. {N, N, N, N}), each of which leads to a trace back to {0, 0 , 0, 0} scores are recalculated, so the combinatorial surge remains. However, dynamic programming is a clever way to avoid such combinatorial proliferation.

可使用算法，从而每当已计算出得分时，将其贮存在四维N×N×N×N的矩阵中以备再用。从而，当计算{3，2，2，2}的最佳引入序列时，{2，2，2，2}、{1，2，2，2}等的得分将贮存在所述矩阵中。当稍后再次需要比方说{2，2，2，2}的得分时，可完全避免递归，而仅仅是由矩阵中取出前面计算过的结果。这提供了极为有效的执行。与研究大约3^4N个可能的引入序列相反，仅需研究N⁴个可能性。例如，在N＝20的实际系统中，该问题从大约10³⁸次计算降至160 000，使得算法由不可行变为有效。Algorithms can be used such that whenever a score has been calculated, it is stored in a four-dimensional NxNxNxN matrix for reuse. Thus, when computing the best incoming sequence for {3, 2, 2, 2}, the scores for {2, 2, 2, 2}, {1, 2, 2, 2}, etc. will be stored in the matrix. When the score of say {2, 2, 2, 2} is needed again later, the recursion can be avoided entirely, and the previously computed results are simply fetched from the matrix. This provides extremely efficient execution. As opposed to studying approximately 3 ^4N possible introduction sequences, only N ⁴ possibilities need to be studied. For example, in an actual system with N=20, the problem is reduced from about 10 ³⁸ calculations to 160 000, making the algorithm from infeasible to effective.

能够通过如本文所公开的算法可靠地引入的最长序列是这样的一种序列，其具有碱基之一的N同聚物，超过N个一种碱基，且少于N个其他碱基。这因如下事实是显而易见的，即当一种终止碱基超过了N时，该序列仍然能够引入，因为缺少的碱基必须填补由其他三种留下的缺口。但是当第二种碱基超过N时，由其余碱基留下的缺口不能被明确无疑地填充。所述限制并非是绝对的；部分序列仍能从完整色度中获得。The longest sequence that can be reliably introduced by an algorithm as disclosed herein is one that has N homopolymers of one of the bases, more than N of one base, and fewer than N of the other bases . This is evident from the fact that when one of the terminating bases exceeds N, the sequence can still be introduced, since the missing base must fill the gap left by the other three. But when the second base exceeds N, the gap left by the remaining bases cannot be unambiguously filled. The limits are not absolute; partial sequences can still be obtained from full chroma.

根据应用，人们可选择报告(其中)任何{i，j，k，l}直至{N，N，N，N}的最佳序列，{N，N，N，N}的最佳序列，或者其中一个索引为N的最佳序列。在下文的实施例中，使用的是后者。选择依赖于这样的因素，例如是否较之于精确度优选读取长度，以及是否部分序列是可接受的。Depending on the application, one can choose to report the best sequence of (where) any {i, j, k, l} up to {N, N, N, N}, the best sequence of {N, N, N, N}, or One of the best sequences with index N. In the examples below, the latter is used. The choice depends on factors such as whether read length is preferred over accuracy, and whether partial sequences are acceptable.

碱基引入阶段II，最小二乘方(任选)Base introduction phase II, least squares (optional)

阶段I的结果是引入序列S₀，S₁，...S_n及相应的同聚物长度q₀，q₁，...q_n。通过将每一个q四舍五入至最接近的整数，并拼写出所得的DNA序列，我们能够以常规的方式将此写出。然而，色度中存在着更多信息，我们能够对其加以利用，以找出q_i更好的估计值。毕竟，所测的每一个终止碱基的同聚物长度是单次测量，但是引入序列中每一个位置实际上都已被测量过四次(每一个终止碱基一次)。Phase I results in the introduction of the sequence S ₀ , S ₁ , ... S _n and the corresponding homopolymer lengths q ₀ , q ₁ , ... q _n . We can write this out in the conventional way by rounding each q to the nearest integer and spelling out the resulting DNA sequence. However, there is more information in the chrominance that we can exploit to find a better estimate of q _i . After all, the homopolymer length was measured as a single measurement for each stop base, but each position in the incoming sequence had actually been measured four times (once for each stop base).

一个实例使这变得清楚。考虑如下序列：An example makes this clear. Consider the following sequence:

ACGCATCAAAGCCTTACACGGTAAGCATCATCACGCATCAAAAGCCTTACACGGTAAGCATCATC

该序列位置8上出现的‘AAA’三联体将在A的色度的第三步骤中直接测量，并且将是一个近似数如3.43。如果测量误差巨大，则可能难于在每种情况下都确信如何将所测的量四舍五入为整数。The 'AAA' triplet occurring at position 8 of the sequence would be measured directly in the third step of the color of A and would be an approximate number such as 3.43. If measurement errors are large, it can be difficult to be sure how to round the measured quantities to whole numbers in every case.

然而，‘AAA’三联体也有助于C的色度的第四步骤、G的色度的第二步骤以及T的色度的第二步骤。在两种情况下(C和T的色度)，所述三联体实际上是单独测量的，而在第三种情况下，它是与在前的单个A一起测量的。让我们假定对于A、C、G和T的色度而言，相应的测量值分别为3.43、3.1、4.2和2.9。我们将愿意使用这些附加测量值以减小随机测量误差的效应。However, the 'AAA' triplet also contributes to the fourth step of the chromaticity of C, the second step of the chromaticity of G, and the second step of the chromaticity of T. In two cases (colors of C and T) the triplet is actually measured alone, while in the third case it is measured together with the preceding single A. Let us assume that for the chromaticities A, C, G and T the corresponding measurements are 3.43, 3.1, 4.2 and 2.9 respectively. We would prefer to use these additional measurements to reduce the effect of random measurement errors.

再次考虑同聚物长度q₀，q₁，...q_n。与接受在阶段I所获得的单纯数字相反，我们可形成一组描述有关q的附加信息的联立方程式。上述三联体为q₈，这是因为它是第八个同聚物。同样地，在其前面的A为q₅。我们现在能够根据前段写出如下的信息：Consider again the homopolymer lengths q ₀ , q ₁ , . . . q _n . Instead of accepting the mere numbers obtained in Phase I, we can form a set of simultaneous equations that describe additional information about q. The above triplet is q ₈ because it is the eighth homopolymer. Likewise, A preceding it is q ₅ . We can now write the following information based on the previous paragraph:

q₈＝3.43(来自A的色度)q ₈ =3.43 (chromaticity from A)

q₈＝3.1(来自C的色度)q ₈ =3.1 (chromaticity from C)

q₅+q₈＝4.2(来自G的色度)q ₅ +q ₈ =4.2 (chroma from G)

q₈＝2.9(来自T的色度)q ₈ =2.9 (chromaticity from T)

我们能够对引入序列中的每一个位置以类似的方式进行。所得的联立方程式系统可使用如最小二乘方最优化求解，且所得解给出了最好地匹配色度中所有测量值的那组同聚物长度q₀，q₁，...q_n。We can do this in a similar fashion for each position in the incoming sequence. The resulting system of simultaneous equations can be solved using e.g. least squares optimization, and the resulting solution gives the set of homopolymer lengths q ₀ , q ₁ , ... q that best matches all measurements in chroma _n .

误差耐受性碱基引入算法的实例Examples of Error Tolerant Base Calling Algorithms

下表显示了对每一个终止核苷酸进行10个循环的模板The table below shows the template for 10 cycles for each stop nucleotide

ATGGAGCAGCGTCATTCCTTAGCGGGCAACTGTGACGATGGTGAGAAGTCAGAAAGAGAGGCATGGAGCAGCGTCATTCCTTAGCGGGCAACTGTGACGATGGTGAGAAGTCAGAAAGAGAGGC

TCAGGGATTCGAGCATCGGACCTGTATGGACTCTGGGGATCAGGGATTCGAGCATCGGACCTGTATGGACTCTGGGGA

(给出测序链)色度测序的模拟结果。(Given the sequencing strand) Simulation results of chroma sequencing.

每一组显示了所示终止核苷酸的色度，每一行显示了左侧所示核苷酸获得的以一个碱基为单位的(模拟的)测量值，而每一列为一个循环，包括添加前三个核苷酸，然后添加一个核苷酸。例如，粗体的四个数字显示了在用dATP作为终止核苷酸的色度的首轮循环中所获得的测量值。由于模板以A起始，所以仅A给出了显著不同于零的信号。Each group shows the chromaticity of the indicated terminating nucleotide, each row shows the (simulated) measurements obtained in units of one base for the indicated nucleotide on the left, and each column is a cycle, including The first three nucleotides are added, followed by one nucleotide. For example, the four numbers in bold show measurements obtained in the first cycle of chroma with dATP as the terminating nucleotide. Since the template starts with A, only A gives a signal significantly different from zero.

A A ACGTACGT 0.78-0.190.20.070.78-0.190.20.07 1.09-0.142.170.861.09-0.142.170.86 1.070.811.090.031.070.811.090.03 12.071.861.3112.071.861.31 1.031.950.023.571.031.950.023.57 2.012.083.96-0.142.012.083.96-0.14 0.861.171.912.190.861.171.912.19 1.171.211.010.091.171.211.010.09 1.03-0.113.052.11.03-0.113.052.1 1.990.010.960.081.990.010.960.08 C C ACGTACGT 20.962.951.0420.962.951.04 1.0511.010.151.0511.010.15 0.20.980.730.950.20.980.730.95 1.011.950.032.021.011.950.032.02 0.940.920.91.990.940.920.91.99 -0.061.043.050.02-0.061.043.050.02 1.911.10.12-0.031.911.10.12-0.03 1.080.992.032.141.080.992.032.14 4.081.055.863.074.081.055.863.07 5.851.144.990.125.851.144.990.12 G G ACGTACGT 0.950.062.061.080.950.062.061.08 1.010.020.87-0.131.010.020.87-0.13 1.151.0110.061.151.0110.06 0.011.111.06-0.080.011.111.06-0.08 2.0830.975.032.0830.975.03 -0.011.082.98-0.03-0.011.082.98-0.03 2.172.121.081.162.172.121.081.16 0.010.070.920.880.010.070.920.88 1.141.160.990.041.141.160.990.04 1.130.092.020.951.130.092.020.95 T T ACGTACGT 0.97-0.07-0.061.040.97-0.07-0.061.04 2.022.014.840.932.022.014.840.93 1.060.81-0.142.251.060.81-0.142.25 01.91-0.22.0301.91-0.22.03 3.053.163.971.193.053.163.971.19 -0.060.060.960.84-0.060.060.960.84 1.910.92.010.911.910.92.010.91 0.020.072.060.960.020.072.060.96 2.94-0.12.940.932.94-0.12.940.93 6.112.245.370.616.112.245.370.61

使用上述动态程序设计算法的碱基引入鉴定了如下引入序列(其并不显示同聚物)：ATGAGCAGCGTCATCTAGCGCACTGTGACGATG，这是正确的。通过四舍五入至最接近的整数展开同聚物，产生ATGGAGCAGCGTCATTCCTTAGCGGGCAACTGTGACGATGG，这再次是正确的，并且覆盖了41bp的模板。从而，仅仅在10个循环的色度测序中，并且是在有效测量误差的存在下(在本例中，10％CV)，人们能够获得41个碱基对的序列信息。Base calling using the dynamic programming algorithm described above identified the following incoming sequence (which does not show homopolymers): ATGAGCAGCGTCATCTAGCGCACTGTGACGATG, which is correct. Expanding the homopolymer by rounding to the nearest integer yields ATGGAGCAGCGTCATTCCTTAGCGGGCAACTGTGACGATGG, which again is correct and covers the 41 bp template. Thus, in only 10 cycles of chroma sequencing, and in the presence of significant measurement error (in this case, 10% CV), one was able to obtain 41 base pairs of sequence information.

为评估所给算法的误差耐受性，以相当于10％CV的随机噪声对所给模板运行一系列100次模拟。全部100个引入序列和全部100个展开序列都是正确的。其中59个长41bp，而其余的包括了来自模板的另外的T。从而，所示算法在面对实验偏差时是生产性的，并且是误差耐受性的。To evaluate the error tolerance of the given algorithm, a series of 100 simulations were run on the given template with random noise equivalent to 10% CV. All 100 incoming sequences and all 100 expanded sequences were correct. 59 of them were 41 bp long, while the rest included an additional T from the template. Thus, the shown algorithm is productive and error tolerant in the face of experimental bias.

核苷酸添加方案Nucleotide Addition Protocol

在SBS中，一直以来总是假定核苷酸必须每次添加一个，或者至少必须如在BASS中那样强制每次掺入一个。然而，如上文所示，可利用其他核苷酸添加方案得到DNA序列，并且有些更好地适合于避免SBS的局限(如同步性的丧失)。在这部分，我们研究了所有可能的核苷酸添加方案，并且证明常规方案在某些方面是最不可能的。In SBS, it has always been assumed that nucleotides must be added one at a time, or at least must be incorporated one at a time as in BASS. However, as indicated above, other nucleotide addition schemes are available to obtain DNA sequences, and some are better suited to avoid the limitations of SBS (eg, loss of synchrony). In this part, we investigate all possible nucleotide addition schemes and demonstrate that the conventional scheme is in some respects the least likely.

核苷酸添加方案是用于向SBS反应中添加核苷酸的规则。它是由包括添加一种或多种核苷酸的连续步骤组成的。在这部分，我们将忽略纯粹作为抑制剂所添加的或者出于一些其他原因不能被掺入的任何核苷酸。并且我们将称能够与腺苷碱基配对的任何核苷酸为“T”(或者类似地对于胞嘧啶、鸟嘌呤、胸腺嘧啶而言称为G、C、A)。在特殊的应用中，可使用天然核苷酸的类似物或衍生物，但出于测序目的，是其碱基配对能力决定了核苷酸添加方案的逻辑。具有多重碱基配对能力的核苷酸类似物或衍生物可表示为“AC”、“GCT”等，以表明该事实。A nucleotide addition scheme is a rule for adding nucleotides to an SBS reaction. It consists of successive steps involving the addition of one or more nucleotides. In this section we will ignore any nucleotides that are added purely as inhibitors or cannot be incorporated for some other reason. And we will call any nucleotide capable of base pairing with adenine "T" (or similarly G, C, A for cytosine, guanine, thymine). In special applications, analogs or derivatives of natural nucleotides may be used, but for sequencing purposes it is their base-pairing ability that dictates the logic of the nucleotide addition scheme. Nucleotide analogs or derivatives capable of multiple base pairing may be denoted "AC", "GCT", etc. to indicate this fact.

循环方案是重复基本模式的核苷酸添加方案。带有重新启动的循环方案是重复基本模式、且随之以新鲜引物以基本模式的变形重新启动的核苷酸添加方案。自然方案是其中没有碱基重复直至已经添加了所有四种碱基的方案。Cycling schemes are nucleotide addition schemes that repeat a basic pattern. A cycle with restart scheme is a nucleotide addition scheme that repeats the basic pattern followed by restarting with fresh primers with a variation of the basic pattern. A natural scheme is one in which no bases are repeated until all four bases have been added.

在自然循环方案中，“4”表明在第一步骤中添加了所有四种核苷酸，它是简并的，并且不能用于测序。In the natural cycle scheme, "4" indicates that all four nucleotides were added in the first step, which is degenerate and cannot be used for sequencing.

方案“1-1-1-1”是常规方案，为所有以前公开的SBS法所用。注意即使BASS落入此类，但由于尽管所有四种核苷酸可同时添加，但是因为可切割的封闭基团，它们被迫逐个地掺入。Protocol "1-1-1-1" is the conventional protocol used for all previously published SBS methods. Note that even though BASS falls into this category, they are forced to be incorporated one by one because of the cleavable blocking group, although all four nucleotides can be added simultaneously.

方案1-1-1-1是最小生产性的方案。这可由如下事实看出，即在每一个生产性的步骤之后，模板上的下一个核苷酸可能是三种可能性之一(即不同于刚刚测序的碱基的那三个)，但是仅添加了单个的碱基。结果，它是最受同步性丧失影响的方案。Option 1-1-1-1 is the least productive option. This can be seen from the fact that after each productive step, the next nucleotide on the template could be one of three possibilities (i.e. different from those three of the bases just sequenced), but only A single base is added. As a result, it is the scenario most affected by loss of synchronicity.

根据本发明的方法是方案3-1，如文中所公开的那样。它是完全生产性的方案(每一步骤都确保核苷酸掺入，因为在给定步骤中缺少的核苷酸在随后的步骤中被添加)。存在3-1的四种变形，是通过在A、C、G和T中改变单个核苷酸所提供的。如上文所示，那四种变形可用于重建靶序列。The method according to the invention is Scheme 3-1, as disclosed in the text. It is a fully productive protocol (each step ensures nucleotide incorporation, since missing nucleotides in a given step are added in subsequent steps). There are four variants of 3-1 provided by changing single nucleotides in A, C, G and T. As indicated above, those four variants can be used to reconstruct the target sequence.

方案2-2是另一种可能完全生产性的方案。该方案只有三种变形，对应于AC-GT、AG-CT和AT-GC；所有其他组合都是简单的颠倒。Scenario 2-2 is another potentially fully productive scenario. The scheme has only three variants, corresponding to AC-GT, AG-CT, and AT-GC; all other combinations are simple inversions.

什么是一个方案确保人们总是能够重建原始序列(很可能重新启动)的最小必要条件呢？实质上，全部所需的在于靶序列中的每一个同聚物必须与其两个邻居可分离。换言之，每一个同聚物必须是排除其左手邻居的至少一个核苷酸掺入步骤、以及排除其右手邻居的步骤的一部分。在方案1-1-1-1中，每一单个步骤都具有此特性，因此序列总是能够重建。What is the minimum necessary for a scheme to ensure that one can always reconstruct the original sequence (possibly restarting it)? Essentially, all that is required is that each homopolymer in the target sequence must be separable from its two neighbors. In other words, each homopolymer must be part of at least one nucleotide incorporation step that excludes its left-hand neighbor, and a step that excludes its right-hand neighbor. In scheme 1-1-1-1, every single step has this property, so the sequence can always be reconstructed.

在方案3-1中，用所有四种可能的变形重新启动确保了每一个同聚物都是不包括其他核苷酸的步骤的一部分。原则上，仅四种变形中的三种是严格需要的，因为在那种情况下，三种碱基将在一些步骤中单独添加，这自动地将其与第四种区分开来。从而，方案3-1产生了方案1-1-1-1中不存在的冗余信息，这可用于在面对实验噪声时改善碱基引入(如，通过动态程序设计，如上所示)。因而它不仅比1-1-1-1更有生产性，而且更具有误差耐受性。In scheme 3-1, restarting with all four possible variants ensures that each homopolymer is part of a step that does not include other nucleotides. In principle, only three of the four variants are strictly required, since in that case the three bases will be added separately in some steps, which automatically distinguishes it from the fourth. Thus, scheme 3-1 generates redundant information not present in scheme 1-1-1-1, which can be used to improve base introduction in the face of experimental noise (eg, by dynamic programming, as shown above). It is thus not only more productive than 1-1-1-1, but also more error tolerant.

方案2-2历经三次重新启动，也产生足够的信息以引入序列。容易看出每一对核苷酸在AC-GT、AG-CT和AT-GC至少一个中可分离。从而方案2-2很可能是最为简洁的完全生产性的方案，尽管由3-1产生的额外信息可能值得努力。仍然存在一些冗余(如果核苷酸是由不同标记所标记的话)；因而，方案2-2的误差耐受性在1-1-1-1和3-1之间。Scenario 2-2 also produced enough information to introduce the sequence over three restarts. It is readily seen that each pair of nucleotides is separable in at least one of AC-GT, AG-CT and AT-GC. Option 2-2 is thus probably the most concise fully productive option, although the extra information produced by 3-1 may be worth the effort. There is still some redundancy (if the nucleotides are labeled with different labels); thus, the error tolerance of scheme 2-2 is between 1-1-1-1 and 3-1.

非常规(非循环)方案也可能在特殊场合下有用。例如，当部分序列已知时，与其他方式可能实现的相比，可使用非常规方案更快地越过非目的部分，或者可使用它们产生甚至更为冗余的数据，以便进一步降低碱基引入误差。Unconventional (acyclic) schemes may also be useful in special occasions. For example, when partial sequences are known, unconventional protocols can be used to get past unintended parts faster than would otherwise be possible, or they can be used to generate even more redundant data in order to further reduce base introductions error.

总而言之，关于核苷酸添加方案，我们已调查到，3-1是最为有生产性和误差耐受性的，而有些令人惊讶的是，传统方案1-1-1-1是最小生产性且最易错的。In conclusion, with respect to nucleotide addition schemes, we have investigated that 3-1 is the most productive and error tolerant, whereas, somewhat surprisingly, the traditional scheme 1-1-1-1 is the least productive And the most error-prone.

签名测序signature sequencing

本发明可用于签名测序的方面的另一个实施方案，包括这样的方法(方案III)，其包括：Another embodiment of the aspect of the invention useful for signature sequencing includes a method (Scheme III) comprising:

1.为单链模板提供退火引物。1. Provide annealing primers for single-stranded templates.

2.添加三种核苷酸，其中一个携带标记如荧光标记。2. Add three nucleotides, one of which carries a label such as a fluorescent label.

3.任选地添加一种或多种非掺入的抑制剂核苷酸(不同于标记核苷酸)。实例包括5’-双-和单-磷酸核苷酸，5’-(α-β-亚甲基)三磷酸核苷酸。3. Optionally add one or more non-incorporated inhibitor nucleotides (other than the labeled nucleotides). Examples include 5'-bis- and mono-phosphate nucleotides, 5'-(α-β-methylene)triphosphate nucleotides.

5.检测标记核苷酸的存在和量。5. Detecting the presence and amount of labeled nucleotides.

6.使标记失效，如通过光漂白(并非在每一循环中都是必需的)。6. Inactivation of labeling, eg by photobleaching (not necessary in every cycle).

7.添加剩余的核苷酸，并与聚合酶(并非必须和步骤5中的相同)在导致核苷酸添加到生长链的条件下温育。7. Add the remaining nucleotides and incubate with the polymerase (not necessarily the same as in step 5) under conditions that result in the addition of nucleotides to the growing strand.

8.重复步骤2-7，直至所需的循环数得以完成。8. Repeat steps 2-7 until desired number of cycles are completed.

例如，人们可在步骤2中使用荧光dC和常规dA/dG，然后在步骤7中添加dT。则步骤4将添加任何数目的dA、dG和dC，直至模板中首次出现dA，然后终止，因为不存在互补的dT核苷酸。步骤5中的荧光读数将揭示每一对dT之间dC的存在或不存在。所获得的序列一般可记做二进制数字的序列，表明对于每一连续对的T，在它们之间是否存在一个或多个C。For example, one could use fluorescent dC and conventional dA/dG in step 2, then add dT in step 7. Step 4 would then add any number of dA, dG, and dC until the first occurrence of dA in the template and then terminate because there is no complementary dT nucleotide. Fluorescence readings in step 5 will reveal the presence or absence of dC between each pair of dT. The resulting sequence can generally be written as a sequence of binary digits indicating, for each successive pair of Ts, whether there are one or more Cs between them.

例如，序列ACGCTACGCATCAGACTC将记做为1111，而序列ACTCAGCTATATT记做11000。一般而言，此类序列含有等价于每个循环1/2个碱基对的信息。24个循环将等价于12bp的签名序列，并且举例来说在人类转录物组(transcriptome)中将是独一无二的。现有的序列数据库和序列比对算法可容易地改造，以适应此类二进制签名用于进行分析。For example, the sequence ACGCTACGCATCAGACTC would be written as 1111 and the sequence ACTCAGCTATATT as 11000. In general, such sequences contain information equivalent to 1/2 base pair per cycle. 24 cycles would be equivalent to a signature sequence of 12 bp, and would be unique within the human transcriptome, for example. Existing sequence databases and sequence alignment algorithms can be easily adapted to accommodate such binary signatures for analysis.

方案III特别易于执行，因为仅需定性测量。例如，方案III可能特别适合于利用荧光相关光谱学测序单个分子。Option III is particularly easy to implement since only qualitative measurements are required. For example, Protocol III may be particularly suitable for sequencing single molecules using fluorescence correlation spectroscopy.

利用PPi检测的色度测序Chromatic sequencing using PPi detection

在另一个实施方案中，本发明的一方面提供了这样的方法(方案IV)，其包括(与使用标记核苷酸相反)监控无机焦磷酸盐(PPi)的释放(如参见WO93/23564)。此种方法可包括：In another embodiment, an aspect of the present invention provides a method (Scheme IV) comprising (as opposed to using labeled nucleotides) monitoring the release of inorganic pyrophosphate (PPi) (see eg WO93/23564) . Such methods may include:

2.添加一组插入核苷酸(即超过一种但少于所有四种可能的核苷酸)。2. Adding a set of intervening nucleotides (ie more than one but less than all four possible nucleotides).

3.任选地添加一种或多种非掺入的抑制剂核苷酸(不同于插入核苷酸)。实例包括5’-双-和单-磷酸核苷酸，5’-(α-β-亚甲基)三磷酸核苷酸3. Optionally add one or more non-incorporating inhibitor nucleotides (other than the intervening nucleotides). Examples include 5'-bis- and mono-phosphate nucleotides, 5'-(α-β-methylene)triphosphate nucleotides

4.与适当的聚合酶在导致核苷酸添加到生长链的条件下温育，而同时监控掺入(例如，如WO93/23564中所述的那样)。4. Incubation with an appropriate polymerase under conditions that result in the addition of nucleotides to the growing chain while monitoring incorporation (eg as described in WO93/23564).

5.添加终止核苷酸组，并与聚合酶(并非必须和步骤5中的相同)在导致核苷酸添加到生长链的条件下温育，而同时监控掺入(例如，如WO93/23564中所述的那样)。5. Add the set of terminating nucleotides and incubate with a polymerase (not necessarily the same as in step 5) under conditions that result in the addition of nucleotides to the growing strand while monitoring incorporation (e.g. as in WO93/23564 as described in).

6.重复步骤2-5，直至所需的循环数得以完成。6. Repeat steps 2-5 until desired number of cycles are completed.

再次地，可使用四种天然核苷酸作为终止核苷酸重复该方案。与标准的焦磷酸测序相比，这个方案在读取长度方面提供了四倍的增长，而没有对标准方案进行修饰(除了核苷酸添加次序上的变化以及所需的碱基引入的变化)。Again, the protocol can be repeated using the four natural nucleotides as stop nucleotides. Compared to standard pyrosequencing, this protocol provides a four-fold increase in read length without modifications to the standard protocol (other than changes in the sequence of nucleotide additions and changes in the required base introduction) .

如下的实例显示了同步性丧失的重要性和使用色度测序方案的影响。它显示了由焦磷酸测序和色度测序两者所测序的靶DNA的结果。假定在每个掺入步骤中固定分数的所有模板都丧失同步性。在SBI中，步骤是添加单个的碱基。在跳跃测序(jump sequencing)步骤中，是交替添加三种或一种碱基。另外地，色度测序由新鲜引物重新启动三次，利用四种天然核苷酸中的每一种作为终止核苷酸。The following examples show the importance of loss of synchrony and the impact of using a chroma sequencing protocol. It shows the results of target DNA sequenced by both pyrosequencing and chroma sequencing. A fixed fraction of all templates is assumed to lose synchrony at each incorporation step. In SBI, steps are the addition of individual bases. In the jump sequencing step, three or one bases are alternately added. Additionally, chroma sequencing was restarted three times with fresh primers, utilizing each of the four natural nucleotides as stop nucleotides.

靶序列(对于每一个终止核苷酸，色度测序所达的终末核苷酸以大写字母显示)：Target sequence (for each stop nucleotide, the end nucleotide reached by Chroma sequencing is shown in capital letters):

atggagcagc gtcattcctt agcgggcaac tgtgacgatg gtgagaagtcatggagcagc gtcattccctt agcgggcaac tgtgacgatg gtgagaagtc

agaaagagag gctcaGGGat tcgagcatcg gacctgtAtg gactctggggagaaagagag gctcaGGGat tcgagcatcg gacctgtAtg gactctgggg

atccTTcctt tgggCaaaat gatcccccta ccattttgcc cattactgctatccTTcctt tgggCaaaat gatcccccta ccattttgcc cattactgct

焦磷酸测序Pyrosequencing

针对同步性丧失的40次终止40 terminations for loss of synchronicity

40个反应步骤40 reaction steps

反应 reaction 结果 result a c g ta c g ta c g ta c g ta c g ta c g ta c g ta c g ta c g ta c g t a c g ta c g ta c g ta c g ta c g ta c g ta c g ta c g ta c g ta c g t a - - t2g - - -a - g -- c - -a - g -- c - -- - g t- c - -a - - 2t- 2c - 2t a - - t2g - - -a - g -- c - -a - g -- c - -- - g t- c - -a - - 2t- 2c - 2t

总序列：20bpTotal sequence: 20bp

色度测序Chroma sequencing

针对同步性丧失的40次终止40 terminations for loss of synchronicity

160个反应步骤(即每种终止碱基40个) 反应结果 cgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt a - at2g agc a2g2ct a4t2c a…等… 160 reaction steps (ie 40 for each stop base) reaction result cgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt acgt a - at2g agc a2g2ct a4t2c a...etc...

...由[gta c]、[tac g]和[acg t]重新启动并重复......reboot by [gta c], [tac g] and [acg t] and repeat...

总序列：88bp+27bp部分序列Total sequence: 88bp+27bp partial sequence

总而言之，色度测序规避了同步性丧失问题，实现了超过四倍的更长的读取长度。Altogether, chroma-sequencing circumvents the loss of synchrony issue and achieves more than four times longer read lengths.

固相色度测序solid-phase chromatic sequencing

为了使所述方法自动化和并行，根据本发明的实施方案提供了两种主要的途径。To automate and parallelize the method, embodiments according to the present invention provide two main approaches.

第一种途径使用阵列或以其他方式排列的模板，并且适合在必须对大量模板在保留其特性情况下进行测序时。The first approach uses arrayed or otherwise arranged templates and is suitable when a large number of templates must be sequenced while preserving their identity.

第二种途径使用对固体支持物的随机附着，并且可用在必须由文库随机获得大量序列的时候。The second approach uses random attachment to a solid support and can be used when a large number of sequences must be randomly obtained from a library.

根据本发明用于测序阵列模板的一个方面的实施方案的方法提供了这样的方法(方案V)，其包括：A method according to an embodiment of an aspect of the invention for sequencing array templates provides a method (Scheme V) comprising:

1.提供固体支持物，其提供了许多活性区域或活性表面，每一个都能够结合模板分子，其中结合是1. Provide a solid support that provides a number of active regions or active surfaces, each capable of binding template molecules, wherein the binding is

a.直接地，或者a. directly, or

b.间接地，通过结合引物或接头，所述引物或接头与模板杂交或以其他方式与模板具有亲和力。b. Indirectly, by binding a primer or adapter that hybridizes to or otherwise has an affinity for the template.

2.向每一个活性区域或向活性表面添加单链模板，跟踪每一位置上安置哪一个模板。则每一个区域将由大量同样的ssDNA模板组成，如在斑点微阵列中那样。2. Add single-stranded templates to each active region or to the active surface, keeping track of which template is placed at each position. Each region will then consist of a large number of identical ssDNA templates, as in a spotted microarray.

3.任选地，添加引物(或者利用来自固体支持物的接头)。3. Optionally, add primers (or use adapters from solid supports).

4.根据本发明并行地测序所有的模板，如根据方案I-IV的任一种。4. Sequence all templates in parallel according to the invention, as according to any of schemes I-IV.

5.获得关于每一鉴定模板的序列。5. Obtain the sequence for each identified template.

在所有活性区域中，接头(步骤1b)并不是必须相同。可使用不同的接头以从复杂的混合物中钓出特定的模板，从而提供了测序亚组文库的可能性。The linker (step 1b) does not have to be the same in all active regions. Different adapters can be used to fish out specific templates from complex mixtures, offering the possibility to sequence subgroup libraries.

方案V的通量受限于用于添加模板的装置的分辨率。利用标准的微阵列仪器，每平方厘米数千个模板的密度是可能的。The throughput of Protocol V is limited by the resolution of the device used to add the template. Densities of thousands of templates per square centimeter are possible with standard microarray instrumentation.

当需要更高的通量且模板密度并不重要时，可以使用另一种途径。Another route can be used when higher throughput is required and template density is not critical.

本发明一方面的另一实施方案是作为这样的方法(方案VI)提供的，其包括：Another embodiment of an aspect of the invention is provided as a method (Scheme VI) comprising:

1.提供固体支持物，其携带有在随机位置上附着的至少是部分单链的模板分子(优选是以适合于检测仪器的密度)，任选地扩增每一个模板，以容纳多拷贝的靶序列，它们或者附着于或者极其邻近于原始模板(至少比任何其他模板分子更为靠近)。1. Provide a solid support carrying at least partially single-stranded template molecules attached at random positions (preferably at a density suitable for the detection instrument), optionally amplifying each template to accommodate multiple copies of Target sequences, which are either attached to or in close proximity to the original template (at least closer than any other template molecule).

2.利用本发明并行地测序模板，例如方案I-IV的任一种，并行地检测标记的核苷酸。2. Using the present invention to sequence templates in parallel, for example any one of schemes I-IV, to detect labeled nucleotides in parallel.

存在许多途径高密度地提供扩增模板。例如，可如下使用滚环扩增：There are many ways to provide amplification templates in high density. For example, rolling circle amplification can be used as follows:

a.为表面(如玻璃)提供附着的引物，优选介由共价键附着，或者与共价键相反，可使用极强的非共价键(如生物素/链霉抗生物素蛋白)。a. Provide primers attached to the surface (such as glass), preferably via covalent bonds, or as opposed to covalent bonds, very strong non-covalent bonds (such as biotin/streptavidin) can be used.

b.添加环状模板，优选以适合于检测仪器的密度添加。b. Adding the circular template, preferably at a density suitable for the detection instrument.

c.使模板与引物退火。c. Annealing the template to the primer.

d.利用滚环扩增进行扩增，以产生在每一位置上附着于表面的长的单链串联重复模板。d. Amplification using rolling circle amplification to generate long single-stranded tandem repeat templates attached to the surface at each position.

Lizardi等描述了“Mutation detection and single-moleculecounting using isothermal rolling circle amplification”：Nature Genetics vol 19，p.225。Lizardi et al describe "Mutation detection and single-molecule counting using isothermal rolling circle amplification": Nature Genetics vol 19, p.225.

对此方法的修饰包括提供反向引物以产生附加的复制叉，从而提高产物得率。RCA的备选方法包括固相PCR(Adessi等“Solid phaseDNA Amplification：characterization of primer attachment andamdlification mechanisms”Nucleic Acids Research 2000：28(20)：87e)以及凝胶内(in-gel)PCR(‘polonies’，US6485944和MitraRD，Church GM，-In situ localized amplification and contactreplication of many individual DNA molecules”，Nucleic AcidsResearch 1999：27(24)：e 34)。Modifications to this method include providing a reverse primer to generate additional replication forks, thereby increasing product yield. Alternatives to RCA include solid-phase PCR (Adessi et al. "Solid phaseDNA Amplification: characterization of primer attachment and amdlification mechanisms" Nucleic Acids Research 2000:28(20):87e) and in-gel PCR ('polonies' , US6485944 and MitraRD, Church GM, -In situ localized amplification and contact replication of many individual DNA molecules", Nucleic Acids Research 1999: 27(24): e 34).

“合适的密度”优选为使通量最大化的密度，例如，确保尽可能多的检测器(或检测器中的象素)检测单个模板分子的有限稀释。在任何常规阵列上，完美的有限稀释将使得37％的所有位置持有单个的模板(由于泊松分布的形式)；其余位置将不持有或者持有超过一个模板。A "suitable density" is preferably a density that maximizes throughput, eg, a limiting dilution that ensures detection of a single template molecule by as many detectors (or pixels in a detector) as possible. On any conventional array, a perfect limiting dilution would be such that 37% of all positions would hold a single template (due to the form of the Poisson distribution); the rest would hold none or more than one template.

例如，在具有25μm象素大小的Typhoon 9200上，35×43cm的反应室持有二亿四千万象素。通过有限稀释(泊松分布)，其中37％将持有单个的模板，即8900万个模板。对每个模板上的50个碱基进行测序在50个循环中产生1.7Gb的序列。扫描时间为45分钟，日通量约为3Gbp，等价于人类基因组的全部序列。For example, on a Typhoon 9200 with a 25 μm pixel size, a 35 x 43 cm chamber holds 240 million pixels. By limiting dilution (Poisson distribution), 37% of them will hold a single template, ie 89 million templates. Sequencing 50 bases on each template yielded 1.7 Gb of sequence in 50 cycles. The scanning time is 45 minutes, and the daily throughput is about 3Gbp, which is equivalent to the entire sequence of the human genome.

适合用于固相RCA的模板应当最优化得率(就模板序列的拷贝数而言)，同时提供适合于下游应用的序列。一般而言，优选小模板。特别是，模板可由20-25bp的引物结合序列和40-150bp的插入片段组成。引物结合序列既可用于起始RCA，又可用于引发测序反应，或者模板可包含单独的测序引物结合位点。插入片段应当尽可能地小，同时保持足够长以容纳所需序列。例如，如果利用单个终止核苷酸进行10个循环的测序，则平均将探测40个碱基，因而模板必须至少比40个碱基长足够多，以防止测序引物结合序列。Templates suitable for solid-phase RCA should optimize yield (in terms of copy number of the template sequence) while providing sequences suitable for downstream applications. In general, small templates are preferred. In particular, the template may consist of 20-25 bp of the primer binding sequence and 40-150 bp of the insert. Primer binding sequences can be used both to initiate the RCA and to prime the sequencing reaction, or the template can contain separate sequencing primer binding sites. Inserts should be as small as possible while remaining long enough to accommodate the desired sequence. For example, if 10 cycles of sequencing are performed with a single stop nucleotide, an average of 40 bases will be probed, so the template must be at least sufficiently longer than 40 bases to prevent the sequencing primers from binding the sequence.

为了增大由滚环扩增的模板所产生的信号，可能有必要浓缩它们。因为RCA产物基本上是单链DNA分子，其由多至1000或者甚至是10000个原始环状模板的串联复制物组成，所述分子将非常长。例如，利用RCA扩增了1000倍的100bp模板将在30μm的量级，因而将其信号将延伸跨越几个不同的象素(假定为5μm的象素分辨率)。利用低分辨率仪器可能无助于事，因为稀少的ssDNA产物仅占据30μm象素区域内非常小的一部分，因此可能不可检测。从而，期望能够将信号浓缩至更小的区域。In order to increase the signal generated by rolling circle amplified templates it may be necessary to concentrate them. Since the RCA product is essentially a single-stranded DNA molecule consisting of as many as 1000 or even 10000 tandem copies of the original circular template, the molecule will be very long. For example, a 100 bp template amplified 1000 times with RCA will be on the order of 30 μm, so its signal will extend across several different pixels (assuming a pixel resolution of 5 μm). Utilizing a low resolution instrument may not be helpful as rare ssDNA products occupy only a very small fraction of the 30 μm pixel area and therefore may not be detectable. Thus, it is desirable to be able to concentrate the signal into a smaller area.

在(Lizardi等，上文所引述的)中，RCA产物是通过使用表位标记的核苷酸和多价抗体作为交联剂浓缩的。在另一方面，本发明提供了简单的备选方案，其在测序原始双链DNA时特别方便。In (Lizardi et al., cited above), RCA products are concentrated by using epitope-tagged nucleotides and multivalent antibodies as crosslinkers. On the other hand, the present invention provides a simple alternative which is particularly convenient when sequencing raw double-stranded DNA.

关于用于根据本发明的方法的模板制备，并且作为本发明的另一方面，将dsDNA模板(其可能是短的，如80bp)连接于携带发夹环的接头寡核苷酸，以形成假双链的环结构或哑铃形。在此种结构中，可将用于RCA和随后的测序反应两者的引物结合位点置于发夹环中。为了避免同时测序两条链，通过使用用于RCA扩增的以及用于测序的不同引物，人们可确保将仅仅测序在其两端具有不同发夹环的模板。从而，将仅仅扩增具有至少一个RCA引物结合位点的模板，并且将仅仅测序具有至少一个测序引物结合位点的那些模板。Regarding the template preparation for the method according to the invention, and as a further aspect of the invention, the dsDNA template (which may be short, such as 80bp) is ligated to an adapter oligonucleotide carrying a hairpin loop to form a dsDNA template Double chain ring structure or dumbbell shape. In such a structure, primer binding sites for both RCA and subsequent sequencing reactions can be placed in the hairpin loop. To avoid sequencing both strands at the same time, by using different primers for RCA amplification and for sequencing, one can ensure that only templates with different hairpin loops at both ends will be sequenced. Thus, only templates with at least one RCA primer binding site will be amplified, and only those templates with at least one sequencing primer binding site will be sequenced.

由于此种模板的RCA产物在各处将是部分双链的，所以它将向回折叠成Z字形结构，浓缩至更小的区域。但是由于引物结合位点在各处是作为单链DNA暴露的，所以引物接近不成问题。下文的实施例显示在RCA后此类模板形成～5-10μm的产物。Since the RCA product of such a template will be partially double stranded throughout, it will fold back into a zigzag structure, condensing into smaller regions. But since the primer binding sites are exposed everywhere as single-stranded DNA, primer access is not a problem. The examples below show that such templates form ~5-10 [mu]m products after RCA.

为了将寡核苷酸固定在表面上，业已描述了许多不同的途径(参见如Lindroos等“Minisequencing on oligonucleotide arrays：comparison of immobilisation chemistries”，Nucleic AcidsResearch 2001：29(13)e69)。例如，可将生物素化的寡核苷酸(oligo)附着于链霉抗生物素蛋白涂覆的阵列；可将NH₂-修饰的寡核苷酸共价附着于环氧硅烷衍生化或异硫氰酸酯涂覆的玻璃载玻片，可通过肽键将琥珀酰化的寡核苷酸偶联于氨基苯基-或氨丙基衍生化的玻璃，并且可通过硫醇/二硫化物交换反应将二硫化物修饰的寡核苷酸固定在巯基硅烷化(mercaptosilanised)的玻璃上。更多业已在文献中描述。To immobilize oligonucleotides on surfaces, a number of different approaches have been described (see eg Lindroos et al. "Minisequencing on oligonucleotide arrays: comparison of immobilisation chemistries", Nucleic Acids Research 2001: 29(13)e69). For example, biotinylated oligonucleotides (oligos) can be attached to streptavidin-coated arrays; _NH2 -modified oligonucleotides can be covalently attached to epoxysilane derivatized or iso Thiocyanate-coated glass slides for coupling succinylated oligonucleotides to aminophenyl- or aminopropyl-derivatized glass via peptide bonds and thiol/disulfide The exchange reaction immobilizes disulfide-modified oligonucleotides on mercaptosilanised glass. Many more have been described in the literature.

用于自动化高通量测序的装置Apparatus for automated high-throughput sequencing

根据本发明的方法尤其适合于自动化，这是因为它们可简单地通过经过置于检测器之上或其中的反应室循环许多试剂溶液而进行，任选地带有热控制。The methods according to the invention are particularly suitable for automation, since they can be performed simply by circulating a number of reagent solutions, optionally with thermal control, through a reaction chamber placed on or in the detector.

在一个实例中，检测器为荧光扫描仪，例如，它可能是通过激光激发、带通滤波和光电倍增管检测而运转的。例如，ScanArrayExpress(PerkinElmer)是这样的一种仪器；它以5μm/象素的分辨率扫描显微镜载玻片，能够检测少至每象素2个荧光染料，并具有～20分钟的扫描时间(以四种颜色)。在这样的仪器上的日测序通量最高将为1.7Gbp。In one example, the detector is a fluorescence scanner, which may be operated, for example, by laser excitation, bandpass filtering, and photomultiplier tube detection. For example, the ScanArrayExpress (PerkinElmer) is one such instrument; it scans microscope slides at a resolution of 5 μm/pixel, is capable of detecting as few as 2 fluorochromes per pixel, and has a scan time of ~20 minutes (in four colors). The maximum daily sequencing throughput on such an instrument will be 1.7Gbp.

反应室提供了：The reaction chamber provides:

·对扫描头(scan head)的容易接近。• Easy access to the scan head.

·密闭的反应室。· Airtight reaction chamber.

·用于注射和从反应室中移取试剂的进口。• Inlets for injection and removal of reagents from the reaction chamber.

·允许空气和试剂进入和排出反应室的出口。• An outlet to allow air and reagents to enter and exit the reaction chamber.

反应室可构建为如图3所示的标准微阵列载玻片形式，适合于插入到标准微阵列扫描仪如ScanArray Express中。反应室可被插入到扫描仪中，并在整个测序反应期间保持在那里。泵和试剂瓶(例如，如图4所示)根据固定的方案供应试剂，且计算机控制着泵和扫描仪两者，在反应和扫描之间交替。任选地，反应室可以是温控的。The reaction chamber can be constructed in the form of a standard microarray slide as shown in Figure 3, suitable for insertion into a standard microarray scanner such as ScanArray Express. Reaction chambers can be inserted into the scanner and remain there during the entire sequencing reaction. Pumps and reagent bottles (eg, as shown in Figure 4) supply reagents according to a fixed protocol, and a computer controls both the pumps and scanner, alternating between reacting and scanning. Optionally, the reaction chamber can be temperature controlled.

可将分配器单元连接于机动化的出口，以指导试剂的流动，整个系统在计算机的控制下运行。集成系统将由扫描仪、分配器、出口和贮液器，以及控制的计算机组成。Dispenser units can be connected to motorized outlets to direct the flow of reagents, and the entire system operates under computer control. The integrated system will consist of scanners, dispensers, outlets and reservoirs, and a computer for control.

根据本发明的另一方面，提供了用于实施本发明方法的仪器，该仪器包括：According to another aspect of the present invention, there is provided an apparatus for implementing the method of the present invention, the apparatus comprising:

能够检测掺入或释放的标记的成像元件，Imaging elements capable of detecting incorporated or released labels,

用于盛装一个或多个附着模板的反应室，从而每组步骤至少有一次它们能够接近成像元件，a reaction chamber for holding one or more templates attached so that they have access to the imaging element at least once per set of steps,

用于为反应室提供试剂的试剂分配系统。Reagent distribution system for supplying reagents to reaction chambers.

反应室可以提供，且成像元件可能能够分辨，密度至少为100/cm²，任选地至少1000/cm²、至少10 000/cm²或至少100 000/cm²的附着的模板。The reaction chamber may provide, and the imaging element may be able to resolve, a density of at least 100/cm ² , optionally at least 1000/cm ² , at least 10 000/cm ² or at least 100 000/cm ² of attached template.

成像元件可采用选自下组的系统或装置：光电倍增管、光电二极管、电荷耦合的装置、CMOS成像芯片、近场扫描显微镜、远场共焦显微镜、广视野表面照明(epi-illumination)显微镜和全内反射显微镜。The imaging element can adopt a system or device selected from the group consisting of photomultiplier tubes, photodiodes, charge-coupled devices, CMOS imaging chips, near-field scanning microscopes, far-field confocal microscopes, and wide-field surface illumination (epi-illumination) microscopes and total internal reflection microscopy.

成像元件可检测荧光标记。Imaging elements detect fluorescent labels.

成像元件可检测激光诱发的荧光。An imaging element detects laser-induced fluorescence.

在根据本发明的仪器的一个实施方案中，反应室是密闭结构，包括透明表面、盖以及用于使反应室与试剂分配系统附着的端口，其中透明表面在其内表面盛装模板分子，而成像元件能够穿过透明表面成像。In one embodiment of the apparatus according to the invention, the reaction chamber is a closed structure comprising a transparent surface, a cover, and ports for attaching the reaction chamber to a reagent distribution system, wherein the transparent surface holds template molecules on its inner surface for imaging Elements can be imaged through transparent surfaces.

实施例I-原位模板扩增Example 1 - In Situ Template Amplification

通过退火4μl 100pmol/μl的两个5’-磷酸化的寡核苷酸(TGGTCATCAGCCTTCATGCAACCAAAGTATGAAATAACCAGCGTAATACGACTCACTATAGGGCGTGGTTATTTCATACT和TTGGTTGCATGAAGGCTGATGACCATCCTTTTCCTTACTAGCGTAATACGACTCACTATAGGGCGTAGTAAGGAAAAGGA)，并添加2μl T4连接缓冲液、0.3μl T4DNA连接酶(1.5 Weiss单位；Fermentas)和7μl水，并在37℃温育1小时来制备环状单链模板。然后通过在65℃温育10分钟来灭活连接酶。通过退火4μl 100pmol/μl的两个5'-磷酸化的寡核苷酸(TGGTCATCAGCCTTCATGCAACCAAAGTATGAAATAACCAGCGTAATACGACTCACTATAGGGCGTGGTTATTTCATACT和TTGGTTGCATGAAGGCTGATGACCATCCTTTTCCTTACTAGCGTAATACGACTCACTATAGGGCGTAGTAAGGAAAAGGA)，并添加2μl T4连接缓冲液、0.3μl T4DNA连接酶(1.5 Weiss单位；Fermentas)和7μl水，并Circular single-stranded templates were prepared by incubating at 37°C for 1 hour. The ligase was then inactivated by incubation at 65°C for 10 minutes.

引物A50T7RCPrimer A50T7RC

(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCCTATAGTGAGTCGTATTACGC)携带5’末端氨基(-NH)部分，通过在100μl MOPS(0.2M，其中醋酸钠和EDTA根据Sambrook等‘Molecular Cloning’，third edition，Cold Spring Harbor Laboratory Press 2001制备)中温育10μM引物5分钟，而附着于Greiner甲硅烷基化的微阵列载玻片上，在1ml PBS/乙醇(3∶1)中用2.5mg NaBH₄还原5分钟，然后在0.2％十二烷基硫酸钠中冲洗，接着用蒸馏水冲洗。(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCCCTATAGTGAGTCGTATTACGC) carrying the 5' terminal amino (-NH) moiety was prepared by incubating 10 μM primers in 100 μl MOPS (0.2 M with sodium acetate and EDTA according to Sambrook et al 'Molecular Cloning', third edition, Cold Spring Harbor Laboratory Press 2001) 5 min, while attached to a Greiner silylated microarray slide, reduce with 2.5 mg NaBH in 1 ml PBS/ethanol (3:1) for 5 min, _then rinse in 0.2% sodium dodecyl sulfate , followed by rinsing with distilled water.

然后温育干燥的载玻片，用于使用2μl dUTP-Cy3(100μM的终浓度，PerkinElmer)、各2μl的dTTP、dATP、dCTP和dGTP(全部1mM的终浓度，NEB)、4μl测序酶缓冲液、1μl测序酶(13u，AmershamBiosciences)、4μl水和1μl模板进行滚环扩增。从而标记核苷酸约为全部核苷酸的2.5％。在于37℃温育两小时后，在水中冲洗载玻片，并在PerkinElmer ScanArray Express上扫描。结果为大量的亮斑，分别代表了扩增的模板。该结果也显示以这种形式能够容易地检测出2.5％的标记频率(事实上，许多斑点使检测器饱和)。Dried slides were then incubated for use with 2 μl of dUTP-Cy3 (100 μM final concentration, PerkinElmer), 2 μl each of dTTP, dATP, dCTP, and dGTP (all 1 mM final concentration, NEB), 4 μl Sequenase Buffer , 1 μl Sequenase (13u, Amersham Biosciences), 4 μl water and 1 μl template for rolling circle amplification. Labeled nucleotides are thus approximately 2.5% of all nucleotides. After two hours of incubation at 37°C, slides were rinsed in water and scanned on a PerkinElmer ScanArray Express. The result is a large number of bright spots, each representing the amplified template. The results also show that a labeling frequency of 2.5% can be easily detected in this format (in fact, many spots saturate the detector).

部分载玻片的放大显示，在5μm图像的象素大小下，大多数扩增的模板占据一个或小数目的象素。在这种大小下，扫描仪上非常大比例的象素可用于不同的模板分子，从而确保了最大通量。白色象素完全饱和了检测器，表明少于2.5％的标记已足够可检测的了。假设模板为160bp，则2.5％的标记代表了每个模板拷贝大约4个掺入的核苷酸，在预期的色度测序反应的范围中。Magnification of part of the slide shows that most of the amplified template occupies one or a small number of pixels at the pixel size of the 5 μm image. At this size, a very large proportion of the pixels on the scanner can be used for different template molecules, ensuring maximum throughput. White pixels completely saturate the detector, indicating that less than 2.5% label is sufficiently detectable. Assuming a template size of 160 bp, 2.5% labeling represents approximately 4 incorporated nucleotides per template copy, within the range expected for a chroma sequencing reaction.

实施例II-单步测序反应Example II-Single-step sequencing reaction

通过在10pmol/μl的Dynal结合/洗涤缓冲液(Dynal，挪威)中温育，将生物素化的T7引物(GCGTAATACGACTCACTATAGGGCG)附着于Greiner链霉抗生物素蛋白涂覆的微阵列载玻片上。在载玻片上通过粘贴含有5mm宽孔洞阵列的橡胶膜进行造孔。将TOPO2.1质粒(Clontech)煮沸，冰上冷却，然后以20fmol/μl添加至各孔中。于室温下温育15分钟之后，将载玻片在结合/洗涤缓冲液中洗涤15分钟。Biotinylated T7 primers (GCGTAATACGACTCACTATAGGGCG) were attached to Greiner streptavidin-coated microarray slides by incubation in Dynal binding/wash buffer (Dynal, Norway) at 10 pmol/μl. Holes were made on glass slides by sticking a rubber membrane containing an array of 5 mm wide holes. TOPO2.1 plasmid (Clontech) was boiled, cooled on ice, and added to each well at 20 fmol/μl. After incubation for 15 minutes at room temperature, the slides were washed in binding/wash buffer for 15 minutes.

向两个孔中添加含4μl EcoPol缓冲液、各0.4μl的dATP、dTTP和dGTP(100μM的终浓度，NEB)、0.4μl dUTP-Cy3(10μM的终浓度，PerkinElmer)、2μl克列诺外切DNA聚合酶(NEB)并加水至40ul的反应混合物，并向另外的两个孔中添加以水代替克列诺的相同的混合物。在温育10分钟并在结合/洗涤缓冲液中洗涤两次15分钟之后，将载玻片在Typhoon 9200上扫描。Add 4 μl EcoPol buffer, 0.4 μl each of dATP, dTTP, and dGTP (100 μM final concentration, NEB), 0.4 μl dUTP-Cy3 (10 μM final concentration, PerkinElmer), 2 μl Klenow exosomes to two wells. DNA polymerase (NEB) and water were added to 40ul of the reaction mixture, and the same mixture was added to the other two wells with water instead of Klenow. After a 10 min incubation and two 15 min washes in binding/wash buffer, slides were scanned on the Typhoon 9200.

给定模板的情况下(Clontech TOPO2.1)，预期的结果是掺入2个dTTP。图2显示了该结果，清楚地表明掺入了标记的dTTP，且所获得的信号显著高于背景(如由省略克列诺的反应中的荧光所给出的那样)。Given the template (Clontech TOPO2.1), the expected result is the incorporation of 2 dTTPs. Figure 2 shows the results, clearly showing the incorporation of labeled dTTP, and the signal obtained was significantly above background (as given by the fluorescence in the reaction omitting Klenow).

Claims

1. determine the sequence of nucleic acid and/or the method for based composition information, described method comprises:

(i) provide the nucleic acid that comprises first chain, described first chain comprises nucleic acid-templated, wherein the free 3 ' end with the described first chain annealed nucleic acid chains allows to be complementary to nucleic acid-templated nucleic acid chains extension, this is by the dependent nucleic acid polymerase of template, by template sequence dependency ground Nucleotide is incorporated into to be complementary to and realizes in the nucleic acid-templated nucleic acid chains;

(ii) implement one or more steps of one group, number of cycles with expectation should be organized one or more steps, or implemented with one or more combination of steps of other groups, was complementary to nucleic acid-templated nucleic acid chains with extension, thereby allow to obtain the based composition of the described nucleic acid of expression or the information of sequence

One of them step comprises:

(a) exist when following:

The nucleic acid that comprises first chain, described first chain comprises nucleic acid-templated,

With the free 3 ' end of the described nucleic acid-templated first chain annealed nucleic acid chains and

Template dependency nucleic acid polymerase;

Provide be selected from a kind of, two kinds, the Nucleotide of three kinds or the four kinds complementary types of Nucleotide, with be used for by described nucleic acid polymerase with described oligonucleotide template dependency be incorporated into and be complementary to nucleic acid-templated nucleic acid chains, wherein each described Nucleotide is natural nucleotide or nucleotide analog, they can the free 3 ' end of nucleic acid chains by nucleic acid polymerase template dependency be incorporated in the nucleic acid chains, and in the complementary type of each Nucleotide, described Nucleotide and nucleotide analog and adenosine (A), cytosine(Cyt) (C), the complementation of one of thymus pyrimidine (T) and guanine (G);

With

(b) remove or the uncorporated Nucleotide of deactivation;

And

Wherein in one group of step

Provide the Nucleotide that is selected from the complementary types of all four kinds of Nucleotide, and it can be used for carrying out that template is dependent mixes,

In at least one step, provide and be selected from the Nucleotide that surpasses a kind of, optional two kinds, three kinds or the four kinds complementary types of Nucleotide, and it can be used for carrying out, and template is dependent mixes, and the Nucleotide in the complementary type of at least a Nucleotide, be complementary in the nucleic acid-templated nucleic acid chains if be incorporated into, then allow to be complementary to nucleic acid-templated nucleic acid chains is further extended and

Randomly in surpassing a step, do not provide Nucleotide complementary type; With

Wherein, if in a step, provide the Nucleotide that is selected from all four kinds of complementary types, Nucleotide in the complementary type of then a kind of, two or three Nucleotide, be complementary in the nucleic acid-templated nucleic acid chains if be incorporated into, then prevent to be complementary to nucleic acid-templated nucleic acid chains and further extend, if there is multiple copied with all copies that exist;

(iii) implement the described steps of many groups, the described step group that circulates and/or with the different described step groups of step group Joint Implementation;

(iv) determine to be incorporated into character and/or the amount that is complementary to the Nucleotide in the nucleic acid-templated nucleic acid chains at least one group of step, this is to realize by being incorporated into the character and/or the amount that are complementary to the Nucleotide in the nucleic acid-templated nucleic acid chains at least one step of determining each group, to described group of character and/or the amount that will determine the Nucleotide that mixed.

2. according to the method for claim 1, wherein in one group of step, the Nucleotide that is selected from three kinds or the two kinds complementary types of Nucleotide is provided in first step, and the Nucleotide of choosing from the complementary type of remaining one or both Nucleotide is provided in second step.

3. according to the method for claim 2, comprise the Nucleotide that mixed in first or second step of determining step group or the amount of multiple Nucleotide, to determine the character and/or the amount of the Nucleotide that mixed to described step group.

4. according to the method for claim 3, comprise the amount of the Nucleotide that is mixed in each step of determining group, to the described group of amount that will determine the Nucleotide that mixed.

5. according to the method for claim 4, wherein in one group of step, three kinds of Nucleotide are provided in first step, and a kind of Nucleotide is provided in second step.

6. according to the method for claim 5, comprise the property quality and quantity of the Nucleotide that is mixed in definite first step.

7. according to each method in the claim 2 to 6, wherein the Nucleotide that is provided in first step is differently carried out mark separately.

8. according to each method in the claim 2 to 7, wherein the Nucleotide that is provided in second step is labeled.

9. according to each method in the claim 1 to 8, four kinds of Nucleotide that wherein are complementary to A, C, T and G are differently carried out mark separately.

10. according to the method for claim 7, claim 8 or claim 9, wherein Nucleotide is by fluorescent mark.

11. according to the method for claim 7, claim 8, claim 9 or claim 10, wherein be incorporated into when being complementary in the nucleic acid-templated nucleic acid chains when Nucleotide, the mark of described Nucleotide lost efficacy.

12. according to the method for claim 7, claim 8, claim 9 or claim 10, wherein be incorporated into when being complementary in the nucleic acid-templated nucleic acid chains when Nucleotide, the mark of described Nucleotide is from described Nucleotide cutting or discharge.

13., comprise and determining from being incorporated into the character and/or the amount of the mark that is complementary to the one or more Nucleotide cuttings the nucleic acid-templated nucleic acid chains or discharges according to the method for claim 12.

14. according to each method in the claim 5 to 13, comprise and implement a round-robin step group, wherein in every group of step of this round-robin, three kinds of Nucleotide are provided in first step, and a kind of Nucleotide is provided in second step.

15. method according to claim 14, comprise described nucleic acid is implemented four round-robin step groups, wherein in each circulation, in steps in all second steps of group a kind of Nucleotide of being provided be identical, and a kind of Nucleotide that wherein institute is provided in all second steps of group in steps in each circulation be different from other three round-robin a kind of Nucleotide of being provided in all second steps of organizing in steps.

16. according to each method in the claim 1 to 15, wherein one group of step additionally comprises provides one or more sealing Nucleotide, its termination is mixed Nucleotide in being complementary to nucleic acid-templated nucleic acid chains.

17. according to each method in the claim 1 to 16, wherein one group of step additionally comprises provides the inhibitor of one or more non-mixing property Nucleotide, it suppresses in being complementary to nucleic acid-templated nucleic acid chains mistake and mixes Nucleotide.

18. according to each method in the claim 1 to 17, wherein nucleic acid-templated is thymus nucleic acid (DNA), nucleic acid polymerase is the dependent archaeal dna polymerase of DNA, and Nucleotide is deoxyribonucleotide or deoxyribonucleotide analogue.

19. according to each method in the claim 1 to 17, wherein nucleic acid-templated is thymus nucleic acid (DNA), nucleic acid polymerase is the dependent Yeast Nucleic Acid of DNA (RNA) polysaccharase, and Nucleotide is ribonucleotide or ribonucleoside acid-like substance.

20. according to each method in the claim 1 to 17, wherein nucleic acid-templated is Yeast Nucleic Acid (RNA), nucleic acid polymerase is a ThermoScript II, and Nucleotide is deoxyribonucleotide or deoxyribonucleotide analogue.

21., wherein nucleic acid-templatedly provide with multiple copied according to each method in the claim 1 to 20.

22., comprise by nucleic acid amplification reaction the nucleic acid-templated of multiple copied is provided according to the method for claim 21.

23. according to the method for claim 22, wherein nucleic acid amplification reaction comprises rolling circle amplification.

24. the method according to claim 23 comprises:

The dna molecular of being made up of the stem and first and second ring portions is provided, wherein said stem is made up of first chain and second chain, wherein said first chain and second chain length equate, complementation and annealing are together, and comprise the zone that needs its sequence and/or based composition information, wherein said first ring portion is connected in 3 ' end of described first chain at 5 ' end of described second chain, and described second ring portion is connected in 3 ' end of described second chain at 5 ' end of described first chain, thereby described dna molecular does not have free 5 ' or 3 ' end, and one of them ring portion comprises the primer binding site that is used for rolling circle amplification and a ring portion comprises the primer binding site that is used to check order;

Implement rolling circle amplification, with nucleic acid that multiple copied is provided as described nucleic acid-templated.

25. it is, wherein nucleic acid-templated attached on the solid support according to each method in the claim 1 to 24.

26. according to the method for claim 25, wherein a plurality of different nucleic acid-templated forms with array are attached on the solid support.

27. according to the method for claim 25 or claim 26, wherein nucleic acid-templated Jie by with attached to the primer annealing on the solid support attached on the solid support.

28., comprise by coming the definite kernel acid sequence to being incorporated into the analysis that the character that is complementary to the Nucleotide in the nucleic acid-templated nucleic acid chains and/or amount determine according to each method in the claim 1 to 27.

29. via the synthetic method for nucleic acid sequencing, be characterised in that in mode progressively and mix Nucleotide that one of them step is mixed with allowing the template dependency and surpassed a kind of different Nucleotide.

30. method according to claim 29, one of them step is mixed three kinds of different Nucleotide with allowing the template dependency, described Nucleotide is selected from the Nucleotide that is complementary to adenosine (A), cytosine(Cyt) (C), thymus pyrimidine (T) and guanine (G), and different steps mixing this organizes remaining Nucleotide with allowing the template dependency.

31. through programdesign with control according to the computer processor of each method in the claim 1 to 30.

32. carry the computer readable device that is used for according to the program of the computer processor of claim 31.

33. through programdesign with by implementing to provide the computer processor of the sequence and/or the based composition information of nucleic acid according to each method in the claim 1 to 30.

34. carry the computer readable device that is used for according to the program of the computer processor of claim 33.

35. be suitable for implementing the test kit according to each method in the claim 1 to 30, described test kit comprises the reagent that one or more groups is pre-mixed in one or more reagent containers, wherein the reagent that is pre-mixed of each group comprises

The Nucleotide of from all four kinds of complementary types, choosing,

The Nucleotide of choosing at least one container contains a kind of from surpassing, optional two kinds, three kinds or the four kinds of complementary types, and the Nucleotide in the complementary type of at least a Nucleotide is complementary in the nucleic acid-templated nucleic acid chains if be incorporated into, then allow to be complementary to nucleic acid-templated nucleic acid chains and further extend, and

Wherein, if the Nucleotide that is selected from all four kinds of complementary types is provided in single container, Nucleotide in then a kind of, the two or three complementary type is complementary in the nucleic acid-templated nucleic acid chains if be incorporated into, and then prevents to be complementary to nucleic acid-templated nucleic acid chains and further extends.

36. be used for implementing comprising according to each the instrument of method of claim 1 to 30:

Can detect the image-forming component of the mark that mixes or discharge,

The reaction chamber that is used for the one or more templates of adhering to of splendid attire, thus every group of step has at least once them can be near image-forming component,

Be used to reaction chamber that the reagent distribution system of reagent is provided.

37. according to the instrument of claim 36, wherein reaction chamber provides, and image-forming component can differentiate, density is at least 100/cm ², 1000/cm at least randomly ², at least 10 000/cm ²Or at least 100 000/cm ²The template of adhering to.

38. according to the instrument of claim 35 or claim 36, wherein image-forming component adopts system or the device that is selected from down group: photomultiplier, photorectifier, charge-coupled device, cmos imaging chip, near-field scan microscope, far field confocal microscope, wide visual field surface illumination microscope and total internal reflectance microscope.

39. according to the instrument of claim 35 or claim 36, wherein image-forming component detects fluorescent mark.

40. according to the instrument of claim 39, the fluorescence that brings out of image-forming component detection laser wherein.

41. according to each instrument in the claim 35 to 40, wherein reaction chamber is a closed structure, it comprises transparent surface, covers and is used to port that reaction chamber and reagent distribution system are adhered to, wherein the surperficial within it splendid attire template molecule of transparent surface forms pixel spare and can pass the transparent surface imaging.

42. the dna molecular of forming by the stem and first and second ring portions, wherein said stem is made up of first chain and second chain, wherein said first chain and second chain length equate, complementary and annealing is in the same place, wherein said first ring portion is connected in 3 ' end of described first chain at 5 ' end of described second chain, and described second ring portion is held the 5 ' end that is connected in described first chain with 3 ' of described second chain, thereby described dna molecular does not have free 5 ' or 3 ' end.

43. according to the dna molecular of claim 42, one of them ring portion comprises the primer binding site that is used for rolling circle amplification.

44. according to the dna molecular of claim 42 or claim 43, one of them ring portion comprises the primer binding site that is used to check order.

45. attached to the array on the solid support according to a plurality of different dna moleculars of claim 42, claim 43 or claim 44, randomly be situated between by with adhere to attached to the primer annealing on the solid support.

46. preparation is according to the method for the dna molecular of claim 42, claim 43 or claim 44, described method comprises:

Provide by first chain and have 5 ' end and double chain DNA molecule that 3 ' second chain of holding is formed with 5 ' end and 3 ' end; And

Connect first joint, be connected in 5 ' end of second chain with 3 ' end with first chain, and connect second joint, be connected in 5 ' of first chain with the end with second chain 3 ' and hold, wherein said joint is a hairpin structure.

47. produce the method for multiple copied dna profiling, described method comprises implements rolling circle amplification to the dna molecular according to claim 43 or claim 44, comprises the dna molecular of the extension of multiple copied dna profiling with production.

48. produce the method for a plurality of dna profilings of multiple copied, described method comprises implements rolling circle amplification to a plurality of dna moleculars according to claim 43 or claim 34, comprises the IDNA molecule of a plurality of extensions of multiple copied dna profiling with production.

49. according to the method for claim 47 or claim 48, wherein rolling circle amplification primer or dna molecular are attached on the solid support.

50., comprise that further the annealing between the complementary strand in the dna profiling of multiple copied in the dna molecular that passes through to be extended concentrates the dna molecular of described extension according to the method for claim 47 or claim 48.

51. according to the method for claim 50, wherein the dna molecular of Yan Shening is concentrated on the solid support.

52., comprise that further dna profiling or a plurality of dna profiling to multiple copied in the dna molecular that is extended checks order according to each method in the claim 47 to 51.