HK1123582A

HK1123582A - Reagents, methods, and libraries for bead-based squencing

Info

Publication number: HK1123582A
Application number: HK08113069.1A
Authority: HK
Inventors: K．麦柯南; A.布兰查德; L．科特勒; G.科斯塔
Original assignee: Ab先进基因分析公司
Priority date: 2005-02-01
Filing date: 2006-02-01
Publication date: 2009-06-19

Description

Reagents, methods and libraries for bead-based sequencing

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims provisional application USSN 60/649,294 filed on 1/2/2005; USSN 60/656,599 filed on 25/2/2005; priorities and benefits of USSN 60/673,749 filed on day 21, 4/2005, USSN 60/699,541 filed on day 15, 7/2005 and USSN 60/722,526 filed on day 30, 9/2005, all of which are incorporated herein by reference.

Background

Nucleic acid sequencing technology is of great importance in a variety of fields, from basic research to clinical diagnostics. The results obtained from this technique may include varying degrees of specificity information. For example, useful information may include: determining whether a particular polynucleotide differs in sequence from a reference polynucleotide, confirming the presence of a particular polynucleotide sequence in a sample, determining partial sequence information such as identifying one or more nucleotides within a polynucleotide, determining the identity and order of nucleotides within a polynucleotide, and the like.

DNA strands are generally polymers composed of four types of subunits, i.e., deoxynucleotides containing bases adenine (A), cytosine (C), guanine (G), and thymine (T). These subunits are linked to each other by a covalent phosphodiester bond that links the 5 'carbon of one deoxyribose group to the 3' carbon of the next group. Most naturally occurring DNA consists of two such strands, arranged in an antiparallel orientation, joined together by hydrogen bonds formed between complementary bases, i.e., a and T and G and C.

Large scale DNA sequencing is possible with the development of chain termination or dideoxynucleotide methods (Sanger et al, Proc. Natl. Acad. Sci./4: 5463-5467, 1977) and chemical degradation methods (Maxam and Gilbert, Proc. Natl. Acad. Sci. 74: 560-564, 1977), the former of which has been widely used, improved and automated. In particular, the use of fluorescently labeled chain terminators is important in the development of automated DNA sequencers. Common to both of the above methods is the generation of one or more aggregates of labeled DNA fragments of different sizes that must then be separated by length to identify the nucleotide at the 3' end of the fragment (chain termination method) or the nucleotide that has recently been excised from the fragment (chemical degradation method).

While currently available sequencing technologies have achieved significant advances, such as sequencing many entire genomes, these technologies have a number of drawbacks and improvements are highly desirable in many respects. The labeled DNA fragments are typically separated by polyacrylamide gel electrophoresis. However, this step has proven to be a major bottleneck limiting the speed and accuracy of sequencing in many cases. Although capillary electrophoresis (CAE) has been demonstrated to be able to accomplish the human genome project breakthrough (Venter et al, Science, 291: 1304-921, 2001; Lander et al, Nature, 409: 860-921, 2001), it still has significant drawbacks. For example, CAE still requires time consuming separation steps and still involves differentiation by size, which may be inaccurate.

Various alternatives to the chain termination method have been proposed. In one method, commonly referred to as "sequencing by synthesis", an oligonucleotide primer is first hybridized to a target template. The incorporation of the nucleotides into the growing strand is then detected by polymerase-catalyzed addition of successive cycles of extension primers of differently labeled nucleotides. The identification of the label serves as an identification of the complementary nucleotide in the template. Alternatively, multiple reactions can be run in parallel with each nucleotide, and incorporation of the labeled nucleotide in a reaction using one particular nucleotide identifies the complementary nucleotide in the template. (see, e.g., Melamede, U.S. Pat. No. 4,863,849; Cheeseman, U.S. Pat. No. 5,302,509, Tsien et al, International application WO 91/06678; Rosenthal et al, International application WO 93/21340; Canard et al, Gene, 148: 1-6 (1994); Metzker et al, Nucleic Acids Research, 22: 4259-.

In order to sequence a polynucleotide of any significant length efficiently, it is necessary for the polymerase to incorporate exactly one nucleotide per cycle. Thus, it is often desirable to employ nucleotides that act as chain terminators, i.e., their incorporation prevents further extension by the polymerase. The incorporated nucleotide must then be modified enzymatically or chemically to allow the polymerase to incorporate the next nucleotide. Various nucleotide analogs are proposed that can be used as chain terminators, but can be modified after their incorporation so that they continue to be extended in a subsequent step. For example, in U.S. Pat. nos. 5,302,509; 6,255,475, respectively; 6,309,836, respectively; 6,613,513, such a "reversible terminator" has been described. However, it has proven difficult to identify reversible terminators that can be incorporated efficiently by polymerases, probably because, given the small size of nucleotides, modifications that affect the use of nucleotides as terminators also affect their incorporation into the growing polynucleotide strand.

Other sequencing methods include pyrophosphate sequencing (pyrosequencing), which is based on the detection of pyrophosphate (PPi) released during DNA polymerization (see, e.g., U.S. Pat. nos. 6,210,891 and 6,258,568). Although electrophoretic separation is not required, pyrophosphate sequencing has a number of disadvantages that still limit its widespread use (Franca et al, Quartely Reviews of Biophysics, 35 (2): 169- "200, 2002). Sequencing by hybridization has also been proposed as an alternative (U.S. Pat. No. 5,202,231; WO 99/60170; WO 00/56937; Drmanac et al, Advances in Biochemical Engineering/Biotechnology, 11: 16-101, 2002), but has a number of disadvantages, including the possibility of error in distinguishing highly similar sequences. Theoretically, single molecule sequencing by exonuclease is a very efficient method for rapid sequencing of long DNA molecules, which involves labeling each base on one strand and then detecting the 3' terminal nucleotide cut off in sequence in the sample stream (Stephan et al, JBiotechnol, 86: 255-. However, there are also a number of technical hurdles to overcome before this possible approach is implemented (Stephan et al, 2001).

Diagnostic tests based on specific sequence variations have been used for a variety of different diseases. It is widely believed that sequencing of the human genome opens the era for personalized medicine, where treatment (including prophylactic treatment) will be tailored to the specific genetic makeup of the patient or selected based on the identification of specific alleles or mutations. The need for rapid and accurate determination of sequence variants of pathogens, such as HIV, is increasing. Therefore, it is certain that accurate and rapid sequence determination is more required in the near future. Thus, there is a need for improved methods for sequencing of all types.

Summary of The Invention

The present invention provides novel and improved sequencing methods that do not require fragment isolation and, in certain embodiments, do not require the use of a polymerase. U.S. patents 5,740,341 and 6,306,597 to Macevicz describe alternative methods to the methods discussed in the background of the invention. The method is based on repeated cycles of duplex extension along a single stranded template. In a preferred embodiment of these methods, one nucleotide is identified in each cycle. The present invention improves these processes. These improvements enable efficient implementation of the method and are particularly suitable for high throughput sequencing. In addition, the invention provides a method for sequence determination comprising repeated cycles of duplex extension along a single stranded template but not including the identification of any single nucleotide in each cycle.

In one aspect, the invention provides an improved method of sequencing based on successive cycles of duplex extension along a single stranded template, ligation of a labelled extension probe and detection of the label. Typically, the extension is initiated from a duplex formed by the initiator oligonucleotide and the template. The initiator oligonucleotide is extended by ligating the oligonucleotide to the end of the initiator oligonucleotide to form an extended duplex, and the extended duplex is then repeatedly extended by successive ligation cycles. During each cycle, one or more nucleotides in the template are identified by identifying the label that is successfully attached to or associated with the oligonucleotide probe. The label of the newly added probe may also be detected before ligation, or, in addition, after ligation. It is generally preferred to detect the label after ligation.

In a preferred embodiment, the probe has a non-extendable moiety in the terminal position (the opposite end of the probe to the nucleotide to which the growing duplex nucleic acid strand is attached) so that only a single extension of the extended duplex occurs in a single cycle. "inextensible" means that the moiety is not available as a ligase substrate without modification. For example, the moiety may be a nucleotide residue lacking a 5 'phosphate or 3' hydroxyl group. The moiety may be a nucleotide to which a blocking group is attached to prevent attachment. In a preferred embodiment of the invention, the non-extendable moiety is removed after ligation to regenerate the extendable terminus, so that the duplex can be extended further in subsequent cycles.

To enable removal of the non-extendable moiety, in certain embodiments of the invention, the probe contains at least one internucleoside linkage that is cleavable under conditions that do not substantially cleave the phosphodiester bond. Such linkages are referred to herein as "cleavable internucleoside linkages" or "scissile linkages". Cleavage of the cleavable internucleoside linkage removes the non-extendable moiety and regenerates the extendable probe terminus or leaves the terminal residue modified to form the extendable probe terminus. The cleavable internucleoside linkage can be between any two nucleosides in the probe. Preferably, the scissile junction is at least several nucleotides away from the newly formed bond (i.e., distal). The terminal nucleotide of the extension probe that is ligated to the extendable terminus and the nucleotide between the cleavable linkage need not be fully hybridized to the template. These nucleotides can be used as "spacers" and to identify the nucleotides located at the template interval without performing one cycle for each nucleotide within the interval.

Preferably, the cleavable internucleoside linkage and the label are positioned such that cleavage of the cleavable internucleoside linkage separates the extension probe into the label portion and a portion that remains as part of the growing nucleic acid strand, thereby allowing the label portion to diffuse out (e.g., by increasing the temperature). For example, the label can be attached to the terminal nucleotide of the extension probe at the opposite end of the linking nucleotide. Alternatively, the mark may be removed by any other method.

The present inventors have found that phosphorothioate linkages in which one of the bridging oxygen atoms in the phosphodiester linkage is replaced by a sulfur atom are particularly advantageous for scissile internucleoside linkages. The sulfur atom in the phosphorothioate linkage may be attached to the 3 'carbon of one nucleoside or the 5' carbon of an adjacent nucleoside.

In certain embodiments of the above methods, a plurality of sequencing reactions are performed. These reactions use starting oligonucleotides that hybridize to different sequences of the template, such that the ends where the initial ligation occurs are located at different positions on the template. For example, the positions at which the initial ligation occurs can be shifted, or "dephased" from each other by 1 nucleotide addition. Thus, after each cycle of extension with oligonucleotide probes of the same length, the same relative phase exists between the ends of the starting oligonucleotides on different templates. The reactions can be performed in parallel in separate vessels each containing a copy of the same template, or the reactions can be performed in series, i.e., the extended duplex on the template is removed after sequence information is obtained with the initial initiator oligonucleotide, and then other reactions are performed with initiator oligonucleotides that hybridize to different sequences of the template.

In another aspect, the present invention provides solutions that can be used in various nucleic acid manipulations. In one embodiment, the invention provides compositions comprising or consisting essentially of 1.0-3.0% SDS, 100-300mM NaCl and 5-15mM sodium bisulfate (NaHSO) ₄) The aqueous solution of (a). The solution may contain or consist essentially of about 2% SDS, about 200mM NaCl and about 10mM sodium bisulfate (NaHSO)₄) The aqueous solution of (a). For example, in one embodiment, the solution contains 2% SDS, 200mM NaCl and 10mM sodium bisulfate (NaHSO)₄) An aqueous solution of (a). In another embodiment, the solution consists essentially of 2% SDS, 200mM NaCl and 10mM sodium bisulfate (NaHSO)₄) The aqueous solution of (a). In certain embodiments, the pH of the solution is 2.0 to 3.0, such as 2.5. The solution can be used to separate double-stranded nucleic acids, such as double-stranded DNA, into single strands, i.e., to denature (melt) the double-stranded nucleic acids. In certain embodiments, both strands are DNA. In other embodiments, both strands are RNA. In other embodiments, one strand is DNA and the other strand is RNA. In other embodiments, one or both strands contain both RNA and DNA. In other embodiments, one or both strands contain at least one nucleotide other than A, G, C or T. In some embodiments, one or both strands contain non-naturally occurring nucleotides. In other embodiments, one or both residues are priming residues, e.g., abasic residuesA base or a damaged base. In some embodiments, one or more residues comprise a universal base. In some embodiments, one or both strands contain a scissile linkage.

The double-stranded nucleic acid may be fully or partially double-stranded. They may be free molecules in solution, or one or both strands may be physically attached (e.g., covalently or non-covalently) to a solid or semi-solid support or substrate. Of particular note, the double-stranded nucleic acids incubated in these solutions effectively separate into single strands without the application of heat or the absence of a denaturing agent that causes gel delamination (e.g., when the nucleic acids are on or attached to a semi-solid support such as a polyacrylamide gel) or can disrupt non-covalent linkages such as Streptavidin (SA) -biotin linkages (e.g., when the nucleic acids are attached to a support or substrate via SA-biotin linkages). In one embodiment, the solution is used to isolate double stranded nucleic acids in which one nucleic acid is linked to a bead via SA-biotin linkage.

The present invention also provides a method of separating strands of a double-stranded nucleic acid, the method comprising the steps of: the double-stranded nucleic acid is contacted with any of the above-mentioned solutions, for example, containing about 1.0-3.0% SDS, about 100 mM NaCl and about 5-15mM sodium hydrogensulfate (NaHSO)₄) For example, containing 1.0-3.0% SDS, 100-300mM NaCl and 5-15mM sodium bisulfate (NaHSO)₄) An aqueous solution of (a). In one embodiment, the solution contains about 2% SDS, 200mM NaCl and 10mM sodium bisulfate (NaHSO) ₄) Such as 2% SDS, 200mM NaCl and 10mM sodium bisulfate (NaHSO)₄). In another embodiment, the solution consists essentially of 2% SDS, 200mM NaCl and 10mM sodium bisulfate (NaHSO)₄) The aqueous solution of (a). In certain embodiments, the pH of the solution is 2.0 to 3.0, such as 2.5. In some embodiments, double stranded nucleic acids are incubated in the solution. In other embodiments, double-stranded nucleic acids (preferably nucleic acids attached to a support or substrate) are washed with the solution. In some embodiments, the double-stranded nucleic acid is contacted with the solution for a time sufficient to separate at least 10% of the double-stranded nucleic acid molecules into single strands. In some embodiments, the double stranded nucleic acid is contacted with the solution for a time sufficient to bring at least 20%, 30%, 40%50%, 60%, 70%, 80%, 90%, 95%, 98%, 99% or more of the double-stranded nucleic acid is separated into single strands. In an exemplary embodiment, double-stranded nucleic acids are contacted with the solution for 15 seconds to 3 hours. In another embodiment, double stranded nucleic acids are contacted with the solution for 1 minute to 1 hour. In certain embodiments, double stranded nucleic acids are contacted with the solution for about 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, or 60 minutes. The method may further comprise the step of removing the solution or removing some or all of the nucleic acid from the solution after the incubation for a period of time.

This solution can be used for one or more steps of many sequencing methods described herein, and can be used in any of these methods. For example, the solution can be used to separate extended duplexes from the template. The portion of the extension probe that is no longer attached to the extended duplex can be removed with this solution after cleavage of the scissile ligation. The solution can also be used to isolate strands of triple-stranded nucleic acids or to isolate double-stranded regions of single-stranded nucleic acids containing self-complementary portions that hybridize to each other.

In another aspect, the invention provides methods for obtaining sequence information using a collection of at least two families of distinguishably labeled oligonucleotide probes. Probes in the probe family contain both undefined and defined portions. As described in the above methods, the extension is initiated from the duplex formed by the initiator oligonucleotide and the template. Extension of the initial oligonucleotide by ligation of the oligonucleotide probe to its terminus to form an extended duplex, followed by repeated extension through successive ligation cycles. The probe contains an inextensible moiety in its terminal position (opposite the end of the probe that is attached to a nucleotide of a growing nucleic acid strand of the duplex) so that extension of the duplex occurs only once in a single cycle. During each cycle, labels on or attached to the successfully ligated probes are detected, and the non-extendable moieties are removed or modified to generate extendable ends. The label corresponds to the probe family to which the probe belongs.

Successive cycles of extension, ligation and detection produce an ordered list of probe families to which successively, successfully ligated probes belong. Sequence information was obtained using an ordered list of probe families. However, knowing to which probe family the newly ligated probe belongs is not sufficient in itself to determine the nucleotide species in the template. Conversely, knowing which probe family the newly ligated probe belongs to can exclude the possibility that certain sequences will become part of the defined sequence of the probe, but leave at least two possible nucleotide species at each position. Thus, there are at least two possibilities for the type of nucleotide in the template that is located opposite the nucleotide of the defined portion of the newly ligated probe (i.e., the nucleotide that is complementary to the nucleotide of the defined portion of the probe).

In certain embodiments, after performing the desired number of cycles, an ordered list of probe family species is used to generate a set of candidate sequences. The set of candidate sequences may provide sufficient information to achieve the goal. In a preferred embodiment of the invention, one or more additional steps are performed to select the correct sequence from the candidate sequences. For example, the sequence may be compared to a database of known sequences, and the candidate sequence closest to one of the sequences in the database is selected as the correct sequence. In other embodiments, the differential coding set of probe families is used to perform another round of sequencing of the template through successive cycles of extension, ligation, detection and cleavage, and the correct sequence is selected using the information obtained in the second round. In other embodiments, the at least one item of information is combined with information obtained from an ordered list of probe families to determine the sequence.

The invention also provides methods for error checking in sequencing with a family of probes. Certain methods can distinguish between Single Nucleotide Polymorphisms (SNPs) and sequencing errors.

The invention also provides nucleic acid fragments (e.g., DNA fragments) comprising at least two segments of interest (e.g., at least two tags) and at least three Primer Binding Regions (PBRs) such that at least two different templates, each corresponding to a segment of interest, are amplified from each fragment. A "primer binding region" is a portion of a nucleic acid that an oligonucleotide can hybridize to, thereby making the oligonucleotide useful as an amplification primer, a sequencing primer, an initiator oligonucleotide, and the like. Thus, the primer binding region should have a known sequence to select the appropriate complementary oligonucleotide. As used herein and in the accompanying figures, a portion of a nucleic acid strand used in the methods of the invention can be referred to as a primer binding region, whether the primer actually binds to that region or to the corresponding portion of the complementary strand of the nucleic acid strand in the practice of the methods of the invention. Thus, when used in the methods of the invention, a portion of a nucleic acid can be referred to as a primer binding region, whether the primer actually binds to the region (in which case the sequence of the primer is complementary or substantially complementary to the sequence of the region) or to the region of complementarity of the region (in which case the sequence of the primer is the same or substantially the same as the sequence of the region). A segment of interest is any nucleic acid segment for which sequence information is desired. For example, the sequence of interest may be a tag, and for purposes of this disclosure, the segment of interest may be assumed to be a tag (also referred to herein and elsewhere as an "end tag"). It should be understood, however, that the present invention is not limited to segments of interest as tags. In certain embodiments, the at least two tags are paired tags. The nucleic acid fragment may contain one or more pairs of tags, such as one or more pairs of tags, for example 2, 3, 4, 5 or more pairs of tags. The invention also provides libraries containing such nucleic acid fragments, and methods of making templates and libraries.

The invention also provides microparticles, such as beads, having attached at least two distinct nucleic acid populations, wherein each of the at least two nucleic acid populations consists of a plurality of substantially identical nucleic acids, and wherein the nucleic acid populations are generated by amplification, such as PCR amplification), of a single nucleic acid fragment. In some embodiments, the single nucleic acid fragment comprises a 5 'tag and a 3' tag, wherein the 5 'and 3' tags are paired tags. In some embodiments, wherein the single nucleic acid fragment comprises a pair of a 5 'tag and a 3' tag, one of the populations of nucleic acids attached to the microparticle comprises at least a portion of the 5 'tag, and one of the populations of nucleic acids attached to the microparticle comprises at least a portion of the 3' tag. In a preferred embodiment, one of the populations of nucleic acids comprises the complete 5 'tag and one of the populations of nucleic acids comprises the complete 3' tag.

The nucleic acid fragment comprises a plurality of PBRs, at least one of which is located between tags, and at least two of which are flanked by tag-containing portions of the nucleic acid fragment, thereby enabling amplification of a region containing at least a portion of a 5 'tag and amplification of a region containing at least a portion of a 3' tag to produce two different populations of nucleic acids. In a preferred embodiment, both the complete 5 'tag and the complete 3' tag can be amplified. For example, the nucleic acid fragment may contain first and second primer binding sites flanked by a 5 'tag, and third and fourth primer binding sites flanked by a 3' tag. PCR amplification of the 5' tag is performed using primers that bind to the first and second primer binding sites. PCR amplification of the 3' tag is performed using primers that bind to the third and fourth primer binding sites. It will be appreciated that the primers will be selected so as to extend from each primer to the region of the DNA fragment containing the tag to be amplified. Alternatively, the first primer binding site may be located upstream of one of the tags, and the second primer binding site may be located downstream of the other tag, and the third primer binding site may be located between the two tags. The third primer binding site serves as a binding site for the forward primer of the PCR amplification to amplify one tag and serves as a binding site for the reverse primer of the PCR amplification to amplify another tag. Thus, in one embodiment of the invention, there is provided a microparticle, such as a bead, having attached thereto at least two different populations of nucleic acids, wherein each of the at least two populations of nucleic acids consists of a plurality of substantially identical nucleic acids, and wherein a first, different population of nucleic acids comprises a 5 'tag and a second, different population of nucleic acids comprises a 3' tag.

The invention also provides populations of microparticles, such as beads, in which each microparticle has attached thereto at least two different populations of nucleic acids, wherein each of the at least two populations of nucleic acids consists of a plurality of substantially identical nucleic acids, wherein the populations of nucleic acids are generated by amplification (e.g., PCR amplification) of a single nucleic acid fragment. The substantially identical population of nucleic acids can be, for example, a 5 'tag and a 3' tag. The invention also provides arrays of such microparticles and methods of sequencing comprising sequencing a population of substantially identical nucleic acids. For example, in one embodiment, the two substantially identical populations of nucleic acids attached to a single microparticle each comprise different Primer Binding Regions (PBRs), such that one population can be sequenced without interference from the other population by using different sequencing primers. If two or more substantially identical populations of substantially identical nucleic acids are attached to a microparticle, each population may have a unique PBR such that a primer that binds a particular PBR does not bind to PBRs present in other substantially identical populations of nucleic acids attached to the microparticle. Thus, the methods of the invention can produce microparticles having attached at least two different populations of substantially identical nucleic acids (e.g., multiple copies of a template comprising a 5 'tag and multiple copies of a template comprising a 3' tag), wherein the tags are paired tags. According to the method of the invention, the template contains different PBRs which provide binding sites for sequencing primers. Thus, by selecting sequencing primers that are complementary to the PBR in the template containing the 5 'tag, sequence information can be obtained from the 5' tag without interference from the template containing the 3 'tag, even if the template containing the 3' tag is present on the same microparticle. By selecting sequencing primers that are complementary to the PBR in the 3 'tag-containing template, sequence information can be obtained from the 3' tag without interference from the 5 'tag-containing template, even if the 5' tag-containing template is present on the same microparticle. When two paired tags are present on the same microparticle, it is meant that the sequences of the 5 'and 3' paired tags can be linked to each other, as is known in the art when they are present on a single template.

The invention also provides automated sequencing systems that can be used, for example, to sequence templates that are arrayed in or on a substantially flat support. The present invention also provides an image processing method, which can be stored in a computer readable medium such as a hard disk, a CD, a zip disk, a flash memory, etc. In certain preferred embodiments, the system achieves identification of 40,000 or more nucleotides per second. In certain preferred embodiments, the system generates 8.6 giga (Gb) of sequence data or more per day (24 hours). In certain preferred embodiments, the system generates 48Gb sequence information (nucleotide identifications) or more daily.

In addition, the invention provides a computer readable medium storing information generated using the sequencing method of the invention. The information may be stored in a database.

This application refers to various patents, patent applications, journal articles, and other publications, all of which are incorporated herein by reference. In addition, the following standard references are incorporated herein by reference: new Molecular Biology Protocols in Molecular Biology, John Wiley & Sons, New York, 2002, eds 7 months; sambrook, Russell, and Sambrook, molecular cloning: a Laboratory Manual, third edition, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 2001. In the event that the specification conflicts with any document incorporated by reference, the specification shall control, and it is to be understood that the inventors are able to determine at any time whether a conflict or inconsistency exists.

Brief description of the drawings

FIG. 1A is a schematic representation of two extension, ligation and authentication cycles following initiation.

FIG. 1B is a schematic representation of the start-up followed by two cycles of extension, ligation and identification in an embodiment in which the inward extension from the free end of the template to the support is followed.

FIG. 2 shows a color assignment scheme for oligonucleotide probes, in which the 3' base species of the probes are determined by identifying the color of the fluorophore.

FIG. 3A shows a schematic representation of the hybridization of an initiator oligonucleotide to different positions of a binding region of a template followed by ligation of an extension probe to form an extended duplex.

FIG. 3B shows a schematic of the assembly of contiguous sequences by extension, ligation and cleavage methods using extension probes designed to read every 6 bases on the template molecule.

FIG. 4A shows a 5 ' -S-phosphorothioate linkage (3 ' -O-P-S-5 ').

FIG. 4B shows the 3 ' -S-phosphorothioate linkage (3 ' -S-P-O-5 ').

FIG. 5A shows a schematic of one cycle of extension, ligation and cleavage for sequencing in the 5 '→ 3' direction using an extension probe containing a 3 '-O-P-S-5' phosphorothioate linkage.

FIG. 5B shows a schematic of one extension, ligation and cleavage cycle for sequencing in the 3 '→ 5' direction using an extension probe containing a 3 '-O-P-S-5' phosphorothioate linkage.

FIGS. 6A-6F are more detailed schematic diagrams of several sequencing reactions performed on a single template. These reactions utilize starting oligonucleotides that bind to different portions of the template.

FIG. 7 is a schematic diagram showing the synthesis scheme of 3' -phosphoramidites for dA and dG.

FIGS. 8A-8E are gel shift test results showing two cycles of successfully ligating and cleaving extension probes containing phosphorothioate linkages.

FIG. 8F shows a schematic diagram of the ligation mechanism of DNA ligase.

FIG. 9 is a result of a gel shift assay showing ligation efficiency of degenerate oligonucleotide probes containing inosine.

FIG. 10 is a result of a gel shift assay showing the efficiency of ligation of inosine-containing degenerate oligonucleotide probes on various substrates.

FIG. 11 shows the results of an analysis evaluating the conservation of each of two DNA ligases (T4 DNA ligase and Tag DNA ligase) in 3 '→ 5' extension.

FIG. 12 is the results of gel shift assay (A) showing ligation efficiency of inosine-containing degenerate oligonucleotide probe and the results of direct sequencing analysis (B) of ligation reaction for evaluating the conservation of T4 DNA ligase in the ligated oligonucleotide probe. The results are tabulated to form graphs C-F.

FIGS. 13A-13C show the results of experiments in which ligation was performed in a gel when the bead-based template was embedded in a polyacrylamide gel on a glass slide. Figure 13A shows a ligation reaction scheme. Ligation reactions were performed in gels in the presence (B) and absence (C) of T4 DNA ligase.

FIG. 14A shows an image of an emulsion PCR reaction performed on beads with attached first amplification primer using fluorescently labeled second amplification primer and excess template.

FIG. 14B (top) shows a fluorescence image of a portion of a slide glass with beads attached with template hybridized with Cy 3-labeled oligonucleotide immobilized within a polyacrylamide gel. (this slide was used for different experiments, but the slide used here is representative). FIG. 14B (bottom) shows a schematic of a slide equipped with a Teflon mask to block the polyacrylamide solution.

FIG. 15 shows three sets of labeled oligonucleotide probes designed to address the problems of probe specificity and selectivity, and also shows the excitation and emission values for a set of four spectrally resolved labels.

FIG. 16 shows the results of experiments confirming the 4-color spectral characteristics of the oligonucleotide probes. Hybridization and ligation reactions were performed on slides containing four unique single-stranded template populations (A) with oligonucleotide probe mixtures containing four unique fluorophore probes, imaged under bright light before and after ligation (B), and fluorescence-excited imaged with four band-pass filters. The individual populations show a false color (C). The spectral characteristics showing minimal signal overlap are plotted in (D).

FIG. 17 shows an experiment to confirm the ligation specificity of oligonucleotide extension probes. Fig. 17(a) shows a schematic diagram of the connection. Fig. 17(B) is a bright light image, and fig. 17(C) is a corresponding fluorescence image after the beads embedded in the polyacrylamide gel are linked. FIG. 17(D) shows fluorescence detected from each label before or after ligation.

FIG. 18 shows another experiment confirming the ligation specificity and selectivity of oligonucleotide extension probes. Fig. 8(a) shows a schematic diagram of the connection. Fig. 18(B) is a bright light image, and fig. 18(C) is a corresponding fluorescence image after the beads embedded in the polyacrylamide gel are linked. FIG. 18(D) shows the expected and observed ligation frequencies, showing that the predicted and observed frequencies are highly correlated according to the proportion of a particular extension probe in the population.

FIG. 19 shows experiments confirming that pools of oligonucleotide extension probes containing degenerate and universal bases are useful for providing specific and selective ligation in gels. FIG. 19(A) shows a schematic of a ligation experiment illustrating four differentially labeled inosine-containing degenerate probe pools after ligation. Fig. 19(B) is a bright light image, and fig. 19(C) is a corresponding fluorescence image after the beads embedded in the polyacrylamide gel are linked. FIG. 19(D) shows the expected and observed ligation frequencies, showing that the predicted and observed frequencies are highly correlated according to the proportion of a particular extension probe in the population. Figure 19(E) shows a scatter plot of raw data and filtered data representing the first 90% bead signal values.

FIG. 20 is a bar graph showing the signals detected in successive hybridization stripping cycles of the initial oligonucleotide (primer) to the template. As shown, a small signal loss occurs over 10 cycles.

FIG. 21 is a photograph of an automated sequencing system that can be used, for example, to collect sequence information from templates arranged in or on a substantially flat support. A special purpose computer is also shown for controlling the operation of the various components of the system, processing and storing the collected image data, providing a user interface, etc. The lower part of the figure shows an enlarged view of the flow cell for achieving specific gravity bubble replacement.

FIG. 22 shows a schematic of a high throughput automatic sequencing apparatus that can be used to determine template sequences arrayed in or on a substantially flat support.

Fig. 23 shows a scatter plot of the disparity alignment, which shows that there are few disparities in 30 frames.

FIGS. 24A-I show schematic views of various views of a flow cell or portion thereof according to the invention.

FIG. 25A shows an exemplary code for a preferred set of probe families comprising partially defined probes comprising a defined portion of 2 nucleotides in length.

FIG. 25B shows a preferred set of probe families (top panel) and ligation, detection and cleavage cycles (bottom panel).

FIG. 26 shows an exemplary code for another preferred set of probe families, which set includes partially defined probes comprising a defined portion of 2 nucleotides in length.

FIGS. 27A-27C represent another method for graphically identifying the set of 24 preferred probe families defined in Table 1.

FIG. 28 shows a less preferred set of probe families, in which the probes contain a defined portion of 2 nucleotides in length.

FIG. 29A shows a diagram of a defined portion that can be used to generate a collection of probe families that includes probes that contain a defined portion that is 3 nucleotides in length.

FIG. 29B shows a mapping scheme diagram that can be used to generate a defined portion of a collection of probe families from a collection of 24 preferred probe families, the collection including probes that contain a defined portion that is 3 nucleotides in length.

FIG. 30 shows a method for sequencing with a probe family set. One embodiment employing a preferred set of probe families is described.

FIGS. 31A-31C show a method for generating candidate sequences using a first probe family set and decoding using a second probe family set to perform sequence determination.

FIG. 32 shows a method for sequencing with a less preferred combination of probe families.

FIG. 33A shows a schematic of a slide with attached beads. The DNA template is attached to the bead.

FIG. 33B shows a population of beads attached to a slide. The lower panel shows the same slide area under a white light (left) and fluorescence microscope. The upper panel shows the bead density range.

FIGS. 34A-34C show a scheme in which two tags of a pair of tags present in a nucleic acid fragment (template) are amplified as a single population of nucleic acids and captured to microparticles by an amplification method.

FIGS. 35A and 35B show details of primer design and amplification for the scheme of FIG. 35. The two strands of the nucleic acid fragment (template) are shown for clarity. Primers and primer binding regions having the same sequence are represented by the same color. For example, P1 is shown in dark blue, indicating that the sequence of primer P1 present on the microparticles and in solution is identical to the corresponding colored portion of the template strand shown. The dark blue region of the template (labeled P1) may be referred to as a primer binding region, although the corresponding primer (P1) actually binds to the complementary portion of the other strand and is identical in sequence to primer P1.

FIGS. 35C and 35D show sequencing of first and second tags attached to microparticles generated using the methods shown in FIGS. 35A and 35B, respectively.

Definition of

In order to facilitate understanding of the present specification, the following definitions are provided. It is to be understood that, in general, terms which are not specifically defined are to be given a usual meaning or a meaning commonly accepted in the art.

As used herein, an "abasic residue" is a residue having the structure of a nucleoside or nucleotide moiety that is retained after removal of a nitrogenous base or removal of a significant portion of a nitrogenous base such that the resulting molecule is no longer involved in the hydrogen bonding characteristics of the nucleoside or nucleotide. Abasic residues may be generated by removing nitrogenous bases from nucleosides or nucleotides. However, the term "abasic" is used to refer to a structural feature of a residue, independent of the manner in which the residue is generated. The terms "abasic residue" and "abasic site" as used herein refer to a residue in a nucleic acid that lacks a purine or pyrimidine base.

As used herein, an "apurinic/Apyrimidinic (AP) endonuclease" refers to an enzyme that cleaves a bond on the 5 'side, the 3' side, or both the 5 'and 3' side of an abasic residue in a polynucleotide. In certain embodiments of the invention, the AP endonuclease is an AP lyase. Examples of AP endonucleases include, but are not limited to: escherichia coli (e.coli) endonuclease VIII and homologs thereof, and escherichia coli endonuclease III and homologs thereof. It will be understood that references to particular enzymes, such as endonucleases, e.g., E.coli Endo VIII, Endo V, etc., are also intended to include homologs from other species which are recognized in the art as homologs and which have similar biochemical activity in removing damaged bases and/or cleaving DNA containing abasic or other trigger residues.

The term "array" as used herein refers to a collection of entities distributed on or in a support substrate; the individual entities are preferably spaced apart a sufficient distance to allow identification of discrete features of the array using a variety of techniques. The entity can be, for example, a nucleic acid molecule, a clonal population of nucleic acid molecules, a microparticle (optionally having a clonal population of nucleic acid molecules attached) or the like. When used as a verb, the term "array" and variations thereof refer to any method of forming an array, such as distributing entities onto or into a support substrate.

A "damaged base" is a purine or pyrimidine base, different from A, G, C or T, that makes it a substrate for removal from DNA by DNA glycosylases. Uracil is considered a damaging base useful in the present invention. In some embodiments of the invention, the damaged base is hypoxanthine.

When referring to a position in one polynucleotide of a population of polynucleotides, "degenerate" refers to the difference in the base species that form the part of the nucleoside occupying that position between different members of the population. Thus, the population contains individual members that differ in sequence at degenerate positions. The term "position" refers to a numerical value assigned to each nucleoside in a polynucleotide, typically relative to the 5 'or 3' end. For example, the nucleoside at the 3' end of the extension probe can be designated as position 1. Thus, in the extended probe library of the 3 '-XXXNXXXXXX-5' structure, N is in position 4. Position 4 is considered a degenerate position if the kind of N can vary among different members of the library. Also called extension probe library in position N degeneracy. A position is said to be k-fold degenerate if it can be occupied by k different kinds of nucleosides. For example, positions that can be occupied by a nucleoside with two different bases are 2-fold degenerate.

"determining sequence information" includes "sequence determination" and also includes other levels of information, such as one or more possibilities to eliminate sequences. It should be noted that sequencing polynucleotides generally yields equivalent information for polynucleotides that are fully complementary (100% complementary) and is therefore equivalent to sequencing directly on fully complementary polynucleotides.

When referring to a plurality of elements, such as nucleosides in an oligonucleotide probe molecule or portion thereof, "independent" means that the type of each element is not limited or restricted to any other type of element, e.g., each type of element is selected independently of any other type of element. Thus, understanding the nature of one or more elements may not provide any information regarding any other nature of the elements. For example, if each N species can be A, G, C or T, independent of the other N species, then the nucleosides in the sequence NNNN are independent.

"ligation" refers to the formation of a covalent bond or linkage between the ends of two or more nucleic acids, such as oligonucleotides and/or polynucleotides, in a template-driven reaction. The nature of the bond or linkage may vary widely, and the linkage may be performed enzymatically or chemically.

The term "microparticle" as used herein refers to a particle having a minimum cross-sectional dimension of 50 microns or less, preferably 10 microns or less. In certain embodiments, the minimum cross-sectional dimension is about 3 microns or less, about 1 micron or less, about 0.5 microns or less, such as about 0.1, 0.2, 0.3, or 0.4 microns. The microparticles may be made from a variety of inorganic or organic materials, including but not limited to: glass (e.g., pore size controlling glass), silica, zirconia, cross-linked polystyrene, polyacrylic acid, polymethacrylic acid, titanium dioxide, latex, polystyrene, and the like. Various suitable materials and other considerations are found, for example, in U.S. patent 6,406,848. Dyna beads available from Dynal, Oslo, Norway are examples of commercially available microparticles that can be used in the present invention. Magnetically reactive particles may be used. The magnetic reactivity of certain preferred microparticles facilitates collection and concentration of the microparticle-attached template after amplification, as well as facilitates other steps (e.g., washing, removal of reagents, etc.). In certain embodiments of the invention, populations of particles having different shapes (e.g., some spherical and others non-spherical) are employed.

The term "microsphere" or "bead" as used herein refers to a substantially spherical microparticle having a diameter of 50 microns or less, preferably 10 microns or less. In certain embodiments, the diameter is about 3 microns or less, about 1 micron or less, about 0.5 microns or less, such as about 0.1, 0.2, 0.3, or 0.4 microns. In certain embodiments of the invention, a population of monodisperse microspheres is employed, i.e., the microspheres are substantially uniform in size. For example, the coefficient of variation of the particle diameter may be less than 5%, such as 2% or less, 1% or less, and the like. However, in other embodiments, the coefficient of variation of the population of microparticles is 5% or greater, such as 5%, 5% -10% (inclusive), 10% -25% (inclusive), and the like. In certain embodiments, a mixed population of microparticles is employed. For example, a mixture of two populations, each with a coefficient of variation of less than 5%, can be used, resulting in a mixed population that is not monodisperse. For example, a mixture of microspheres having diameters of 1 micron and 3 microns may be used. In certain embodiments of the invention, additional information is provided by the size of the microspheres when sequencing is performed using templates attached to populations of microspheres that are not monodisperse. For example, different template libraries may be attached to microspheres of different sizes. Also, since fewer template molecules can be attached to a small particle, the signal intensity can be altered, which can facilitate multiple sequencing.

The term "nucleic acid sequence" as used herein may refer to the nucleic acid material itself and is not limited to sequence information (i.e., a contiguous combination of letters selected from the five base letters A, G, C, T or U) that characterizes the biochemical characteristics of a particular nucleic acid, such as a DNA or RNA molecule. The nucleic acids described herein are presented in the 5 '→ 3' orientation unless otherwise indicated.

"nucleoside" includes a nitrogenous base linked to a sugar molecule. This term as used herein includes natural nucleoside and nucleoside analogs in the 2 '-deoxy and 2' -hydroxy forms as described by Kornberg and Baker, DNA Replication, 2 nd edition (Freeman, san Francisco, 1992). For example, natural nucleosides include adenosine, thymidine, guanosine, cytidine, uridine, deoxyadenosine, deoxythymidine, deoxyguanosine, and deoxycytidine. Nucleoside "Analogs" refer to synthetic nucleosides containing a modified base moiety and/or a modified sugar moiety, as generally described by Scheit, Nucleotide Analogs (John Wiley, new york, 1980). Such analogs include synthetic nucleosides designed to improve binding properties, reduce degeneracy, improve specificity, and the like. Nucleoside analogs include 2-aminoadenosine, 2-thiothymidine, pyrrolo-pyrimidine, 3-methyladenosine, C5-propynyl cytidine, C5-propynyl uridine, C5-bromouridine, C5-fluorouridine, C5-iodouridine, C5-methylcytidine, 7-deazaadenosine, 7-deazaguanosine, 8-oxyadenosine, 8-oxoguanosine, O (6) -methylguanine, 2-thiocytidine, and the like. Nucleoside analogs can include any universal base described herein.

The term "organism" as used herein refers to any living or non-living entity comprising a nucleic acid capable of replication and whose sequence is of interest. It includes plasmids; a virus; prokaryotes, archaebacteria, and eukaryotic cells, cell lines, fungi, protozoa, plants, animals, and the like.

Reference to an overhanging strand of a probe and template polynucleotide, "perfectly matched duplex" means that the overhanging strand of one strand forms a duplex structure with the other strand such that each nucleoside in the duplex structure is Watson-Crick base paired with one nucleoside on the opposite strand. The term also includes pairs of nucleoside analogs such as deoxyinosine, nucleosides with 2-aminopurine bases, and the like, that can be used to reduce probe degeneracy, whether or not such pairs include hydrogen bond formation.

The term "plurality" refers to more than one.

The term "polymorphism" has the ordinary meaning in the art and refers to a difference in genomic sequence between individuals of the same species. A "single nucleotide polymorphism" (SNP) refers to a polymorphism at a single position.

"Polynucleotide", "nucleic acid" or "oligonucleotide" refers to a linear polymer of nucleosides (including deoxyribonucleosides, ribonucleosides, or analogs thereof) linked by internucleoside linkages. Typically, a polynucleotide comprises at least three nucleosides. In certain embodiments of the invention, one or more nucleosides in the extension probe comprise a universal base. Typically, oligonucleotides range in size from a few, e.g., 3-4, monomer units to hundreds of monomer units. When a polynucleotide such as an oligonucleotide is represented by a letter sequence such as "ATGCCTG," it is understood that the nucleotides are in 5 '→ 3' order from left to right, "a" means deoxyadenosine, "C" means deoxycytidine, "G" means deoxyguanosine, "T" means thymidine unless otherwise specified. In the art, the letters A, C, G and T are generally used to refer to the base itself, the nucleoside or nucleotide comprising the base.

In naturally occurring polynucleotides, the internucleoside linkage is typically a phosphodiester linkage, and the subunits are referred to as "nucleotides". However, oligonucleotide probes containing other internucleoside linkages, such as phosphorothioate linkages, are employed in certain embodiments of the invention. It is understood that one or more subunits comprising an oligonucleotide probe having a non-phosphodiester linkage may not include a phosphate group. Such nucleotide analogs are considered to fall within the term "nucleotide" as used herein, and nucleic acids containing one or more internucleoside linkages other than phosphodiester linkages are still referred to as "polynucleotides", "oligonucleotides", and the like. In other embodiments, a polynucleotide, such as an oligonucleotide probe, comprises a linkage comprising an AP endonuclease sensitive site. For example, the oligonucleotide probe may contain an abasic residue, a residue containing a damaged base that serves as a substrate for removal by a DNA glycosylase, or another residue or linkage that serves as a substrate for cleavage by an AP endonuclease. In another embodiment, the oligonucleotide probe comprises a disaccharide nucleoside.

The term "primer" refers to a short polynucleotide, typically about 10-100 nucleotides in length, that binds to a target polynucleotide or "template" by hybridization to a target. The primer preferably provides a starting point for template-directed synthesis of a polynucleotide complementary to the target, which can be synthesized in the presence of suitable enzymes, cofactors, substrates such as nucleotides, oligonucleotides, and the like. The primer typically provides an end from which extension can occur. For primers used in polymerase, e.g., DNA polymerase catalyzed synthesis (e.g., "sequencing by synthesis", Polymerase Chain Reaction (PCR) amplification, etc.), the primers typically contain, or can be modified to contain, a free 3' OH group. PCR reactions typically employ a pair of primers (first and second amplification primers) comprising an "upstream" (or "forward") primer and a "downstream" (or "reverse") primer, which pair of primers delimit an amplification region. For primers that are synthesized for successive cycles of extension, ligation (and optionally cleavage), the primers typically contain, or can be modified to contain, a free 5 'phosphate group or 3' OH group that serves as a substrate for DNA ligase.

As used herein, "probe family" refers to a population of probes that each contain the same label.

As used herein, the terms "sequencing," "determining a nucleotide sequence," "sequencing," and the like, when referring to a polynucleotide, encompass determining some or all of the sequence information in the polynucleotide. That is, the term includes information on the level of sequence comparison, fingerprinting, etc., of the target polynucleotide, as well as the rapid identification and sequencing of each nucleoside of the target polynucleotide within the region of interest. In certain embodiments of the invention, "sequencing" includes identifying a single nucleotide, while in other embodiments, more than one nucleotide is identified. In certain embodiments of the invention, sequence information is collected in a single cycle that is not sufficient by itself to identify any nucleotide. The identification of nucleosides, nucleotides and/or bases is considered herein to be equivalent. It should be noted that sequencing polynucleotides generally yields sequence information for equivalent fully complementary (100% complementary) polynucleotides, and is therefore equivalent to sequencing polynucleotides that are directly complementary.

As used herein, "sequencing reaction" refers to a set of extension, ligation and detection cycles. When the extended duplexes on the template are removed and the template is subjected to a second set of cycles, each set of cycles is considered a separate sequencing reaction, but the resulting sequence information can be combined to produce one sequence.

As used herein, "semi-solid" refers to a compressible matrix containing solid and liquid components, wherein the liquid occupies pores, spaces, or other interstices between the solid matrix components. Exemplary semi-solid matrices include matrices made from polyacrylamide, cellulose, polyamide (nylon), and cross-linked agarose, dextran, and polyethylene glycol. The semi-solid support can be provided on a second support, also referred to as a substrate, such as a substantially flat rigid support, which is capable of supporting the semi-solid support.

As used herein, "support" refers to a matrix upon or in which nucleic acid molecules, microparticles, etc., can be immobilized, i.e., they can be covalently or non-covalently attached to the support, or they can be partially or completely embedded in or on the support, such that they are substantially or completely prevented from free diffusion or relative movement.

A "priming residue" is a residue that, when present in a nucleic acid, renders the nucleic acid more susceptible to cleavage by a cleavage agent (e.g., an enzyme, silver nitrate, etc.) or combination of cleavage agents (e.g., cleavage of the nucleic acid backbone) and/or is susceptible to modification to yield a residue that renders the nucleic acid more susceptible to such cleavage, relative to an otherwise identical nucleic acid that does not comprise the priming residue. Thus, the presence of a priming residue in a nucleic acid may result in a scissile linkage in the nucleic acid. For example, an abasic residue is a priming residue because the presence of an abasic residue in a nucleic acid makes the nucleic acid susceptible to cleavage by an enzyme, such as an AP endonuclease. Nucleosides containing damaged bases are priming residues, because the presence of a nucleoside containing a damaged base in a nucleic acid also makes the nucleic acid more susceptible to cleavage by an enzyme, such as an AP endonuclease, e.g., after removal of the damaged base by a DNA glycosylase. The cleavage site may be a bond between the initiating residue and an adjacent residue, or may be a bond that moves one or more residues from the initiating residue. For example, deoxyinosine is a priming residue because the presence of deoxyinosine in a nucleic acid makes the nucleic acid more susceptible to cleavage by E.coli endonuclease V and homologs thereof. This enzyme cleaves the second phosphodiester bond at the 3' end of deoxyinosine. Any of the probes disclosed herein may contain one or more priming residues. The initiating residue may (but need not) comprise a ribose or deoxyribose moiety. The cleavage agent is preferably one that does not substantially cleave nucleic acids in the absence of a priming residue, but has significant cleavage activity under the same conditions for nucleic acids containing a priming residue, which conditions may include the presence of a nucleic acid modifying agent to render it more sensitive to the cleavage agent. For example, preferably, if a cleaving agent is present in a composition comprising nucleic acids of the same length, one of which contains a priming residue and the other of which does not, the probability of cleaving a nucleic acid containing a priming residue is at least 10 that of cleaving a nucleic acid not containing a priming residue; 25; 50; 100, respectively; 250 of (a); 500, a step of; 1000, parts by weight; 2500; 5000; 10,000; 25,000; 50,000; 100,000; 250,000; 500,000; 1,000,000 or more fold, and the ratio of the probability of cleaving a nucleic acid containing a priming residue to the probability of cleaving an otherwise identical nucleic acid not containing a priming residue is from 10 to 10⁶Or any integer subrange therein. It is understood that this ratio may vary depending on the particular nucleic acid, as well as the position and nucleotide environment of the priming residue.

Preferably, if the nucleic acid containing the priming residue requires modification to render the nucleic acid susceptible to cleavage by a cleavage agent, such modification can be readily carried out in the presence of a suitable modifying agent, e.g., in reasonable yields and in reasonable times. For example, in certain embodiments of the invention, at least 50%, at least 60%, at least 70%, preferably at least 80%, at least 90% or more preferably at least 95% of the nucleic acid comprising a priming residue is modified within (e.g.) 24 hours, preferably within 12 hours, more preferably within less than 1 minute to 4 hours.

Various suitable initiation residues and corresponding cleavage reagents are listed herein. Any trigger residue and cleavage agent similar to the activity described herein may be used. One of ordinary skill in the art will be able to determine whether a particular priming residue and cleavage agent combination is suitable for use in the present invention, e.g., whether the efficiency and rate of cleavage, selectivity of the cleavage agent for nucleic acids containing the priming residue, etc., are suitable for use in the methods of the present invention. It is noted that a "priming residue" differs from a nucleotide that forms only part of a restriction enzyme site in that the ability of the priming residue to increase susceptibility to cleavage generally does not significantly depend on the specific sequence content in which the priming residue is found, but as noted above, the sequence content may have some effect on the susceptibility to modification and/or cleavage. Of course, depending on the surrounding nucleotides, the priming residue may form part of the restriction site. Thus, in most cases, the cleavage agent is not a restriction enzyme, but does not preclude the use of an enzyme that is both a restriction enzyme and has non-sequence specific cleavage capabilities.

As used herein, a "universal base" is a base that "pairs" with more than one base found in a naturally occurring nucleic acid, so that it can replace a naturally occurring base in a duplex. The base need not be capable of pairing with every naturally occurring base. For example, certain bases selectively pair only with purines, or only with pyrimidines. Certain preferred universal bases (fully universal bases) can pair with any base typically found in naturally occurring nucleic acids, and thus can substitute for any of these bases in the duplex. The base need not have the same ability to pair with various naturally occurring bases. If the probe mixture contains probes (one or more positions) that contain universal bases that do not pair with all naturally occurring nucleotides, it may be desirable to utilize two or more universal bases at a particular position of the probe so that at least one universal base pairs with A, at least one universal base pairs with G, at least one universal base pairs with C, and at least one universal base pairs with T.

A variety of universal bases are known in the art, including but not limited to: hypoxanthine, 3-nitropyrrole, 4-nitroindole, 5-nitroindole, 4-nitrobenzimidazole, 5-nitroindazole, 8-aza-7-deazaadenine, 6H, 8H-3, 4-dihydropyrimido [4, 5-c ] [1, 2] oxazin-7-one (P.Kong Thoo Lin. and D.M.Brown, Nucleic Acids Res., 1989, 17, 10373-. Hypoxanthine is a preferred fully universal base. Inosine containing include, but are not limited to: inosine, isoinosine, 2 ' -deoxyinosine, and 7-deaza-2 ' -deoxyinosine, 2-aza-2 ' -deoxyinosine.

Other universal bases are known in the art, as described in the relevant part of the following references: loakes, d.and Brown, d.m., nucleic acids res.22: 4039-; ohtsuka, e, et al, j.biol.chem.260 (5): 2605. 2608, 1985; lin, p.k.t. and Brown, d.m., Nucleic Acids res.20 (19): 5149-5152, 1992; nichols, r. et al, Nature 369 (6480): 492 493, 1994; rahmon, m.s. and Humayun, n.z., Mutation Research 377 (2): 263-8, 1997; berger, m, et al, Nucleic Acids Research, 28 (15): 2911-2914, 2000; amosova, o. et al, nucleic acids res.25 (10): 1930-; and Loakes, d., Nucleic Acids res.29 (12): 2437-47, 2001. Universal bases can, but need not, form hydrogen bonds with the base in the opposite position. Universal bases can form hydrogen bonds via Watson-Crick or non-Watson-Crick interactions (e.g., Hoogsteen interactions).

Rather than using oligonucleotide probes comprising universal bases, oligonucleotide probes comprising abasic residues are used in certain embodiments of the invention. The abasic residue may occupy the relative positions of the four naturally occurring nucleotides and thus may serve the same function as a nucleotide containing a universal base. In some embodiments of the invention, the linkage adjacent to the abasic residue is cleaved by the AP endonuclease, but abasic residues (i.e., functioning as universal bases) may also be employed in embodiments of the invention in which other scissile linkages (e.g., phosphorothioates) are present and other cleavage reagents are employed.

Detailed description of certain preferred embodiments of the invention

A. Sequencing by successive extension, ligation and cleavage cycles

Figure 1A diagrammatically shows an overall scheme of one aspect of the invention generally similar to the method described in U.S. patent nos. 5,740,341 and 6,306,597 issued to Macevicz. For convenience, these patents are collectively referred to herein as "Macevicz". In particular, Macevicz describes a method of identifying a nucleotide sequence in a polynucleotide, the method comprising the steps of: (a) extending an initiator oligonucleotide along the polynucleotide by ligating an oligonucleotide probe to form an extended duplex; (b) identifying one or more nucleotides of the polynucleotide; and (c) repeating steps (a) and (b) until the nucleotide sequence is determined.

Macevicz also describes a method of determining the nucleotide sequence of a template polynucleotide, the method comprising the steps of: (a) providing a probe-template duplex formed by hybridization of an initiator oligonucleotide probe to a template polynucleotide, said probe having an extendable probe end; (b) ligating an extension oligonucleotide probe to the end of said extendable probe to form an extension duplex containing an extended oligonucleotide probe; (c) identifying in the extended duplex (1) at least one nucleotide in the template polynucleotide that is complementary to the just ligated extension probe or (2) a nucleotide residue in the template polynucleotide immediately downstream of the extended oligonucleotide probe; (d) generating an extendable probe end on the extended probe if the extendable end is not present, such that the generated end is different from the end to which the last extension probe is ligated; and (e) repeating steps (b), (c) and (d) until the nucleotide sequence of the target polynucleotide is determined. In certain embodiments of these methods, each extension probe comprises a chain terminating moiety on the distal end of the initial oligonucleotide probe. In certain embodiments, the regenerating step comprises chemically cleaving the cleavable internucleoside linkages in the extended oligonucleotide probe.

In FIG. 1A, a polynucleotide template 20 comprising a polynucleotide region 50 of unknown sequence and a binding region 40 is attached to a support 10. Nucleotide 41 distal to binding region 40 is adjacent to nucleotide 51 proximal to polynucleotide region 50. An initiator oligonucleotide 30 is provided that hybridizes to the binding region 40 at the location of the binding region 40 to form a duplex. The initiator oligonucleotide 30 is also referred to herein as a "primer" and the binding region 40 can be referred to as a "primer binding region". The duplex may be, but need not be, a perfect match duplex. The starting oligonucleotide has an extendable terminus 31. In FIG. 1A, the initial oligonucleotide is bound to the binding region such that the extendable terminus 31 is located opposite nucleotide 41. However, the initial oligonucleotide may be bound elsewhere in the binding region, as described below. An extension oligonucleotide probe 60 of length N hybridizes to the template adjacent to the starting oligonucleotide. The terminal nucleotide 61 of the extension oligonucleotide probe is attached to the extendable terminus 31.

The terminal nucleotide 61 is complementary to the first unknown nucleotide in the polynucleotide region 50. Thus, the identity of the terminal nucleotide 61 determines the identity of nucleotide 51. Preferably, nucleotide 51 is identified by detecting a label (not shown) attached to an extension probe whose known terminal nucleotide 61 is A, G, C or T. The label is removed after detection. FIG. 2 shows a scheme for assigning different labels, e.g., fluorophores of different colors, to extension probes having different 3' terminal nucleotides.

After ligation and detection, if the probe 60 does not have such an end, an extendable probe end is generated on the extension probe 60. A second extension probe 70, also preferably of length N, anneals to the template adjacent to the extension probe 60 and is attached to the extendable terminus of the probe 60. The identity of the terminal nucleotide 71 of the extension probe 70 specifies the identity of the nucleotide 52 at the opposite position in the polynucleotide 50. Thus, the terminal nucleotide 71 constitutes the "sequencing portion" of the extension probe, which means that the hybridization specificity of the probe portion serves as the basis for determining one or more nucleotide species in the template. It will be appreciated that other nucleotides in the extension probe will generally hybridise to the template, but only those nucleotides in the probe whose species is associated with a particular label will be used to identify the nucleotides in the template.

In a preferred embodiment of the invention, generating an extendable terminus comprises cleaving an internucleoside linkage as described below. Preferably, the cutting also removes the mark. Cleavage removes multiple nucleotides M in the extension probe (not shown). Thus, the duplex is extended by N-M nucleotides in each cycle and the nucleotides located between N-M in the template are identified. It will be appreciated that typically multiple copies of a given template are attached to a single support and the sequencing reaction is performed simultaneously on these templates.

Macevicz states that the oligonucleotide probe should generally be capable of ligating to the initiator oligonucleotide or the extension duplex to produce an extension duplex for the next cycle of extension; the ligation should be template driven, since the probe should form a duplex with the template prior to ligation; the probe should have a capping moiety to prevent multiple probes from being attached to the same template in one extension cycle; the probe should be capable of regenerating an extendable terminus after ligation, either treated or modified; the probe should have a signal portion (i.e., a detectable portion) to obtain sequence information about the template after successful ligation.

Macevicz describes certain suitable starter oligonucleotides, extension oligonucleotide probes, templates, binding sites, and features of various methods for synthesizing, designing, producing, or obtaining these components. Macevicz also describes certain suitable ligases, ligation conditions and various suitable labels. Macevicz also describes an alternative method for identifying the addition of labeled chain terminating nucleotides to a newly ligated extension probe by polymerase extension. The type of nucleotide added determines the nucleotide in the relative position of the template.

As understood by one of ordinary skill in the art, references to templates, starter oligonucleotides, extension probes, primers, and the like, generally refer to a population or pool of substantially identical nucleic acid molecules within the region of interest, rather than a single molecule. Thus, for example, a "template" generally refers to a plurality of substantially identical template molecules; "Probe" generally refers to a plurality of substantially identical probe molecules, and the like. In probes that are degenerate in one or more positions, it is understood that the sequences of probe molecules comprising a particular probe differ in degenerate positions, i.e., the sequences of probe molecules that make up a particular probe may be substantially identical only in non-degenerate positions. For the purposes of this specification, the singular forms "a", "an", and "the" are intended to include both single molecules and substantially the same group of molecules. Where it is desired to refer to a single nucleic acid molecule (i.e., a molecule), the terms "template molecule", "probe molecule", "primer molecule", and the like, are used. In some cases, the plural nature of a population of substantially identical nucleic acid molecules is specified.

A population of substantially identical nucleic acid molecules can be obtained or produced by a variety of known methods, including chemical synthesis, biosynthesis in a cell, enzymatic amplification from one or more starting nucleic acid molecules in vitro, and the like. For example, using methods well known in the art, a nucleic acid of interest can be cloned by inserting a suitable expression vector, such as a DNA or RNA plasmid, and then introducing cells, such as bacterial cells, capable of replicating therein. Plasmid DNA or RNA containing copies of the nucleic acid of interest is then isolated from the cells. Genomic DNA isolated from viruses, cells, etc., or cDNA produced by reverse transcription of mRNA can be the source of a population of substantially identical nucleic acid molecules (e.g., template polynucleotides to be sequenced) without intermediate steps such as cloning or in vitro amplification, but it is generally preferred that they be subjected to intermediate steps.

It is understood that the population members are not necessarily 100% identical, as a certain number of "errors" may occur during the synthesis process. Preferably, at least 50% of the population members are at least 90%, or more preferably at least 95%, identical to a reference nucleic acid molecule (i.e., a molecule determined by sequence used as a basis for sequence comparison). More preferably, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, at least 99% or more of the population members are at least 90%, or more preferably at least 95%, or more preferably at least 99% identical to a reference nucleic acid molecule. Preferably, at least 98%, 99%, 99.9% or more of the population members have a percent identity to a reference nucleic acid molecule of at least 95% or more preferably at least 99%. Percent identity can be calculated by the following method: the percentage of sequence identity is determined by comparing the two optimally aligned sequences, determining the number of positions in the two sequences at which the nucleobase (e.g., A, T, C, G, U or I) is identical, generating the number of matched positions, dividing the number of matched positions by the total number of positions, and multiplying by 100. It is understood that in some instances a nucleic acid molecule such as a template, probe, primer, etc. may be part of a larger nucleic acid molecule that also contains portions that are not part of the template, probe or primer. In this case, the portions of the individual members of the population are not necessarily substantially identical.

Macevicz describes a method of attaching a template to a support (e.g., a bead) and extending towards the end of the template distal to the support, as shown in FIG. 1A. Thus, the binding region is closer to the support than the unknown sequence, and the extended duplex grows in a direction away from the support. However, the inventors have surprisingly found that it is advantageous to carry out the method in an alternative way in which the binding region is located at the end of the template distal to the support, extending inwardly in the direction of the support. FIG. 1B depicts such an embodiment, with the various elements numbered as shown in FIG. 1A. The inventors determined that "in" sequencing from the distal end of the template to the support provided better results. In particular, sequencing from the distal end of the template to a support, such as a bead, results in higher ligation efficiency than sequencing from the support to the outside.

As further described by Macevicz, the oligonucleotide probes are preferably added to the template as a mixture of oligonucleotides containing all possible sequences of predetermined length. For example, having a NNNN (which can also be expressed as (N)_kWhere k ═ 6) structure, probe mixtures of all possible sequences of length 6 nucleotides (hexamers) contain 4 ⁶(4096) The species of the probe. In general, the structure of the probe is X (N)_kN^*Wherein N represents any nucleotide, k is 1-100, X represents a label, and X represents a nucleotide whose species corresponds to the label. In certain embodiments, k is 1-100, 1-50, 1-30, 1-20, such as 4-10. One or more of the nucleotides may comprise a universal base. Probes are typically 4-fold degenerate at the position represented by N, or contain nucleotides with reduced degeneracy at one or more positions represented by N. If desired, the mixture can be divided into probe subsets ("stringency classes") that have similar stability or binding free energy to perfectly matched duplexes of complementary sequences. These subgroups can be used for different hybridization reactions as described by Macevicz.

The complexity (i.e., the number of different sequences) of a probe mixture can be reduced by a number of methodsThe methods include the use of so-called degenerately reduced nucleotides or nucleotide analogues. For example, a library of probes containing all possible sequences of 8 nucleotides contains 4⁸And (4) a probe. The number of probes can be reduced to 4 by using universal bases at both positions⁶While maintaining various desirable characteristics of the octamer library, such as length. The invention includes the use of any universal base described above or in the references cited above.

According to this embodiment, the extension duplex or starter oligonucleotide may be extended with an oligonucleotide probe in the 5 '→ 3' direction or the 3 '→ 5' direction, as described below. In general, the oligonucleotide probe does not necessarily form a perfectly matched duplex with the template, but such binding may be preferred. In embodiments where one nucleotide in the template is identified per extension cycle, complete base pairing is required to identify that particular nucleotide. For example, in embodiments where the oligonucleotide probe is enzymatically ligated to an extended duplex, complete base pairing, i.e., appropriate Watson-Crick base pairing, between the terminal nucleotide of the ligated probe and its template complement is required. Typically, in such embodiments, the remaining nucleotides of the probe serve as "spacers" to ensure that the next ligation occurs at the predetermined site or a certain number of bases displaced along the template. That is, they may or may not provide further sequence information. Also, in embodiments that rely on polymerase extension for base identification, the probe acts primarily as a spacer, and thus specific hybridization to the template is not important.

The above method enables partial sequencing, i.e.identification of individual nucleotides spaced apart from each other in the template. In a preferred embodiment of the invention, in order to collect more complete information, a plurality of reactions are carried out, wherein each reaction utilizes a different starting oligonucleotide i. The initial oligonucleotide i binds to a different part of the binding region. Preferably, the binding positions of the initial oligonucleotides are such that the extendable ends of different initial oligonucleotides hybridize to the binding regions offset from each other by 1 nucleotide. For example, as shown in fig. 3, a sequencing reaction 1.. N was performed. Starting oligonucleotide i ₁...i_nPhase of lengthSimilarly, the terminal nucleotides 31, 32, 33, etc. hybridize to consecutive adjacent positions 41, 42, 43, etc. in the binding region 40 after binding to the binding region 40. Thus, the extension probe e₁...e_nBind to contiguous adjacent regions of the template and are linked to the extendable terminus of the initial oligonucleotide. Is connected to i_nProbe e of (2)_nIs complementary to nucleotide 55 of polynucleotide region 50, i.e., the first unknown polynucleotide in the template. In a second extension, ligation and detection cycle, Probe e_nIs complementary to nucleotide 56 of polynucleotide region 50, i.e., the second nucleotide of the unknown sequence. Similarly, the terminal nucleotide of the extension probe attached to the duplex is derived from the initiator oligonucleotide i₂、i₃、i₄And the like, complementary to the third, fourth and fifth nucleotides of the unknown sequence 50. It will be appreciated that the initial oligonucleotide may bind to a region progressively further from the polynucleotide region 50, rather than progressively closer to it.

The spacing function of the non-terminal nucleotides of the extension probes is such that sequence information at template positions that are spaced apart by a certain number of nucleotides from the position at which the initial oligonucleotide binds is obtained without the need for a corresponding number of cycles for any given template. For example, nucleotides at intervals of N-1 nucleotides can be identified in successive cycles by ligating probes of length N and then cleaving to remove a single terminal nucleotide on the extension probe. For example, the nucleotides at positions 1, N, 2N-1, 3N-2, 4N-3 and 5N-4 in the template can be identified in 6 cycles, where the nucleotide at position 1 of the template corresponds to the nucleotide attached to the end of the extendable probe in the duplex formed by the binding of the initiator oligonucleotide to the template. Similarly, if the cleavage removes two nucleotides of an extension probe of length N, the nucleotides at positions N-2 nucleotides apart from each other can be identified in successive rounds. For example, 6 cycles can be used to identify the nucleotides at positions 1, N-1, 2N-3, 3N-5, 4N-7 in the template. Thus, if the probe is 8 nucleotides in length and 2 nucleotides are removed per cycle, the nucleotides at positions 1, 7, 13, 19 and 25 are identified. Thus, the number of cycles required to identify a nucleotide that is X away from the first nucleotide in the template is about X/M, where M is the length of extension probe remaining after cleavage, and not about X.

For example, the scheme shown in FIG. 3B shows the final result of using the extension, ligation and cleavage cycle method with extension probes designed to read the template once every 6 bases. Sequential stripping and sequencing of the template with 6 starting nucleotides bound to offset positions of the binding region, and combining the results, all template bases in a defined length can be elucidated. For example, if 6 reactions are each performed 10 consecutive ligations, the resulting read length is 60 consecutive base pairs, whereas if each reaction is performed 15 consecutive ligations, the resulting read length is 90 consecutive base pairs.

While not wishing to be bound by any theory, the inventors propose that, contrary to this approach, most sequential sequencing with synthesis is accompanied by the drawback of error accumulation, which ultimately limits the possibility of long read lengths. An advantageous feature of certain methods described herein is that they can be identified once every n bases (depending on the position of the cleavable moiety in the probe) so that after a given number of cycles (y), the nth x y- (n-1) base is reached (after 15 cycles as in the above example, the 71 th base is reached, or after 20 cycles with a 6 base probe on the 3' side of the cleavage site, the 115 th base is reached). The ability to "restart" the initial oligonucleotide at n-1, n-2, etc. positions greatly reduces the accumulation of consecutive errors (through dephasing or depletion) over a given length, since the process of stripping the extension strand from the template and hybridizing a new initial oligonucleotide effectively resets the background signal to zero. For example, comparing the polymerase synthesis-based sequencing method to the ligation-based method described herein, if the signal-to-noise ratio is 99: 1 for each extension cycle, then the signal-to-noise ratio is 37: 63 for 100 cycles of the polymerase-based method and 85: 15 for the ligase-based method. The net result of the ligase-based approach is a greatly increased read length over the polymerase-based approach.

For a variety of reasons, the ability to identify nucleotides with fewer cycles than would be required if each preceding nucleotide in the template were to undergo one cycle is important. In particular, the efficiency of the various steps of the process cannot reach 100%. For example, some templates may not be successfully attached to extension probes; some extension probes may not be cleaved, etc. Thus, the reactions occurring on different copies of the template in each cycle gradually become out of phase, and the number of templates from which useful accurate information can be obtained decreases. Thus, it is particularly desirable to minimize the number of cycles required to read nucleotides further from the extendable end of the initiator oligonucleotide. However, increasing the extension probe length may result in increased complexity of the probe mixture, which may reduce the effective concentration of each probe sequence. As described herein, reduced complexity may be achieved with nucleotides having reduced degeneracy, but this may result in reduced hybridization strength and/or reduced ligation efficiency. The inventors have recognized that these competing factors need to be balanced to optimize the results. Thus, in a preferred embodiment of the invention, extension probes of length 8 nucleotides are used, with reduced degeneracy at selected positions. Furthermore, the inventors recognized the importance of selecting appropriate scissile ligations, as well as cleavage conditions and times to optimize the efficiency of the cleavage step (i.e., the percentage of ligations cleaved successfully in each cleavage step) and specificity for the appropriate ligation.

B. Oligonucleotide extension probe design

While Macevicz mentions that reduced degeneracy nucleoside analogs can be used in oligonucleotide extension probes, he does not specify the particular position in the extension probe where it is particularly desirable to include such residues, nor the various specific probe structures (i.e., sequences) into which the reduced degeneracy nucleosides are incorporated. The present inventors have recognized that it may be particularly advantageous to employ a specific number of reduced degeneracy nucleosides (e.g., universal base containing nucleosides) at specific positions on an oligonucleotide extension probe. For example, in certain embodiments of the invention, most or all of the nucleotides at positions 6 or more distal (from X) contain a universal base. For example, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, or at least 100% of the nucleotides at position 6 or more can contain universal bases. These nucleotides do not necessarily all contain the same universal base. In certain embodiments of the invention, hypoxanthine and/or nitroindole are used as universal bases. For example, nucleosides such as inosine can be used.

The present inventors have recognized that excellent results can be obtained with an extension probe greater than 6 nucleotides in length, wherein one or more nucleotides at positions 6 or more from the proximal end of the probe are nucleotides with reduced degeneracy, as counted from the nucleotide attached to the end of the extension probe, such as nucleotides containing universal bases (i.e., if the proximal-most nucleotide is considered to be position 1, then one or more nucleotides at positions 6 or more comprise universal bases), such as 1, 2, or 3 nucleotides at positions 6 or more in an 8-mer probe containing universal bases. For example, in 3 '→ 5' sequencing, a probe of the structure 3 '-XNNNNsINI-5' can be used, wherein X and N represent any nucleotide and "s" represents a scissile junction, such that cleavage occurs between the fifth and sixth residues from the 3 'end, and preferably at least one residue between the scissile junction and the 5' end has a label corresponding to the X species. Another design is 3 '-XNNNNsNII-5'. Yet another probe design is 3 '-XNNNNsIII-5'. This design results in a probe mixture of modest complexity containing 1024 different probes, long enough to prevent the formation of significant adenylation products (see example 1), and with the advantage that the extension products obtained after cleavage consist of unmodified DNA. One disadvantage is that this probe extends only 5 bases of primer at a time. Since read length is a function of extension length multiplied by cycle number, each increase in extension length can increase read length by 1x cycles of several bases (e.g., 20 bases if 20 cycles are used). Another probe design leaves one or more inosines (or other universal bases) at the end of the extension probe after cleavage to produce an extended duplex of 6 bases or longer. For example, using the probe 3 ' -XNNNNIsII-5 ', the duplex is extended 6 bases at a time, leaving 5 ' inosine at the junction. In these designs, it is preferred that at least one residue between the scissile junction and the 5' end has a label corresponding to the X species. In certain embodiments of the invention, the third nucleotide from the distal end of the probe contains a universal base, counting from the opposite end to the nucleotide attached to the end of the extendable probe (i.e., if the distal end is considered position K, then the nucleotide at position K-2 contains a universal base).

In certain embodiments of the invention, Locked Nucleic Acid (LNA) bases are employed at one or more positions on the starting oligonucleotide probe, the extension probe, or both. For example, U.S. Pat. nos. 6,268,490; koshkin, AA, et al, Tetrahedron, 54: 3607-; singh, SK, et al, chem. 455-456, 1998. LNA, which can be incorporated into oligonucleotides that also contain naturally occurring nucleotides and/or nucleotide analogs, can be synthesized using an automated DNA synthesizer and standard phosphoramidite chemistry. They can also be synthesized with a label such as the following.

C. Template and support preparation method

Macevicz describes a method of first synthesizing a template comprising a plurality of substantially identical template molecules, such as amplification by conventional Polymerase Chain Reaction (PCR) in a test tube or other container. Macevicz indicates that the amplified template molecules are preferably attached to a support, such as a magnetic microparticle (e.g., a bead), after synthesis.

The present inventors have recognized that it may be advantageous to synthesize the template to be sequenced on or in the support itself, e.g., using a support such as a microparticle or various semi-solid supports, such as a gel matrix, to which one of a pair of amplification primers is attached prior to performing a PCR reaction. This method does not require a separate step to attach the template molecule to the support after synthesis. Thus, it is convenient to amplify a plurality of templates different in sequence in parallel. For example, a population of individual microparticles, each having multiple copies of a particular template molecule (or its complement) attached thereto, is synthesized on the microparticles in a manner such that the sequence of the template molecule attached to each microparticle differs from the sequence of the template molecule attached to the other microparticles. Thus, each support is associated with a population of cloned templates, e.g., support A is associated with multiple copies of template X; support B has multiple copies of template Y attached thereto; support C has attached thereto a plurality of copies of template Z, etc. "cloned template population", "cloned nucleic acid population" and the like refer to a population of substantially identical template molecules, preferably produced by successive amplification rounds starting from a single template molecule of interest (the starting template). The substantially identical template molecule may be substantially identical to the starting template or its complement.

Amplification is typically performed by PCR, but other amplification methods may be used (see below). It will be appreciated that the members of a clonal population are not necessarily 100% identical, for example, a certain number of "errors" may occur during synthesis, such as amplification. Preferably, at least 50% of the members of the clonal population are at least 90%, or more preferably at least 95%, identical to the starting template molecule (or its complement). More preferably, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, at least 99%, or more of the population members are at least 90%, or more preferably at least 95%, or more preferably at least 99% identical to the starting template molecule (or its complement). Preferably, at least 95% or more preferably at least 99% of the population members have a percent identity with the starting template molecule (or its complement) of at least 98%, 99%, 99.9% or more.

Various techniques can be used to attach the amplification primers to the support. For example, one member of a binding pair (e.g., biotin) can be used to functionalize one end (5' end) of the primer, and the other member of the binding pair (e.g., streptavidin) can be used to functionalize the support. Any similar binding pair may be used. For example, nucleic acid tags of defined sequence may be attached to a support and primers containing complementary nucleic acid tags may hybridize to the nucleic acid tags attached to the support. Various linkers and crosslinking agents may also be employed.

Methods for performing PCR are well known in the art, see, e.g., U.S. Pat. nos. 4,683,195, 4,683,202, and 4,965,188, and Dieffenbach, c, and Dveksler, GS, PCR primers: a Laboratory Manual (PCRPrimer: A Laboratory Manual), 2 nd edition, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 2003. Methods for amplifying nucleic acids on microparticles are well known and described in the art, e.g., standard PCR can be performed on primer-attached beads in microtiter plate wells or tubes (e.g., beads prepared in example 12). Although PCR is a convenient method of amplification, many other methods known in the art may be employed. For example, multiple strand displacement amplification, Helicase Displacement Amplification (HDA), nick translation, Q.beta.replicase amplification, rolling circle amplification and other isothermal amplification methods may be employed.

The template molecule may be obtained from any source. For example, DNA may be isolated from a sample, which may be obtained or derived from a subject. The term "sample" refers broadly to any source of template for which sequencing is performed. By the term "derived from" is meant a sample obtained directly from a subject and/or nucleic acids in the sample that have been further processed to obtain a template molecule. The sample source may be any viral, eukaryotic, archaeal or eukaryotic species. In certain embodiments of the invention, the source is a human. The sample may be, for example, blood or other bodily fluid containing cells; semen; biopsy samples, and the like. Genomic or mitochondrial DNA from any organism of interest can be sequenced. The cDNA can be sequenced. RNA can also be sequenced, e.g., by first reverse transcription to produce cDNA using methods well known in the art, such as RT-PCR. Mixtures of DNA from different samples and/or subjects can be combined. The sample can be processed in various ways. Nucleic acids can be isolated, purified and/or amplified from a sample using known methods. Of course, entirely artificial, synthetic nucleic acids, recombinant nucleic acids not derived from an organism can also be sequenced.

The template may be provided in double-stranded or single-stranded form. Typically, when the template is initially provided in double-stranded form, the two strands are then separated (e.g., the DNA is denatured), and only one of the two strands is amplified to produce a clonal population of localized template molecules, which clonal population is (e.g.) attached to a microparticle, immobilized in or on a semi-solid support, or the like.

The template may be selected or machined in a variety of other ways. For example, a template obtained by treating DNA with a methyl-sensitive restriction enzyme (e.g., MspI) can be used. This treatment to generate DNA fragments may be performed prior to amplification. Fragments containing methylated bases are not amplified. Sequence information obtained from hypermethylated templates can be compared to sequence information obtained from templates from the same source that were not selected for hypermethylation.

The template may be inserted into the library, or the template may be provided in the library, or the template may be derived from the library. For example, hypermethylated libraries are known in the art. The insertion of the template into the library conveniently allows for the attachment of additional nucleotide sequences to the ends of the template, such as tags, primer binding sites or starter oligonucleotides. For example, certain protocols allow for the addition of tags having multiple binding sites, such as amplification primer binding sites, initiation oligonucleotide binding sites, capture agent binding sites, and the like.

Various suitable libraries are known in the art. For example, USSN 10/978,224, PCT publications WO2005042781 and WO2005082098 and sheddere, j. et al, Science, 309 (5741): 1728-32, 2005, Sciencexpress, 8.4.2005 (www.sciencexpress.org) describes libraries of particular interest and methods of construction thereof. It will of course be appreciated that other methods of generating such libraries may be employed. Certain libraries of particular interest contain multiple nucleic acid fragments (typically DNA), each fragment containing two nucleic acid segments of interest separated by sequences complementary to the amplification and/or sequencing primers used in the sequencing step, i.e., these sequences serve as Primer Binding Regions (PBRs). In a particularly interesting embodiment, the nucleic acid segment is a contiguous stretch of naturally occurring DNA. For example, the segments may be from the 5 'and 3' ends of a contiguous portion of genomic DNA, as described in the above-mentioned references. Consistent with the above references, such nucleic acid segments are referred to herein as "tags" or "end tags". Two tags derived from a stretch of contiguous nucleic acid, such as its 5 'and 3' ends, are referred to as a "pair tag", or "ditag". It should be understood that "paired labels" includes both labels, i.e., using the singular form. The distance separating the two tags is limited by selecting contiguous portions of DNA that produce paired tags within a predetermined size limit.

In addition to being separated by sequences complementary to sequencing and/or amplification primers, the nucleic acid fragments of the library will typically also contain sequences complementary to the sequencing and/or amplification primers flanking the tag, i.e., a first such sequence may be located 5 'of the tag nearer the 5' end of the fragment and a second such sequence may be located 3 'of the tag nearer the 3' end of the fragment. It is understood that the positions of the two tags present in the contiguous nucleic acid from which the tags are generated in various embodiments may, but need not, correspond to the positions of the tags in the library DNA fragments.

The nucleic acid fragments and tags may have different size ranges. The length of the nucleic acid fragment may be, for example, 80-300 nucleotides, such as 100-200, 100-150, about 200, etc. The tag may be, for example, 15-25 nucleotides in length, such as about 17-18 nucleotides in length, etc. It should be noted that these lengths are exemplary, and not limiting. Shorter or longer fragments and/or tags may be used.

It should also be noted that while obtaining paired tags from a single contiguous nucleic acid provides a convenient method for library construction, it is important that the paired tags are separated from each other by a distance ("separation distance") in the nucleic acid from which they were originally produced, wherein the separation distance falls within a predetermined range of distances. The tags are separated by a separation distance that falls within a predetermined range to enable alignment of the tag sequence to a reference sequence, such as a reference genomic sequence. Without wishing to be bound by any theory, this may be advantageous for certain applications such as genome re-sequencing, where it enables shorter read lengths to be employed while still being able to accurately locate sequences on a reference genome. The 5 'and 3' tags of the paired tags represent segments of larger nucleic acid fragments, such as genomic DNA (i.e., they have the above sequences), which are spaced within a predetermined distance from each other in naturally occurring DNA fragments, such as genomic DNA fragments. For example, in certain embodiments of the invention, in a naturally occurring DNA fragment, the 5 'and 3' tags of a pair of tags represent DNA segments within 500 nucleotides of each other, within 1kB of each other, within 2kB of each other, within 5kB of each other, within 10kB of each other, or within 20kB of each other. In certain embodiments, in a naturally occurring DNA fragment, the 5 'and 3' tags of a pair of tags are separated by 500 nucleotides to 2kB, such as 700 nucleotides to 1.2kB, about 1kB, and the like. It should be noted that the exact separation distance of the two tags of a pair is not important and is generally unknown. Furthermore, while the tag is originally obtained from a larger nucleic acid fragment, the term "tag" is used for any nucleic acid segment containing a tag sequence, whether present in the original sequence content or in a library fragment, an amplification product of a library fragment, a template to be sequenced, or the like.

Nucleic acid fragments (e.g., library molecules) may have the following structure:

linker 1-tag 1-linker 3-tag 1-linker 2

Tag 1 and linker 2 may be the 5 'and 3' tags of a paired tag. Either tag may be a 5 'tag or a 3' tag. Linker 1 and linker 2 contain primer binding regions for one or more primers. In certain embodiments, linkers 1 and 2 each contain a PBR of an amplification primer and a PBR of a sequencing primer. The primers in each adapter can be nested primers such that the sequencing primer PBR is located inside the amplification primer PBR. Linker 3 may contain one or more PBRs of sequencing primers to allow sequencing of tag 1 and tag 2. The term "linker" refers to a nucleic acid sequence present in a plurality of nucleic acid fragments of a library, e.g., substantially all fragments of a library. During library construction, the linker may or may not have actual ligation function, and the linker may only be considered to be a defined sequence common to most or all members of a given library. Such sequences are also referred to as "universal sequences". Thus, a nucleic acid complementary to an adapter or a portion thereof hybridizes to a plurality of members of the library and can serve as an amplification primer or a sequencing primer for most or all of the molecules in the library.

In certain embodiments of the invention, the nucleic acid fragment has the following structure:

linker 1-tag 1-internal adapter-tag 2-linker 2

Tag 1 and tag 2 and linker 1 and linker 2 comprise the PBR described above. The internal adaptor contains two primer binding regions, which may be referred to as IA and IB, as described below. These PBRs can be used to generate microparticles with two separate populations of substantially identical nucleic acids attached, wherein one population of nucleic acids comprises tag 1 and the other population of nucleic acids comprises tag 2. The two separate populations of nucleic acids contain sequences that are at least partially different, such as different sequences of their tag regions. A spacer may be included between the two primer binding regions of the internal adaptor. The spacer may contain an abasic residue that prevents extension of the polymerase through the spacer. Of course, spacers containing any other blocking group that prevents polymerase extension through the spacer may be used.

In other embodiments, the nucleic acid fragment comprises one or more (e.g., 2, 4, 6, etc.) additional tags and one or more additional internal adaptors. For example, a nucleic acid fragment can have the following structure:

linker 1-tag 1-internal adaptor 1-tag 2-linker 2-tag 3-internal adaptor 2-tag 4-linker 3

It should be noted that in addition to the ligation-based sequencing methods described herein, the nucleic acid fragments of the invention, as well as libraries of such fragments, microparticles comprising two or more substantially identical populations of nucleic acids, and arrays of such microparticles, can also be used in a variety of sequencing methods. For example, sequencing methods such as FISED, pyrophosphate sequencing, and the like can be used. See, for example, WO 2005082098. Of course, connection-based methods may also be advantageously utilized. It is to be understood that in the ligation-based methods described herein, the term "sequencing primer" may be understood as an "initiator oligonucleotide".

In certain embodiments of the invention, PCR is performed in a separate aqueous emulsion chamber (also referred to as a "reactor") to synthesize a template to be sequenced. Preferably, each chamber contains a particle support such as a bead with an appropriate first amplification primer attached, a first copy of the template, a second amplification primer, and components (e.g., nucleotides, polymerase, cofactors, etc.) necessary to perform a PCR reaction. Methods for preparing emulsions are described, for example, in U.S. Pat. No. 6,489,103 (Griffiths); 5,830,663 (Embleton); and U.S. publication No. 20040253731 (Ghadessy). Methods for performing PCR in a single emulsion chamber to generate a clonal population of templates attached to microparticles are described, for example, in Dressman, d.et al, proc.natl.acad.sci., 100 (15): 8817-8822, 2003, and PCT publication WO 2005010145.

The methods described in the above references, or modified forms thereof, can be used to generate populations of microparticle-attached template clones for sequencing. In a preferred non-limiting embodiment, short (< 500 nucleotides) templates suitable for PCR are generated by ligating universal adaptor sequences to each end of a population of different target sequences (templates). (by "universal" herein is meant that the same adaptor sequence is ligated to each template, resulting in an "adapted" template that can be amplified using a pair of PCR amplification primers). A batch PCR reaction is prepared using the adapter template, a free amplification primer, microparticles with attached second amplification primer, and other PCR reagents (e.g., polymerase, cofactors, nucleotides, etc.). The aqueous phase PCR reaction was mixed with the oil phase (containing light mineral oil and surfactant) at 1: 2. This mixture was vortexed to produce a water-in-oil emulsion. One milliliter of the mixture was sufficient to produce 4X 10 in the emulsion⁹An aqueous chamber, each of which is a possible PCR reactor. The emulsion sample aliquots are dispensed into microtiter plate (e.g., 96-well plate, 384-well plate, etc.) wells and thermocycled to effect solid phase PCR amplification on the microparticles. To ensure clonality, the microparticle and template concentrations were carefully controlled so that the reactor contained almost no more than one bead or template molecule. For example, in certain embodiments of the invention, at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or more of the reactors contain one bead and one template. Thus, the members of each clonal population of templates are spatially restricted by virtue of their attachment to the microparticle. In general, the attachment points of the template may be substantially uniformly distributed over the surface of the particle.

Of particular interest is the generation of populations of microparticles using PCR emulsion methods, wherein individual microparticles are attached to different populations of amplified nucleic acid fragments containing pairs of 5 'tags and 3' tags. In other words, it is of particular interest to generate a population of microparticles in which individual particles have different nucleic acid fragments from the library amplified as described above and linked thereto.

Methods known in the art for amplifying DNA in emulsion are limited by the ability to achieve amplification of nucleic acid molecules and attachment of these molecules to microparticles (as described in the above references). For example, PCR efficiency has been shown to decrease exponentially with longer amplicons. The reduction in PCR efficiency reduces the efficiency with which nucleic acid fragments containing paired tags and primer binding sites (as described above) are amplified in a PCR emulsion and attached to microparticles by such amplification. Thus, the method of amplifying a single population of substantially identical nucleic acid fragments containing first and second tags of a paired tag in a PCR emulsion and attaching to beads via such amplification is subject to a number of limitations.

The methods provided herein enable the use of smaller amplicons while retaining the paired tag information generated when individual nucleic acid fragments containing the 5 'and 3' tags of the paired tags are attached to microparticles by amplification. The present invention provides microparticles, such as beads, having attached at least two distinct populations of nucleic acids, wherein each of the at least two populations consists of a plurality of substantially identical nucleic acids, wherein a first population of substantially identical nucleic acids comprises a first nucleic acid segment of interest, such as a 5 'tag, and a second population of nucleic acids comprises a second nucleic acid segment of interest, such as a 3' tag. The first and second populations of nucleic acids are amplified from a larger nucleic acid fragment containing both tags, also containing an appropriate distribution of primer binding sites flanking and separating the tags, such that both amplification reactions are performed consecutively or (preferably) simultaneously in a single PCR emulsion reactor in the presence of the microparticles and amplification reagents. The microparticles have two distinct primer populations attached, one of which corresponds in sequence to a primer binding region in the nucleic acid fragment other than one of the tags, and the other of which corresponds in sequence to a primer binding region in the nucleic acid fragment other than the other tag, i.e., the primer binding regions flank the tags.

The invention also provides primers that bind to a primer binding region located between two tags, so as to perform two different PCR reactions, each amplifying a portion of a nucleic acid fragment containing one tag. The amplified nucleic acid segments contain other primer binding regions that differ from each other. These other primer binding regions are present in the nucleic acid fragment, within the PBR of the amplification primers, i.e., they are nested primers. These additional PBRs serve as binding regions for two different sequencing primers. Thus, by applying one or the other of two different sequencing primers to a microparticle to which two populations of substantially identical nucleic acid segments are ligated, one or the other of the two nucleic acid segments can be sequenced without interference from the other nucleic acid segment. Each nucleic acid segment is significantly shorter than the nucleic acid fragment from which it is amplified, thus increasing the efficiency of emulsion-based PCR with a library of fragments containing paired tags, while still preserving the association between the tags of the paired tags.

The above method can be better understood by referring to each of FIGS. 34 and 35, wherein nucleic acid portions having the same sequence are assigned the same color. The above description is intended to explain figures 34 and 35 consistently. Fig. 34A and 35A show the same steps, with fig. 35A providing additional detail. As shown in FIGS. 34A and 35A, paired-end library fragments containing two tags (tag 1 and tag 2) were constructed using internal adapter cassettes (IA-IB) and unique flanking linker sequences (P1 and P2). Both the internal adaptor cassette and the flanking adaptor sequences contain nucleotide sequences for PCR amplification and DNA sequencing. PCR primer regions were designed so that nested DNA sequencing primers were used. DNA capture microparticles (beads) were created by ligating the same two oligonucleotide sequences to unique flanking linker sequences. In PCR amplification, DNA capture microparticles bound to oligonucleotides having the sequences P1 and P2 are seeded into a reaction containing single ditag library fragments (i.e., the library fragments contain pairs of tagged 5 'tags and 3' tags) and solution PCR primers.

Limited amounts of solution-side adapter primers (P1 and P2) are added compared to the internal adapter primers (IA and IB) to facilitate efficient bead-driven amplification of the PCR-derived tag product (i.e., [ P1 < IB ], [ P2 < IA ]). If desired, appropriate control of the amount of primer can also ensure that the population of nucleic acids contains substantially the same amount of nucleic acid, e.g., about half of the nucleic acids on a single particle belong to the first population and about half of the nucleic acids on a single particle belong to the second population. Thus, if desired, a form of asymmetric PCR can be used to control the ratio of different populations.

During amplification, as shown in fig. 34B and 35B (where fig. 35B provides additional detail again with respect to fig. 34B), one paired-end library fragment produced two unique PCR products in the presence of four oligonucleotide primers (P1, P2, IA and IB). One population contains tag 1 flanked by P1 and IA and a second population contains tag 2 flanked by P2 and IB.

After amplification, the microparticles were loaded with two unique PCR populations corresponding to tag 1 and tag 2 generated from the starting library fragments. Thus, each tag contains a unique set of primers to allow for sequential sequencing of each tag, as shown in FIGS. 34C, 35C, and 35D. FIGS. 35C and 35D show sequential sequencing of tags 1 and 2 with different sequencing primers. Various sequencing methods can be employed.

Microparticles having two or more populations of, e.g., 4, 6, 8, 12, 16, 20 different nucleic acid sequences attached thereto can be generated using the methods described above, e.g., where the populations include 2, 3, 4, 6, 8, 10 paired tags. Each population can be sequenced individually by providing a unique primer binding region in each sequence, as described above for the two tag portions.

The invention includes nucleic acid fragments having the structures shown in FIGS. 34 and 35 and the structures described above, libraries of such fragments, microparticles having attached thereto nucleic acid segments from such fragments, populations of such microparticles (where the sequence of the population of nucleic acids to which an individual microparticle is attached differs from the population of nucleic acids to which other microparticles are attached), arrays of microparticles, amplification primers for amplifying nucleic acid segments (tags) from nucleic acid fragments, sequencing primers for sequencing nucleic acid segments attached to microparticles, methods of making such fragments, libraries, and microparticles, and methods of sequencing nucleic acids attached to microparticles. The invention includes kits comprising any combination of the above components, optionally also containing one or more enzymes, buffers or other reagents for amplification, sequencing, etc.

The template-attached microparticles can be enriched by various methods, if desired. For example, hybridization methods may be employed in which an oligonucleotide (capture agent) complementary to a portion of the amplification product (template) attached to a microparticle is attached to a capture entity such as another (preferably larger) microparticle, a microtiter well or other surface. This portion of the amplification product may be referred to as the target region. The targeted region can be incorporated into the template during amplification, such as at one end of a portion of the template containing an unknown sequence. For example, the target region may be present in an amplification primer that is not attached to a microparticle, such that a complementary portion is present in the amplification template. Thus, a plurality of different templates may comprise the same targeting region, and thus one capture agent may hybridise to a plurality of different templates, which enables a plurality of microparticles to be captured using only one oligonucleotide sequence, e.g. capture agent. The microparticles undergoing amplification are contacted with a capture agent under conditions in which hybridization can occur. As a result, the microparticles with attached amplification template are attached to the capture entity by the capture agent. The unattached microparticles are then removed, releasing the residual microparticles (e.g., by increasing the temperature). In certain embodiments employing particle capture entities, aggregates consisting of particle-bound capture entities after hybridization are separated from particle capture entities that are not bound to microparticles and microparticles that are not bound to capture entities, such as by centrifugation in a viscous solution, such as glycerol. Other separation methods based on size, density, etc. may also be employed. Hybridization is one of many methods that can be used for enrichment. For example, capture agents having affinity for a number of different ligands that can be incorporated into the template (e.g., during synthesis) can be used. Multiple rounds of enrichment may be employed.

FIG. 14A shows a cell image of a water-in-oil emulsion in which a PCR reaction is performed with a fluorescently labeled second amplification primer and excess template on a bead to which a first amplification primer is attached. The aqueous reactor emits weak fluorescence from the diffused free primer, while the beads emit strong fluorescence from the primer accumulated on the beads due to solid phase amplification (i.e., incorporation of the fluorescent primer into the amplification template attached to the bead by the first amplification primer). Bead signals were consistent in different size reactors.

Following amplification, the microparticles are collected (e.g., using a magnet in the case of magnetic particles) and used for sequencing by repeated cycles of extension, ligation, and cleavage, as described herein. In certain embodiments of the invention, microparticles are arrayed in or on a semi-solid support and then sequenced, as described below. Examples 12, 13, 14 and 15 provide additional details of representative, non-limiting methods that can be used to (i) prepare microparticles with attached amplification primers for template synthesis on the microparticles (example 12); (ii) an emulsion containing multiple reactors was prepared for PCR (example 13); (iii) PCR amplification was performed in the emulsion chamber (example 13); (iv) breaking the emulsion and recovering the microparticles (example 13); (v) enrichment of microparticles with attached clonal template populations (example 14); (vi) preparation of a slide for use as a substrate for a semi-solid polyacrylamide support (example 15); and (vii) mixing the microparticles with unpolymerized acrylamide to form an array of template-attached microparticles embedded in the acrylamide on the substrate (example 15). Example 15 also describes a polymerase capture protocol that can be used for certain methods when PCR is performed in a semi-solid support. Those of ordinary skill in the art will recognize that many variations of these methods are possible.

In other embodiments of the invention, templates are amplified by PCR in a semi-solid support such as a gel having suitable amplification primers immobilized therein. The template, other amplification primers and reagents required for the PCR reaction are present in a semi-solid support. One or both primers of the amplification primer pair are attached to the semi-solid support via a suitable linking moiety such as an acrydite group. The linking may be performed during polymerization. Other reagents (e.g., template, second amplification primer, polymerase, nucleotides, cofactors, etc.) may be present prior to formation of the semi-solid support (e.g., in a liquid prior to gel formation), or one or more reagents may diffuse into the semi-solid support after formation of the semi-solid support. The pore size of the semi-solid support is chosen so that this diffusion can occur. As is well known in the art, in the case of polyacrylamide gels, the pore size is determined primarily by the concentration of the acrylamide monomer, and is also somewhat affected by the crosslinking agent. Similar considerations apply in the case of other semi-solid support materials. Suitable crosslinkers and concentrations can be selected to achieve the desired pore size. In certain embodiments of the invention, additives such as cationic lipids, polyamines, polycations, etc. are included in the solution prior to polymerization, which form micelles or aggregates around the microparticles in the gel. The methods described in U.S. Pat. Nos. 5,705,628, 5,898,071, and 6,534,262 may also be used. For example, various "encryption reagents" can be used to encrypt DNA near the beads for clonal PCR. SPRI ® magnetic bead technology and/or conditions can also be employed. See, e.g., U.S. patent 5,665,572, which shows efficient PCR amplification in the presence of 10% polyethylene glycol (PEG). In certain embodiments of the methods of the invention, amplification (e.g., PCR), ligation, or both amplification and ligation are performed in the presence of certain reagents, such as betaine, polyethylene glycol, PVP-40, and the like. These reagents may be added to the solution, present in the emulsion, and/or diffused into the semi-solid support.

The semi-solid support can be positioned or assembled on a substantially flat rigid substrate. In certain preferred embodiments, the substrate is transparent to radiation of excitation and emission wavelengths (e.g., about 400-900nm) used to excite and detect typical labels (e.g., fluorescent labels, quantum dots, plasmon resonance particles, nanoclusters). Certain materials such as glass, plastic, quartz, etc. are suitable. The semi-solid support can be adhered to the substrate and optionally immobilized to the substrate by a variety of methods. The substrate may be coated with or without adhesion or bonding enhancing substances such as silanes, polylysine, and the like. U.S. patent 6,511,803 describes methods of synthesizing populations of clonal templates in a semi-solid support using PCR, methods of preparing semi-solid supports on a substantially flat substrate, and the like. Similar methods can be employed with the present invention. The substrate may have a well or depression to accommodate a liquid prior to forming the semi-solid substrate. Alternatively, a raised border or mask may be used for this purpose.

The above method provides another method of generating a spatially defined population of clonal templates using a reactor in an emulsion. Clonal populations are present at discrete locations in a semi-solid support, such that a signal can be obtained from each population during sequencing, e.g., by imaging, for detection of newly ligated extension probes. In some embodiments of the invention, two or more different clonal populations, present as a mixture at discrete locations in a semi-solid support, are amplified from one nucleic acid fragment. Each clonal population in the mixture can contain a tag such that discrete locations contain fragments containing a 5 'tag and fragments containing a 3' tag. The cloning templates containing the 5 'tag and the 3' tag contain different sequencing primers so that they can be sequenced independently of each other. This method is the same as the method described above and can be used to generate a plurality of substantially identical populations of nucleic acids on a microparticle and obtain sequencing information for both members of the pair tag from one microparticle.

Typically, the semi-solid support used in any of the methods of the invention forms a layer having a thickness of about 100 microns or less, such as about 50 microns or less, such as about 20-40 microns. Preferably, a cover glass or other similar object having a substantially flat surface can be placed on the semi-solid support material prior to polymerization to help create a uniform gel layer, such as a substantially flat and/or substantially uniform thickness gel layer.

In other embodiments of the invention, a modified form of the above method may be used, wherein the template is synthesised by PCR on microparticles to which suitable amplification primers are attached, wherein the microparticles are immobilised in or on the semi-solid support prior to template synthesis, i.e.they are fully or partially embedded in the semi-solid support. Typically, the semi-solid supports completely surround the microparticles, but they may also be held on an underlying substrate. Thus, the microparticles remain in a substantially fixed position relative to each other unless the semi-solid support is destroyed. The method provides another method of using an emulsion to generate a spatially restricted population of clonal templates. The microparticles can be mixed with a liquid prior to forming the semi-solid support. Alternatively, the microparticles may be arranged on a substantially planar substrate and the liquid added to the array of microparticles prior to polymerization, crosslinking, etc. The microparticle has a first amplification primer attached thereto. The second amplification primer can be, but is not necessarily, attached to a semi-solid support. Other reagents (e.g., template, second amplification primer, polymerase, nucleotides, cofactors, etc.) may be present prior to forming the semi-solid support (e.g., in a liquid prior to gel formation), or one or more reagents may diffuse into the semi-solid support after gel formation. Typically, a semi-solid substrate is formed on the slide as described above.

In certain embodiments of the invention, the gel may be dissolved (e.g., digested or disaggregated or melted) to facilitate recovery of microparticles (e.g., using a magnet in the case of magnetic particles) that link the clonal template population after template synthesis. Gels that are soluble, digestible, depolymerized, solubilized, etc., are referred to herein as "reversible" gels. Conventional polyacrylamide polymerization involves the use of N-N ' methylenebisacrylamide (BIS) as a crosslinker and a suitable catalyst to initiate polymerization (e.g., N ' -Tetramethylethylenediamine (TEMED). to produce a reversible gel, another crosslinker such as N-N ' diallyltartaric acid diamide (DATD) can be used, a compound that is similar in structure to BIS but has cis-dihydroxy (Anker, h.s.: f.e.b.s.lett., 7: 293, 1970) that can be cleaved by periodic acid (e.g., a solution containing sodium periodate) thus, a DATD gel that is not poorly soluble, gels prepared using DATD as a crosslinker are highly transparent and strongly bound to glass another crosslinker that has daotd-like properties to form a reversible gel is ethylene diacrylate (chles, g.l. and Zimm, b.s.: anal.biochem., 13: 336. N.,339, n' -Diallylcystamine (BAC) is another cross-linking agent that can be used to form reversible polyacrylamide gels. Another cross-linking agent that may be used to form a gel that dissolves in periodate is N, N' - (1, 2-dihydroxyethylene) bisacrylamide (DHEBA). Various other materials that can form a reversible semi-solid support can also be used. For example, thermoreversible polymers such as pluronic (from BASF) may be employed. Pluronic is a family of triblock copolymers of poly (ethylene oxide) -poly (propylene oxide) -poly (ethylene oxide) (PEO-PPO-PEO) (nice, v.m. et al, Nonionic surfactants, Marcel-Dekker, NY, 1996). These materials become semi-solid (gel) when the temperature is raised (e.g., above room temperature) and liquefy when cooled. Pluronic can be chemically derivatized by various methods, for example, to facilitate ligation of primers (see, e.g., Neff, J.A. et al, J.biomed.Mater.Res., 40: 511, 1998; Prud' homme, RK et al, Langmuir, 12: 4651, 1996).

After solubilization, the microparticles can be collected and sequenced using repeated cycles of extension, ligation, and cleavage. Prior to sequencing, the microparticles may be arranged in or on the second semi-solid support (e.g., at a density higher than that at which they are present in or on the first semi-solid support). The semi-solid support itself is supported by a substantially flat rigid substrate such as a glass slide.

Thus, two general approaches can be used to produce a semi-solid support in or on which an array of microparticles carrying a population of clonal templates is embedded. The first method involves amplification (e.g., using emulsion PCR) on microparticles that are not present in the semi-solid support, and then immobilizing the microparticles in or on the semi-solid support. The second general method involves immobilizing microparticles in or on a semi-solid support followed by amplification. In both cases, steps may need to be taken to reduce particle aggregation and/or to align the particles substantially in a focal plane. For example, when fixing particles in a polyacrylamide gel, the concentrations of monomer and crosslinker are chosen so that the particles settle to the bottom of the solution, and then the polymerization is completed so that they rest on the underlying flat substrate and thus lie in one plane. In certain embodiments of the invention, an object having a substantially flat surface, such as a coverslip, is placed on a liquid acrylamide (or material capable of forming a semi-solid support) containing microparticles such that the acrylamide is sandwiched between two layers of a "sandwich" structure. The sandwich is then inverted so that the particles settle by gravity and rest on the coverslip (or other object having a substantially flat surface). After polymerization, the cover slip was removed. Thus, the microparticles are substantially embedded in the same plane, near the surface of the semi-solid support (e.g., tangent to the surface).

In certain embodiments of the invention, rather than immobilizing supports, such as microparticles, in a semi-solid matrix as described above, the microparticles are not covalently or non-covalently attached to a substantially flat, rigid substrate, rather than employing a semi-solid support to immobilize them. Various methods of attaching particles to substrates such as glass, plastic, quartz, silicon, and the like are known in the art. The substrate may be coated (e.g., spin-coated) or functionalized with or without certain materials (e.g., various polymers) or substances that facilitate attachment. The coating may be a thin film, a self-assembled monolayer, or the like. The microparticle may be attached, attached to a moiety of the microparticle, or attached to an oligonucleotide (e.g., a template) of the microparticle.

In general, any pair of molecules that have an affinity for each other to form a binding pair can be used to attach the microparticle or the template to the substrate. The first member of the binding pair is covalently or non-covalently attached to the substrate and the second member of the binding pair is covalently or non-covalently attached to the microparticle or template. The first binding member may be attached to the substrate by a linker. The second binding member may be attached to the microparticle or the template via a linker. For example, according to one method, a slide or other suitable substrate is modified with an amine activating group (e.g., using a PEG linker containing an amine activating group). Under aqueous conditions (e.g., pH 8.0), the amine activating group reacts with an amine such as lysine in a protein (e.g., streptavidin). Thus, microparticles functionalized with amine-bearing moieties will be immobilized on a substrate. The amine-bearing moiety may be a protein or a suitably functionalized nucleic acid, such as a DNA template. Multiple moieties may be attached to the bead. For example, beads may be attached to a protein that reacts with NHS esters to attach the beads to a substrate, or to a DNA template, which may be sequenced after the beads are attached to the substrate. Suitably coated glass slides bearing a polymer tether containing an amine-reactive NHS moiety at one end are available, for example, from Schott Nexterion, Schott north america, inc. Alternatively, coated slides (e.g., biotin coated slides) are available from Accelr8 Technology Corporation, Denver, CO. Their OptiChem ^TMThe art represents one method of attaching microparticles to a substrate. See, for example, U.S. patent 6,844,028. Alternatively, the microparticles can be attached to the substrate by functionalizing the polynucleotides on the beads with biotin using, for example, terminal transferase and biotin-dideoxy ATP and/or biotin-deoxyATP, and then contacting the beads with a streptavidin-coated slide (available from, for example, Accelr8 Technology Corporation, Denver, CO) under conditions that promote biotin-streptavidin bonding.

In general, nucleic acids, such as oligonucleotide primers, probes, templates, etc., can be modified by various methods known in the art to facilitate attachment of such nucleic acids to a microparticle or other support or substrate. In addition, microparticles or other supports can be modified using various methods known in the art to facilitate attachment of nucleic acids, to facilitate attachment of microparticles to a support or substrate, and the like. Microspheres with surface chemistry that facilitates attachment of the desired functional groups can be obtained. Some examples of such surface chemistries include, but are not limited to: including amino, carboxylic acid, aldehyde, amide, chloromethyl, hydrazide, hydroxyl, sulfonate, and sulfate esters of aliphatic amines and aromatic amines. These groups can react with groups present on the nucleic acid or the nucleic acid can be modified by attaching reactive groups. In addition, a number of stable bifunctional groups are well known in the art, including homobifunctional and heterobifunctional linkers. See, e.g., the Pierce chemical technology library, obtained from the Web site at URL www.piercenet.com (originally published in 1994-95 Pierce catalog) and g.t. hermanson, Bioconjugate technologies, Academic Press, inc., 1996. See also us patent 6,632,655.

The microparticle arrays formed according to the methods described herein are typically random arrays. The term "random patterning" or "random" as used herein refers to the occurrence of a disordered, non-cartesian distribution of entities (features) on a support (in other words, no 'clock positions', angles or radii arranged at predetermined points or positions along the x-and y-axes of a grid or determined relative to the center of a radiation pattern), which is not achieved by deliberate design (or a program for which such a design is available) or placement of individual entities. Such "randomly patterned" or "random" arrays of entities can be achieved by dropping, spraying, electroplating, spreading, distributing (etc.) a solution, emulsion, aerosol, vapor or dry formulation containing a library of entities onto or into a support and allowing it to settle onto or into the support without any intervening directing them to specific sites in or on the support. For example, the entities can be suspended in a solution containing semi-solid support precursors (e.g., acrylamide monomers). The solution is then distributed on a second support to form a semi-solid support on the second support. The entity is embedded in or on a semi-solid support. Of course, non-random arrays may also be employed. Generally, the method of forming an array as used herein is different from the method of synthesizing polynucleotides by successively applying individual nucleotide subunits onto predetermined positions of a substrate.

Fig. 14B (top) shows a fluorescence image of a glass slide (1 inch x 3 inch) with polyacrylamide gel on it. Beads (1 micron in diameter) with fluorescently labeled oligonucleotides hybridized to the template attached to the beads were immobilized in a gel. The figure shows the bead surface density (i.e., the number of beads per unit area of substrate in the area of the beads) sufficient to image about 2.8 hundred million beads per slide. The surface density and imageable area on one slide were sufficient to image at least 5 hundred million beads. For example, figure 14B (bottom) shows a schematic of a slide with Teflon ® mask surrounding a clear region where beads are embedded into a semi-solid support layer such as polyacrylamide gel. The mask area was 864mm². Having 5 hundred million beads and a surface density of 578,000 beads/mm². A tightly packed 1 micron hexagonal array contains 1,155,000 beads/mm²Thus, this embodiment produces an array having 52% of theoretical maximum density. It is understood that fewer and greater bead numbers, lower or higher bead surface densities than in this embodiment may be used.

The microparticles can be arranged in or on a substantially flat semi-solid support or another support or substrate at various densities, which can be defined in a variety of ways. For example, density may be expressed as the number of particles (e.g., spherical particles) per unit area of a substantially flat array. In certain embodiments of the invention, the number of particles per unit area on the substantially planar array is at least 80% of the number of particles in the hexagonal array ("hexagonal array" refers to a substantially planar array of particles in which each particle in the array contacts at least six other adjacent particles of equal area, as described in U.S. patent 6,406,848). However, in other embodiments of the invention, the density of the particles is low, e.g., the number of particles per unit area on a substantially flat array is less than 80%, 70%, 60%, or 50% of the number of particles in a hexagonal array. Without wishing to be bound by theory, it is preferred to use a lower density (such as the densities described above) in order to allow sufficient diffusion of reagents such as enzymes, primers, cofactors, etc., and to avoid the reagent partitioning effect that occurs when certain reagents have different affinities for the particles or become entrapped therein. This effect may lead to different reaction conditions at different locations of the array, and may even prevent access of these reagents to certain locations of the array. These problems can be more difficult to deal with when conducting reactions in a flow chamber, as the reagents pass through the flow chamber in a directed manner. In certain embodiments of the invention, the chamber of the flow chamber comprises a mixing device, such as a device that effects fluid mixing by mechanical or acoustic means. Many suitable mixing devices are known in the art.

The sequencing method of the invention can be carried out with templates arranged in all types of arrays, including random and non-random arrays, which can be arrays of microparticles or arrays of templates themselves. For example, U.S. Pat. No. 5,641,658 and PCT publication No. WO0018957 describe supports on which templates are arranged. The array may be on a variety of substrates such as filter paper, membranes (e.g., nylon), metal surfaces, and the like. Other examples of array formats that can be sequenced on the array by repeated cycles of extension, ligation and cleavage are bead arrays located in wells at the ends or distal ends of individual fibers in a fiber bundle. See, for example, U.S. publications and patents such as 6,023,540; 6,429,027, 20040185483, 2002187515, PCT applications US98/05025 and PCT US98/09163 and PCT publication WO0039587 describe bead arrays and "arrays of arrays". The template-attached beads can be arranged as described herein. Preferably, amplification is performed prior to formation of the array. The arrays formed on these substrates are not necessarily substantially planar.

In other embodiments, PCR is performed on an array containing oligonucleotides attached to a substrate or support (see, e.g., U.S. Pat. Nos. 5,744,305; 5,800,992; 6,646,243 and related patents (Affymetrix); PCT publication WO 2004029586; WO 03065038; WO03040410 (Nimblegen)). Typically, such oligonucleotides contain a free 3 'or 5' end. If necessary, the terminal may be modified, for example, a phosphate group or an OH group is added to the 3 'end if the 3' end has no phosphate group or OH group. Template molecules containing regions complementary to oligonucleotides attached to a support or substrate are hybridized to the oligonucleotides and in situ PCR is performed on the array to generate a population of clonal templates at each position of the array. Oligonucleotides attached to the array can be used as one of the amplification primers. The templates are then sequenced using the ligation-based methods described herein. Sequencing can also be performed on templates in the array, as described in U.S. publication No. 20030068629.

Other methods of preparing DNA arrays on a surface may be used. For example, alkylthiols (alkanethiol) modified with a terminal aldehyde group can be used to prepare self-assembled monolayers (SAMs) on gold surfaces. The aldehyde groups of this monolayer can react with amine-modified oligonucleotides or other amine-carrying biomolecules to form Schiff shouts, which can then be reduced to stable secondary amines by treatment with sodium cyanoborohydride (Peelen and Smith, Langmuir, 21 (1): 266-71, 2005). PCR amplification of the template may then be performed. Alternatively, microparticles with attached populations of clonal templates can be attached to a surface by reacting the microparticles or amine groups on the template or oligonucleotides attached to the particles with the surface.

Another method of obtaining microparticles with attached populations of cloning templates is the "solid phase cloning" method described in U.S. Pat. No. 5,604,097, which uses oligonucleotide tags to sort polynucleotides onto microparticles so that only polynucleotides of the same sequence are attached to a particular microparticle.

In certain embodiments of the invention, sequencing is performed in repeated cycles of extension, ligation and cleavage by spreading sequencing reagents (e.g., extension probes, ligases, phosphatases, etc.) into a semi-solid support, such as a gel, containing clonal template populations immobilized in or on the support, each clonal population being located in a spatially distinct region of the support. In certain embodiments, the template is directly attached to the semi-solid support. However, in a preferred embodiment, the template is immobilised on a second support, such as a microparticle, which is in turn immobilised on or in a semi-solid support, as described above.

As described in example 1, the present inventors have demonstrated that strong ligation and cleavage can be performed on templates attached to beads immobilized in polyacrylamide gels. Accordingly, the present invention provides a method of ligating a first polynucleotide to a second polynucleotide, the method comprising the steps of: (a) providing a first polynucleotide immobilized in or on a semi-solid support; (b) contacting the first polynucleotide with a second polynucleotide and a ligase; and (c) maintaining the first and second polynucleotides under conditions suitable for ligation in the presence of a ligase. Suitable conditions include providing buffers, cofactors, temperatures, times, etc. appropriate to the particular ligase being used. In a preferred embodiment, the semi-solid support is a gel such as an acrylamide gel. In another preferred embodiment, the first polynucleotide is immobilized in or on a semi-solid support by attachment to a support such as a bead, followed by immobilization of the bead itself in or on the semi-solid support, such as by partial or complete embedding in a support matrix. Alternatively, the first polynucleotide may be directly attached to the semi-solid support by attachment, for example, to an acrydite moiety. The linkage may be covalent or non-covalent (e.g., via biotin-avidin interaction). U.S. Pat. No. 6,511,803 describes various methods that can be used to attach nucleic acid molecules to the preferred support of the present invention, namely polyacrylamide gels.

The present invention also provides a method of cleaving a polynucleotide, the method comprising the steps of: (a) providing a polynucleotide immobilized in or on a semi-solid support, wherein the polynucleotide comprises a scissile linkage; (b) contacting the polynucleotide with a cleaving agent; and (c) maintaining the polynucleotide under conditions suitable for cleavage in the presence of the cleavage agent. Suitable conditions include providing buffers, temperatures, times, etc. appropriate for the particular cleavage agent. In a preferred embodiment, the semi-solid support is a gel such as an acrylamide gel. In another preferred embodiment, the polynucleotide is immobilized in a semi-solid support by attachment to a support, such as a bead, followed by immobilization of the bead itself in the semi-solid support. Alternatively, the polynucleotide may be attached directly to the semi-solid support by attachment of, for example, an acrydite moiety. The linkage may be covalent or non-covalent (e.g., via biotin-avidin interaction).

Macevicz discloses sequencing a template with a specific sequence. He does not discuss the possibility of performing this method in parallel to sequence multiple templates with different sequences simultaneously. The present inventors have recognized that in order to perform efficient sequencing in a high throughput manner, it is necessary to prepare multiple supports (e.g., beads), as described above, such that each support is attached to a template of a particular sequence, and to perform the methods described herein simultaneously on the templates attached to each support. In certain embodiments of the methods, the plurality of supports are arranged in or on a flat substrate, such as a glass slide. In certain embodiments, the support is arranged in or on a gel. The supports may be arranged in a random manner, i.e.without predetermining the position of each support on the substrate. The supports need not be regularly spaced or in an ordered array of rows and columns, etc. Preferably, the support is arranged in a density such that it is possible to detect a single signal emitted by many or most supports. In certain preferred embodiments, the support is distributed predominantly in one focal plane. Multiple supports with attached templates of identical sequence can be included, for example, for quality control. Parallel sequencing reactions were performed on templates attached to each support.

The signals may be collected in a variety of ways, including various imaging modalities. Preferably, in embodiments where sequencing is performed on microparticles arrayed on a substrate (e.g., beads embedded in a semi-solid support on a substrate) prior to detection, the resolution of the imaging device is 1 μm or less. For example, a scanning microscope equipped with a CCD camera or microarray scanner of sufficient resolution may be employed. Alternatively, the beads are passed through a flow cell or fluid station attached to a microscope equipped for fluorescence detection. Other methods of collecting signals include fiber optic bundles. Suitable image capture and processing software may be employed.

In certain embodiments of the invention, sequencing is performed in a microfluidic device. For example, a bead with attached template can be loaded into the device and reagents flowed therethrough. Template synthesis can also be performed in the device using PCR. An example of a suitable microfluidic device is described in us patent 6,632,655.

D. Sequencing by restarting of different initiator oligonucleotides

In a preferred embodiment of the invention, after a sufficient number of cycles, the extended strand produced by extension of the first initiator oligonucleotide is removed from the template, the second initiator oligonucleotide is annealed to the binding region, and then cycles of extension, ligation and detection are performed. This process is repeated with any number of different starting oligonucleotides. In embodiments where the extension probe is cleaved, the number of different starting oligonucleotides (and thus the number of reactions) used is preferably equal to the length of the portion of the extension probe that remains hybridized to the template after release of the distal portion of the probe. Thus, according to this embodiment, sequence information (e.g., the order and type of nucleotides) can be obtained from templates attached to a support, in which case the sequence can still be read in depth using a much smaller number of cycles than is required to identify consecutive nucleotides in each cycle.

Embodiments in which the starting oligonucleotide is bound to the same template sequentially have certain advantages over methods that require the template to be divided into multiple aliquots, such as the method described by Macevicz. For example, applying the starting oligonucleotide to the same template eliminates the need to track and subsequently merge data obtained for multiple aliquots. In embodiments where the supports are arranged in a random manner such that the location of individual supports cannot be predetermined, it may be difficult or impossible to reliably combine partial sequence information from multiple supports, each having attached templates of identical sequence.

E. Identification of multiple nucleotides on a template in each cycle

Macevicz describes identifying one nucleotide on a template per cycle of extension, ligation, and detection. However, the present inventors have recognized that the method can be modified to identify multiple nucleotides on the template in each cycle. In this case, the extension probe is labelled so that the identity of two or more (preferably consecutive) nucleotides adjacent to the extended duplex can be determined from the label. In other words, the sequencing portion of the extension probe comprises more than one nucleotide, typically the closest, immediately adjacent nucleotide, and possibly one or more additional (preferably consecutive) nucleotides, all of which are capable of specifically hybridizing to the template. For example, in addition to the use of 4 labels to identify bases A, G, C and T, 16 differentially labeled probes or probe combinations can be used to identify 16 possible dinucleotides AA, AG, AC, AT, GA, GG, GC, GT, CA, CG, CC, CT, TA, TG, TC and TT. The sequence-determining portion of each differentially labeled extension probe is complementary to one of these dinucleotides. Similar methods using more tags can identify longer nucleotide sequences in each cycle.

F. Marking

The term "label" as used herein refers broadly to any detectable moiety or moieties attached to a probe that can be used to distinguish between different types of probes (e.g., probes containing different terminal nucleotides). Thus, there is not necessarily a one-to-one correspondence between labels and specific detectable moieties. For example, multiple detectable moieties can be attached to one probe, resulting in a combined signal that can distinguish the probe from probes to which different detectable moieties or groups of detectable moieties are attached. For example, a method according to U.S. Pat. No. 6,632,609 and Speicher et al, Nature Genetics, 12: 368-.

The probes of the invention can be labeled in a variety of ways, including direct or indirect attachment of a fluorescent or chemiluminescent moiety, a colorimetric moiety, an enzymatic moiety that produces a detectable signal upon contact with a substrate, and the like. Macevicz teaches that the probes can be labeled with fluorescent dyes, such as Menchen et al, U.S. Pat. No. 5,188,934; begot et al, PCT application PCT/US 90105565. The terms "fluorescent dye" and "fluorophore" as used herein refer to a moiety that absorbs light energy at a particular excitation wavelength and emits light energy at a different wavelength. Preferably, the labels selected for a given probe mixture are spectrally resolvable. As used herein, "spectrally resolvable" means that the labels can be distinguished under operating conditions by spectral characteristics, particularly fluorescence emission wavelengths. For example, the identity of one or more terminal nucleotides may be related to the maximum intensity of light emission at a unique wavelength, or may be related to the ratio of intensities at different wavelengths. The spectral characteristics of the label used to detect and identify the label are referred to herein as "color". It will be appreciated that labels are often identified by specific spectral characteristics, for example by the frequency of the maximum emission intensity when the label consists of one detectable moiety or by the frequency of the emission peak when the label consists of a plurality of detectable moieties.

Preferably, four probes are provided, with four spectrally resolvable fluorescent dyes each corresponding one-to-one to four possible terminal nucleotides of the probes. U.S. Pat. Nos. 4,855,225 and 5,188,934; international application PCT7US 90/05565; and Lee et al, Nucleic Acids research, 20: 2471-2483(1992) disclose spectrally resolved dye sets. In certain embodiments, it is preferably made of FITC, HEX^TMTexas red and Cy5. Many suitable dyes are commercially available, for example, from Molecular Probes, inc. Specific examples of fluorescent dyes include, but are not limited to: alexa Fluor dyes (Alexa Fluor350, Alexa Fluor 488, Alexa Fluor 532, Alexa Fluor 546, Alexa Fluor 568, Alexa Fluor 594, Alexa Fluor 633, Alexa Fluor 660 and Alexa Fluor 680), AMCA-S, BODIPY dyes (BODIPY FL, BODIPY R6G, BODIPY TMR, BODIPYTR, BODIPY 530/550, BODIPY 558/568, BODIPY 564/570, BODIPY 576/589, BODIPY 581/591, BODIPY 630/650, BODIPY 650/665), CAL dyes, carboxyrhodamine 6G, carboxy-X-rhodamine (CasX), Cascade blue, Cascade yellow, cyanine dyes (Cy3, Cy5, Cy3.5, CyDIP 5.5), dansyl, Dapoxyl, dialkylaminocoumarin 4 ', 5' -dimethoxycoumarin, 2 '-dimethoxycoumarin, fluorescein-7' -fluorescein, fluorescent red-coumarin, fluorescent red fluorescent, IRD dyes (IRD40, IRD 700, IRD 800), JOE, Lissamine rhodamine B, Marina blue, methoxycoumarin, naphthofluorescein, Oregon green 488, Oregon green 500, Oregon green 514, Oyster dyes, pacific blue, PyMPO, Pyrene, rhodamine 6G, rhodamine green, rhodamine red, Rhodol green, 2 ', 4', 5 ', 7' -tetrabromo sulfone-fluorescein, tetramethyl-rhodamine (TMR), carboxytetramethyl rhodamine (TAMRA), texas red-X. For further description see handbook of Fluorescent Probes and Research Products, 9 th edition, Molecular Probes, Inc.

In the process of non-radiative Fluorescence Resonance Energy Transfer (FRET), some fluorescent groups transfer energy to another group, and a detection signal is generated by the second group, rather than directly detecting the group. That is, it is within the scope of the present invention to employ a quencher. The term "quencher" refers to a moiety that, when in proximity, absorbs the energy of an excited fluorescent label and dissipates that energy without emitting visible light. Examples of quenchers include, but are not limited to: DABCYL (4- (4 ' -dimethylaminophenylazo) benzoic acid) succinimidyl ester, diarylrhodamine carboxylic acid succinimidyl ester (QSY-7) and 4 ', 5 ' -dinitrofluorescein carboxylic acid succinimidyl ester (QSY-33) (all from Molecular Probes), quencher 1 (Q1; from Epoch) or "Black hole quenchers" BHQ-I, BHQ-2 and BHQ-3 (from BioSearch, Inc.).

In addition to the various detectable moieties described above, the present invention also contemplates the use of spectrally resolvable quantum dots, metal nanoparticles or nanoclusters, and the like, which can be attached directly to the oligonucleotide probe or embedded or attached in a polymer matrix and subsequently attached to the probe. As mentioned above, the detectable moiety itself is not necessarily directly detectable. For example, they may react on the substrate to be detected or they may need to be modified to become detectable.

As noted above, in certain embodiments of the invention, the label is comprised of a plurality of detectable moieties. The combined signal of these detectable moieties produces a color that is used to identify the probe. For example, a "violet" probe of a particular sequence can be constructed by ligating "blue" and "red" detectable moieties. Alternatively, a mixed probe can be generated by mixing two probes of the same sequence but labeled with different detectable moieties, thereby generating a unique color. Thus, a "violet" probe of a particular sequence can be generated by constructing two probes of that sequence. A "red" detectable moiety is attached to the first probe and a "blue" detectable moiety is attached to the second probe. Aliquots of both probes were mixed. Different shades of purple can be produced by mixing the sample amounts in different ratios. This approach provides a number of advantages. First, it enables the generation of a variety of distinguishable probes with fewer detectable moieties. Second, the use of mixed probes may provide a degree of degeneracy that may help to reduce bias that may result from the interaction of a particular detectable moiety with a particular nucleotide.

In certain embodiments of the invention, the detectable moiety is attached to a nucleotide in the oligonucleotide extension probe by a cleavable linkage such that the detectable moiety is removed after ligation and detection. A variety of different cleavable linkages may be employed. The term "cleavable linkage" as used herein refers to a chemical moiety that links a detectable moiety to a nucleotide and which can be cleaved off to remove the detectable moiety on the nucleotide, if desired, without substantially altering the nucleotide or nucleic acid molecule to which it is linked. Depending on the nature of the linkage, cleavage may be achieved by, for example, acid or base treatment, or oxidation or reduction of the linkage, or by light treatment (photocleavage). Examples of cleavable linkers and cleaving agents are found in shinnkus et al, 1985, proc.natl.acad.sci.usa 82: 2593-2597; soukup et al, 1995, bioconjugate. chem.6: 135-138; shimikus et al, 1986, DNA 5: 247-255; and Herman and Fenn, 1990, meth.enzymol.184: 584-588.

For example, as described in U.S. patent 6,511,803, disulfide linkages can be reduced, thereby cleaving with a thiol compound reducing agent such as Dithiothreitol (DTT). Fluorophores containing a thiol (SH) group (e.g., SH-containing cyanine 5 or cyanine 3 fluorophores; New England Nuclear-DuPont) are available for conjugation to nucleotides containing an active arylamino group (e.g., dCTP). The reactive pyridyl dithiols can react with sulfhydryl groups to produce sulfhydryl bonds (sulfhydryl bonds) that can be cleaved with reducing agents such as dithiothreitol. The deoxynucleotide containing an active arylamino group can be linked to a pyridyl dithiol group with a NHS ester heterobifunctional crosslinker (Pierce) and then reacted with SH on the fluorophore to produce the disulfide-linked cleavable nucleotide-fluorophore complex useful in the methods of the invention. Alternatively, the cis-diol linkage between the nucleotide and the fluorophore may be cleaved by periodate. Various cleavable linkages are described in U.S. patent nos. 6,664,079 and 6,632,655, U.S. published application 20030104437, WO 04/18497, and WO 03/48387.

In other embodiments of the invention, detectable moieties are used that can be rendered undetectable by exposure to electromagnetic energy, such as light (photobleaching).

In embodiments of the invention that utilize extension probes that contain a label attached to the probe via a cleavable linkage or that contain a label that can be photobleached, the sequencing method generally includes a step of cleaving or photobleaching in one or more cycles after the linkage and detection of the label have been performed. As described above, cleavage of the scissile ligation in an oligonucleotide extension probe may not proceed to completion (i.e., less than 100% of the newly ligated probes may be cleaved in its cycle of ligation). Since such probes typically contain either non-extendable templates or caps, they cannot be cycled continuously. However, failure to cleave the probe means that the label remains attached to the template molecule to which the probe is attached, which will generate a background signal (i.e., background fluorescence) that may increase noise in subsequent cycles. Addition of a cleavage or photobleaching step to remove or render the label undetectable reduces this background and improves the signal-to-noise ratio. The cutting or photo-bleaching may be performed in each cycle, or less frequently, such as once every two cycles, every three cycles, or every five cycles or more. In certain embodiments of the invention, it is not necessary in practice to add an additional step to cleave the cleavable linker. For example, a cleavage agent such as DTT may already be present in the wash buffer and may be used to remove unligated extension probes.

G. Preferred easy-cut connection

The present inventors have found that extension probes comprising at least one phosphorothioate linkage are particularly useful in methods of sequencing by successive cycles of extension, ligation, detection and cleavage. In this connection, one of the bridging oxygen atoms of the phosphodiester bond is replaced by a sulfur atom. The phosphorothioate linkage may be a 5 '-S-phosphorothioate linkage (3' -O-P-S-5 ') as shown in FIG. 4A or a 3' -S-phosphorothioate linkage (3 '-S-P-O-5') as shown in FIG. 4B. It is understood that the phosphorus atom in the linkage, designated 3 '-O-P-S-5' or 3 '-S-P-O-5', may be attached to two non-bridging oxygen atoms, as shown in FIGS. 4A and 4B (as in a typical phosphodiester linkage). Alternatively, the phosphorus atom may be attached to various other atoms or groups, such as S, CH₃、BH₃And the like. Thus, one aspect of the invention is a labeled oligonucleotide probe comprising a phosphorothioate linkage. Although the probes are particularly useful in the sequencing methods described herein, they can also be used for a variety of other purposes. Specifically, the present invention provides (i) 5' -O-P-O-X-O-P-S- (N)_kN_B ^*-an oligonucleotide in 3' form; and (ii) 5' -N_B ^*(N)_k-an oligonucleotide of the form S-P-O-X-3'. In these probes, N represents any nucleotide, N _BRepresents a ligase non-extendible moiety, represents a detectable moiety, X represents a nucleotide, and k is 1-100. In certain embodiments, k is 1 to 50, 1 to 30, 1 to 20, such as 4 to 10, with the proviso that: the detectable moiety may be present in the surrogate N_BOr except for N_BOther (N)_kAny nucleotide of (a). The terminal nucleotide in these probes may or may not include a phosphate group or a hydroxyl group. It is also to be understood that in preferred embodiments the phosphorus atom is typically attached to two other (non-bridging) oxygen atoms.

Methods for synthesizing oligonucleotides containing 5 '-S-phosphorothioate or 3' -S-phosphorothioate linkages are known in the art, some of which are suitable for automated solid phase oligonucleotide synthesis. Synthetic methods see, for example: cook, AF, j.am.chem.soc., 92: 190-; chladek, s, et al, j.am.chem.soc., 94: 2079-; rybakov, VN et al, Nucleic Acids Res., 9: 189-201, 1981; cossentik, r, and Vyle, JS, j.chem.soc.chem.commun., 992-; mag, m, et al, Nucleic Acids res, 19 (7); 1437-1441, 1991; xu, Y and Kool, ET, Nucleic Acids res, 26 (13): 3159 3164, 1998; cossistick, r, and Vyle, JS, tetrahedron lett, 30: 4693-4696, 1989; cossentik, r, and Vyle, JS, Nucleic Acids res, 18: 829-835, 1990; sun, SG and piccirillin, JA, nucl.nucl., 16: 1543-1545, 1997; sun SG et al, RNA, 3: 1352-1363, 1997; vyle, JS et al, Tetrahedron Lett., 33: 3017-3020, 1992; li, x, et al, j.chem.soc.perkin trans, 1: 2123-22129, 1994; liu, XH and Reese, CB, Tetrahedron lett, 37: 925 and 928, 1996; weinstein, LB et al, j, am, chem, soc, 118: 10341-10350, 1996; and Sabbagh, g., et al, Nucleic Acids res, 32 (2): 495-501, 2004. In addition, the present inventors have developed a novel synthesis method. For example, figure 7 shows the synthesis scheme of 3' -phosphoramidites of dA. A similar protocol can be used to synthesize dG 3' -phosphoramidites. These phosphoramidites can be used to synthesize oligonucleotides containing 3' -S-phosphorothioate linkages linked to purine nucleosides, e.g., using an automated DNA synthesizer.

Various metal-containing species can be used to cleave phosphorothioate linkages. The metal may be, for example, Ag, Hg, Cu, Mn, Zn or Cd. Preferably, the material is Ag⁺、Hg⁺⁺、Cu⁺⁺、Mn⁺⁺、Zn⁺Or Cd⁺Water-soluble salts of anions (salts providing ions in other oxidation states may also be employed). May also adopt I₂. Silver salts such as silver nitrate (AgNO) are particularly preferred₃) Or otherwise providing Ag⁺A salt of an ion. Suitable conditions include, for example: 50mM AgNO₃About 22-37 deg.C, for 10 minutes or longer, e.g., 30 minutes. Preferably, the pH is from 4.0 to 10.0, more preferably from 5.0 to 9.0, such as from about 6.0 to 8.0, e.g. about 7.0. See, e.g., Mag, m, et al, Nucleic Acids res, 19 (7): 1437-1441, 1991. Example 1 provides an exemplary protocol.

Sequencing can be performed in the 5 '→ 3' direction using an extension probe containing a 3 '-O-P-S-5' linkage. FIG. 5A shows the use of 5' -O-P-O-X-O-P-S-NN_B ^*A cycle of hybridization, ligation and cleavage with an extension probe in the 3' form, in which N represents any nucleotide, N_BRepresents a ligase-inextensible moiety (e.g.N)_BIs a nucleotide lacking a 3' hydroxyl group or having a blocking moiety attached thereto), represents a detectable moiety, and X represents a nucleotide of the species corresponding to the detectable moiety. Alternatively, a plurality of blocking moieties may be attached to the 3' terminal nucleotide to prevent multiple ligation. For example, attaching a bulky group to the sugar moiety of a nucleotide at (e.g.) the 2 'or 3' position will prevent attachment. Fluorescent labels may be used as suitable bulky groups.

A template comprising binding regions 40 and polynucleotide regions 50 of unknown sequence is attached to a support such as a bead. In a preferred embodiment, as shown in FIG. 5A, the binding region is located at the other end of the template to support attachment point. An initial oligonucleotide 30 having an extendable terminus (in this case a free 3' OH group) is annealed to the binding region 40. The extension probe 60 hybridizes to the polynucleotide region 50 of the template. The nucleotide X forms a complementary base pair with the unknown nucleotide Y in the template. Extension probes 60 are ligated to the initiator oligonucleotide (e.g., using T4 ligase). After ligation, the label (not shown) attached to the extension probe 60 is detected. The label corresponds to the kind of nucleotide X. Thus, nucleotide Y is identified as the nucleotide complementary to nucleotide X. The extension probes 60 are then cleaved on the phosphorothioate linkages (e.g., using AgNO)₃Or providing Ag⁺Another salt of an ion) to produce an extended duplex. Cleavage generates a phosphate group on the 3' end of the extended duplex. Treatment with phosphatase generates extendable probe ends on the extended duplex. The process is repeated for the desired number of cycles.

In a preferred embodiment, the extension probe containing the 3 '-S-P-O-5' linkage is used to sequence in the 3 '→ 5' direction. FIG. 5B shows the use of 5' -N _B ^*One cycle of hybridization, ligation and cleavage with the extension probe of the NNNN-S-P-O-X-3' formA loop, wherein N represents any nucleotide, N_BRepresents a ligase-inextensible moiety (e.g.N)_BIs a nucleotide lacking a 5' phosphate group or having a blocking moiety attached thereto), represents a detectable moiety, and X represents a nucleotide of the species corresponding to the detectable moiety.

A template comprising binding regions 40 and polynucleotide regions 50 of unknown sequence is attached to a support such as a bead. In a preferred embodiment, as shown in FIG. 5B, the binding region is located at the other end of the template to support attachment point. A starting oligonucleotide 30 having an extendable terminus (in this case a free 5' phosphate group) is annealed to the binding region 40. The extension probe 60 hybridizes to the polynucleotide region 50 of the template. The nucleotide X forms a complementary base pair with the unknown nucleotide Y in the template. Extension probes 60 are ligated to the initiator oligonucleotide (e.g., using T4 ligase). After ligation, the label (not shown) attached to the extension probe 60 is detected. The label corresponds to the kind of nucleotide X. Thus, nucleotide Y is identified as the nucleotide complementary to nucleotide X. The extension probes 60 are then cleaved on the phosphorothioate linkages (e.g., using AgNO) ₃Or providing Ag⁺Another salt of an ion) to produce an extended duplex. Cleavage generates an extendable monophosphate group on the 5' end of the extended duplex, so no additional step has to be performed to generate an extendable terminus. The process is repeated for the desired number of cycles.

It will be appreciated that many variations of this approach may be employed. For example, the probe may be shorter or longer than 6 nucleotides; the label need not be on the 3' terminal nucleotide; the P-S linkage can be between any two adjacent nucleotides, and so on. In the above embodiments, successive cycles of extension, ligation, detection and cleavage result in the identification of nucleotides at adjacent positions. However, by placing the P-S ligation closer to the distal end of the extension probe (i.e., the opposite end where ligation occurs), the sequentially identified nucleotides will be distributed at intervals along the template, as described above and in FIGS. 1 and 6.

FIGS. 6A-6F are more detailed schematic diagrams of several sequencing reactions performed sequentially on one template. Sequencing was performed in the 3 '→ 5' direction with an extension probe containing a 3 '-S-P-O-5' linkage. Each sequencing reaction includes multiple cycles of extension, ligation, detection, and cleavage. The reaction utilizes an initial oligonucleotide that binds to a different portion of the template. The extension probe is 8 nucleotides in length and contains a phosphorothioate linkage between the 6 th and 7 th nucleotides from the 3' end of the probe. Nucleotides 2-6 serve as spacers to allow each reaction to identify multiple nucleotides at intervals along the template. The complete sequence of a partial template is determined by performing a plurality of reactions in succession and appropriately combining the partial sequence information obtained from each reaction.

FIG. 6A shows priming with a first initiator oligonucleotide (referred to as a primer in FIGS. 6A-6F) that hybridizes to an adapter sequence (referred to above as a binding region) in a template to provide an extendable duplex. FIGS. 6B-6D show several cycles of nucleotide identification, in which every 6 bases in the template are read. In FIG. 6B, a first extension probe, whose 3' terminal nucleotide is complementary to the first unknown nucleotide in the template sequence, is bound to the template and ligated to the extendable end of the primer. The label attached to the extension probe identifies the 3' terminal nucleotide of the probe as A, thereby identifying the first unknown nucleotide of the template sequence as A. FIG. 6C shows the use of AgNO₃The extension oligonucleotide is cleaved at the phosphorothioate linkage and the portion of the extension probe to which the label is attached is released. Fig. 6D shows other extension, ligation and cutting cycles. Since the spacer contained in the probe is 5 nucleotides in length, the sequencing reaction is identified every 6 nucleotides on the template.

After the desired number of cycles, the extension strand comprising the first initiator oligonucleotide is removed and a second initiator oligonucleotide, which binds to a different portion of the binding region than the first initiator oligonucleotide, is hybridized to the template. FIG. 6E shows a second sequencing reaction in which a second initial oligonucleotide is used to start, followed by several cycles of nucleotide identification. FIG. 6F shows a priming with a third starting oligonucleotide followed by several cycles of nucleotide identification. Extension from the second starting oligonucleotide can be identified every 6 bases in a "reading frame" that is different from the nucleotide identified in the first sequencing reaction.

Although extension probes comprising phosphorothioate linkages are preferred in certain embodiments of the invention, various other scissile linkages may also be desirable. For example, many variations on the O-P-O linkage found in naturally occurring nucleic acids are known (see, e.g., Micklefield, J.Curr.Med.chem., 8: 1157-. Any of the structures described herein that contain a P-O bond can be modified to contain a scissile P-S bond. For example, the NH-P-O bond may be changed to an NH-P-S bond.

In some embodiments of the invention, the extension probes contain a priming residue that, after optional modification with a modifying agent, renders the nucleic acid susceptible to cleavage by a cleaving agent or a combination thereof. In particular, the inventors have found that enzymes involved in DNA repair are advantageous cleavage reagents for carrying out methods of sequencing through successive cycles of extension, ligation, detection and cleavage. Typically, the presence of a trigger residue, such as a damaged base or an abasic residue, in an extension probe after optional modification by a DNA glycosylase may render the probe susceptible to cleavage by one or more DNA repair enzymes. Thus, extension probes containing a linkage that serves as a cleavage substrate for an enzyme involved in DNA repair, such as AP endonuclease, may be used in the present invention. Extension probes containing residues that are substrates for modification by enzymes involved in DNA repair, such as DNA glycosylases, where the modification renders the probe susceptible to cleavage by AP endonuclease are also particularly useful in the present invention. In some embodiments, the extension probe contains an abasic residue, i.e., it lacks a purine or pyrimidine base. The linkage between the abasic residue and the adjacent nucleoside is susceptible to cleavage by the AP endonuclease and is therefore a scissile linkage. In certain embodiments of the invention, the abasic residue comprises a 2' deoxyribose. In some embodiments, the extension probe comprises a damage base. The damaged base is a substrate for an enzyme that removes the damaged base, such as a DNA glycosylase. After removal of the damaged base, the resulting linkage between the abasic residue and the adjacent nucleoside is susceptible to cleavage by the AP endonuclease and is therefore considered to be a scissile linkage of the present invention.

Many different AP endonucleases can be used as cleavage reagents in the present invention. Two major types of AP endonucleases are distinguished by the mechanism of cleavage of the linkage adjacent to the abasic residue. Class I AP endonucleases such as E.coli endonuclease III (endo III) and endonuclease VIII (endo VIII), as well as the human homologues hNTH1, NEIL1, NEIL2 and NEIL3 are AP lyases that cleave DNA 3 ' to the AP residue, this cleavage producing a 5 ' portion containing the 3 ' terminal phosphate and a 3 ' portion carrying the 5 ' terminal phosphate. Class II AP endonucleases such as E.coli endonuclease IV (endo IV) and exonuclease III (Exo III) cleave DNA 5 ' to the AP site, which results in 3 ' OH and 5 ' deoxyribose phosphate moieties at the ends of the resulting fragments. See, e.g., double, s, et al, proc. natl. acad sci.101(28), 10284-; haltiwanger, B.M., et al, Biochem J., 345, 85-89, 2000; levin, j, and Demple, b., nucl.acids.res., 18(17), 1990; and all references cited above to further discuss various class I and class II AP endonucleases and their conditions for removing damaged bases from DNA and/or cleaving DNA containing abasic residues. It will be appreciated by those of ordinary skill in the art that various homologues of these enzymes present in other organisms (e.g., yeast) may be used in the present invention.

Certain enzymes are bifunctional enzymes that have both glycosylase activity that removes damaged bases to produce AP residues, and lyase activity that cleaves the phosphodiester backbone 3' of the AP site resulting from glycosylase activity. Thus, these dual-activity enzymes are AP endonucleases and DNA glycosylases. For example, Endo VIII is used as an N-glycosylase and AP-lyase. N-glycosylase activity releases damaged pyrimidines from double-stranded DNA, producing apurinic bases (AP sites). AP-lyase activity cleaves both the 3 'and 5' ends of the AP site, producing a 5 'phosphate and a 3' phosphate. The damaged bases recognized and excised by endonuclease VIII include urea, 5, 6-dihydroxythymine, thyminediol, 5-hydroxy-5-methylhydantoin, uracyl diol, 6-hydroxy-5, 6-dihydrothymine, and methylpropanoyldiurea. See, e.g., Dizdaroglu, M. et al, Biochemistry, 32, 12105-; jiang, D, et al, J, biol. chem., 272(51), 32220-32229, 1997; jiang, D, et al, J.Bact, 179(11), 3773-3782, 1997.

Fpg (formamidopyrimidine [ copy ] -DNA glycosylase) (also known as 8-oxoguanine DNA glycosylase) is also used as an N-glycosylase and an AP-lyase. N-glycosylase activity releases damaged purines from double-stranded DNA, creating an apurinic base (AP site). AP-lyase activity cleaves both the 3 'and 5' ends of the AP site, thereby removing the AP site and creating a 1-base gap. Some of the damaged bases recognized and removed by Fpg include 7, 8-dihydro-8-oxoguanine (8-oxoguanine), 8-oxoadenine, copy-guanine, methyl-copy-guanine, copy-adenine, aflatoxin B1-copy-guanine, 5-hydroxy-cytosine, and 5-hydroxy-uracil. See, e.g., Tchou, J. et al, J.biol.chem., 269, 15318-15324, 1994; hatahet, z, et al, j.biol.chem., 269, 18814-18820, 1994; boiteux, S. et al, EMBO J., 5, 3177-3183, 1987; jiang, d, et al, j.biol.chem., 272(51), 32220-32229, 1997; jiang, D, et al, J.Bact, 179(11), 3773-3782, 1997.

A number of DNA glycosylases and AP endonucleases are commercially available, for example, from New England Biolabs, Ipshoch, Mass.

In some embodiments of the invention, an extension probe containing a site that is a substrate for cleavage by an AP endonuclease is employed in the sequencing method or sequencing method AB (see below) described above for a phosphorothioate-linked extension probe. In any of these methods, after the extension probe is ligated to a growing nucleic acid strand, the extension probe is cleaved with an AP endonuclease to remove the portion of the probe containing the label.

Depending on the particular AP endonuclease and depending on whether sequencing is performed in the 3 '→ 5' or 5 '→ 3' direction, it may be necessary or desirable to treat the extension duplex with a polynucleotide kinase or phosphatase after cleavage to generate extendable probe ends on the extension duplex (see fig. 5A and 5B for a description of extendable probe ends). Thus, in certain methods of the invention, polynucleotide kinase or phosphatase treatment is used to generate the extendable terminus. It will be appreciated by one of ordinary skill in the art that buffers suitable for the various enzymes may be employed, and that additional washing steps may be included to remove the enzymes and provide suitable conditions for the subsequent steps of the method.

In other embodiments, the extension probe contains a damage base that serves as a substrate for removal by a DNA glycosylase. Various cytotoxic and mutagenic DNA bases are removed with different DNA glycosylases to initiate the base excision repair pathway following DNA damage (Krokan, H.E., et al, Biochem J, 325(Pt 1): 1-16, 1997). DNA glycosylases cleave the N-glycosyl bond between the damaged base and the deoxyribose, releasing the free base and creating an apurinic/Apyrimidinic (AP) site. In some embodiments, the extension probe contains a uracil residue that is removed by uracil-DNA glycosylase (UDG). UDG is found in all living organisms studied to date and a large number of such enzymes are known in the art and are useful in the present invention (Frederica et al, Biochemistry, 29, 2353-. For example, mammalian cells contain at least 4 types of UDG: mitochondrial UNG1 and nuclear UNG2, SMUG1, TDG and MBD4(Krokan et al, Oncogene, 21, 8935-. UNG1 and UNG2 belong to a highly conserved family represented by E.coli Ung.

In embodiments where the extension probe contains a damaged base, following ligation of the extension probe to the terminus of the extendable probe, the extended duplex is contacted with a glycosylase that removes the damaged base, thereby generating an abasic residue. Extension probes containing damaged bases removed by glycosylase are considered "readily modified to contain scissile linkages". The extended duplex is then contacted with an AP endonuclease, which cleaves the linkage between the abasic residue and the adjacent nucleoside, as described above. In certain embodiments of the invention, both reactions are performed with a dual-active enzyme that is a DNA glycosylase and an AP endonuclease. In some embodiments, the extended duplex containing the damaged base is contacted with a DNA glycosylase and an AP endonuclease. In various embodiments of the invention, these enzymes may be used in combination or sequentially (i.e., glycosylase followed by endonuclease).

In some embodiments of the invention, the priming residue comprised by the extension probe is deoxyinosine. As described above, E.coli endonuclease V (endo V), also known as deoxyinosine 3 'endonuclease and homologs thereof, cleave deoxyinosine-containing nucleic acids at the second phosphodiester bond 3' to the deoxyinosine residue, yielding 3 'OH and 5' phosphate termini. Thus, this bond serves as a scissile linkage for the extension probe. Endo V and its cleavage properties are known in the art (Yao, M. and Kow Y.W., J biol. chem., 271, 30672-30673 (1996); yao, m, and Kow y.w., J biol. chem., 270, 28609-28616 (1995); he, B et al, MiitatRes., 459, 109-114(2000) in addition to deoxyinosine, Endo V also recognizes deoxyuridine, deoxyxanthosine and deoxyxanthosine (Hitchcock, T. et al, Nuc. acids Res., 32(13) (2004) mammalian homologs such as mEndo V also have cleavage activity (Moe, A. et al, Nuc. acids Res., 31(14), 3893-3900(2004) although Endo V is a preferred cleavage agent for deoxyinosine-containing probes, for example, as a damaged base, hypoxanthine can be removed by a suitable DNA glycosylase, and the resulting extended probe containing an abasic residue subsequently cleaved by an endonuclease.

It will be appreciated that if deoxyinosine is used as a priming residue, it may be desirable to avoid the use of deoxyinosine elsewhere in the probe, particularly at a position between the terminus that will be attached to the end of the extendable probe and the priming residue. Thus, if the probe contains one or more universal bases, nucleosides other than deoxyinosine may be employed. It will also be appreciated that when priming residues are used for extension probes that make nucleic acids containing the priming residue susceptible to cleavage by a particular cleavage agent, it may be desirable to avoid the inclusion of other residues in the probe that prime cleavage by the same cleavage agent (or other probes that will be used in a sequencing reaction with the extension probe).

The present invention includes the use of any enzyme that cleaves nucleic acids containing a priming residue. Other enzymes can be identified by studying catalogues of enzyme suppliers such as New England Biolabs ®, Inc. The New England Biolabs catalog, version 2005 (New England Biolabs, Ipswich, MA 01938-. Other enzymes employed include, for example: hOGG1 and its homologues (Radcella, JP et al, Proc Natl Acadsi USA, 94 (15): 8010-5, 1997).

Methods for the synthesis of oligonucleotides containing priming residues such as damaged bases, abasic residues, and the like are known in the art. Methods for the synthesis of oligonucleotides containing sites that serve as substrates for AP endonucleases, such as oligonucleotides containing abasic residues, are known in the art and are generally applicable to automated solid phase oligonucleotide synthesis. In some embodiments, an oligonucleotide containing a uridine at a desired position of an abasic residue is synthesized. Treatment of the oligonucleotide with an enzyme such as UDG to remove uracil then produces abasic residues wherever uridine is present in the oligonucleotide.

In some embodiments of the invention, the oligonucleotide probe comprises a disaccharide nucleoside, as described in Nauwelaerts, k. et al, nuc. acids. res., 31(23), 2003. After ligation, periodate (NaIO) was used₄) Cleavage of the extended duplex, followed by treatment with base (e.g., NaOH) to remove the label, yields free 3' OH and P5-OPO₃H₂A group. Depending on whether sequencing is performed in the 3 '→ 5' or 5 '→ 3' direction, it may be necessary or desirable to treat the extended duplexes with a polynucleotide kinase or phosphatase to create extendable ends. Thus, in certain methods of the invention, polynucleotide kinase or phosphatase treatment is used to generate the extendable terminus.

Polynucleotides containing disaccharide nucleosides are believed to contain abasic residues. For example, a polynucleotide having a ribose residue inserted between the 3 'OH of one nucleotide and the 5' phosphate group of the next nucleotide is considered to contain an abasic residue.

Capping

In some cases, not all probes with extendable ends are successfully involved in the ligation reaction for each cycle of extension, ligation and cleavage. It will be appreciated that if such a probe is involved in a subsequent cycle, the accuracy of each nucleotide identification step will gradually decrease. Although the present inventors have demonstrated that the use of extension probes containing phosphorothioate linkages allows for efficient ligation, in certain embodiments of the invention, a capping step is included to prevent unlinked extendable ends from participating in subsequent cycles. In sequencing with an extension probe containing a 3 '-O-P-S-5' phosphorothioate linkage in the 5 '→ 3' direction, e.g., after a ligation or detection step, capping can be performed by extending the unligated extendable terminus with a DNA polymerase and a non-extendable moiety, e.g., a chain terminating nucleotide such as a dideoxynucleotide or a nucleotide to which a blocking moiety is attached. In sequencing in the 3 '→ 5' direction using the extension probe containing the 3 '-S-P-O-5' phosphorothioate linkage, for example, after ligation or detection, the template may be treated with phosphatase to cap the extension probe. Other capping methods may also be used.

H. Sequencing with a family of oligonucleotide probes

In the above sequencing methods, collectively referred to as "method A", there is a direct and known correspondence between the label attached to any particular extension probe and the identity of the nucleotide or nucleotides proximal to the probe (i.e.the end attached to the end of the extendable probe of the extension duplex). Thus, identification of the label of the newly ligated extension probe is sufficient to identify one or more nucleotides in the template. The present invention provides other sequencing methods for nucleotide identification using different methods, collectively referred to as "method AB", also including successive cycles of extension, ligation and (preferably) cleavage.

The sequencing method AB provided by the invention employs a collection of at least two differentially labeled oligonucleotide probe families. Each probe family is assigned a name, such as "red", "blue", "yellow", "green", according to the label. As described above, the extension is initiated from the duplex formed by the initiator oligonucleotide and the template. The oligonucleotide probe is ligated to the end of the initiator oligonucleotide to form an extension duplex, thereby extending the initiator oligonucleotide, which is then repeatedly extended by successive ligation cycles. The probe contains a non-extendable moiety at a terminal position (opposite the nucleotide on the probe that is attached to the strand of the duplex growing nucleic acid) such that extension of the duplex occurs only once in a single cycle. In each cycle, the label on or linked to the successfully ligated probe is detected, and the non-extendable moiety is removed or modified to generate an extendable terminus. Detection of the label can determine the name of the probe family to which the probe belongs.

The successive extension, concatenation and detection cycles produce an ordered list of tag names. These labels correspond to the probe family to which the well-ligated probes that hybridize to the template at consecutive positions belong. After ligation, the proximal position of the probe is opposite to the different nucleotides in the template. Therefore, the probe family name sequence and the nucleotide sequence in the template have a corresponding relationship.

In embodiments of the invention where the scissile junction is located between the proximal nucleoside and the adjacent nucleoside of the extension probe, an ordered list of probe family names can be obtained by successive cycles of extension, ligation, detection and cleavage starting from one starting oligonucleotide, as each cycle extends the extended oligonucleotide probe by one nucleotide. If the scissile linkage is located between two other nucleosides, an ordered list of probe family names is assembled from the results obtained from multiple sequencing reactions in which the starting oligonucleotides that hybridize to different positions in the binding reaction region are employed, as described in sequencing method A.

Knowing which probe family the newly ligated probe belongs to is by itself insufficient to identify the nucleotide species in the template. However, determining the probe family name eliminates the possibility of certain combinations of nucleotides as sequences for at least a portion of the probe, giving at least two possible classes of each nucleotide. Thus, knowledge of the probe family name without additional information gives at least two possible template nucleotide species located at relative positions of the nucleotides of the newly ligated probe. Thus, any one cycle of extension, ligation, detection (and optionally cleavage) cannot identify any nucleotide in the template itself. However, it can eliminate one or more possible sequences of templates, thereby providing sequence information. In certain embodiments of the invention, the template sequence may still be determined by appropriately designing the probes and probe families as described below. In certain embodiments of the invention, sequencing method AB comprises two stages: the first stage obtains an ordered list of probe family names and the second stage decodes the ordered list to determine the template sequence.

Unless otherwise indicated, sequencing methods a and AB generally employ similar methods for probe synthesis, template preparation and steps for extension, ligation, cleavage and detection.

Sequencing methods AB oligonucleotide extension probes and features of the Probe family

The family of probes for sequencing method AB is characterized in that each probe family comprises a plurality of labeled oligonucleotide probes of different sequences, and at each position of said sequences one probe family comprises at least 2 probes that differ in base at that position. The probes in each probe family contain the same label. Preferably, the probe comprises a cleavable internucleoside linkage. The frangible connection can be located anywhere in the probe. One end of the probe preferably contains a ligase non-extendible moiety. The probe is preferably labeled at a position between the scissile junction and the non-extendible portion of the ligase such that cleavage of the scissile junction following ligation of the probe to the extendable probe end results in an unlabeled portion attached to the extendable probe end and a labeled portion no longer attached to the unlabeled portion.

Probes in each probe family preferably contain at least j nucleotides X, where j is at least 2, and each X is at least 2-fold degenerate in probes in each probe family. The probes of each probe family also contain at least k nucleosides N, where k is at least 2, and where N represents any nucleoside. Typically, j + k is equal to or less than 100, and typically less than or equal to 30. Nucleoside X can be located anywhere in the probe. The nucleosides X are not necessarily located in contiguous positions. Similarly, nucleoside N is not necessarily in a contiguous position. In other words, nucleosides X and N can be interspersed. Nucleoside X may be considered to be 5 'although it is not necessarily contiguous' The sequence → 3'. For example, consider X_ANX_GNNX_CThe nucleoside X of the probe of N structure contains the sequence AGC. Similarly, nucleoside N can be considered to contain a sequence.

The nucleosides X can be the same or different but cannot be selected independently, i.e., the species of each X is limited by the species of one or more other nucleosides X in the probe. Thus, typically only certain combinations of nucleosides X are present in a probe of a particular probe and a particular probe family. In other words, the sequence of nucleoside X in each probe may represent only a subset of all possible sequences of length j. Thus, the identity of one or more nucleotides in X limits the possible identity of one or more other nucleosides.

The nucleosides N are preferably independently selected and may be A, G, C or T (or optionally reduced degeneracy nucleosides). The sequence of nucleoside N preferably represents all possible sequences of length k, except that one or more N may be a reduced degeneracy nucleoside. Thus, the probe contains two parts, wherein the part consisting of nucleoside N is referred to as the non-constrained part and the part consisting of nucleoside X is referred to as the constrained part. As noted above, the moieties need not be contiguous nucleosides. Probes containing both constrained and unconstrained moieties are referred to herein as partially constrained probes. The nucleoside or nucleosides of the constrained moiety are preferably located proximal to the probe, i.e., the end containing the nucleoside or nucleosides to be ligated to the end of the extendable probe, which may be the 5 'or 3' end of the oligonucleotide probe in various embodiments of the invention.

Since the constrained portion of any oligonucleotide probe may only have certain sequences, knowledge of the identity of one or more nucleosides of the constrained portion of the probe can provide information about one or more other nucleosides. This information may or may not be sufficient to accurately identify one or more other nucleosides, but it is sufficient to eliminate one or more possibilities of the restricted portion of one or more other nucleoside species. In certain preferred embodiments of sequencing method AB, knowledge of the identity of one nucleoside of the constrained portion of the probe is sufficient to accurately identify each of the other nucleosides of the constrained portion, i.e., to determine the identity and order of the nucleosides containing the constrained portion.

As described in the sequencing methods above, the most proximal nucleoside of the extension probe complementary to the template is ligated to the extendable terminus of the initial oligonucleotide (in the first extension, ligation and detection cycle) and to the extendable terminus of the extended oligonucleotide probe (in the subsequent extension, ligation and detection cycles). Detection determines the name of the probe family to which the newly ligated probe belongs. Since each position of the constrained portion of the probe is at least 2-fold degenerate, the probe family name by itself cannot identify any nucleotide of the constrained portion. However, since the sequence of the restricted portion is one of a subset of all possible sequences of length j, identifying a probe family does not eliminate some of the possible restricted portion sequences. The restricted portion of the probe constitutes the sequencing portion thereof. Thus, one or more possibilities of extending the nucleotide species of the template to which the probe hybridizes are eliminated by identifying one or more possibilities of eliminating one or more nucleoside species of the restricted portion of the probe to which the probe belongs. In a preferred embodiment of the invention, the partially defined probe contains a scissile linkage between any two nucleosides.

In certain embodiments, the partially defined probe has the general formula (X)_j(N)_kWherein X represents a nucleoside, (X)_jAt each position is at least 2-fold degenerate, so X can be any of at least 2 nucleosides with different base-pairing specificities, N represents any nucleoside, j is at least 2, k is 1-100, at least one N or X other than X at the end of the probe contains a detectable moiety. Preferably, (N)_kIndependent 4-fold degeneracy at each position, therefore, in each probe (N)_kRepresents all possible sequences of length k, except (N)_kOne or more positions in (a) may be occupied by nucleotides of reduced degeneracy. (X)_jThe nucleosides in (a) can be the same or different, but cannot be independently selected. In other words, in each probe, (X)_jOnly a subset of all possible sequences of length j may be represented. Thus, (X)_jThe one or more nucleotide species limits the possible species of one or more other nucleosides. Thus, the probe contains two parts, among which (N)_kIs not a limiting part, (X)_jIs the restricted portion.

In certain preferred embodiments of the present invention, the partially defined probe has the structure 5' - (X)_j(N)_kN_B ^*-3 'or 3' - (X)_j(N)_kN_B ^*-5', wherein N represents any nucleoside, N_BRepresents a ligase inextensible moiety, represents a detectable moiety, (X) _jIs a restricted portion of the probe which is at least 2 times degenerate at each position, (X)_jThe nucleosides in (a) can be the same or different but cannot be independently selected, at least one internucleoside linkage is a scissile linkage, j is at least 2, k is from 1 to 100, provided that: the detectable moiety may be present in the surrogate N_BOr in addition to N_BAny nucleoside other than N or X other than X at the end of the probe. The frangible connection can be located at (X)_jBetween two nucleosides of (A), (X)_jThe most distal nucleotide of (A) and (N)_kAmong the most proximal nucleosides, (N)_kInternucleoside OR (N)_kAnd N_BBetween the terminal nucleosides of (1). The scissile linkage is preferably a phosphorothioate linkage.

In other more preferred embodiments of the invention, the probe has the structure 5' - (XY) (N)_kN_B ^*-3 'or 3' - (XY) (N)_kN_B ^*-5', wherein N represents any nucleoside, N_BRepresents a ligase inextensible moiety, represents a detectable moiety, and XY is a constrained moiety of the probe, wherein X and Y represent the same or different, but independently selectable, nucleosides, X and Y are at least 2-fold degenerate, at least one internucleoside linkage is a scissile linkage, k is 1-100, with the proviso that: the detectable moiety may be present in the surrogate N_BOr in addition to N_BBut for any nucleotide N or X other than X at the end of the probe. The scissile linkage is preferably a phosphorothioate linkage. The structure is 5' - (XY) (N) _kN_B ^*A probe of-3 ' can be used for sequencing in the 5 ' → 3 ' direction. The structure is 3' - (XY) (N)_kN_B ^*The-5 ' probe can be used for sequencing in the 3 ' → 5 ' direction.

The structure of some preferred probes is described in more detail below. For sequencing in the 5 ' → 3 ' direction, the sequence was modified to have the structure 5 ' -O-P-O- (X)_j(N)_k-O-P-S-(N)_iN_B ^*-3' of a partially defined probe, wherein N represents any nucleoside, N_BRepresents a portion which is not extendable by the ligase,^*represents a detectable moiety, (X)_jIs a restricted portion of the probe which is at least 2-fold degenerate at each position, (X)_jThe nucleosides in (a) can be the same or different but are not independently selected, j is at least 2, (k + i) is from 1 to 100, k is from 1 to 100, i is from 0 to 99, with the proviso that: the detectable moiety may be present in the surrogate N_BOr in addition to N_BOuter (N)_jOn any nucleoside of (a). In certain embodiments of the invention, (X)_jIs (XY), wherein X and Y are at least 2-fold degenerate and represent identical or different nucleotides which cannot be independently selected. In certain embodiments of the invention, i is O.

Other preferred probes for sequencing in the 5 ' → 3 ' direction have the structure 5 ' -O-P-O- (X)_j-O-P-S-(N)_iN_B ^*-3', wherein N represents any nucleoside, N_BRepresents a ligase inextensible moiety, represents a detectable moiety, (X) _jIs a restricted portion of the probe which is at least 2-fold degenerate at each position, (X)_jThe nucleotides in (a) may be the same or different but cannot be independently selected, j is at least 2, i is from 1 to 100, with the proviso that: the detectable moiety may be present in the surrogate N_BOr in addition to N_BOuter (N)_iOn any nucleoside of (a). In certain embodiments of the invention, (X)_jIs (XY), wherein the positions X and Y are at least 2-fold degenerate, X and Y representing nucleosides which are identical or different and cannot be selected independently. Another preferred probe for sequencing in the 5 ' → 3 ' direction has the structure 5 ' -O-P-O- (X)_j-O-P-S-(X)_k(N)_iN_B ^*-3', wherein N represents any nucleoside, N_BRepresents a ligase inextensible moiety, represents a detectable moiety, (X)_j-O-P-S-(X)_kAt least 2-fold degenerate probes at each positionRestricted moiety (X)_j-O-P-S-(X)_kIs at least 2 times degenerate, may be the same or different, but is not independently selected, j and k are both at least 1, (j + k) is at least 2 (e.g., 2, 3, 4, or 5), i is from 1 to 100, with the proviso that: the detectable moiety may be present in the surrogate N_BOr in addition to N_BOuter (N)_iOn any nucleoside of (a). In certain embodiments of the invention, j and k are both 1.

For sequencing in the 3 ' → 5 ' direction, a probe having the structure 5 ' -N is used_B ^*(N)_i-S-P-O-(N)_k-O-P-O-(X)_j-3' of a partially defined probe, wherein N represents any nucleoside, N _BRepresents a ligase inextensible moiety, represents a detectable moiety, (X)_jIs a restricted portion of the probe which is at least 2-fold degenerate at each position, (X)_jThe nucleosides in (a) can be the same or different but are not independently selected, j is at least 2, (k + i) is from 1 to 100, k is from 1 to 100, i is from 0 to 99, with the proviso that: the detectable moiety may be present in the surrogate N_BOr in addition to N_BOuter (N)_iOn any nucleoside of (a). In certain embodiments of the invention, (X)_jIs (XY), wherein X and Y are at least 2-fold degenerate and represent nucleosides which are the same or different, but which cannot be independently selected. In certain embodiments of the invention, i is 0.

Other preferred probes for sequencing in the 3 ' → 5 ' direction have the structure 5 ' -N_B ^*(N)_i-S-P-O-(X)_j-3', wherein N represents any nucleoside, N_BRepresents a ligase inextensible moiety, represents a detectable moiety, (X)_jIs a restricted portion of the probe which is at least 2-fold degenerate at each position, (X)_jThe nucleosides in (a) can be the same or different but cannot be independently selected, j is at least 2, i is from 1 to 100, with the proviso that: the detectable moiety may be present in the surrogate N_BOr in addition to N_BOuter (N)_iOn any nucleoside of (a). In certain embodiments of the invention, (X)_jIs (XY), wherein X and Y are at least 2-fold degenerate and represent nucleosides which are the same or different, but which cannot be independently selected. In some of the present invention In embodiments, j is 2 to 5, such as 2, 3, 4, or 5, in any partially defined probe.

Another preferred probe for sequencing in the 3 ' → 5 ' direction has the structure 5 ' -N_B ^*(N)_i-S-P-O-(X)_k-O-P-O-(X)_j-3', wherein N represents any nucleoside, N_BRepresents a ligase inextensible moiety, - (X)_k-O-P-O-(X)_jIs a restricted portion of the probe that is at least 2-fold degenerate at each position, - (X)_k-O-P-O-(X)_jThe nucleosides in (a) can be the same or different but are not independently selected, j and k are both at least 1, (j + k) is at least 2 (e.g., 2, 3, 4, or 5), i is from 1 to 100, with the proviso that: the detectable moiety may be present in the surrogate N_BOr in addition to N_BOuter (N)_iOn any nucleoside of (a). In certain embodiments, j is 1 and k is 1.

In the easy-cutting connection position (X)_jThe most proximal nucleoside of (A) and (X)_jIn embodiments of the invention between the next-to-proximal nucleosides, an ordered list of probe family names can be obtained by successive cycles of extension, ligation, detection, and cleavage starting from one starting oligonucleotide, as each cycle extends the extended oligonucleotide probe by one nucleotide. In embodiments of the invention in which the scissile linkage is between two other nucleosides, an ordered list of probe family names is assembled from results obtained from multiple sequencing reactions in which an initial oligonucleotide that hybridizes to a different position in the binding reaction region is employed, as described in sequencing method a.

It will be appreciated that probes having a number of structures other than those described above may be used in the sequencing method AB. For example, the probe may have the structure (e.g.XNY (N) where the constrained nucleosides X and Y are not adjacent_kOr I is a universal base XIY (N)_k。(N)_kX(N)_l、(N)_iX(N)_jY(N)_kZ(N)_l、(N)_iX(N)_jYIZ(N)_lAnd (N)_iX(N)_jY(N)_kZ(I)_lRepresenting other possibilities. As described above for the probes, these probes containA cleavable linking, detectable moiety, one end of which contains a ligase non-extendable moiety. Preferably, the probe does not comprise a detectable moiety linked to a nucleotide at the opposite end of the portion of the probe that is not extendable by the ligase. A probe family including probes having any of these structures and others satisfies the criterion that each probe family includes a plurality of labeled oligonucleotide probes differing in sequence, and at each position of the sequence, one probe family includes at least 2 probes differing in base at that position. The total number of nucleosides in each probe is preferably 100 or less, such as 30 or less.

Encoding a family of oligonucleotide extension probes.

The sequencing methods of the invention utilize a family of encoded probes. "encoding" refers to the scheme of associating a specific label with a probe containing a portion having one of a defined set of sequences, such that probes containing portions having sequences that are members of the defined set of sequences are labeled with such label. Typically, the code associates a plurality of distinguishable labels with one or more probes each, such that each distinguishable label is associated with a different set of probes, and each probe is labeled with only one label (which may comprise a combination of detectable moieties). Preferably, the probes of each probe set each contain a portion having the same member sequence of the defined sequence group. The portion may be one nucleoside or multiple nucleosides in length, such as 2, 3, 4, 5 or more nucleosides. The length of this portion may constitute only a small portion of the entire length of the probe, or may constitute the entire probe. The decision sequence group may contain only one sequence or any number of different sequences, depending on the length of the portion. For example, if the portion is a nucleoside, then the sequence set is determined to contain a maximum of 4 elements (A, G, C, T). If the portion is two nucleosides in length, the defined set of sequences can contain up to 16 elements (AA, AG, AC, AT, GA, GG, GC, GT, CA, CG, CC, CT, TA, TG, TC, TT). Typically, a group of determined sequences contains fewer elements than the total number of possible sequences, and the encoding will employ more than one group of determined sequences.

Sequencing methods A described herein generally utilize probesA set of simply encoded probes in which the proximal nucleoside (i.e., the nucleoside attached to the end of the extendable probe) corresponds directly to the label species. The proximal nucleoside is complementary to the template nucleotide to which it hybridizes, so that the identity of the proximal nucleoside in the newly ligated probe determines the identity of the template nucleotide located opposite the extended duplex. In a general sense, probes for use in the other sequencing methods described herein have the structure X (N)_kWhere X is a proximal nucleoside and each nucleoside N is 4-fold degenerate, so that all possible sequences of length k are represented in the pool of oligonucleotide probe molecules that make up the probe. Thus, for example, some oligonucleotide probe molecules contain a at position k ═ 1, other molecules contain G at position k ═ 1, other molecules contain C at position k ═ 1, other molecules contain T at position k ═ 1, and the situation is similar for other positions k, where (N) is considered_kWherein the nucleoside adjacent to X occupies position k ═ 1; consider (N)_kThe next nucleoside occupies position k-2, etc. However, in any given oligonucleotide probe, X represents only one base pairing specificity, which generally corresponds to a particular nucleoside species, such as A, G, C or T. Therefore, X in the probe molecule library constituting a specific probe is generally A, G, C or T. FIG. 2 shows a structure of X (N) _kSuitable coding of the probes of (1). According to this code, the label "red" is assigned to the probe for X ═ C; assigning the label "yellow" to a probe for X ═ a; assigning a label "green" to the probe of probe X ═ G; the label "blue" was assigned to the probe for X ═ T. Thus, there is a one-to-one correspondence between the sequencing portion of the probe and its label.

It will be appreciated that the above method of mapping the labelled species of the newly ligated extension probe to the species of the most proximal nucleoside in the extension probe may be extended to include the encoding of a sequence of labelled species corresponding not only to the species of the most proximal nucleoside in the extension probe, but also to the most proximal 2 or more nucleosides in the extension probe, so that the identity of multiple nucleotides in the template is determined in one cycle of extension, ligation and detection (typically followed by cleavage). However, this coding still associates the label with a sequence of the oligonucleotide extension probe in order to identify the species of complementary nucleotide located in the opposite position in the template. As described above, in order to identify two nucleotides in a cycle, 16 different oligonucleotide probes are required, each containing a corresponding label (i.e., 16 distinguishable labels).

Sequencing method AB another method was used to associate labels with probes. The same label is assigned to a plurality of probes having different sequencing moieties without one-to-one correspondence between the kind of label and the sequence of the sequencing moiety of the probe. The probe is a partially constrained probe, and the constrained portion of the probe is a sequencing portion thereof. Thus, the same label is assigned to a plurality of different probes each having a restricted portion different in sequence, which is one sequence of a defined sequence group. As described above, probes containing the same label constitute a "probe family". The method employs a plurality of such probe families, each comprising a plurality of probes comprising restricted portions that differ in sequence, wherein the sequence is one of a defined set of sequences.

Multiple probe families are referred to as a probe family "set". A probe of a probe family in a collection of probe families is labeled with a label that is distinguishable from labels used to label other probe families of the collection. Each probe family preferably has its own set of defined sequences. Preferably, the restricted portions of the probes in each probe family are the same length, and preferably, the restricted portions of the probe families in the probe family set are the same length. Preferably, the combination of the determined set of sequences of probe families in the probe family set includes all possible sequences of a restricted partial length. Preferably, the set of probe families comprises or consists of 4 differentially labelled probe families. Preferably, the constrained portion of the probe is 2 nucleosides in length.

A collection of families of differentially-encoded distinguishable-labeled probes will meet the criteria set forth above and can be used to practice the methods of the invention. However, certain probe family sets are preferred. An exemplary code for a preferred set of 4 differentially labeled probe families consisting of partially defined probes is shown in FIG. 25A. As shown in FIG. 25A, the restricted portion consists of the 2 nucleosides closest to the 3' end in the probe. The probe families were labeled "red", "yellow", "green" and "blue". The probes of each probe family include a restricted portion whose sequence is one of the sequences in the defined sequence group, and the defined sequence group of each probe family is different. For example, starting from the 3' end of each sequence considered proximal to the probe, the "red" probe family is { CT, AG, GA, TC }; the exact sequence set for the "yellow" probe family is { CC, AT, GG, TA }; the exact set of sequences for the "green" probe family is { CA, AC, GT, TG }; the exact set of sequences for the "blue" probe family is { CG, AA, GC, TT }. It is a preferred feature that each defined sequence group does not contain any members present in the other groups. In addition, the combination of the defined set of sequences of probe families in the probe family set includes all possible sequences of length 2, i.e., all possible dinucleosides. Another feature (preferably but not necessarily) of this probe family set is that each position of the restricted portion of the probe is 4-fold degenerate, i.e., each position can be occupied by A, G, C or T. Another feature (preferably but not necessarily) of this probe family set is that within each defined sequence group, only one sequence has any particular nucleoside at any position, such as the most proximal position or any other position. It is particularly preferred, but not necessary, that if the nearest nucleoside is considered to be position 1, within each defined group of sequences only one sequence has any particular nucleoside at position 2 or higher within the restricted portion. For example, in a defined set of sequences in the red probe family, only one sequence has a T at position 2; only one sequence has a G at position 2; only one sequence has a at position 2; only one sequence has a C at position 2.

For any particular code shown in FIG. 25A, knowledge of the identity of one or more nucleotides in the constrained portion of a probe in a probe family can provide information about other nucleotides in the constrained portion of the probe. In the most general sense, knowledge of the identity of one or more nucleoside species for a limited portion of a probe family provides sufficient information to exclude one or more possible nucleoside species at another position, since the defined set of sequences for that probe family does not include the sequence for that nucleoside species at that position. In general, knowledge of the identity of one or more nucleosides in a defined portion of a probe family provides sufficient information to exclude one or more possible identities of multiple nucleosides, such as other individual nucleosides. In preferred coding, knowledge of the identity of one or more nucleosides in a restricted portion of a probe family can rule out all but one of the possibilities for each of the other nucleosides in the probe. For example, in the case of the coded probe family shown in FIG. 25A, if the probe is known to be a member of the red family, then the adjacent nucleoside must be T if the proximal nucleoside is also known to be C. Similarly, if the probe is known to be a member of the green family, the adjacent nucleoside must be T if the proximal nucleoside is also known to be G. Thus, understanding the identity of one nucleoside in the restricted portion is sufficient to exclude all but one possibility of the other nucleoside, and thus, the identity of the other nucleoside is fully identified. However, if the identity of at least one nucleoside of the constrained portion of the probe is not known, no information can be obtained about the identity of any particular nucleoside in the probe based solely on knowledge of the name of the probe family to which it belongs, since the nucleoside at each position of the constrained portion can be A, G, C or T. FIG. 25B shows preferred probe family pooling (top panel) and cycles of ligation, detection and cleavage (bottom panel) when sequencing method AB is used.

The present inventors designed a collection of 24 probe families containing a restricted portion of 2 nucleotides in length and having the advantageous features of the collection of probe families shown in FIG. 25A. These probe families provide the most information, since knowing the name of the probe family to which the probe belongs, and knowing the identity of one nucleoside in the probe, is sufficient to accurately identify the other nucleosides of the restricted portion. This applies to all probes and all nucleosides of each restricted portion. The coding scheme for each of the 24 preferred probe family sets is shown in Table 1. Table 1 assigns coding IDs of 1-24 to each probe family set. Each code identified a generic structure for sequencing method AB as (XY) N_kThe restricted portion of the set of probe families, thereby defining the set itself. In Table 1, the value 1 in the column below the "code ID" indicates that the probes containing nucleosides X and Y as shown in the first and second columns, respectively, are assigned to the first probe family according to the code; (ii) 'weaving' fabricCode ID "the value 2 in the lower column indicates that the probes containing nucleosides X and Y as shown in the first and second columns, respectively, are assigned to the second probe family according to the code; (iii) the value 3 in the column below "code ID" indicates that the probes containing nucleosides X and Y as shown in the first and second columns, respectively, are assigned to the third probe family according to the code; and (iv) the value 4 in the column below the "code ID" indicates that the probes containing nucleosides X and Y as shown in the first and second columns, respectively, are assigned to the fourth probe family according to the code. The values 1, 2, 3 and 4 each represent a label. For example, code 9 identifies the probe family set shown in FIG. 25A, where 1 represents blue, 2 represents green, 3 represents red, and 4 represents yellow. It will be appreciated that the assignment of a value to a tag is arbitrary, as 1 may equally represent green, red or yellow. Changing the association between values 1, 2, 3, and 4 and the label does not change the probe set in each probe family, but only associates different labels with each probe family.

Table 1: oligonucleotide probe family coding

		Encoding ID
		Encoding ID																								1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24
		X	Y																							1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24
A	A	X	Y																							1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
A	A	C	A	2	4	3	2	2	4	3	2	2	3	4	3	2	3	3	4	2	3	4	4	2	4	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	3	4
G	A	C	A	2	4	3	2	2	4	3	2	2	3	4	3	2	3	3	4	2	3	4	4	2	4	4	3	2	3	3	2	4	4	3	2	2	4	4	2	4	3	4	2	3	2	3	2	4	3	3	4
G	A	T	A	3	2	4	4	4	3	2	3	4	4	3	2	3	4	2	2	3	4	2	3	4	3	4	3	2	3	3	2	4	4	3	2	2	4	4	2	4	3	4	2	3	2	3	2	4	3	2	2
A	C	T	A	3	2	4	4	4	3	2	3	4	4	3	2	3	4	2	2	3	4	2	3	4	3	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2
A	C	C	C	1	1	1	1	1	1	1	1	4	4	3	4	4	4	4	3	3	4	3	3	3	3	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	4	3
G	C	C	C	1	1	1	1	1	1	1	1	4	4	3	4	4	4	4	3	3	4	3	3	3	3	3	4	4	4	4	3	3	3	1	1	1	1	3	3	3	4	1	1	1	1	4	4	3	4	4	3
G	C	T	C	4	3	3	3	3	4	4	4	3	3	4	3	1	1	1	1	4	3	4	4	1	1	3	4	4	4	4	3	3	3	1	1	1	1	3	3	3	4	1	1	1	1	4	4	3	4	1	1
A	G	T	C	4	3	3	3	3	4	4	4	3	3	4	3	1	1	1	1	4	3	4	4	1	1	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	1	1
A	G	C	G	4	2	4	4	4	2	4	4	1	1	1	1	1	1	1	1	4	2	2	2	4	2	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	2	2
G	G	C	G	4	2	4	4	4	2	4	4	1	1	1	1	1	1	1	1	4	2	2	2	4	2	1	1	1	1	2	4	2	2	4	4	4	2	2	4	2	2	2	4	4	4	1	1	1	1	2	2
G	G	T	G	2	4	2	2	1	1	1	1	2	2	2	4	4	2	4	4	1	1	1	1	2	4	1	1	1	1	2	4	2	2	4	4	4	2	2	4	2	2	2	4	4	4	1	1	1	1	4	4
A	T	T	G	2	4	2	2	1	1	1	1	2	2	2	4	4	2	4	4	1	1	1	1	2	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4
A	T	C	T	3	3	2	3	3	3	2	3	3	2	2	2	3	2	2	2	1	1	1	1	1	1	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	4	1	1
G	T	C	T	3	3	2	3	3	3	2	3	3	2	2	2	3	2	2	2	1	1	1	1	1	1	2	2	3	2	1	1	1	1	2	3	3	3	1	1	3	1	3	3	2	3	2	3	2	2	1	1
G	T	T	T	1	1	1	1	2	2	3	2	1	1	1	1	2	3	1	3	2	2	3	2	3	2	2	2	3	2	1	1	1	1	2	3	3	3	1	1	3	1	3	3	2	3	2	3	2	2	3	3

To further illustrate how table 1 can be used to determine a preferred set of probe families, code 17 is considered. According to this code, probes with restricted portions AA, GC, TG and CT are assigned to marker 1 (e.g.Red); assigning probes with restricted portions CA, AC, GG and TT to marker 2 (e.g., yellow); assigning probes with restricted portions TA, CC, AG and GT to marker 3 (e.g., green); probes with restricted portions of GA, TC, CG and AT are assigned to label 4 (e.g.blue). The resulting probe family set is shown in FIG. 26.

FIGS. 27A-27C represent another method for schematically defining a collection of 24 preferred probe families. The method utilizes a graph, as in FIG. 27A. The first column of the graph represents the first base. Each tag is attached to four different base sequences, which are given by juxtaposing the bases of the first column with the bases of the selected tag column. For example, if there is an A in the column titled "first base", then a probe containing a restricted portion of sequence AA is assigned to probe family 1 (tag 1); assigning a probe containing a restricted portion with the sequence AC to probe family 2 (tag 2); assigning a probe containing a restricted portion of the sequence AG to probe family 3 (Label 3); probes containing restricted portions of sequence AT were assigned to probe family 4 (tag 4). For probes containing a restricted portion starting at C, G or T, the probe family was assigned in a similar manner. Thus, a graph filling up the bases shown in FIG. 27A is translated into the codes shown in FIG. 27B, in which probes whose restricted portions belong to the { AA, CC, GG, TT } group are assigned to the probe family 1; assigning probes whose restricted portion belongs to the group { AC, CA, GT, TG } to probe family 2; assigning probes whose restricted portion belongs to the { AG, CT, GC, TA } group to probe family 3; probes whose restricted portion belongs to the group { AT, CG, GA, TC } were assigned to probe family 4. FIG. 27C shows a diagram that can be inserted in place of the shaded portion of FIG. 27A to generate 24 preferred probe family sets, respectively. The methods of using the preferred probe family set in sequencing method AB are described further below.

The set of 24 coded probe families identified in table 1 represents only a preferred embodiment of a set of probe families for the sequencing method AB. Various other coding schemes, probe families and probe structures that are identical in basic principle can be used, wherein knowledge of the probe family name, and knowledge of the identity of one or more nucleosides of the constrained moiety, can provide information about one or more other nucleosides. The reason why a less preferred set of probe families is less preferred than a preferred set of probe families is generally: (i) knowledge of probe family names and nucleoside species provides a lesser amount of information, at least for some probes; or (ii) knowledge of the name of the probe family provides a greater amount of information, at least for some probes.

In general, a less preferred set of probe families can be used to perform the sequencing method AB in a manner similar to the use of a preferred set of probe families. However, the steps required for decoding may be different. For example, in some cases, it may be sufficient to compare candidate sequences to each other to determine at least a portion of the sequence.

An example of a less preferred set of probe families in which the probes contain a restricted portion of 2 nucleosides in length is shown in FIG. 28. According to this code, probes whose restricted portion belongs to the { AA, AC, GA, GC } group were assigned to the probe family 1; assigning a restricted portion of probes belonging to the group { CA, CC, TA, TC } to probe family 2; assigning probes whose restricted portion belongs to the group { AG, AT, GG, GT } to probe family 3; probes with restricted portions belonging to the { CG, CT, TG, TT } group were assigned to Probe family 4. In this probe family set, knowledge of the probe family name, which is determined by detecting the label of the new ligation extension probe, can exclude some possibility of template nucleotide species located at a position relative to the proximal nucleoside of the new ligation extension probe. For example, if the probe family name is 1, then the proximal nucleoside of the newly ligated extension probe must be A or G, and thus the complementary nucleotide in the template must be T or C. In contrast to when using the preferred probe family set, nucleotides cannot be identified accurately because there are at least two possibilities at each position of the restricted portion, but the information obtained from a single cycle is sufficient to exclude some possibilities.

In certain embodiments of the invention, probes are defined using a restriction moiety that is 3 nucleosides in length. In order to contain probes whose limited portion includes all possible sequences of length 3 (preferred), the probe family set should include 4³64 different probes. FIG. 29A shows a diagram of the restricted portion of a collection of probe families that can be used to generate probes that include the restricted portion 3 nucleosides in length (trinucleosides). The figure shows 4 sets of rows denoted A, G, C and T and 4 columns with probe family names 1, 2, 3 and 4. Each group consisting of 4 rows is opposed to a frame containing a nucleoside species inside. To determine the trinucleoside probe family, the box containing the last nucleoside of the trinucleoside is first selected. Of the 4 rows adjacent to this box, the row labeled with the letter identifying the first nucleoside in the trinucleoside was selected. Within this row, the column of the second nucleoside containing the trinucleoside is selected. Trinucleosides were assigned to the probe families shown at the top of the column. For example, trinucleosides "TCG" are assigned to a probe family according to the following method: since the last nucleoside is "G", the focus is limited to the group of 4 rows opposite to the box containing "G", i.e., the third group. Since the first nucleoside is "T", the range of consideration is further limited to the last row of the 4 groups. The probe family assignment is determined by the title of the column containing the intermediate nucleosides. Since the intermediate nucleoside is "C", the trinucleosides are divided into Dispensing probe family 1. A similar approach yields the following probe family assignments: AAA is 1; ATA is 2; AGA ═ 3; GTA ═ 4; GAG ═ 1; TGG ═ 2, and the like. This process is continued until all possible trinucleosides are assigned to the probe family.

FIG. 29B shows a method for constructing additional restricted portions of a probe family set comprising probes that are restricted portions 3 nucleotides in length. This method was used to construct a pool from each of the 24 preferred probe family pools described above, where the restricted portion is 2 nucleosides in length, containing 4 probe families. The top panel of the figure shows an exemplary graph representing a preferred set of probe families. The columns of the upper graph are drawn directly into the lower graph in accordance with the colors assigned to the columns in the upper graph. Thus, the columns of the upper diagram are, from left to right, blue, green, yellow and red, respectively. The entries below column 1 in the lower panel are blue, green, yellow and red, respectively, from top to bottom, with each set of 4 nucleosides corresponding to the columns of the upper panel. Columns 2, 3 and 4 in the figure below were generated by gradually moving each set of 4 nucleosides of column 1 down.

It is understood that a "probe family" can be considered to be a "super probe" containing a plurality of different probes, each containing the same label. In this case, the probe molecules that make up the probe are generally not a population of molecules in which any portion of the probe is substantially identical. The use of the term "probe family" is not intended to be limiting in any way, but rather is used for convenience to characterize the probes that make up these "super probes".

Decoding

As described above, in one sequencing reaction, successive cycles of extension, ligation, detection and cleavage using a probe family set comprising at least two differentially labeled probe families produce an ordered list of probe family names, or the probe family names determined from multiple sequencing reactions initiated at different sites in the template are assembled into an ordered list. The number of cycles performed should be approximately equal to the length of the desired sequence. Ordered lists contain a large amount of information, but cannot immediately produce sequences of interest. Additional steps must be performed, at least one of which includes collecting at least one additional item of information about the sequence to obtain a sequence that is most likely representative of the sequence of interest. The sequence most likely to represent the sequence of interest is referred to herein as the "correct" sequence, and the process of extracting the correct sequence from an ordered list of probe families is referred to as "decoding". It will be understood that elements in the above-described "ordered list" may be rearranged during or after sequence generation, so long as the information content includes the correspondence of the elements in the list to the nucleotides in the template is preserved, and so long as rearrangement, fragmentation, and/or substitution are properly accounted for in the decoding process (described below). Thus, the term "ordered list" is intended to include rearranged, fragmented, and/or permuted ordered lists produced as described above, so long as such rearranged, fragmented, and/or permuted lists include substantially the same information content.

The ordered list may be decoded in various ways. Some of these methods include generating a set of at least one candidate sequence from an ordered list of probe family names. This set of candidate sequences may provide sufficient information to achieve the goal. In a preferred embodiment, one or more additional steps are performed to select the sequence most likely to represent the sequence of interest from the candidate sequence or a group of sequences to which the candidate sequence is compared. For example, in one method, at least a portion of at least one candidate sequence is compared to at least one other sequence. The correct sequence is selected based on the comparison. In certain embodiments of the invention, decoding comprises repeating the method and obtaining a second ordered list of probe family names using a probe family set that is different from the original probe family set code. The correct sequence is determined using information from the second ordered list of probe families. In some embodiments, information obtained from as few as one cycle of extension, ligation, and detection with an alternatively encoded set of probe families is sufficient to select the correct sequence. In other words, the first probe family identified with the alternatively encoded probe family provides sufficient information to determine which candidate sequence is correct.

Other decoding methods include the specific identification of at least one nucleotide in the template using any available sequencing method, such as one cycle sequencing method a. Information about one or more nucleotides is used as a "key" to decode an ordered list of probe family names. Alternatively, the sequenced template portion may include regions of known sequence in addition to regions of unknown sequence. If the sequencing method AB is applied to a portion of a template that includes an unknown sequence and at least one nucleotide of a known sequence, then the known sequence can be used as a "key" to decode an ordered list of probe family names. The following sections describe the process of generating candidate sequences. Subsequent sections describe the use of candidate sequences to compare with known sequences, to compare with a second set of candidate sequences, and to select the correct sequence using known nucleotide species.

Generating candidate sequences

It will be appreciated that the portion of the template to be sequenced is complementary to the extended duplex resulting from successive cycles of extension, ligation and cleavage. Thus, the candidate sequence for generating an extended duplex is equivalent to the candidate sequence for generating the region of the template to be sequenced. In practice, candidate sequences for the template region to be sequenced may be generated, or candidate sequences for extended duplexes may be generated and their complements used to determine the candidate sequence for the template region to be sequenced. The latter method is described herein. To generate candidate sequences from the probe family name list, the first member of the probe family list is considered. The set of restricted portions associated with this probe family limits the possibility of starting nucleotides of the sequence over a length equal to the length of the restricted portion. For example, if the constrained portion is a dinucleotide, the possible sequence of the first dinucleotide in the extended duplex is limited to the constrained portion found in probes belonging to the probe family (and thus the possible sequence of the first dinucleotide in the template region to be sequenced is limited to combinations complementary to the constrained portion found in probes belonging to the probe family). The probability of the first dinucleotide is generally recorded in silico. Similarly, the possible sequence of the second dinucleotide in the extended duplex (i.e., a dinucleotide offset by one nucleotide from the first dinucleotide) is limited to the restricted portion found in probes belonging to the second probe family (thus, the possible sequence of the second dinucleotide in the template, a dinucleotide offset by one nucleotide from the first dinucleotide, is limited to combinations complementary to the restricted portion found in probes belonging to the second probe family). Possible sequences for the second dinucleotide were also recorded. The likelihood of subsequent dinucleotides is likewise recorded until a probability or list of dinucleotides corresponding to the desired length of the sequence to be determined is recorded without any probe family.

A representative example of a method of recording possibilities is depicted in FIG. 30, in which it is envisaged that a list of probe family names is generated using the probe family set shown in FIG. 25A. The left-most column of FIG. 30 shows a list of probe families in top-to-bottom order: yellow, green, red, blue. The sequence possibilities of the dinucleotides corresponding to each probe family in the list are shown on the right side of the figure. Nucleotide positions are identified above the sequence possibilities. The sequence starts at position 1, so that the first dinucleotide occupies positions 1 and 2; the second dinucleotide occupies positions 2 and 3, etc. For the yellow probe family, the possibilities are CC, AT, GG and TA, as shown in FIG. 30. For the green probe family, the possibilities are CA, AC, GT, and TG, etc. The process of recording the possible sequences of each dinucleotide continues until the desired sequence length is reached.

After the set of possibilities is generated, a first hypothesis is made as to the identity of the first nucleotide in the candidate sequence, which is assumed to be at the 5' position of the sequence, denoted as position 1 in FIG. 30. The first hypothesis may be that the nucleotide is A, the nucleotide is G, the nucleotide is C, or the nucleotide is T.

It was observed that the possible sequences of each dinucleotide are limited to those of adjacent dinucleotides, because adjacent dinucleotides overlap, i.e. the second nucleotide of the first dinucleotide is also the first nucleotide of the second dinucleotide. For example, if it is assumed that the first nucleotide is a C, then the first nucleotide must be a CC. If the first dinucleotide is CC, then the first position of the second dinucleotide must be C. Since the only possible sequence of the second dinucleotide, which is C at the first position, can be CA, it was demonstrated that the second dinucleotide must be CA. Thus, the first 3 nucleotide sequence must be CCA. Similarly, the possible sequence of the third dinucleotide is limited to the possible sequence of the second dinucleotide. If the second dinucleotide is CA, then the third dinucleotide must be AG, since this is the only possibility that the first position is A. The sequence of the first 4 nucleotides must therefore be CCAG. Continuing this process produces the first 5 nucleotide sequence 5 '-CCAGC-3'. Therefore, CCAGC is the first candidate sequence.

A second candidate sequence was generated by assuming that the first nucleotide was A. This hypothesis makes the first dinucleotide an AT. TG is the only possible sequence of the second dinucleotide coinciding with the sequence of the first dinucleotide AT. GA is the only possible sequence of the third dinucleotide coinciding with the sequence TG of the second dinucleotide. AA is the only possible sequence of the fourth dinucleotide coinciding with the sequence GA of the third dinucleotide. These dinucleotides are assembled into full-length candidate sequences to generate ATGAA. Similarly, a candidate sequence for G production assuming the first nucleotide is GGTCG and a candidate sequence for T production assuming the first nucleotide is TACTT. Thus, 4 candidate sequences were generated, each starting with a different nucleotide that is assumed to be the first nucleotide of the sequence.

No assumptions are required that must be made about the first nucleotide and not one of the other nucleotides. For example, the same effect can be achieved by making assumptions about the kind of the fourth nucleotide, in which case the candidate sequence is generated by moving "backwards" along the template (i.e., in the 3 '→ 5' direction). For example, assuming that the fourth nucleotide is T means that the fourth dinucleotide must be TT; the third dinucleotide must be CT; the second dinucleotide must be AC; the first dinucleotide must be CC. (although the species are generated by movement in the sequence in the 3 '→ 5' direction, the nucleotides are written in the 5 '→ 3' direction). Alternatively, the assumption can be made on any nucleotide in the sequence that generates a dinucleotide species by moving in the 5 ' → 3 ' and 3 ' → 5 directions. It will be appreciated that the identity of each nucleotide cannot be determined at all without making assumptions about one of the nucleotides, since each position can be occupied by A, G, C or T.

With the preferred probe family set, it is assumed that any single nucleotide (e.g., first nucleotide) species can produce and only one candidate sequence. However, with a less preferred set of probe families, it may be necessary to assume more than one nucleotide species, i.e., assume that the first nucleotide species cannot fully determine the remaining sequence. For example, a less preferred set of probe families might include families whose members have defined sequences that are AA and AC. In this case, assuming that the first nucleotide is A, two possibilities are created for the second nucleotide. Sequencing with a less preferred set of probe families is discussed further below. It will be appreciated that if the constrained portion consists of non-contiguous nucleotides, the above method may still be used with minor modifications.

Sequence identification by comparing candidate sequences to known sequences

Typically, if the candidate sequence for the extended duplex is determined as described above, the corresponding candidate sequence for the template region to be sequenced is obtained by taking its complement. In some cases, the candidate sequence itself will provide enough information to achieve the goal. For example, if the purpose of sequencing is simply to exclude certain sequence possibilities, then it is sufficient to compare the candidate sequence to these possibilities. The candidate sequences shown in FIG. 30 enable determination that, for example, the sequenced region is not part of the poly-A tail. Longer sequences confirm that the sequencing region is not part of the vector.

In many cases, a definite determination of the correct sequence is required. According to a preferred embodiment of the invention, the correct sequence is identified by comparing a candidate sequence of the template region to be sequenced with a set of known sequences. The set of known sequences can be, for example, a set of sequences of a particular organism of interest. For example, if human DNA is sequenced, the candidate sequence can be compared to a draft human genome sequence. See URL aswww.ncbi.nih.gov/genome/guide/human/The website of (a) for guidance on publicly available sources of human genomic sequences. As another example, if nucleic acids derived from an infectious agent (e.g., a bacterium or virus isolated from a subject) are to be sequenced, the nucleic acid can be searched forA database of sequences of such bacterial or viral variants. Many databases of such specific organisms, containing complete or partial sequences, are known in the art, and as sequencing work progresses, more databases are available. Some representative examples include the mouse database (see, e.g., URL aswww.ncbi.nlm.nih.gov/genome/seq/MmHome.htmlWebsite of (c)), a human immunodeficiency virus database (see, e.g., URL forhiv-web.lanl.gov/content/hiv-db/mainpage.htmlWebsite (s)), Plasmodium falciparum database (see, e.g., URL forhttp://www.tigr.org/tdb/edb2/pfal/htmls/index.shtmlWeb site of) etc. Of course, the set of sequences for a particular organism need not be employed. Searchable databases such as GenBank (URL is http://www.ncbi.nlm.nih.gov/Genbank/The website of (c) containing sequences from various organisms and viruses. The database does not even necessarily contain any sequences of the organism or virus from which the template was derived. Typically, the sequence may be a genomic sequence, a cDNA sequence, an EST, or the like. Multiple sequences may be searched.

It may be sufficient to perform a search only. For example, if viral nucleic acid is isolated from a patient, comparing a candidate sequence to a set of known sequences of the virus can determine whether the viral nucleic acid contains a sequence from the virus, even if no matching sequence has been detected. The presence of a match confirms that the patient is infected with the virus, while the absence of a match indicates that the patient is not infected with the virus.

In certain embodiments, the set of known sequences contains a narrow range of sequences, which may be particularly suitable for sequencing purposes. Thus, sequencing nucleic acid information can be used to select a group of known sequences. For example, if a known template represents the sequence of a particular gene, the known sequence may represent a different allele, mutant or wild-type sequence, etc., of the gene at a given locus of interest. It may be desirable to only compare the candidate sequence to a known sequence to determine which candidate sequence is the correct sequence. For example, in certain embodiments of the invention, the template is obtained by amplifying DNA containing the region of interest (e.g., using primers flanking the region of interest). The region of interest may include a mutation or polymorphic site, such as a mutation or polymorphism that is substantially associated with a particular species. If the template is known to represent the sequence of a particular region of interest, then only the candidate sequence need be compared to a reference sequence, such as this region of the wild-type or mutant form of the sequence. In other words, if a portion or all of the template sequence is known, it may not be necessary to compare to multiple known sequences. Instead, a candidate sequence that contains all or part of the known sequence is selected as the correct sequence. For example, given that mutations in the BRCA1 and BRCA2 genes are associated with an increased risk of breast cancer, it is of great interest to determine whether a subject carries such mutations. If the template is known to contain a sequence from the BRCA1 gene, for example, if primers flanking a region of interest that includes a portion of the gene are used to generate a clonal population of templates, then the candidate sequence need only be compared to the wild-type or mutated BRCA1 sequence to determine the correct sequence.

In a more general case, comparing a candidate sequence to a set of known sequences will identify any known sequences that are similar to the candidate sequence. The likelihood that the database contains sequences that are identical or very similar to more than one candidate sequence is very small, provided the candidate sequences are sufficiently long. In other words, if the candidate sequence is long enough, it is unlikely that more than one candidate sequence will be identical to a sequence in the set of known sequences. The candidate sequence is compared to any sequence considered a "match". It is generally necessary to set an identity threshold required to determine that a match exists. For example, a candidate sequence can be considered a match to a known sequence if it is at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, at least 95%, at least 99%, or even 100% identical to the known sequence. The percent identity is generally evaluated over a window of at least 10 nucleotides in length, e.g., 10-15 nucleotides, 15-20 nucleotides, 20-25 nucleotides, 25-30 nucleotides, etc. The window length may be selected according to a variety of different criteria including, but not limited to: the number of sequences in the plurality of known sequences, the type or source of the plurality of known sequences, and the like. For example, if candidate sequences are compared to a large database such as GenBank, the window length required may be longer than if a database containing fewer sequences were used. In some embodiments of the invention, the sequences are compared over a plurality of different windows, which are not necessarily adjacent to each other. Preferably, the total length of the window is at least 10 nucleotides, such as 10-15 nucleotides, 15-20 nucleotides, 20-25 nucleotides, 25-30 nucleotides, and the like. In some cases, multiple sequences in a group of known sequences may match. The sequence may, for example, represent a homologous gene found in the same organism as the organism from which the template was produced, a homologous gene from a different organism, a pseudogene, a cDNA and genomic sequence, and the like.

Usually, the candidate sequence that is closest to the sequence in the group of known sequences is selected as the correct sequence. Alternatively, for example, if it is reasonable to believe that the sequencing method may produce a high error rate, then the corresponding sequence in the database is preferably selected as the correct sequence. For example, if the error rate is known to exceed a predetermined threshold, then the sequence in the database is preferably selected as the correct sequence.

The length required to ensure the likelihood of finding a match from a variety of candidate sequences depends on various factors, including but not limited to: a specific set of known sequences, a threshold to accept matches, etc. Typically, sequences of about 25-26 nucleotides in length occur only once in the genome of a typical organism. Thus, generating a candidate sequence of about this length is sufficient to identify the correct sequence. Typically, the candidate sequence should be at least 10 nucleotides, preferably at least 15, at least 20 nucleotides in length, such as 20-25, 25-30, 30-35, 35-40, 45-50 nucleotides or even longer.

Sequence identification by comparing a first set of candidate sequences to a second set of candidate sequences

In certain embodiments of the invention, decoding is performed by generating a first ordered list of probe families from a first set of probe families encoded according to a first coding scheme, generating a first set of candidate sequences therefrom, and then generating a second ordered list of probe families from the same template from a second set of probe families encoded according to a second coding scheme, generating a second set of candidate sequences therefrom. Newly synthesized DNA strands on the template are removed between sequencing reactions, or templates of identical sequence are sequenced using a second probe family set. The set of candidate sequences is compared. It will be appreciated that whatever probe family set is employed, one of the candidate sequences is the correct sequence, while the others are not (or at most are partially) correct. Thus, each set of candidate sequences contains the correct sequence, but in most cases the other candidate sequences in any given candidate sequence are different from the sequences found in the other set of candidate sequences. Thus, the correct sequence can be determined by comparing only two sets of candidate sequences. It is not necessary to use two sets of probes that encode different probe families to generate candidate sequences of equal length. In a preferred embodiment of the invention, the candidate sequences generated using the second set of probe families may be as short as 2 nucleotides, or alternatively, the ordered list of probe families generated using the second set of probe families may be as short as 1 element (i.e., 1 cycle of ligation and detection).

FIGS. 31A-31C show examples of candidate sequence generation and decoding with two discriminatory labeling of preferred probe families. FIG. 31A shows a preferred set of probe families encoded according to a first encoding scheme. Fig. 31B shows the generation of 4 candidate sequences from an ordered list of probe families yellow, green, red, blue (which can be denoted as "2314", where red-1, yellow-2, green-3, and blue-4), where the correct sequence is assumed to be CAGGC (bold). FIG. 31C shows a preferred set of probe families encoded according to a second encoding scheme. Since the first dinucleotide in the template is CA, the top probe in the yellow probe family will be ligated to the extendable terminus in the first extension cycle. This makes the first dinucleotide a candidate for the following group: CA. TC, GG and AT. Of the candidate sequences generated with the first probe family set, only the sequence CAGGC begins with either of these dinucleotides. Therefore, it must be the correct sequence. In general, the first and second probe family sets preferably satisfy the following conditions: comparing the first and second probe family sets, (i) 3 of the 4 probes of each probe family in the first set should be assigned to a new probe family of the second set; and (ii) each of the 3 reassigned probes should be assigned to a different probe family in the second set.

Decoding an ordered list of probe families with known nucleotide species

As described above, candidate sequences can be generated by presumptively extending the identity of one nucleotide in a duplex or template. Depending on the particular probe family set used, it is generally necessary to generate at least 4 candidate sequences. However, if the identity of at least one nucleotide in the template (and hence in the extended duplex) is known, the generation of multiple candidate sequences can be avoided. In this case, only one candidate sequence needs to be generated. The method of generating the candidate sequence is the same as described above. The identity of at least one nucleotide in the template can be determined by any sequencing method, including but not limited to: sequencing method A, primer extension from the starting oligonucleotide using a set of differentially labeled nucleotides and a polymerase, and the like. It will be appreciated that one or more nucleotides in a template may first be sequenced using a sequencing method different from sequencing method AB, and then the starting oligonucleotide and any extension products may be removed and the same template sequenced using sequencing method AB (and vice versa).

Another method is to sequence only templates containing one or more species of known nucleotides, except for the portion whose sequence is to be determined. For example, the portion between the region to which the starting oligonucleotide binds and the beginning of the unknown sequence may include one or more nucleotides of known species. By performing the sequencing method AB on this portion of the template, one or more nucleotides in the sequence will be pre-determined in kind and thus can be used to generate a candidate sequence, which will be the correct sequence.

Thus, the above method comprises the steps of: (i) assigning a species to a nucleotide adjacent to a nucleotide of the known species on the template by determining which species corresponds to a likely sequence of the constrained portion of the probe at a position relative to the known nucleotide species and the nucleotide proximal thereto linked to the nucleotide adjacent to the known species; (ii) assigning a species to a subsequent nucleotide by determining which species corresponds to the likely sequence of the constrained portion of the probe whose proximal nucleotide is linked to the relative position of said subsequent nucleotide; and (iii) repeating step (ii) until the sequence is determined. It will be appreciated that these steps are equivalent to performing the same steps on extended duplexes, since there is an exact correspondence between the extended duplexes and the template region to be sequenced.

Sequencing with less preferred Probe families

The sequencing method AB can be performed in a similar manner to using the preferred probe family set with the less preferred probe family set. However, the results may differ in many ways. For example, certain sequence portions may be fully identified from the candidate sequence without additional information. FIG. 32 shows an example of sequencing using the less preferred probe family set encoded by FIG. 28. Sequencing methods are generally as described in the preferred probe family collection. The template of interest has the sequence "GCATGA", when an ordered list of probe families is generated as "12341". Assuming that the nucleotide at position 1 is a, the resulting candidate sequence is "ACATGA". However, unlike the case of the preferred probe family set, there are two possibilities for the second nucleotide, since the label "1" is associated with two different dinucleotides, namely "AA" and "AG", with A as the first nucleotide. Thus, assuming that the nucleotide at position 1 is a, the second candidate sequence generated is "ACATGC". Assuming that the nucleotide at position 1 is G, the resulting candidate sequence is "GCATGA", and "GCATGC" is also produced as a candidate sequence. Since the label "1" is independent of any dinucleotide at position 1, which is either a C or T, no candidate sequence beginning with either a "C" or a "T" is generated. FIG. 32 shows 4 candidate sequences aligned with each other. It should be observed that the middle 4 nucleotides in all candidate sequences are CATG. Thus, the correct sequence must include CATG at positions 2-5. If only these nucleotides are of interest, no further decoding steps need to be performed.

As mentioned above, the set of probe families need not consist of four different probe families, but can consist of more than 2 and less than 4^NA composition wherein N is the restricted portion length. However,if less than 4 families are used, more than 4 candidate sequences may have to be generated, while if more than 4 probe families are used, additional labels are required. For these and other reasons, a set of 4 probe families is preferred.

Sequence identification by comparison of candidate sequences with each other

In certain embodiments of the invention, some or all of the sequences of interest may be determined by comparing candidate sequences to each other. Typically, this comparison is not sufficient to determine which candidate sequence is correct over the entire length. However, if two or more candidate sequences are identical or sufficiently similar in sequence over a portion of the template, this information may be sufficient to unambiguously identify the nucleotide sequence within that portion of the template.

If desired, the template can be resequenced one or more additional times with alternately encoded probe families to generate additional portions of the identified sequence. These portions can be combined to assemble a sequence of the desired length.

Error correction using probe families

It is often desirable to sequence multiple templates representing all or part of the same DNA sequence and to align the sequences. If the template contains only part of the region of interest, then a longer sequence is obtained by assembling overlapping fragments. For example, when sequencing the genome of an organism, the DNA is typically fragmented and enough fragments are sequenced to extend each DNA over several (e.g., 4-12) different fragments. Computer software for assembling overlapping sequences into longer sequences is known to those skilled in the art.

With conventional sequencing methods, it is often the case that multiple fragments are perfectly aligned over a region, but one of these fragments (called the aberrant fragment) differs from the other fragment in one position in the region. Determining whether an individual difference represents a sequencing error or whether there is a true difference (e.g., a single nucleotide polymorphism) at that position can be problematic.

The present invention provides a new method for error checking using sequencing method AB. According to this method, templates containing fragments representing the same DNA segment are sequenced using the above-described differentially labeled probe family set, resulting in an ordered list of probe families for each template. An ordered list of aligned probe families. If several lists are perfectly aligned over a predetermined length, such as 10, 15, 20, or 25 or more elements in the list, except that one list differs from the other fragments in one position, the difference is attributed to sequencing errors. If an actual polymorphism is present, the ordered probe list produced by the aberrant segment will differ from the ordered probe list produced from the other segment in two or more adjacent positions.

For example, application of sequencing method AB using the preferred probe family set of table 1 code 4 to a template containing sequence 5'-CAGACGACAAGTATAATG-3' produced an ordered list of the following probe families: "23324322132444142", as follows:

23324322132444142

CAGACGACAAGTATAATG

If there is an actual SNP (e.g., CAGACGA)GAAGTATAATG, where the underlined nucleotides represent the polymorphic site), then two consecutive elements in the list are altered: 23324333132444142, where the underlining indicates the changes caused by the SNP. The correspondence between the ordered list of probe families and the sequences containing SNPs is given below:

23324333132444142

CAGACGAGAAGTATAATG

however, errors in the identification of the tag associated with the ligation extension probe result in an error in the ordered list of probe families and changes in the resulting candidate sequences from that point forward. For example, determination of the Label 233243 ligated with the 7 th ligation extension Probe32132444142 (where the underlined numbers represent misidentified markers) changes the resulting candidate sequence to CAGACGAGTTCATATTACWherein the underlined section indicates the change caused by the sequencing error. Ordered list of probe families and pairs between the sequencesThe relationship is as follows:

23324332132444142

CAGACGAGTTCATATTAC

with a 3 base, 4-tag protocol, fragments containing SNPs will produce 3 consecutive differences in the ordered list of probe families for the abnormal fragments, but the sequencing error will only produce 1 error. For example, using a probe family set encoded as shown in FIG. 29, an ordered list of probe family classes for sequence CAGACGACAAGTATAATG is shown below:

2322224132412244

CAGACGACAAGTATAATG

Abnormal fragments containing SNPs, e.g. CAGACGAGAAGTATAATG, results in an ordered list of probe families that differs from the ordered list generated by fragments that do not contain SNPs in 3 consecutive positions, as shown below:

2322213332412244

CAGACGAGAAGTATAATG

sequencing errors will cause only one difference in the ordered list of probe families, resulting in completely different candidate sequences from the point of error forward.

Thus, when an ordered list of probe families produced by one fragment (an aberrant fragment) is aligned with an ordered list of probe families produced by other fragments representing the same DNA segment, but differs from the other ordered lists at a single position, an ordered list containing the difference may represent a sequencing error (misidentification of a probe family). An aberrant fragment may contain a SNP when an ordered list of probe families produced by one fragment (the aberrant fragment) is aligned with an ordered list of probe families produced by other fragments representing the same DNA segment, but differs from the other ordered lists in 2 or more consecutive positions. Preferably, the alignment portion of the ordered list of probe families is at least 3 or 4 elements in length, preferably at least 6, 8 or more elements in length. Preferably, the aligned portions are at least 66% identical, at least 70% identical, at least 80% identical, at least 90% identical or more identical, such as 100% identical.

Similarly, a sequence error may occur when a candidate sequence for a fragment is aligned over a first portion of the sequence with candidate sequences for other fragments representing the same segment of DNA, but is significantly different from the candidate sequences for other fragments over a second portion of the sequence. A candidate sequence for a fragment is aligned with a candidate sequence for another fragment representing the same DNA segment on both sequences, but differs at only one position, and the abnormal fragment may contain a SNP. Preferably, the aligned portion of the candidate sequence is at least 4 nucleotides in length. Preferably, the aligned portions are at least 66% identical, at least 70% identical, at least 80% identical, at least 90% identical or more identical, such as 100% identical.

Accordingly, the present invention provides a method of distinguishing single nucleotide polymorphisms from sequencing errors, the method comprising the steps of: (a) sequencing a plurality of templates using sequencing method AB, wherein the templates represent overlapping fragments of a single nucleic acid sequence; (b) aligning the sequences obtained in step (a); and (c) if the sequence is substantially identical over the first portion and substantially different over the second portion (each portion being at least 3 nucleotides in length), determining a difference between the sequences as being indicative of a sequencing error. The present invention also provides a method of distinguishing single nucleotide polymorphisms from sequencing errors, the method comprising the steps of: (a) performing sequencing method AB with a plurality of templates representing overlapping segments of a nucleic acid sequence, thereby obtaining an ordered list of a plurality of probe families; (b) aligning the ordered list of probe families obtained in step (a) to obtain aligned regions in which the ordered list is at least 90% identical; and (c) if the ordered lists differ at only one position within the alignment region, determining the difference between the ordered lists of the probe family as representing a sequencing error; or (d) if the ordered lists differ at two or more consecutive positions within the aligned region, determining the difference between the ordered lists of the probe family as representing a single nucleotide polymorphism.

Delocalized (delocalized) information collections

As is well known in the art, a "bit" (binary digit) refers to a number following a 2 carry, i.e., a 1 or 0, which represents the smallest unit of digital data. Since nucleotides can be one of four different species, it is understood that 2 positions are required to define the nucleotide species. For example, A, G, C and T may be denoted as 00, 01, 10, and 11, respectively. Defining the probe family name in the preferred set of differentially labeled probe families requires position 2, since there are four differentially labeled probe families.

In the most conventional sequencing format and sequencing method a, each nucleotide is determined as a discrete unit and information corresponding to one nucleotide is collected at a time. Each detection step obtains two bits of information from one nucleotide. In contrast, sequencing method AB obtains less than 2 bits of information from each of the multiple nucleotides in each detection step, while 2 bits of information are still obtained for each detection step using the preferred set of probe families. Each probe family name in the ordered list of probe families represents at least 2 nucleotides of the template species, the exact number being determined by the length of the sequencing portion of the probe. For example, consider an ordered list of probe families obtained from sequence 5'-CAGACGACAAGTATAATG-3' using a set of probe families encoded according to code 4 of Table 1:

23324322132444142

CAGACGACAAGTATAATG

Probe family 2 is the first probe family in the list because dinucleotide CA is one of the designated moieties present in the probes of probe family 2. Probe family 3 is the second probe family in the list, since dinucleotide AG is one of the designated moieties present in the probes of probe family 3. As described above, since there are 4 probe families, each probe family type represents 2-bit information. Thus, each detection step collects 2 bits of information about 2 nucleotides, each yielding on average 1 bit of information.

Accordingly, the present invention provides a method of sequencing, wherein the method comprises a plurality of extension, ligation and detection cycles, wherein the step of detecting comprises averaging two bits of information from each of at least two nucleotides in the template simultaneously, without obtaining two bits of information from any single nucleotide. The present invention also provides a method for determining a nucleotide sequence of a template polynucleotide using a first set of oligonucleotide probe families, the method comprising the steps of: (a) performing successive cycles of extension, ligation, detection and cleavage, wherein two-bit information is obtained for each of at least two nucleotides in the template, on average, simultaneously in each cycle, but not for any single nucleotide; and (b) combining the information obtained in step (a) with at least one bit of extra information to determine the sequence. In various embodiments of the present invention, the at least one bit of additional information comprises information selected from the group consisting of: the nucleotide species in the template, information obtained by comparing the candidate sequence to at least one known sequence; and repeating the method with a second set of oligonucleotide probe families to obtain information.

Thus, while this approach does not yield 2 bits of information for a single nucleotide, 2 bits of information for the template are collected in an delocalized fashion over each cycle using the preferred probe family set. With a set of 2 or 3 probe families, less than 2 bits of information are collected per cycle.

There are many advantages to the collection of delocalized information, including the ability to apply error checking methods as described above. Furthermore, since in preferred embodiments each nucleotide in the template is detected more than once, the collection of delocalized information in the detection of fluorophores attached to a particular nucleotide helps to avoid systematic bias.

In addition to methods that involve successive cycles of extension, ligation, and cleavage of probes, the probe families and probe family sets described herein can be used in a variety of sequencing methods. The invention also provides probe families and probe family sets having the sequences and structures described above, wherein the probes optionally do not contain a scissile linkage. For example, the probe may contain only phosphodiester backbone linkages and/or may be free of initiating residues. In some embodiments of the invention, the probe family is used for sequencing, which employs successive cycles of extension and ligation, but does not include cleavage in each cycle. For example, the probe family can be used in ligation-based methods, as described in WO2005021786 and other references in the art. To employ the probe family in such a method, the labels on the probes should be linked by a cleavable linker, as described in WO2005021786, so that the labels can be removed without the use of a scissile linkage that cleaves nucleic acids. This method can be used to generate an ordered list of probe families, for example, by performing multiple reactions in parallel or in sequence with probe families other than the junction boxes described in WO2005021786, and then assembling the list of probe families. The list is decoded as described above.

I. Reagent kit

Various kits may be provided to carry out different embodiments of the invention. Certain kits include extension oligonucleotide probes comprising phosphorothioate linkages. The kit may further comprise one or more starter oligonucleotides. The kit may contain a cleavage reagent suitable for cleaving phosphorothioate linkages, such as AgNO₃And a suitable buffer for performing the cleavage. Certain kits include extension oligonucleotide probes containing a priming residue, such as a nucleoside containing a damaged base or an abasic residue. The kit may further comprise one or more starter oligonucleotides. The kit may contain a cleavage reagent suitable for cleaving a linkage between a nucleoside and an adjacent abasic residue and/or a reagent suitable for removing a damaged base of a polynucleotide, such as a DNA glycosylase. Certain kits include an oligonucleotide probe comprising a disaccharide nucleotide and periodate as a cleavage reagent. In certain embodiments, the kit contains a family of differentially labeled oligonucleotide probes.

The kit may also include linking reagents (e.g., ligase, buffer, etc.) and instructions for practicing embodiments of the invention. Buffers suitable for other enzymes that may be used, such as phosphatases, polymerases, etc., may be included. In some cases, these buffers may be the same. The kit may also include a support for anchoring the template, such as a magnetic bead. These beads can be functionalized with PCR amplification primers. Other optional components include wash solutions; a template-inserted vector for PCR amplification; PCR reagents such as amplification primers, thermostable polymerases, nucleotides; reagents for preparing the emulsion; reagents for preparing gels, and the like.

In certain preferred kits, fluorescently labeled oligonucleotide probes containing phosphorothioate linkages are provided such that probes corresponding to different terminal nucleotides of the probes carry different spectrally resolved fluorescent dyes. More preferably, four such probes are provided so as to provide a one-to-one correspondence between the four spectrally resolved fluorochromes and the four possible probe terminal nucleotides.

Identifiers, such as bar codes, radio frequency ID tags, and the like, may be present in or on the kit. For example, the identifiers may be used to uniquely identify the kits for quality control, inventory management, tracking, movement between workstations, and the like.

Kits typically include one or more vessels or containers for individually storing certain reagents. The kit may also include means for enclosing the individual containers in a relatively tight seal, such as a plastic case, which may contain instructions, packaging materials such as styrofoam, and the like, to facilitate commercial distribution.

J. Automatic sequencing system

The present invention provides a variety of automated sequencing systems that can be used to collect sequence information for multiple templates in parallel (i.e., substantially simultaneously). Preferably, the template is arranged on a substantially flat substrate. Fig. 21 shows a photograph of a system of the present invention. As shown in the above photograph, the system of the present invention comprises a CCD camera, a fluorescence microscope, a mobile station, a Peltier flow cell, a temperature controller, a fluid handling device, and a special purpose computer. It should be understood that various substitutions may be made to these components. For example, another image capturing device may be employed. See example 9 for additional details of this system.

It is to be understood that various sequencing methods, including the ligation-based methods and other methods described herein, can be implemented using the automated sequencing system of the present invention and associated image processing methods and software, including but not limited to: sequencing by synthesis, such as by synthetic fluorescence in situ sequencing (FISSSEQ) (see, e.g., Mitra RD et al, AnalBiochem., 320 (1): 55-65, 2003). As with the ligation-based sequencing methods described herein, FISSSEQ can be performed on templates immobilized directly in or on a semi-solid support, on microparticles immobilized in or on a semi-solid support, directly attached to a substrate, and the like.

An important aspect of the system of the present invention is the flow cell. Typically, the flow chamber comprises a chamber having input and output ports through which fluid can flow. See, for example, U.S. patents 6,406,848 and 6,654,505 and PCT publication No. WO98053300 for a discussion of various flow chambers and materials and methods of making the same. Fluid flow enables the addition and removal of various reagents to entities (e.g., templates, particles, analytes, etc.) located in the flow chamber.

Preferably, a flow cell suitable for use in a sequencing system of the invention comprises a location where a substantially flat substrate, such as a slide, can be mounted to allow fluid flow over the substrate surface, and a window to allow illumination, excitation, signal acquisition, etc. In accordance with the method of the present invention, entities, such as particles, are generally disposed on a substrate prior to entering a flow cell.

In some embodiments of the invention, the flow cell is vertically positioned so that air bubbles escape from the top of the flow cell. The flow chamber is arranged such that the flow path runs from the bottom end of the flow chamber to the top end, e.g. the inlet port is located at the bottom end of the flow chamber and the outlet port is located at the top end of the flow chamber. Since any bubbles that may be introduced are buoyant, they quickly float to the output port without obscuring the illumination window. This method of raising bubbles to the surface of a liquid due to the density of the bubbles being lower than the density of the liquid is referred to herein as "gravity bubble replacement". Thus, the present invention provides a sequencing system with a flow cell orientation that allows for gravity bubble displacement. Preferably, the substrate to which the particles are directly or indirectly attached (e.g. covalently or non-covalently attached to the substrate) or which contains particles adhered or immobilized in or on a semi-solid support on the substrate is mounted vertically in the flow chamber, i.e. the largest planar surface of the substrate is perpendicular to the ground plane. Since in a preferred embodiment the microparticles are immobilized in or on a support or substrate, their relative positions are substantially fixed, which facilitates the continuous acquisition of images and image recordings.

FIGS. 24A-J show schematic views of a flow cell of the present invention or portions thereof in different orientations. The flow cell of the present invention may be used for various purposes including, but not limited to: analytical methods (e.g., nucleic acid assays such as sequencing, hybridization assays, etc.; protein assays, binding assays, screening assays, etc.). The flow cell can also be used to perform synthesis, such as generating combinatorial libraries and the like.

FIG. 22 shows a schematic diagram of another automated sequencing system of the invention. The flow cell was mounted on a temperature controlled robot station (similar to that described in example 9) and connected to a fluid handling system, such as a syringe pump equipped with a multi-port valve. The platform accommodates multiple flow cells for imaging one flow cell while performing other steps on another flow cell, such as extending, joining, and cutting. This approach maximizes the use of expensive optics while increasing throughput.

The fluid line is equipped with optical and/or conductivity sensors to detect bubbles and monitor reagent usage. The temperature control and sensors of the fluidic system ensure that the long-term stability of the reagents is maintained at the appropriate temperature, but are raised to the working temperature as they enter the flow cell to avoid temperature fluctuations during the annealing, joining and cutting steps. The reagents are preferably prepackaged into a kit to prevent errors in sample application.

The optics include four cameras-each taking a picture through one of four filter sets. To reduce the photo-bleaching effect, the illumination optics may be engineered to illuminate only the imaging area to prevent multiple illumination at the edges of the field of view. Imaging optics can be built up with standard infinity corrected microscope objectives and standard beam splitters and filters. The image may be captured with a standard 2,000 x 2,000 pixel CCD camera. The system incorporates mechanical support for the optics. The light intensity is preferably monitored and recorded for use by the analysis software.

In order to obtain multiple images quickly (e.g., about 1800 or more non-overlapping image fields of view in one representative embodiment), the system preferably employs a fast auto-focus system. Autofocus systems based on analysis of the image itself are well known in the art. They typically require at least 5 frames per focus event. This approach is slow and expensive due to the extra illumination required to obtain a focused image (increased photobleaching). In some embodiments of the invention, another autofocus system is used, such as a system based on separate optics, which focuses as fast as a mechanical system can react. Such systems are known in the art, including, for example, focusing systems for consumer-grade CD players, which maintain sub-micron focus in real time as the CD is played.

In some embodiments of the invention, the system is remotely operated. Scripts for implementing particular embodiments may be stored in a central database, downloaded for each sequencing round. The samples may be bar coded to maintain sample tracking integrity and to correlate the samples with the final data. Central real-time monitoring allows fast resolution of process errors. In some embodiments, images collected by the device are immediately uploaded to a central multi-TB storage system and one or more processor libraries. Using tracking data from the central database, the processor analyzes the images and generates sequence data, optionally generating processing specifications such as background fluorescence levels and bead densities, to, for example, track device performance.

The pumps, stage, camera, filters, temperature controller are suitably arranged with control software and the image data is annotated and stored. A user interface is provided to, e.g., assist an operator in setting up and maintaining the apparatus, which preferably includes the functions of determining platform position and activating fluid lines when loading/unloading slides. Display functionality may be included, for example, to display to an operator various operating parameters, such as temperature, platform position, current filter configuration, status of operating schemes, etc. Preferably including an interface to record a database of tracking data such as reagent lot numbers and sample IDs.

K. Image and data processing method

The present invention provides various image and data processing methods implemented at least in part in the form of computer code (i.e., software) stored on a computer readable medium. Further details are set forth in examples 9 and 10. In addition, in general, sequencing methods a and B typically employ suitable computer software to perform processing steps including, for example, keeping track of data collected in multiple sequencing reactions, compiling the data, generating candidate sequences, performing sequence comparisons, and the like.

L. computer readable medium storing sequence information

In addition, the invention provides a computer readable medium storing information generated using the sequencing method of the invention. The information includes raw data (i.e., data that has not been further processed or analyzed), processed or analyzed data, and the like. The data includes images, numbers, and the like. Such information may be stored in a database, i.e., a collection of information (e.g., data), typically arranged for easy lookup, e.g., in a computer memory. The information includes, for example: sequences and any information about sequences such as partial sequences, comparisons of sequences to reference sequences, results of sequence analysis, genomic information such as polymorphism information (e.g., whether a particular template contains a polymorphism) or mutation information, linkage information (i.e., information relating to the physical location of a nucleic acid sequence in a chromosome relative to another nucleic acid sequence), disease-related information (i.e., information relating the presence or susceptibility to a disease to a physical characteristic of a subject such as an allele of the subject), and the like. The information may relate to a sample ID, an object ID, etc. Other information relating to the sample, object, etc. may be included, including but not limited to: sample source, processing steps performed on the sample, interpretation of information, characteristics of the sample or object, and the like. The present invention also includes a method comprising receiving any of the above-described information in a computer-readable form (e.g., stored on a computer-readable medium). The method may further comprise the step of providing a diagnostic, prognostic or predictive message based on such information or simply providing the message, preferably stored on a computer readable medium, to a third party.

The following examples are provided for illustration and are not intended to limit the invention.

Example 1: efficient cleavage and ligation of phosphorothioate oligonucleotides

This example describes experiments showing efficient ligation and cleavage of extension oligonucleotides containing 3' -S phosphorothioate linkages.

Materials and methods

Ligation sequencing method

Preparing a template: to evaluate the potential for sequencing by cycles of oligonucleotide ligation and cleavage and to explore the role of altering certain aspects of the method, two sets of model bead-based template populations were prepared. In a preferred embodiment, the oligonucleotide ligation and cleavage cycle extends the strand in the 3 '→ 5' direction, as described in the examples. Therefore, to evaluate ligation efficiency, the 5 'end of the pattern template was bound to the bead and the same binding region was designed at the 3' end. One set consisted of short (70bp) oligonucleotides bound to streptavidin-coated magnetic beads (1 micron) via a bis-biotin moiety. The 3' ends of each of these short template populations were designed with identical primer binding regions (40bp) and unique sequence regions (30 bp). The population of short oligonucleotide templates is referred to as ligation sequencing templates 1-7(LST 1-7).

A second set of bead-based template populations was designed from PCR-generated long DNA fragments (232-bp) generated by inserting 183-bp spacer sequences (from the human p53 exon) into each template population. The template was amplified with a forward primer containing bis-biotin and a reverse primer containing the same unique 30 base 3' end sequence as the short template population. Single stranded templates were generated by unwinding one strand with a buffer containing sodium hydroxide. The design of these long template populations mimics the species generated from the short fragment paired-end libraries described in the copending patent application, which are referred to as long-LST 1-7.

Primer and method for producing the sameAnd (3) hybridization: mu.L of 100. mu.M FAM-labeled primer was mixed with 100. mu.L of 1X Klenow buffer in advance. After removing the buffer, a 30. mu.L aliquot of template-attached magnetic beads (10) was added to the solution⁶μ L), the resulting solution was mixed well. After allowing hybridization of template/primers (hybridization reaction at 65 ℃ for 2 min, 40 ℃ for 2 min, ice for 2 min), the primers/buffer were removed, the beads were washed with 3 × Wash 1E buffer, and resuspended in 300 μ L (10 μ L)⁶/mL) TENT buffer (containing 10mM Tris, 2mM EDTA, 30mM NaOAc and 0.01% Triton X-100).

Connection 1: then, the cells were incubated in a medium containing 1. mu.L of 100. mu.M LST7-1 nonamer, 4. mu.L of 5 XT 4 ligase buffer (Invitrogen), and 14. mu.LH₂Hybridization of 2.5X 10 containing LigSeq-FAM with a mixture of O and 1. mu. L T4 ligase (1 u/. mu.L, Invitrogen) incubated at 37 ℃⁶LST7 beads for 30 minutes.

Cutting 1: the beads were then washed 3 times with 100 μ L lswashbl (containing 1XTE, 30mM sodium acetate, 0.01% triton x 100); a10 μ L aliquot of this solution was removed and stored for analysis. The beads were then washed with 100 μ L of 30mM sodium acetate (1 ×). 50 μ L of 50mM AgNO₃This solution was added and the resulting mixture was incubated at 37 ℃ for 20 minutes. Removal of AgNO ₃The beads were washed once with 100 μ L of 30mM sodium acetate. The beads were then washed 3 times with 100. mu.L of LSWash1 and resuspended in 90. mu.L of Wash (TENT) buffer; a10 μ L aliquot of this solution was removed and stored for analysis.

And (3) connection 2: after removal of TENT buffer, the beads were resuspended in 14. mu. L H₂O, incubated with a mixture containing 1. mu.L of 100. mu. MLST7-5 nonamer, 4. mu.L of 5 XT 4 ligase buffer (Invitrogen) and 1. mu.LT 4 ligase (1 u/. mu.L, Invitrogen) for 30 min at 37 ℃.

And (3) cutting 2: the beads were washed 3 times with 100. mu.L of LSWash1(1XTE, 30mM sodium acetate, 0.01% Triton X100) and resuspended in 45. mu.L of Wash 1E. A 15 μ L aliquot of this mixture was removed and stored for analysis. The beads were then washed 1 time with 100. mu.L of 30mM sodium acetate and resuspended in 5. mu.L of 20 mM sodium acetate. 50 μ L of 50mM AgNO₃Adding the beads, incubating the mixture at 37 ℃Compound 20 min. Removal of AgNO₃Thereafter, the beads were washed once with 100 μ L of 30mM sodium acetate. The beads were then washed 3 times with 100. mu.L of LSWash1 and resuspended in 30. mu.L of Lash 1E. A20 μ L aliquot of this mixture was removed and stored for analysis.

Results

This experiment can be better understood with reference to fig. 8. The upper part of fig. 8 shows the general outline of the experimental procedure. The initial oligonucleotide (primer) hybridizes to a template (labeled LST7) attached to the bead by biotin linkage. The starting oligonucleotide contains a 5 'phosphate, the 3' end of which is fluorescently labeled with FAM. Two 9-mer (nonamer) oligonucleotide probes (1 sT and 2 nd cleavable oligonucleotides) were synthesized, which contained a phosphorothioate thymidine base (sT) inside (underlined). The 1 st cleavable probe was ligated to the extendable end of the primer with T4 DNA ligase and then cleaved with silver nitrate. Cleavage removes the terminal 5 nucleotides of the extension probe and creates an extendable terminus on the portion of the probe still ligated to the primer. Then, a 2 nd cleavable probe is ligated to the extendable terminus, and then cleavage is performed similarly.

The ligation and cleavage steps were monitored by fluorescence capillary electrophoresis gel shift experiments. In this experiment, the primer was hybridized to the template strand such that the 5' phosphate could be used as a ligation substrate for the introduction of oligonucleotide probes (fluorophores were used as reporters for mobility-based capillary gel electrophoresis). After each step, a sample amount of beads was taken for analysis. After the oligonucleotide probe was ligated, the magnetic beads were collected with a magnet, the ligation formed by the primer and probe ligation on the template beads was released by thermal denaturation, and fluorescence capillary electrophoresis was performed with an automated DNA sequencing apparatus (ABI 3730) using a labeled size standard (lissamine ladder; size range 15-120 nucleotides; shown as a set of orange peaks in the chromatogram, see FIG. 8). In a typical gel shift, possible peaks include, i) the primer peak (due to no extension or lack of primer extension), ii) the adenylation peak (due to the action of DNA ligase to join an adenosine residue at the 5' end of the non-productive junction-see the mechanism of fig. 8F, see also Lehman, i.r., Science, 186: 790 ℃ 797, 1974), and iii) completion of the peak (due to ligation of the oligonucleotide probe). One advantage of evaluating the efficiency of ligation using gel shift experiments is that the area under the peak is directly related to the concentration of each species.

FIG. 8A shows control ligation with T4 DNA ligase and the exact match probe containing only phosphodiester ligation (left of FIG. 8A). The orange peak represents the size marker. The blue peak on the left indicates the position of the primer when no ligation is present. Ligation of the exact match probe results in a shift to the left (arrow). FIG. 8B shows ligation under the same conditions with a probe containing a thiolated T base inside (left in FIG. 8B). The same shift as the control probe was observed (arrow). The bead-attached template population with the phosphorothioate probe attached was then incubated with silver nitrate to induce probe cleavage. Gel shift analysis showed a left-shifted 4-bp cleavage product, confirming efficient cleavage (FIG. 8C). The predicted cleavage product is shown to the left in FIG. 8C. The cleaved bead-based template population was then subjected to a second round of ligation, as evidenced by the presence of a right-shifted 13-bp extension product (FIG. 8D). The predicted cleavage product is shown to the left in FIG. 8D. Second round cleavage confirmed that efficient multiple cleavage steps could be accomplished as indicated by the predicted 8-bp cleavage product left shifted (FIG. 8E).

These results demonstrate that probes containing phosphorothioate linkages were successfully ligated and cleaved.

Apparently, ligation was not performed to 100% completion in these experiments, but a higher degree of completion was observed in other experiments with T4 DNA ligase (see below). While it is certainly desirable for the connection to proceed to completion, this is not a requirement. For example, the unligated 5 'end can be effectively "capped" by a 5' -phosphatase treatment following the ligation step described above. However, in this case, the number of consecutive ligations that can be made may be limited due to the consumption of the ligatable molecules. The read length, given the number of consecutive ligations, depends on the length of probe remaining after each ligation/cleavage cycle and the number of sequencing reactions, each of which is followed by primer removal and hybridization of primers bound to different parts of the primer binding site, also referred to as the "restart" number, for a given template. This supports the use of longer probes with cleavable linkages near the 5' end of the probe. In our experiments, hexamer probes produced more unligated adenylation products than octamers and longer probes. Octamers and longer probes were essentially attached to completion (see below). Furthermore, the addition of a fluorescent moiety to the 5' end of the hexamer probe appears to reduce ligation efficiency, with little or no effect from the addition of a fluorescent moiety to the octamer probe. For these reasons, it is believed that octamer or longer probes are preferred.

Other experiments (described below) have demonstrated ligation and cleavage of probes containing phosphorothioate linkages and nucleotides with reduced degeneracy; 3' end specificity and selectivity of the ligated extension probes; ligation and cleavage in gel; successive cycles of primer hybridization and removal, with only a small loss of signal; the fidelity of 3 '→ 5' extension by T4 or Taq ligase was 100%; and 4-color spectral resolution of the ligated extension probes. An automated system for carrying out the method is constructed.

Example 2: efficient cleavage and ligation of phosphorothioate oligonucleotides containing nucleotides with reduced degeneracy

However, another consideration for probe length is the fidelity of the extended oligonucleotide and its effect on subsequent ligation efficiency. It has been demonstrated that the fidelity of T4 DNA ligase decreases rapidly after the 5 th base after ligation (Luo et al, Nucleic Acids Res., 24: 3071-3078 and 3079-3085, 1996). If a mismatch is introduced on the 5' side of the newly ligated junction, ligation efficiency can be reduced by depletion, however, no dephasing or increase in background signal (a major obstacle encountered in polymerase-based sequencing by synthetic methods) occurs.

Preferably, the probe set should be capable of hybridizing to any DNA sequence in order to resequence uncharacterized DNA. However, the complexity of the set of labeling probes increases exponentially with the length and number of 4-fold degenerate bases. Furthermore, complex probe sets are more difficult to synthesize and more difficult to purify while maintaining substantially the same representation for all probe species. Higher concentrations of probe mixtures are also required to maintain constant concentrations of the various species. One way to address this complexity is to use nucleotides incorporating universal bases such as deoxyinosine instead of 4-fold degenerate bases at certain positions.

12 octanucleotide probes were designed with 4-fold degenerate bases (N; equimolar amounts of A, C, G, T) and the universal base inosine (I) at each position within the octamer (inosine could form bidentate hydrogen bonds with any of the four canonical bases in B-DNA; the order of stability of the inosine base pairs was I: C > I: A > I: T ≈ I: G). One of the goals in evaluating these probe designs was to determine how low octamer complexity could be achieved in the presence of inosine bases while still supporting efficient ligation.

In a preliminary study, several oligonucleotide probes were attached to a bead-based template (Long-LST 1) using T4 DNA ligase. After ligation, the fluorophore-labeled primer (3' FAM primer) was shifted to the right by an amount proportional to the amount of ligated oligonucleotide probe. Probe design NI8-9 showed the highest level of completion, with > 99% of the primer population shifted to the right due to efficient ligation of the probes (see FIG. 9). These reactions were carried out at 25 ℃; when the reaction temperature was increased to 37 ℃, the ligation efficiency was slightly lower and the completion rate was more variable.

Further examination of these data revealed that the probe with less inosine base in the first five nucleotides (underlined) 3' to the junction showed higher ligation efficiency. To further investigate and evaluate the possible effect of sequence content on ligation efficiency, four oligonucleotide probe designs were screened in all templates for only one inosine residue in the first five bases 3' to the junction. FIG. 10 shows gel shift assays performed on selected probe compositions on various templates using T4 DNA ligase to assess ligation completion. The data from these preliminary experiments show that ligation efficiency and completion rate are variable and sequence dependent when an inosine residue appears in the first five 3' positions (underlined) of the ligation. However, efficient ligation of the octamer was consistently observed when NI8-9 was designed using oligonucleotide probes, as evidenced by > 99% completion on all templates tested.

While not wishing to be bound by any theory, these data (including the presence of adenylation intermediates) support the following conclusions: the presence of an unfavorable inosine base pair in the core DNA binding site of T4 DNA ligase destabilizes the DNA protein complex sufficiently to reduce enzyme binding and subsequent ligation. However, an interesting problem is that such destabilizing inosine base pairs will not affect the fidelity of the ligated oligonucleotide probes.

Example 3: fidelity of probe attachment

Bacterial NAD-dependent ligases such as Taq DNA ligase have been reported to have high sequence fidelity at the junction, where mismatches on the 3 'side are essentially notch-closing free, but mismatches on the 5' side are somewhat tolerant (Luo et al, Nucleic Acids Res., 24: 3071-3078 and 3079-3085, 1996). On the other hand, T4 DNA ligase has been reported to be less stringent, allowing mismatches on the 3 '-and 5' -side of the junction. Therefore, it is of interest to evaluate the fidelity of probe ligation with T4 DNA ligase in our system compared to Taq DNA ligase.

Using standard ABI sequencing techniques, we developed two methods to evaluate the sequence fidelity of the ligated oligonucleotides. The first approach was designed to clone and sequence the ligation products. In this method, the ligation extension product is ligated to an adapter sequence, cloned and transformed into bacteria. Individual colonies were picked and sequenced to quantitatively assess the mismatch frequency at each position of the junction. The second approach was designed to directly sequence the ligation products. In this method, single stranded ligation products are denatured from the bead-based template and directly sequenced with complementary primers. The positions of low accuracy in the resulting sequence trace show multiple overlapping peaks, and the sequence fidelity at that position is assessed qualitatively.

The relative fidelity of ligation of the probes using T4 and Taq DNA ligase was assessed in the first method. Molding a single bead-based templateThe population (LST1) was hybridized to universal sequencing primers that were used as starting oligonucleotides. T4DNA ligase (15U/1x 10) was then used in the presence of a degenerate oligonucleotide probe (N7A, 3 'ANNNNNNNNN 5', 2000 picomolar)⁶Beads) or Taq DNA ligase (60U/1X10⁶Beads) were subjected to a solution-based ligation reaction at 37 ℃ for 30 minutes (fig. 11, panel a). The ligation products were cloned and sequenced to evaluate the positional fidelity of each DNA ligase on the 3' side of its ligation (positions 1-8) (FIG. 11, panels B and C). The results indicate that the fidelity levels of T4DNA ligase and Taq DNA ligase were essentially the same at the first 5 positions, but the fidelity of T4DNA ligase was lower at positions 6-8. These results were further confirmed by subsequent cloning experiments that evaluated the DNA sequences of three degenerate inosine-containing probe designs (3 '-NNNIII-5', 3 '-NNNI-5', and 3 '-NNNI-5') linked to all seven templates (LST 1-7). This study confirmed that T4DNA ligase had low sequence fidelity at positions 6-8 of the junction, but high fidelity at the first 5 positions in all templates tested (data not shown).

The fidelity of T4 DNA ligase on the degenerate inosine-containing probe was evaluated by direct sequencing. The oligonucleotide probes were evaluated in a ligation reaction at 25 ℃ and 37 ℃ containing T4 DNA ligase and a bead-based template. The efficiency of oligonucleotide probe ligation was evaluated using a gel shift assay (FIG. 12, panel A). The ligation reaction was directly sequenced using an ABI3730xl DNA analyzer to evaluate the fidelity of T4 DNA ligase in oligonucleotide probe ligation (fig. 12, panel B). Ligation of an exact match oligonucleotide probe and two representative degenerate inosine-containing oligonucleotide probes (NI8-9 and NI8-11) was achieved > 99% completion with very low frequency of mismatches (no multiple peaks in the sequencing traces). The data indicate that the operably linked probes also have high sequence fidelity.

In other experiments, a single bead-based template population (LST1) was hybridized to a universal sequencing primer containing a 5' phosphate that was used as the starting oligonucleotide. Solution-based ligation was performed with T4 DNA ligase (1U/250,000 beads) in the presence of degenerate inosine-containing oligonucleotide probes (3 'NNNiii 5', 3 'NNNiNi 5' or 3 'NNNiNNNi 5', 600 picomolar) at 37 ℃ for 30 min. The ligation products were cloned, colonies were picked and sequenced. Sequence fidelity was determined by counting the number of clones representing each position of the junction. The results are tabulated and shown in FIGS. 12C-F. These studies demonstrated that inosine-containing probes degenerate with T4 DNA ligase 3 '→ 5' ligation at the first 1-5 positions with high levels of fidelity.

Example 4: ligation and cleavage in gel

Preliminary experiments to explore, develop and optimize the methods of oligonucleotide ligation cycles were performed using bead-based templates in solution, as described above. In a second set of experiments, bead-based templates embedded in polyacrylamide gels on glass slides were ligated and cut.

Slides were prepared by mixing millions of beads, each bead attached to a clonal population of single stranded DNA templates, with 5% polyacrylamide on the slide and where polymerization occurred. The beaded polyacrylamide solution was surrounded with a Teflon ® mask. FIG. 14 (top panel) shows a fluorescent image of a portion of a slide on which beads with template hybridized with Cy 3-labeled primer were immobilized in a polyacrylamide gel. (this slide was used for different experiments and is representative of the slide used herein.) figure 14 (lower panel) shows a schematic view of a slide equipped with a Teflon mask to enclose the polyacrylamide solution.

The reagents are introduced into the slide by manually dropping a suitable solution into the slide or placing the slide into an automated laminar flow chamber. Preliminary studies demonstrated that in fact efficient in-gel ligation of templates attached to beads immobilized in a polyacrylamide matrix of such a slide could be performed. In the experiment shown in FIG. 15, single stranded DNA template beads were immobilized on a glass slide containing acrylamide and DATD. After polymerization, 3 'fluorophore-labeled, 5' phosphorylated universal primers (sequencing primers) were diffused into the gel and allowed to polymerize (panel a). The slides were washed to remove unbound sequencing primer, mixed with a ligation mix containing T4 DNA ligase (10U) and oligonucleotide probe, and incubated at 37 ℃ for 30 minutes. The slides were then incubated in buffer containing sodium periodate (0.1M) to digest the acrylamide polymer and release the bead-based template population. The template strands are denatured by heating to give ligation products, which are collected and analyzed by the gel shift assay described above. Ligation reactions performed in the gel in the absence of T4 DNA ligase showed one peak representing unligated sequencing primer (panel B). Ligation reactions with octamer probes in the presence of T4 DNA ligase showed efficient oligonucleotide ligation in the gel with > 99% of the bead-based template population efficiently ligated (FIG. C).

Example 5: four color detection

To maximize detection efficiency, it is desirable to employ a set of oligonucleotide probes containing a distinguishing label corresponding to each possible base addition product. This method was simulated in an automated sequencing apparatus equipped with appropriate excitation and emission filters, as shown in FIG. 15. Three sets of octamer probes were designed to address the issues of probe specificity and selectivity. The first set comprises four octamers, complementary to four unique template populations, containing different 3 'bases and 5' dye labels. The second group includes seven unique octamers containing unique 3 'bases and 5' dyes. The third set corresponds to four degenerate inosine octamer-containing probe designs, each containing a unique 3 'terminal base identified with a different 5' dye label.

To validate the four spectral species, four unique template populations were tested with probe set #1 (see FIG. 16). Slides containing four unique populations of single stranded template attached to beads embedded in polyacrylamide were prepared (panel a). Each bead is associated with a clonal population of templates. Universal sequencing primers containing 5' phosphates were hybridized in situ and ligation was performed using a mixture of oligonucleotide probes containing four unique fluorophore probes (Cy5, CAL 610, CAL 560, FAM; 100 picomoles each) and T4 DNA ligase (10U/slide). The slides were incubated at 37 ℃ for 30 minutes and washed to remove unbound probe. The slides were imaged under bright light to produce a white light base image (panel B) and fluorescence excitation was performed using four bandpass filters (FITC, Cy3, texas red and Cy 5). Fluorescence images are captured before and after ligation. The single population produced false colors (panel C), the image values for different spectral species were plotted, and minimal signal overlap was verified (panel D).

Example 6: demonstration of ligation specificity and selectivity in gels

To verify 3' end specificity, one template population was tested with probe set #2 (see FIG. 17). Slides were prepared with beads attached to a template population (LST1.T) embedded in a polyacrylamide gel and hybridized in situ with universal sequencing primers (panel A). Ligation was performed in a gel using T4DNA ligase (10U/slide) and a mixture of oligonucleotide probes consisting of four 5 'end-labeled probes that differed by only one 3' base. The slides were incubated at 37 ℃ for 30 minutes and washed to remove unbound probe population. The slide was imaged under white light to generate a base image (panel B) and fluorescence excitation was performed with four bandpass filters (FITC, Cy3, texas red and Cy 5). Fluorescence images captured before and after ligation confirmed the presence of a single FAM-based probe population (blue dots) after ligation in a gel with T4DNA ligase, with no spectral overlap (figure C, D). These data show that the probe specificity of T4DNA ligase is stringent and depends on the first 3' base at the junction.

To further confirm the 3' specificity and selectivity, probe set #2 was used to identify bead template population mixtures that contained one base difference and were present in varying amounts. Slides were prepared with mixtures of beads each attached to one of four template populations each having a different single nucleotide polymorphism (LST 1; A, G, C or T) as shown in FIG. 18A. These beads were embedded in polyacrylamide gel on a glass slide. Bead-based template populations were used at various frequencies, as shown in column D. The slides were hybridized in situ with universal sequencing primers. Ligation was performed in a gel using T4DNA ligase (10U/slide) and a mixture of oligonucleotide probes containing equimolar amounts (100 picomoles each) of four 5 'end-labeled probes, which differed by only one 3' base. The slides were incubated at 37 ℃ for 30 minutes and washed to remove unbound probe population. The slides were imaged under white light to generate the base image (column B) and fluorescence excitation was performed with four bandpass filters (FITC, Cy3, texas red and Cy 5). The single probe images were superimposed and a false color was generated (column C). The fluorescence images were counted using bead-call software. The results are shown in column D, which demonstrates that the observed ligation frequency (Obs) correlates with the expected frequency (Exp). The data show that probe specificity and probe selectivity are high after ligation in the presence of multiple templates and demonstrate the ability to detect Single Nucleotide Polymorphisms (SNPs), i.e., changes in one nucleotide base in segments of genomic DNA of different individuals in a population, by ligation.

Example 7: confirmation of ligation specificity and selectivity in gels Using four-color degenerate inosine-containing extension probes

Another set of experiments was performed with Probe set #3 to evaluate the specificity and selectivity of probe ligation using a four-color degenerate inosine-containing oligonucleotide probe pool. The results are shown in FIG. 19. Bead-based slides were prepared as described above, but using four unique populations of single-stranded template present in varying numbers on the beads, and then hybridized in situ with universal sequencing primers (panel a). Ligation was performed in a gel in the presence of T4 DNA ligase (10U/slide) using a probe cell consisting of five degenerate bases at the 3' end (N; complexity 4)⁵1024), two universal bases (I, inosine) and an octamer of known nucleotide design, which correspond to a specific 5' fluorophore (G-Cy5, a-CAL 610, T-CAL560, a-FAM; 600 pmoles each). The slides were incubated at 37 ℃ for 30 minutes and washed to remove unbound probe population. The slides were imaged under white light to generate the base image (column B) and fluorescence excitation was performed with four bandpass filters (FITC, Cy3, texas red and Cy 5). The single probe images were superimposed and a false color was generated (column C). The fluorescence images were counted with bead-call software and the frequency of each ligation product tabulated (column D); the raw data and the filtered data representing the first 90% bead signal values are presented in the spectral scatter plot in column E. The data demonstrate that the observed ligation frequencies (Obs) correlate with the expected frequencies (Exp) based on known concentrations of each template. This verifies that degenerate and universal base-containing probe pools can be used with T4 DNA ligase to provide specific and selective ligation in gels.

Example 8: confirmation of repeated cycles of hybridization and removal of starting oligonucleotides in gels

Experiments performed on templates immobilized in gels on microscope slides mounted in an automated flow cell (see below) demonstrated that multiple cycles of annealing and stripping initiation oligonucleotides can be applied to templates attached to beads embedded in gels on slides with minimal signal loss. A 44 base fluorescently labeled starting oligonucleotide was used. As shown in fig. 20, minimal signal loss occurs over 10 cycles. The starting oligonucleotide is referred to as a primer in FIG. 20. As mentioned above, one major drawback of polymerase-based sequencing-by-synthesis methods is the propensity for positive and negative dephasing to occur on a single template strand. Nucleotide misincorporations into the growing strand occur with positive dephasing, thereby causing the base sequence of that particular strand to run in front of the sequence obtained from the remaining template, and with a phase difference of n +1 base calls. The more common negative dephasing occurs when the strand is not fully extended, resulting in background base calls running after the growing strand (n-1). The ability to efficiently strip extension products and "restart" templates by hybridizing to locate different initial oligonucleotides enables very long read lengths with little to no signal loss.

Example 9: automatic sequencing system

This example describes a representative automated sequencing system of the invention that can be used to collect sequence information for one or more templates. Preferably, the template is located on a substantially flat substrate such as a microscope slide. For example, the template may be attached to beads arranged on a substrate. A photograph of the system is shown in figure 21. The system is based on an olympus epifluorescence microscope body (side mounted) equipped with an automated, autofocus platform and a CCD camera. Four filter cartridges in a rotating holder allow four-color detection at different excitation and emission wavelengths. The platform has mounted thereon a flow chamber equipped with a peltier temperature controller, which can be opened or closed to accept a substrate such as a slide (with a gasket to seal the edge of the region containing the semi-solid support such as a gel). The vertical orientation of the flow cell is an important aspect of the system of the present invention, which allows bubbles to escape from the top of the flow cell. The flow cell may be completely filled with air to expel all reagents prior to each wash step. The flow cell was connected to a fluidic processor equipped with two 9-port Cavro syringe pumps capable of delivering 4 differentially labeled probe mixtures, cleavage reagents, any other desired reagents, enzyme equilibration buffer, wash buffer and air to the flow cell through one port. The operation of the system is fully automated and programmable by control software using a dedicated computer with multiple I/O ports. The Cooke sensor camera is equipped with a 1.3 megapixel cooled CCD, but cameras with lower or higher sensitivity (e.g., 4 megapixels, 8 megapixels, etc.) may also be used. The flow cell utilized a 0.25 micron platform with 1 micron physical dimension.

Example 10: image acquisition and processing method

This example describes representative methods of obtaining and processing images of bead arrays with attached labeled nucleic acids. Accurate characterization and alignment is important for reliable analysis of each acquired image. All pixels except the highest intensity pixel of each bead were first discarded to identify features. Making a histogram of pixel values of a given image; pixels corresponding to the background are discarded and the remaining pixel values are sorted. In a consistent image, where the intensity of all beads is substantially the same, the algorithm used removes the bottom 80-90% of the pixel values. The first 10-20% of the pixels in pixel value are then scanned to identify the pixel that is the local maximum in 4 pixel radii. The average intensity of the area and the average intensity of the perimeter are then recorded. These values form a normal distribution and then pixels whose values fall outside the distribution are removed. The percentage of pixels initially ignored, the size of the circular region, and the cutoff value to eliminate possible beads in a normal distribution are parameterized and can be changed if desired. The comparison is completed by establishing a feature matrix of each image in the comparison group. The most frequent x, y coordinate offsets in the resulting matrix are then searched to identify the optimal alignment.

Bead images were collected in Cy5 channel (corresponding to sequencing primers) before addition of extension probes. These images were used to establish marker localization coordinates for each bead and a profile of raw signal intensity in fluorescence units (RFU). For each subsequent duplex extension, a set of images was taken before and after addition of Cy 3-labeled nucleotides. These images were aligned to the original Cy5 image, and RFU values were assigned to each bead and recorded. Baseline correction was performed by subtracting the difference in intensity between the unlabeled image (before extension) and the labeled image (with fluorescence added) resulting from each base addition. These baseline-subtracted values are then normalized to the intensity found in the Cy5 image for each feature to form the basis for identifying whether the bead is extended or not (i.e., if the duplex attached to the bead is extended, the bead is considered extended). With these methods, thousands of features per image can be analyzed in about 1,300 images per slide, in order to analyze five million-one hundred million template species per experimental run. The algorithm was designed so that it was not difficult to subsequently import C + from MATLAB to further improve efficiency.

Example 11: bead alignment and tracking and sequence decoding

This example describes representative methods of processing images of bead arrays linked with labeled nucleic acids and sequencing from the data obtained.

Image analysis was started by curling the image with a zero-integral circular top-ring center (zero-integral circular top-hat) of a diameter matching the bead size. This enables automatic normalization of the background to zero while identifying the center of a single bead by local maxima. The maxima are determined and those maxima isolated from other local maxima are used as comparison points. The contrast points of each image are calculated in time series. For each pair of images, the points are compared and a displacement vector is calculated based on the average displacement of all the common comparison points. This provides for pair-wise image displacement at sub-pixel resolution.

For N images, there are N x (N-1)/2 pairs of displacements, but only N-1 pairs of displacements are independent, since the rest can be calculated from independent groups. For example, determining the displacement between images 1 and 2 and between images 1 and 3 indicates the displacement between images 2 and 3. If the measured displacement between images 2 and 3 is different from the suggested displacement, the measurement is not consistent. The magnitude of this inconsistency can be used as a measure of how well the alignment algorithm is running. Our preliminary tests show that the disparity in each direction is typically less than 0.1 pixel (see fig. 23).

Once the image time series are aligned, there are two ways to track individual beads. If the bead density is low and most beads do not touch other beads, the optical centroid of each bead can be identified and the bead intensity calculated by integrating the area around the bead. If the bead density is so high that most beads touch each other, it is not possible to identify individual beads by a dark background band around them. However, after all images are aligned to sub-pixel resolution, it is possible to identify pixels belonging to the same bead by computing the correlation of neighboring pixels in time. Highly correlated pairs of pixels can be reliably assigned to the same bead. A similar technique was applied to lane tracking in DNA sequencing gels with good results (Blanchard, A.P., Sequence-specific effects on the incorporation of dideoxynucleotides by modified T7 polymerase (Sequence-specific efficiencies on the incorporation of dideoxynucleotides T7 polymerase), California Institute of Technology, 1993). Once the beads are traced through the entire 4-color time sequence, the sequence can be decoded by knowing which color corresponds to which 3' -terminal base of the probe oligonucleotide.

Example 11: flux calculation

Generally, the throughput of a sequencing system depends primarily on the number of images that the machine can produce per day and the number of nucleotides (bases) in the sequence data for each image. Since the machine is preferably designed to keep the camera busy all the time, the calculation is based on 100% camera utilization. In embodiments where beads are imaged in 4 colors to determine the identity of a base, 4 images taken with one camera, 2 images taken with two cameras, or one image taken with 4 cameras may be used. Four camera imaging can significantly improve throughput compared to other options, with the preferred system utilizing this approach.

Our preliminary tests show that a pixel density of 50 pixels per bead (representing 5.4 square microns) can provide a suitable density for standard image analysis. By using a 4 megapixel CCD camera (now common), one frame of CCD image can capture-80,000 beads (from our existing image data). The time taken to capture four images with different cameras and move to the next field of view on the flow cell does not exceed 1.5 seconds. If 75% of beads produced useful information, we would be able to collect about 80,000 beads 0.75/1.5-40,000 bases/second raw sequence data.

One important issue in maintaining 100% camera availability is matching the time consumed in performing one connect/cut chemical cycle to the time required to image the entire flow cell. A reasonable estimate of the time taken for the extension, cut and join cycles is 11/2 hours (5,400 seconds). These 5,400 seconds will accommodate 1,800 image fields of view or areas of about 15mm x 45mm, which is a suitable size for the flow cell. Conservative estimates the throughput of a system with four cameras, a flow cell of 15mm x 45mm, is 40,000 bases per second. This equates to about 2,000 ABI3730xl sequencers, according to a 28 round per day throughput with a read length of about 650 bases (20 bases/sec) that we achieved with ABI3730xl sequencers. Increasing bead density by a factor of 2.5 to 200,000 beads per image increased flux to 100,000 bases/sec overall, approximately equal to 5,000 ABI3730xl machines. At this flux level, the total output per day is approximately 8.6Gb, so the time required to complete the 12X human genome sequence is 4.2 days.

It should be noted that the sequencing methods of the invention described herein can be implemented with a variety of different sequencing systems, image capture and processing methods, and the like. See, for example, U.S. patents 6,406,848 and 6,654,505 and PCT publication No. WO98053300 for details.

Example 12: method for preparing microparticles for synthesizing templates thereon

This example describes a method of preparing microparticles (magnetic beads in this example) with attached amplification primers to amplify (e.g., by PCR) the template, resulting in a clonal population of template molecules attached to each microparticle. Typically, the amplification beads are ligated with one of the primers required for the cloning PCR reaction. This primer can be covalently coupled to the bead surface or bound to streptavidin on the bead surface, for example via a biotin label. The beads can be used in standard PCR reactions (e.g., in microtiter plate wells, test tubes, etc.), emulsion PCR reactions described in example 13, and the like, to obtain beads with attached clonal populations of template molecules.

Material

1xTE：10mM Tris(pH 8)1mM EDTA

1xPCR buffer solution: (ThermoPol buffer, NEB)

20mM Tris-HCl(pH 8.8)

10mM KCl

10mM(NH₄)₂SO₄

2mM MgSO₄

0.1％Triton X-100

1M betaine (only 1xPCR-B buffer added)

1x binding and washing buffer

5mM Tris HCl(pH 7.5)

0.5mM EDTA

1M NaCl

DNA Capture primer (20-mer, 500. mu.M stock solution)

Bis-biotin- (HEG) 5-P1: 5 '-Biotin- (HEG)5-CTA AGG TAG CGA CTGTCC TA-3'

(HEG)5 ═ hexaethylene glycol linker, containing an 18 carbon spacer, one of many different spacer moieties that can be used. A spacer comprising a P1 primer portion that can be used, for example, to lift oligonucleotides off the surface of the bead. Any of the primers described herein can be incorporated into such a spacer portion.

Dynal storage beads (1 μm diameter) 10mg/ml (7-12X 10)⁶Beads/. mu.l).

Method

1. Remove 50. mu.l of beads (. about.450X 10)⁶Beads).

2. Add 200. mu.l of 1XTE buffer and mix well. Separated by a magnet.

3. Wash 1 time with 200. mu.l of 1XTE buffer. Separated by a magnet.

4. Resuspended in 100. mu. l B/W buffer.

5. Mu. l P1 oligonucleotide (1500 pmol in 500. mu.M stock) was added.

6. Spin at room temperature for > 30 minutes.

7. Wash 3 times with 200 μ l 1 × TE buffer.

8. Resuspended in 50. mu.l (starting volume) of 1XTE buffer.

9. The DNA capture beads were stored at 4 ℃ or placed on ice until use. Beads should be used within 1 week (beads tend to agglomerate for storage times > 1 week).

Example 13: method for performing PCR on microparticles in emulsion

This example describes a method that can be used to perform PCR on microparticles in an emulsion, resulting in microparticles with attached cloning templates. The microparticles (referred to as DNA beads in the nomenclature used below) were first functionalized with a first primer (P1). The second primer (P2) is present in the aqueous phase where the PCR reaction takes place. The aqueous phase may also contain a low concentration of P1, for example 20 times less, if desired. This allows rapid establishment of the template in the aqueous phase, which is the substrate for continued amplification. As the solution is depleted of P1, the reaction is forced to utilize P1 attached to the microparticles. P1_ P2degen10 is an oligonucleotide template (100bp) having sequences hybridizing with P1 and P2 for amplification by PCR and confers 4 to the population of oligonucleotides ¹⁰About 10 degenerate bases of complexity(incorporated during oligonucleotide synthesis).

I. Emulsion protocol (1 μm bead)

1. Preparing an oil phase:

Span 80(7％)

tween 80 (0.4%)

Preparation in light mineral oil

Using only freshly prepared oil phase

Total oil phase 450. mu.l

2. Preparing an aqueous phase: (estimation yields 2x10⁹Drop, each drop 115fL)

Reagent (mother liquor)	(μ l)/reaction	Finally, the product is processed
Reagent (mother liquor)	(μ l)/reaction	Finally, the product is processed	dH₂OMgCl₂Buffer (10X) dNTP (100 mM ea) MgCl₂Betaine (5M) (1M) P1 (primer 1) (10. mu.M) P2 (primer 2) (200. mu.M) P1_ P2 degen10(100pM) DNA beads (8M/. mu.l) Platinum Taq (5U/. mu.l)	156.032.011.37.332.01.640.06.625.09.0	3.5 mM23 mM0.5M 11.25 picomolar 5625 picomolar 5.9X10 for each-1X⁷Mu.l 150M/emulsion 0.28U/mu.l

Total water phase volume 320 mul

Final reaction 255 μ l aqueous phase 450 μ l oil phase

3. The aqueous tube was transferred to ice until the emulsion was added.

4. Add 450. mu.l of the oil phase to a 2ml cryovial.

5. The cryovial was placed upright in a foam socket attached to an IKA vortex. The vortex was set at 2500 rpm.

6. A sample amount of the aqueous phase (3 sample amounts, 85 μ l to 255 μ l each) was added to the shaken oil phase. The monodisperse aqueous phase was added to a stirred 2ml cryovial by inserting the pipette tip into the tube and slowly adding the aqueous phase from the tip to the shaken oil phase. The addition was repeated 2 times with the remaining aqueous phase.

7. The emulsion was continued to be shaken at 2500rpm for 24 minutes,

8. aliquots of-100 μ l of the emulsion were transferred to 96-well plates (total 4 wells). At the same time, a sample amount of the remaining aqueous phase (65 μ l) was added to a separate well and a solution-based PCR control reaction was performed. The plates were sealed and cycled as described in the next section.

Emulsion amplification (1 μm bead)

PCR cycling parameters for 1.1 μm bead emulsion (primer Tm ═ 62 ℃):

the procedure is as follows: DTB-PCR

94 ℃ for 2 min with n ═ 1

94 ℃ for 15 seconds

57 deg.C, 30 seconds n-100

70 ℃ for 60 seconds

55 deg.C, 5 min n ═ 1

At 10 ℃ for any time

2. The cycle time was about 6 hours.

3. The emulsion was observed after circulation. A successful emulsion would show a uniform amber color with no separate aqueous phase observed. The "broken" (out of solution) emulsion produced a distinct aqueous phase at the bottom of the tube. This phase was avoided because the bead population here was not clonal.

4. The post-circulation emulsions were evaluated by bright field microscopy. A2. mu.l aliquot of the circulating emulsion was removed and dropped onto a glass slide. The emulsion samples were covered with 22X 60mm coverslips.

5. The emulsion was observed with a 20X objective lens. Preferably, the beads should be monodisperse, with the majority of droplets containing a single bead.

Note: if the emulsion sample contains a large number of multi-bead droplets, the emulsion reaction is poured into a 1.5mleppendorf tube and centrifuged at 6000rpm for 15 seconds. The bead suspension collected at the bottom of the tube was removed. This population consisted of free beads and multi-bead droplets heavier than single-bead liquid, and therefore settled to the bottom of the tube after brief centrifugation. This bead population is not clonal and should therefore be avoided prior to subsequent processing. Repeat steps 4 and 5 and reevaluate the emulsion to confirm the integrity of the liquid containing the individual beads in the emulsion sample.

6. The emulsion is broken (broken) by the method described in the next section.

Emulsion breaking and unzipping (1 μm bead)

Bead disruption Wash (BBW) buffer

2% Triton X-1002% tween 20; 10mM EDTA

Unzipping solution100mM NaOH

1xTE：10mM Tris(pH8)1mM EDTA

1 Xbind and Wash (B/W) buffer

5mM Tris-HCl(pH7.5)

0.5mM EDTA

1M NaCl

1. Each emulsion group (4 aliquots) was poured into a 1.5ml eppendorf tube.

2. Add 800. mu.l BBW buffer. The emulsion was broken by vortexing the reaction tube for 10 seconds.

3.8000rpm for 2 minutes.

4. The upper 800. mu.l (mainly oil phase) was removed. The DNA beads sink to the bottom of the tube.

5. Add 800. mu.l BBW, vortex, centrifuge at 8000rpm for 2 minutes. Remove the top 600. mu.l.

6. Then, the reaction mixture was washed 2 times with 600. mu.l of 1XTE, and each washing solution was exchanged with a magnet.

8. Add 50. mu.l of melt solution to the bead pellet and resuspend the sample by vigorous pipetting. The beads were incubated with the melting solution at room temperature for 5 minutes, with the tubes flicked intermittently.

9. The tube was placed in a magnet to remove the melt solution. Wash 1 time with 100. mu.l of melting solution to ensure complete removal of the second strand.

10. The bead pellet was washed 2 times with 1XTE, resuspended in 20. mu.l TE buffer and stored at 4 ℃ or 20. mu.l 1XB/W buffer if the next step was enrichment. If the beads appeared to aggregate, they were replaced in 1xPCR-B buffer.

11. The enrichment process (optional) is continued.

Example 14: method for enriching microparticles with a population of clonal templates attached thereto

This example describes a method for enriching microparticles for which template amplification was successful, for example, in a PCR emulsion. This method utilizes larger microparticles with attached capture oligonucleotides. The capture oligonucleotide comprises a region of nucleotides complementary to a region of nucleotides present in the template.

I. Emulsion enrichment (1 μm)

A. Preparation of enrichment beads (Capture entity)

Enriching beads:

spherotech streptavidin coated polystyrene beads (-6.5 μm)

Bead stock (0.5% w/v): 33, 125 beads/. mu.l

Each scheme is as follows: (33, 125 beads/. mu.l) (800. mu.l). mu.26.5X 10⁶A bead

The application comprises the following steps:

1.19 hundred million bead-emulsion clonality estimates per emulsion (2%): each emulsion-3M template positive beads. Addition of 2-3 enrichment beads per predicted template positive emulsion bead-1 million enrichment beads per emulsion reaction.

Enriched oligonucleotide (capture agent):

p2-enrichment (35-mer, Tm-73 ℃ C.)

5 '-Biotin-18 carbon spacer-ttaggaccgttatagttaggtgatgcattaccctg 3'

(OR)

P2-enrichment (e.g. up to 35-mer, Tm ═ 52 ℃)

5 '-Biotin-18 carbon spacer-ggtgatgcattaccctg 3'

Glycerol solution-60% (v/v)

6ml of glycerol

4ml nuclease free H₂O

1. Remove 800. mu.l of beads, centrifuge at 13,000rpm for 1 minute and exchange into B/W buffer. Washed 1 time with 500. mu. l B/W buffer and resuspended in 100. mu. l B/W buffer.

2. Add 20 μ l of enriched oligonucleotide (500 μ M stock 10,000 pmol/rxn).

3. The beads were spun at room temperature for 1 hour.

4. The beads were washed 3 times with 500. mu.l of 1 × TE buffer. The beads were pelleted by centrifugation at 13,000rpm for 1 minute between washes.

5. The beads were resuspended in 25. mu. l B/W buffer. Concentration ═ 1M enrichment beads/. mu.l.

Note: the four enriched emulsion populations were poured into 20-30. mu.l of 1XB/W buffer to generate-40M template positive beads. Multiple slides can then be run.

B. Enrichment step

1. Add 20. mu.l enrichment beads to a tube containing emulsion-derived beads (20. mu.l). Resuspend the bead mixture by gentle pipetting (or use the ratio of 2-3 enrichment beads per expected template positive emulsion bead).

2. If biotinylated P2-enrichment primer coated enrichment beads were used, the bead mixture was incubated at 65 ℃ for 2 minutes. The tubes were moved to ice for 10 minutes.

Note: preliminary experiments suggest that enrichment with enrichment beads containing primer sequences for 100 cycles of PCR (such as P2PCR) may be less efficient because it is able to enrich for beads containing primer dimers that are driven onto beads in template-free droplets. If enriched beads loaded with the above P2-enriching primer are used, the bead mixture is incubated at 50 ℃ for 2 minutes due to the decreased Tm of this shorter primer.

3. The bead mixture was added to a 1.5ml eppendorf tube containing 300. mu.l of a 60% glycerol solution.

4.13,000rpm for 1 minute.

5. After centrifugation, the negative beads sink to the bottom of the tube. The enrichment beads with attached template beads will float above the glycerol phase. The upper phase bead population was collected and transferred to a clean 1.5ml eppendorf tube.

Note: beads that sink to the bottom of the tube (beads without template) can be washed and analyzed with a magnet and then washed with the same protocol as described for template positive beads.

6. 1ml of nuclease-free H₂O was added to the beads collected from the upper phase to dilute the glycerol concentration. Resuspend bead mixture with gentle pipetting. Centrifuge at 13,000rpm for 1 minute.

7. After centrifugation, the supernatant was removed and washed 2 times with 100. mu.l TE.

8. 100 μ l of melting solution was added to the washed bead pellet. The tube was rotated at room temperature for 5 minutes.

9. A further 100. mu.l of melting solution was added and the template beads were separated with a magnet.

10. The nonmagnetic enrichment beads were removed by washing twice with 100. mu.l TE, and the DNA beads were separated from the enrichment beads with a magnet.

11. The template beads were resuspended in 10-20. mu.l of 1 × TE. If the beads appeared to aggregate, then dilute into 1 XPCR-B buffer.

12. The template-containing beads can be mixed with other enriched populations and applied to a slide, as described in the next example.

Example 15: method for preparing microparticle arrays immobilized in or on semi-solid supports

This example describes the preparation of a slide in which microparticles with attached template are immobilized (e.g., embedded) in a semi-solid support located on the slide. Such a slide may be referred to as polony slide. The semi-solid support used in this example was polyacrylamide. One approach employs a method of confining a polymerase molecule near a template to enhance amplification.

Slide preparation

A. Glass slide: adhesion-silane treatment

The adhesion-silane facilitates adhesion of the polyacrylamide gel to the cover glass surface. The slides should be pretreated with adhesion-silane just prior to use.

Note:

^**the adhesion-silane solution was stored in a chemical fume hood.

^**Adhesion-silanes are irritating. The solution was prepared in a chemical laboratory.

^**Ensuring that the adhesion-silane mother liquor is not expired.

^**Transfer from the rack without touching the slide surface.

Preparation of adhesion-silane solution:

1. charging into a 1-L plastic container:

1L dH₂o, 1 stirrer

220 μ l of concentrated acetic acid (to pH 3.5) was added. 4 ml of the adhesion-silane reagent was added and the solution was mixed with a stir plate for > 15 minutes.

Treating the glass slide:

2. slides were loaded (facing the same direction) on inverted plastic 384-well plates.

3. By dH₂O washing slide, pouring out dH₂O。

4. Washing with 100% ethanol, and draining off the ethanol.

5. By dH₂O washing again, pouring out dH₂O, put it into a tissue culture box with a vent and UV lamp running. The washed slides were allowed to dry (-30 minutes).

6. The plate was placed in a plastic container and the slide was covered with an adhesion-silane solution.

7. The solution and slide were allowed to react for 1 hour. The vessel was shaken intermittently to ensure that the adhesion-silane was evenly coated onto the glass.

8. After incubation with dH₂Wash slide 3 times O.

9. Wash once with 100% ethanol and pour the ethanol dry.

10. The slides were thoroughly dried just before use.

11. The adhesion-silane treated slides were stored in a desiccator.

B. Acrylamide-based slide (Small mask)

non-Capture protocol

All reagents were placed on ice. The following pre-cooling reagents were added to a 1.5ml eppendorf tube:

reagent	amt(μl)
reagent	amt(μl)			2 pieces of glass slide	1 glass slide
1xTE	13	6.5		2 pieces of glass slide	1 glass slide
1xTE	13	6.5	Beads (1-3M, diluted with 1 XTE)	10	5
Rhinohide	1	0.5	Beads (1-3M, diluted with 1 XTE)	10	5
Rhinohide	1	0.5	40% acrylamide to bisacrylamide (19: 1, F/S)	5	2.5
TEMED (5%, prepared with 1 xTE)	2	1	40% acrylamide to bisacrylamide (19: 1, F/S)	5	2.5
TEMED (5%, prepared with 1 xTE)	2	1	APS (0.5%, freshly prepared)	3	1.5
Total of	34μl	17μl	APS (0.5%, freshly prepared)	3	1.5

The mixture was vigorously blown to break up the beads.

Add 17. mu.l of each slide under the coverslip.

The polymerization was carried out at room temperature for 60 minutes while turning upside down.

The coverslip was removed with a clean blade.

The slides were soaked and washed 2 times with 1E buffer over 15 minutes (to remove unbound beads).

The bead-embedded slides can be stored in wash IE at 4 ℃.

2. Fluorophore-labeled sequencing primers were hybridized to the embedded bead populations. Slides were equilibrated from wash IE to 1XPCR-B buffer by quickly dropping into a Coplin jar (Coplin jar) containing 1XPCR-B buffer.

3. In a 1.5ml eppendort tube, 1-6. mu.l (100. mu.M stock solution) of primers were added to 99. mu.l of 1xPCR buffer. Dropping 100 mul of primer solution on the acrylamide substrate, covering with a cover glass or a sealing washer,

4. the slide was heated using the < DEVIN > program (2 min 65 ℃ C., slow annealing to 30 ℃ C.) to allow the primers to hybridize to the embedded beads. Slides were washed 2 times for 2 min with wash IE. Slides were prepared for ligation-based sequencing.

Trapping scheme

1. ssDNA template beads were prepared at 1M/. mu.l. [ Polony slides were prepared with 4-5M beads per slide ].

2. The bead mixture was resuspended in 30. mu.l of 1xPCR buffer.

3. Add 1. mu.l of sequencing primer (100. mu.M stock); and (4) fully mixing.

4. Heat to 65 ℃ for 2 minutes.

5. Move to ice for 5 minutes.

6. Wash 3 times with 80. mu.l of 1 XTE.

7. All solutions were removed with a magnet.

8. The following reagents were added:

Reagent	amt(μl)
Reagent	amt(μl)		2 pieces of glass slide
1 Xbuffer solution	1.5		2 pieces of glass slide
1 Xbuffer solution	1.5	10 Xbuffer	2.0
High Concentration (HC) enzyme	16.0	10 Xbuffer	2.0
High Concentration (HC) enzyme	16.0	40% acrylamide to bisacrylamide (19: 1, F/S)	14.4
Rhinohide	2.0	40% acrylamide to bisacrylamide (19: 1, F/S)	14.4
Rhinohide	2.0	TEMED (5%, prepared with 1 xTE)	2.0
APS (0.5%, freshly prepared)	1.5	TEMED (5%, prepared with 1 xTE)	2.0
APS (0.5%, freshly prepared)	1.5	Total of	39.4μl

The mixture was blown to break up the beads.

Add 17. mu.l of each slide under the coverslip.

9. It is preferred to flip the polymerization upside down, for example, using a < Pol-1> cycle program on an MJ Research tetra PCR instrument.

10. The coverslip was removed with a clean blade. The slides were soaked and washed 2 times for 10 minutes with 1E buffer. (to remove unbound beads).

Polony slides were prepared for ligation-based sequencing.

12. Bead-embedded Polony slides can be stored at 4 ℃ in a gasket located in wash IE.

Example 16: method for preparing microparticle arrays attached to solid supports

This example describes the preparation of a slide in which template-attached microparticles on the slide are attached to a solid support.

1. Slides prepared with polymer tethers with reactive NHS were stored at-20 ℃.

(slide H, product No. 1070936; Schott Nexterion; Schott North America, Inc., Elmsford, NY)

2. Slides were equilibrated to room temperature in the presence of a desiccant just prior to use.

3. The slide was washed with 50ml of 1xPBS (300mM sodium phosphate, pH8.7) for 5 minutes. The washing was repeated 2 times.

4. The slide was removed from the solution and covered with an adhesive gasket (for sample loading).

5. In a separate tube, 1-4 billion protein-coated or DNA-coated aliquot beads were added to 1xPBS, pH 8.7. The DNA may be, for example, a DNA template for sequencing. The DNA may include, for example, an amine linker that reacts with NHS.

6. Bead samples were washed 3 times with 1xPBS, pH8.7 by buffer exchange.

7. The beads were resuspended in 125ml of 1xPBS, pH 8.7.

8. The bead solution was added to the slide gasket to uniformly coat the slide surface.

9. Slides were sealed in a dark room and the reaction was incubated at room temperature for 1-2 hours.

10. After incubation, unbound bead solution was removed and the slides were transferred to 50ml of 1 × TE (10 mM Tris, 1mM EDTA, pH 8).

11. The slides were washed 5 times with 50ml of 1 × TE, with constant stirring for 15 minutes for each wash.

12. Slides can be stored at 4 ℃ in 1XTE for several weeks.

13. If desired, bead populations can be assessed by White Light (WL) brightfield image analysis or fluorescence using complementary DNA oligonucleotides linked to fluorophore-based dyes. The DNA template may be sequenced, for example, by a ligation-based sequencing method.

FIG. 33A shows a schematic of a slide with attached beads.

It should be noted that only a small fraction of the DNA template molecules are attached to the slide. One micron bead (Dynabeads MyOne streptavidin beads; Dynal Biotech, Inc., product number 650.01) was used. However, various beads may be used.

Figure 33B shows a bead population attached to a slide. The lower column shows the same area of the slide under white light (left) and a fluorescent microscope. The upper panel shows the bead density range.

Equivalents and ranges

Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. The scope of the invention is not limited by the above description but also by the scope of the appended claims. In the appended claims, articles such as "a," "an," and "the" may refer to one or more than one unless otherwise indicated herein or otherwise clearly contradicted by context. The claims or specification, to the extent that one, more than one, or all of the group members are present in, used in, or associated with a given product or process, shall apply to the claim or specification as connected by the "or" between one or more members of the group unless stated otherwise or the context clearly does not.

Furthermore, it is to be understood that the invention includes all modifications, combinations, and substitutions introducing one or more limitations, elements, clauses, descriptive terms, etc. from one or more of the listed claims into another claim. In particular, any claim dependent on another claim may be adapted to include one or more limitations present in any other claim dependent on the same basic claim.

Further, it is to be understood that any one or more embodiments may be explicitly excluded from the claims, even if the specific exclusion is not explicitly listed herein. It is also understood that when the specification and/or claims disclose reagents for sequencing (e.g., templates, microspheres, probes, probe families, etc.), such disclosure also includes methods of sequencing with the reagents according to the particular methods described herein or other methods known in the art, unless one of ordinary skill in the art would understand them differently or would describe them differently in the specification. In addition, where the specification and/or claims disclose sequencing methods, any one or more of the reagents described herein can be used in the methods, unless one of ordinary skill in the art would understand it differently, or the use of the reagents in such methods is explicitly excluded from the specification. It will also be appreciated that where specific components for sequencing are disclosed in the specification or claims, the invention also includes methods of making such reagents. The term "component" is used broadly to refer to any item used for sequencing, including templates, microparticles with attached templates, libraries, and the like. Further, the drawings are an integral part of the specification, and the invention includes structures such as the template-attached microparticles shown in the drawings and the methods described in the drawings.

Where ranges are given herein, endpoints are included. Moreover, it is to be understood that unless otherwise indicated or otherwise evident from the context and understanding of one of ordinary skill in the art, values that are expressed as ranges in different embodiments of the invention can assume any specific value or subrange within the stated range, to the tenth of the unit of the lower limit of the range, unless the context clearly dictates otherwise.

Claims

1. A method of identifying a nucleotide sequence within a template polynucleotide, the method comprising the steps of:

(a) extending an initiator oligonucleotide along the template polynucleotide by ligating an oligonucleotide probe to the initiator oligonucleotide to form an extended duplex, wherein the oligonucleotide probe comprises a phosphorothioate linkage;

(b) identifying one or more nucleotides of the polynucleotide; and

(c) repeating steps (a) and (b) until the nucleotide sequence is determined.

2. The method of claim 1, wherein the identifying step comprises detecting a label attached to the most recently ligated oligonucleotide probe.

3. The method of claim 1, further comprising the step of cleaving the phosphorothioate linkages with a cleavage agent containing an atom selected from Ag, Hg, Cu, Mn, Zn, or Cd to generate extendable probe ends.

4. The method of claim 3, wherein the cutting agent is AgNO₃。

5. The method of claim 1, wherein the extending step is performed in or on a semi-solid support.

6. A method of determining a sequence of nucleotides in a template polynucleotide, the method comprising the steps of:

(a) Providing a probe-template duplex formed by hybridization of a probe having an extendable terminus to a template polynucleotide;

(b) ligating an extension oligonucleotide probe to the extendable terminus to form an extended duplex comprising an extended oligonucleotide probe, wherein the extension probe comprises a phosphorothioate linkage;

(c) identifying in the extended duplex at least one nucleotide within the template polynucleotide that is (1) complementary to the just-ligated extension probe or (2) a nucleotide residue immediately downstream of the extended oligonucleotide probe;

(d) generating an extendable end on the extension oligonucleotide probe if no extendable end is present, such that the generated end is different from the end to which the previous extension probe was ligated; and

(e) repeating steps (b), (c) and (d) until the nucleotide sequence within the template polynucleotide is determined.

7. The method of claim 6, wherein each extension probe comprises a non-extendable moiety at one end.

8. The method of claim 6, wherein the identifying step comprises detecting a label attached to the most recently ligated extension probe.

9. The method of claim 6, wherein said identifying step comprises removing said non-extendable moiety and extending said extended oligonucleotide probe with a nucleic acid polymerase in the presence of one or more labeled chain terminating nucleoside triphosphates.

10. The method of claim 6, further comprising the step of capping the extended oligonucleotide probe when no extension probe is ligated to the extendable terminus in the ligating step.

11. The method of claim 6, wherein the generating step comprises cleaving the phosphorothioate linkage with a cleavage agent containing an atom selected from Ag, Hg, Cu, Mn, Zn, or Cd.

12. The method of claim 11, wherein the cutting agent is AgNO₃。

13. The method of claim 6, wherein the ligating and generating steps are performed in or on a semi-solid support.

14. The method of claim 6, wherein step (a) comprises providing in separate aliquots a plurality of different probe-template duplexes, each duplex containing an initiator oligonucleotide probe hybridized to a template polynucleotide, wherein the template polynucleotide in each duplex is the same, but the initiator oligonucleotide probe in each duplex binds to a different sequence of the template polynucleotide; steps (b) - (e) were performed independently for each sample size.

15. The method of claim 14, wherein for each aliquot, one end of the extension oligonucleotide probe comprises a non-extendable moiety.

16. The method of claim 15, wherein for each aliquot, the identifying step comprises detecting a label attached to the most recently ligated extension probe.

17. The method of claim 15, wherein for each aliquot, the identifying step comprises removing the non-extendable moiety and extending the extended oligonucleotide probe with a nucleic acid polymerase in the presence of one or more labeled chain terminating nucleoside triphosphates.

18. The method of claim 15, further comprising the step of capping the extended oligonucleotide probe when no extension probe is ligated to the extendable terminus in the ligating step.

19. The method of claim 15, wherein the generating step comprises cleaving the phosphorothioate linkage with a cleavage agent containing an atom selected from Ag, Hg, Cu, Mn, Zn, or Cd.

20. The method of claim 19, wherein the cutting agent is AgNO₃。

21. The method of claim 15, wherein the ligating and generating steps are performed in or on a semi-solid support.

22. The method of claim 6, further comprising the steps of: (f) removing the ligation probes and the initial oligonucleotides on the template; (g) repeating step (a) with a second oligonucleotide that binds to a different sequence of the template polynucleotide; and (h) repeating steps (b) - (e).

23. The method of claim 22, wherein the method is repeated a plurality of times with an initial oligonucleotide that binds to a different sequence of the template polynucleotide.

24. The method of claim 23, wherein one end of the extension probe comprises a non-extendable moiety.

25. The method of claim 23, wherein in each iteration, the identifying step comprises detecting a label attached to the most recently ligated extension probe.

26. The method of claim 23, wherein in each iteration, the identifying step comprises removing the non-extendable moiety and extending the extended oligonucleotide probe with a nucleic acid polymerase in the presence of one or more labeled chain terminating nucleoside triphosphates.

27. The method of claim 23, further comprising the step of capping the extended oligonucleotide probe when no extension probe is ligated to the extendable terminus in the ligating step.

28. The method of claim 23, wherein the generating step comprises cleaving the phosphorothioate linkage with a cleavage agent containing an atom selected from Ag, Hg, Cu, Mn, Zn, or Cd.

29. The method of claim 28, wherein the cutting agent is AgNO₃。

30. The method of claim 23, wherein the ligating and generating steps are performed in or on a semi-solid support.

31. The method of claim 22, wherein the removing step comprises contacting the ligation probe, the initial oligonucleotide and the template with a solution comprising about 1.0-3.0% SDS, 100 mM NaCl and 5-15mM sodium bisulfate (NaHSO)₄) Is contacted with the aqueous solution of (a).

32. The method of claim 22, wherein the removing step comprises contacting the ligation probe, the starter oligonucleotide, and the template with a solution comprising about 2% SDS, 200mM NaCl, and 10mM sodium bisulfate (NaHSO)₄) Such as 2% SDS, 200mM NaCl and 10mM sodium bisulfate (NaHSO)₄) Are contacted.

33. A method of identifying a sequence of nucleotides in a template polynucleotide attached to a support at a point of attachment, the method comprising the steps of:

(a) extending an initiator oligonucleotide along the template polynucleotide by ligating an oligonucleotide probe thereto to form an extended duplex, wherein extension is along the template towards its point of attachment to the support;

(b) Identifying one or more nucleotides of the polynucleotide; and

(c) repeating steps (a) and (b) until the nucleotide sequence is determined.

34. The method of claim 33, wherein the identifying step comprises detecting a label attached to the most recently ligated extension probe.

35. The method of claim 33, wherein said oligonucleotide probe comprises a phosphorothioate linkage and said generating step comprises cleaving said phosphorothioate linkage with a cleavage agent comprising an atom selected from the group consisting of Ag, Hg, Cu, Mn, Zn, or Cd.

36. The method of claim 35, wherein the cutting agent is AgNO₃。

37. The method of claim 35, wherein the ligating and generating steps are performed in or on a semi-solid support.

38. A method of determining the sequence of nucleotides in a template polynucleotide attached to a support at a point of attachment, the method comprising the steps of:

(a) providing a probe-template duplex formed by hybridization of a probe to a template polynucleotide, said primer having an extendable terminus;

(b) ligating an extension oligonucleotide probe to the extendable terminus, forming an extended duplex containing an extended oligonucleotide probe;

39. The method of claim 38, wherein each extension probe comprises a non-extendable moiety at one terminus.

40. The method of claim 38, wherein the identifying step comprises detecting a label attached to the most recently ligated extension probe.

41. The method of claim 38, wherein said identifying step comprises removing said non-extendable moiety and extending said extended oligonucleotide probe with a nucleic acid polymerase in the presence of one or more labeled chain terminating nucleoside triphosphates.

42. The method of claim 38, further comprising the step of capping the extended oligonucleotide probe when no extension probe is ligated to the extendable terminus in the ligating step.

43. The method of claim 38, wherein the oligonucleotide probe comprises a phosphorothioate linkage and the generating step comprises cleaving the phosphorothioate linkage with a cleavage agent comprising an atom selected from Ag, Hg, Cu, Mn, Zn, or Cd.

44. The method of claim 43, wherein the cutting agent is AgNO₃。

45. The method of claim 38, further comprising the steps of: (f) removing the ligation probes and the initial oligonucleotides on the template; (g) repeating step (a) with a second oligonucleotide that binds to a different sequence of the template polynucleotide; and (h) repeating steps (b) - (e).

46. The method of claim 45, wherein the method is repeated a plurality of times with a starter oligonucleotide that binds to a different sequence of the template polynucleotide.

47. The method of claim 38, wherein the ligating and generating steps are performed in or on a semi-solid support.

48. The method of claim 38, wherein the template is attached to a microparticle attached to a substantially flat rigid substrate.

49. A method of identifying a nucleotide sequence within a template polynucleotide, the method comprising the steps of:

(a) Template polynucleotides attached to microparticles immobilized in or on a semi-solid support are provided.

(b) Extending an initiator oligonucleotide along the template polynucleotide by ligating an oligonucleotide probe to the initiator oligonucleotide to form an extended duplex, wherein the oligonucleotide probe comprises a scissile linkage;

(c) identifying one or more nucleotides of the polynucleotide; and

(d) repeating steps (b) and (c) until the nucleotide sequence is determined.

50. The method of claim 49, wherein the extending step is performed on the semi-solid support.

51. The method of claim 49, wherein the template is attached to a microparticle attached to a substantially flat rigid substrate.

52. A method of determining a sequence of nucleotides in a template polynucleotide, the method comprising the steps of:

(a) providing a probe-template duplex formed by hybridization of a probe having an extendable end to a template polynucleotide, the probe-template duplex being attached to a microparticle embedded in or on a semi-solid support;

(b) ligating an extension oligonucleotide probe to the extendable terminus, forming an extended duplex comprising an extended oligonucleotide probe, wherein the extension probe comprises a phosphorothioate linkage;

53. The method of claim 52, wherein the ligating and generating steps are performed in the semi-solid support.

54. The method of claim 52, wherein the template is attached to a microparticle attached to a substantially flat rigid substrate.

55. A method of determining a sequence of nucleotides in a template polynucleotide, the method comprising the steps of:

(a) amplifying the template polynucleotide molecules in the emulsion chamber in the presence of the microparticles, producing microparticles to which a clonal population of template polynucleotides are attached;

(b) recovering the microparticles from the emulsion;

(c) embedding the microparticles in or on a semi-solid support;

(d) Extending an initiator oligonucleotide along the template polynucleotide by ligating an oligonucleotide probe to the initiator oligonucleotide to form an extended duplex, wherein the oligonucleotide probe comprises a scissile linkage;

(e) identifying one or more nucleotides of the polynucleotide; and

(f) repeating steps (d) and (e) until the nucleotide sequence is determined.

56. The method of claim 55, wherein (i) a plurality of template polynucleotide molecules comprising different sequences are amplified in a single emulsion compartment; (ii) (ii) recovering a plurality of microparticles from said emulsion and embedded in or on said support, each microparticle having attached thereto a clonal population of template polynucleotides, wherein said clonal populations have different sequences, and (iii) performing steps (d), (e) and (f) in parallel on said clonal populations attached to said embedded microparticles, so as to assay a plurality of sequences in parallel.

57. A method for determining nucleotide sequence information within a template polynucleotide using a first set of at least two families of differentially labeled oligonucleotide probes, the method comprising the steps of:

(a) extending an initiator oligonucleotide along the template polynucleotide by ligating an oligonucleotide probe to the initiator oligonucleotide to form an extended duplex, wherein the oligonucleotide probe is a member of the collection of differentially labeled oligonucleotide probe families;

(b) Detecting a label attached to the oligonucleotide; and

(c) repeating steps (a) and (b) until an ordered list of probe family names is obtained; and

(d) one or more possible nucleotide sequences are excluded using an ordered list of probe family names.

58. The method of claim 57, wherein step (d) comprises decoding the ordered list of probe family names to determine the sequence.

59. The method of claim 57, which comprises providing a probe-template duplex formed by hybridisation of an initiator oligonucleotide probe to a template polynucleotide, said probe having an extendable terminus, wherein said extending step comprises ligating an oligonucleotide probe to said extendable terminus to form an extended duplex containing an extended oligonucleotide probe, and further comprising the step of capping the remaining extendable termini if no oligonucleotide probe has been ligated to said extendable terminus in said extending step.

60. The method of claim 57, wherein each probe family comprises a non-extendable moiety at one end of the oligonucleotide probe.

61. The method of claim 57, further comprising, after each detecting step: (f) if an extendable terminus is not present, an extendable terminus is generated on the most recently ligated oligonucleotide probe such that the generated terminus is different from the terminus to which the most recently ligated oligonucleotide probe is ligated.

62. The method of claim 61, wherein said oligonucleotide probe comprises a phosphorothioate linkage, and said phosphorothioate linkage is cleaved with a cleavage agent comprising an atom selected from the group consisting of Ag, Hg, Cu, Mn, Zn, or Cd, thereby generating said extendable probe terminus.

63. The method of claim 62, wherein the cutting agent is AgNO₃。

64. The method of claim 57, wherein the extending step is performed in or on a semi-solid support.

65. The method of claim 57, wherein the template is attached to a microparticle attached to a substantially flat rigid substrate.

66. The method of claim 57, wherein said collection comprises 2 differentially labeled probe families.

67. The method of claim 57, wherein said collection comprises 3 differentially labeled probe families.

68. The method of claim 57, wherein said collection comprises 4 differentially labeled probe families.

69. The method of claim 57, wherein said collection comprises more than 4 differentially labeled probe families.

70. The method of claim 57, wherein the oligonucleotide probes comprise restricted portions of nucleosides that are not independently selected, and wherein oligonucleotide probes that differ in the sequence of the restricted portions are assigned to a family of probes according to a coding scheme.

71. The method of claim 57, wherein said oligonucleotide probes are assigned to the first, second, third and fourth probe families according to one of the 24 coding schemes listed in Table 1.

72. The method of claim 58, wherein the type of at least one nucleotide in the template is known, and wherein the decoding step comprises:

(i) assigning a species to a nucleotide adjacent to a nucleotide of the known species on the template by determining which species corresponds to the likely sequence of the constrained portion of the probe whose proximal nucleotide is linked to the relative position of the nucleotide adjacent to the known species;

(ii) assigning a species to a subsequent nucleotide by determining which species corresponds to the likely sequence of the constrained portion of the probe whose proximal nucleotide is linked to the relative position of said subsequent nucleotide; and

(iii) (iii) repeating step (ii) until the sequence is determined.

73. The method of claim 58, further comprising the steps of:

(a) determining the identity of a nucleotide in the template such that the nucleotide has a known identity, wherein the decoding step comprises:

(iii) (iii) repeating step (ii) until the sequence is determined.

74. The method of claim 73, wherein said determining step comprises contacting a template-probe duplex with a labeled nucleotide in the presence of a polymerase under conditions that enable incorporation of said labeled nucleotide if said labeled nucleotide is complementary to said template at a position adjacent to said duplex.

75. The method of claim 58, wherein the decoding step comprises: generating at least one candidate sequence from the ordered list of probe family names; and selecting a candidate sequence as the nucleotide sequence of the template.

76. The method of claim 75, wherein said generating step comprises generating at least 4 candidate sequences.

77. The method of claim 75, wherein the generating step comprises:

(i) a species that presumes the first nucleotide of the nucleotide sequence;

(ii) determining a likely class of adjacent nucleotides based on the probe family name corresponding to the first nucleotide, thereby specifying a class of nucleotides adjacent to the first nucleotide;

(iii) Identifying a likely species of the subsequent nucleotide based on the probe family name corresponding to the most recently specified species of nucleotide, thereby specifying a species of the subsequent nucleotide;

(iv) (iv) repeating step (iii) until a candidate sequence is generated; and

(v) (iii) repeating steps (i) - (iv), wherein in each iteration the first nucleotide is assumed to be of a different species until the desired number of candidate sequences are generated.

78. The method of claim 75, wherein the step of selecting comprises comparing at least one candidate sequence to one or more known sequences and selecting the candidate sequence having a predetermined degree of identity or closest proximity to one or more known sequences.

79. The method of claim 78, wherein the template is derived from an organism of interest, and wherein the step of comparing comprises comparing at least one candidate sequence to sequences in a database containing sequences obtained from the organism.

80. The method of claim 78, wherein the comparing step comprises comparing at least one candidate sequence to sequences in a database comprising a plurality of comparison sequences, each sequence comprising a different potential sequence of the test polynucleotide sequence.

81. The method of claim 75, wherein said selecting step comprises:

(i) obtaining an ordered list of second probe family names from the template with a second set of differentially labeled coded probe families, wherein the codes for the probe families in the second set of probe families are different from the codes for the probe families in the first set of probe families;

(ii) generating at least one comparison sequence from the ordered list of probe family names;

(iii) comparing a portion of at least one of the candidate sequences to a portion of at least one of the comparison sequences; and

(iv) selecting as the nucleotide sequence of said template a candidate sequence on the part compared in step (c) having a predetermined degree of identity or closest to the compared sequence.

82. The method of claim 81, wherein the comparison moiety is a dinucleotide.

83. The method of claim 81, wherein said ordered list of second probe family names contains only one element.

84. The method of claim 57, wherein the oligonucleotide probes in each probe family have the following structures: 5' - (XY) (N)_kN_B ^*-3 'or 3' - (XY) (N) _kN_B ^*-5', wherein N represents any nucleoside, N_BRepresents a moiety that cannot be extended by a ligase, represents a detectable moiety, XY is a constrained moiety of the probe, wherein X and Y represent the same or different, but not independently selected nucleosides, X and Y are at least 2-fold degenerate, at least one internucleoside linkage is a scissile linkage, k is from 1 to 100, with the proviso that: the detectable moiety may be present in Y or (N)_kAt any internal nucleotide and additionally present at N_BOr not otherwise present in N_BThe above.

85. The method of claim 84, wherein the scissile linkage is a phosphorothioate linkage.

86. The method of claim 84, wherein the detectable moiety is linked by a cleavable linker, is photobleachable, or both.

87. The method of claim 86, wherein said cleavable linker comprises a disulfide bond.

88. The method of claim 84, wherein 4 families of differentially labeled oligonucleotide probes are used, wherein oligonucleotide probes differing in sequence by their restricted portions are assigned to the first, second, third and fourth probe families according to one of the 24 coding schemes listed in Table 1.

89. The method of claim 57, wherein the detecting step comprises simultaneously obtaining an average of 2 bits of information from each of at least 2 nucleotides in the template, and not obtaining two bits of information from any single nucleotide.

90. The method of claim 57, wherein the detecting step comprises simultaneously obtaining less than 2 bits of information from each of at least 2 nucleotides in the template.

91. A method for determining nucleotide sequence information within a template polynucleotide using a first set of at least two families of differentially labeled oligonucleotide probes, the method comprising the steps of:

(a) contacting a probe-template complex with at least two families of differentially labelled oligonucleotide probes to hybridise an oligonucleotide probe, said probe-template complex comprising a double stranded portion with an extendable end and a single stranded portion to be sequenced, said oligonucleotide probe comprising a portion complementary to the portion of the template immediately adjacent to said duplex portion;

(b) ligating the hybridized oligonucleotide probe to the extendable terminus, thereby generating a probe-template complex comprising an extended duplex;

(c) detecting a label attached to the ligation probe;

(d) (ii) if no extendable probe termini are present, generating extendable probe termini on the extended duplex; and

(e) repeating steps (a) - (d) until an ordered list of probe family names is obtained.

92. The method of claim 91, wherein the detecting step comprises simultaneously obtaining an average of 2 bits of information from each of at least 2 nucleotides in the template, and not obtaining two bits of information from any single nucleotide.

93. The method of claim 91, wherein said detecting step comprises simultaneously obtaining less than 2 bits of information from each of at least 2 nucleotides in said template.

94. A method for determining nucleotide sequence information for a template polynucleotide using a first collection of oligonucleotide probe families, the method comprising the steps of:

(a) performing successive sequential cycles of extension, ligation, detection and cleavage, wherein the detecting step comprises: simultaneously obtaining an average two-bit message from each of at least two nucleotides in the template, without obtaining two-bit messages from any single nucleotide; and

(b) combining the information obtained in step (a) with at least one bit of other information to determine the sequence.

95. The method of claim 94, wherein said at least one other bit of information comprises an item of information selected from the group consisting of: nucleotide species in the template, information obtained by comparing a candidate sequence to at least one known sequence; and information obtained by repeating the method using a second set of oligonucleotide probe families.

96. A method of distinguishing single nucleotide polymorphisms from sequencing errors, the method comprising the steps of:

(a) sequencing a plurality of templates using the method of claim 58, wherein said templates represent overlapping segments of a nucleic acid sequence;

(b) aligning the sequences obtained in step (a); and

(c) determining that differences between the sequences represent sequence errors if the sequences are substantially identical over a first portion and significantly different over a second portion, the portions being at least 3 nucleotides in length.

97. A method of distinguishing single nucleotide polymorphisms from sequencing errors, the method comprising the steps of:

(a) performing steps (a) - (c) of claim 58 with a plurality of templates representing overlapping fragments of a single nucleic acid sequence, thereby obtaining an ordered list of a plurality of probe families;

(b) aligning the ordered list of probe families obtained in step (a) to obtain an aligned region in which the list is at least 90% identical; and

(c) determining that a difference between the ordered list of probe families represents a sequencing error if the lists differ at only one position in the alignment region; or

(d) Determining that a difference between the ordered list of probe families represents a single nucleotide polymorphism if the list differs at two or more adjacent positions of the alignment region.

98. A collection of at least two families of differentially labeled oligonucleotide probes, wherein the probes of each family of probes comprise a constrained portion and an unconstrained portion, the constrained portion being at least 2-fold degenerate at each position, the probes of each family comprising a cleavable internucleoside linkage.

99. The collection of differentially labeled oligonucleotide probes of claim 98, wherein each probe comprises a ligase non-extendable terminus.

100. The collection of differentially labeled oligonucleotide probe families of claim 98, wherein each probe comprises a ligase non-extendable terminus, and wherein each probe comprises a detectable moiety at a position between the scissile junction and the ligase non-extendable terminus.

101. The collection of families of differentially labeled oligonucleotide probes according to claim 98, wherein the scissile linkage is a phosphorothioate linkage.

102. The collection of differentially labeled oligonucleotide probe families of claim 98, wherein the collection comprises 2 probe families.

103. The collection of differentially labeled oligonucleotide probe families of claim 98, wherein the collection comprises 3 probe families.

104. The collection of differentially labeled oligonucleotide probe families of claim 98, wherein the collection comprises 4 probe families.

105. The collection of differentially labeled oligonucleotide probe families of claim 98, wherein the collection comprises more than 4 probe families.

106. The collection of families of differentially labeled oligonucleotide probes according to claim 98, wherein the probes comprise detectable moieties that are linked by a cleavable linker, that are photobleachable, or both.

107. A collection of at least two differentially labeled oligonucleotide probe families, wherein the oligonucleotide probe of each probe family has the following structure 5' - (X)_j(N)_kN_B-3 'or 3' - (X)_j(N)_kN_B-5', wherein N representsOptionally nucleoside, N_BRepresents a moiety which cannot be extended by ligase, (X)_jIs a restricted part of said probe, wherein each X represents a nucleoside, (X)_jWherein the nucleosides are the same or different but are not independently selected, each X is at least 2-fold degenerate, j is 2-5, k is 1-100, each probe has a detectable moiety at its terminus, the detectable moiety is not (X)_jWherein the probes within each probe family comprise the same label and the probes of different probe families comprise different distinguishable labels.

108. The collection of families of differentially labeled encoding oligonucleotide probes of claim 107, wherein at least one internucleoside linkage is a scissile linkage.

109. The collection of differentially labeled oligonucleotide probe families of claim 107, wherein the scissile linkage is a phosphorothioate linkage.

110. The collection of differentially labeled oligonucleotide probe families of claim 107, wherein the detectable moiety is linked by a cleavable linker, is photobleachable, or both.

111. The collection of differentially labeled oligonucleotide probe families of claim 110, wherein the cleavable linker comprises a disulfide bond.

112. The collection of differentially labeled oligonucleotide probe families of claim 107, wherein the set consists of four probe families, wherein the oligonucleotide probes of each probe family have the following structure: 5' - (XY) (N)_kN_B ^*-3 'or 3' - (XY) (N)_kN_B ^*-5', wherein N represents any nucleoside, N_BRepresents a functionally extended moiety with a ligase, represents a detectable moiety, XY is a constrained portion of the probe, wherein X and Y represent nucleosides which are the same or different but cannot be independently selected from each other, X and Y are at least 2-fold degenerate, at least one internucleoside linkage is a scissile linkage, k is from 1 to 100, with the proviso that: the detectable moiety may be present in Y or (N) _kAt any internal nucleotide and additionally present at N_BOr not otherwise present in N_BThe above.

113. The collection of differentially labeled oligonucleotide probe families of claim 112, wherein the scissile linkage is a phosphorothioate linkage.

114. The collection of differentially labeled oligonucleotide probe families of claim 112, wherein the detectable moieties are linked by a cleavable linker, are photobleachable, or both.

115. The collection of differentially labeled oligonucleotide probe families of claim 114, wherein the cleavable linker comprises a disulfide bond.

116. The collection of differentially labeled oligonucleotide probe families of claim 112, wherein oligonucleotide probes differing in sequence by the constrained portion of the probes are assigned to the first, second, third and fourth probe families according to one of the 24 coding schemes listed in table 1.

117. A collection of at least two differentially labeled oligonucleotide probe families, wherein the oligonucleotide probe of each probe family has the following structure 5' - (X)_j(N)_kN_B-3 'or 3' - (X)_j(N)_iN_B-5', wherein N represents any nucleoside or abasic residue, N_BRepresents a moiety which cannot be extended by ligase, (X) _jIs a constrained moiety of said probe, wherein each X represents a nucleoside or an abasic residue, with the proviso that: x₁Represents a nucleotide，(X)_jWherein the nucleosides are the same or different but are not independently selected, each X is at least 2-fold degenerate, j is 2-5, k is 1-100, each probe comprises a detectable moiety at its terminus, the detectable moiety is not (X)_jWherein the probes within each probe family comprise the same label and the probes of different probe families comprise different distinguishable labels.

118. The collection of families of differentially labeled encoding oligonucleotide probes according to claim 117, wherein at least one internucleoside linkage is a scissile linkage.

119. The collection of differentially labeled oligonucleotide probe families of claim 117, wherein the scissile linkage is between a nucleoside and an abasic residue.

120. The collection of families of differentially labeled encoding oligonucleotide probes according to claim 117, wherein the oligonucleotide probes comprise a priming residue.

121. The collection of differentially labeled oligonucleotide probe families of claim 117, wherein the detectable moiety is linked by a cleavable linker, is photobleachable, or both.

122. The collection of differentially labeled oligonucleotide probe families of claim 121, wherein the cleavable linker comprises a disulfide bond.

123. The collection of differentially labeled oligonucleotide probe families of claim 117, wherein the set consists of four probe families, wherein the oligonucleotide probes of each probe family have the following structure: 5' - (XY) (N)_kN_B ^*-3 'or 3' - (XY) (N)_kN_B ^*-5', wherein N represents any nucleoside or abasic residue, N_BRepresents a moiety that cannot be extended by a ligase, represents a detectable moiety, XY is a constrained moiety of the probe, wherein X and Y represent the same or different, but not independently selected nucleosides, X and Y are at least 2-fold degenerate, at least one internucleoside linkage is a scissile linkage, k is from 1 to 100, with the proviso that: the detectable moiety may be present in Y or (N)_kAt any internal nucleotide and additionally present at N_BOr not otherwise present in N_BThe above.

124. The collection of differentially labeled oligonucleotide probe families of claim 123, wherein the scissile linkage is located between a nucleoside and an abasic residue.

125. The collection of differentially labeled oligonucleotide probe families of claim 123, wherein the detectable moiety is linked by a cleavable linker, is photobleachable, or both.

126. The collection of families of differentially labeled oligonucleotide probes according to claim 125, wherein the cleavable linker comprises a disulfide bond.

127. The collection of families of differentially labeled encoding oligonucleotide probes according to claim 123, wherein the oligonucleotide probes comprise a priming residue.

128. The collection of differentially labeled oligonucleotide probe families of claim 123, wherein oligonucleotide probes differing in sequence by a restricted portion of the probes are assigned to the first, second, third and fourth probe families according to one of the 24 coding schemes listed in table 1.

129. A kit comprising a collection of at least two families of differentially labeled oligonucleotide probes.

130. The kit of claim 129, wherein the probe comprises a cleavable internucleoside linkage.

131. The kit of claim 130, wherein the scissile internucleoside linkage is a phosphorothioate linkage.

132. The kit of claim 129, further comprising at least one selected from the group consisting of: ligase, a substance capable of cleaving phosphorothioate linkages, phosphatase, polymerase, support, buffer, thermostable polymerase, nucleotides, reagents for preparing an emulsion, and reagents for preparing a gel.

133. A method of making a plurality of template polynucleotides, the method comprising the steps of:

(a) embedding a plurality of microparticles in or on a reversible semi-solid support to form a first array of microparticles; and

(b) amplifying the starting template polynucleotide in the semi-solid support such that, following amplification, each microparticle has attached thereto a clonal population of template molecules.

134. The method of claim 133, further comprising the steps of:

(a) dissolving the semi-solid support; and

(b) collecting the microparticles.

135. The method of claim 134, further comprising the steps of:

forming a second array of microparticles in or on the other half of the solid support, wherein the density of microparticles in the second array is greater than the density of microparticles in the first array.

136. A kit comprising an oligonucleotide probe comprising a phosphorothioate linkage, wherein the probe is labeled with a detectable moiety.

137. The kit of claim 136, wherein the detectable moiety is a fluorescent dye.

138. The kit of claim 136, further comprising a substance capable of cleaving phosphorothioate linkages.

139. The kit of claim 136, further comprising a ligase.

140. The kit of claim 136, further comprising a ligase and a substance capable of cleaving phosphorothioate linkages.

141. The kit of claim 136, further comprising at least one selected from the group consisting of: ligase, a substance capable of cleaving phosphorothioate linkages, phosphatase, polymerase, support, buffer, thermostable polymerase, nucleotides, reagents for preparing an emulsion, and reagents for preparing a gel.

142. The kit of claim 136, wherein the kit comprises a plurality of fluorescently labeled oligonucleotide probes comprising a phosphorothioate linkage such that probes corresponding to different probe terminal nucleotides carry different spectrally resolved fluorescent dyes.

143. One form is 5' -O-P-O-X-O-P-S- (N)_kN_B ^*-3' of an oligonucleotide, wherein N represents any nucleotide, N_BRepresents a moiety that cannot be extended by a ligase, represents a detectable moiety, X represents a nucleotide, k is 1-100, with the proviso that: a detectable moiety may be present (N)_kAt any nucleotide in (a) and additionally present at N_BOr not otherwise present in N_BThe above.

144. The oligonucleotide probe of claim 143, wherein the probe comprises at least one nucleotide with reduced degeneracy.

145. The set of oligonucleotide probes of claim 143, wherein the set comprises a plurality of fluorescently labeled oligonucleotide probes such that probes corresponding to different probe nucleotides X carry different spectrally resolved fluorochromes.

146. One form is 5' -N_B ^*(N)_kC-S-P-O-X-3' oligonucleotide probes, wherein N represents any nucleotide, N_BRepresents a moiety that cannot be extended by a ligase, represents a detectable moiety, X represents a nucleotide, k is 1-100, with the proviso that: a detectable moiety may be present (N)_kAt any nucleotide in (a) and additionally present at N_BOr not otherwise present in N_BThe above.

147. The oligonucleotide probe of claim 146, wherein the probe comprises at least one nucleotide with reduced degeneracy.

148. The set of oligonucleotide probes of claim 146, wherein the set comprises a plurality of fluorescently labeled oligonucleotide probes such that probes corresponding to different probe nucleotides X carry different spectrally resolved fluorescent dyes.

149. One form is 5' -O-P-O-X-O- (N)_k-O-P-S-(N)_iN_B ^*-3' oligonucleotide probe, wherein N represents any nucleotide, N_BRepresents a moiety which cannot be extended by a ligase, represents a detectable moiety, X represents a nucleotide, (k + i) is 1 to 100, k is 1 to 100, i is 0 to 99, with the proviso that The method comprises the following steps: a detectable moiety may be present (N)_iAt any nucleotide in (a) and additionally present at N_BOr not otherwise present in N_BThe above.

150. The oligonucleotide probe of claim 149, wherein the probe comprises at least one nucleotide with reduced degeneracy.

151. The oligonucleotide probe of claim 149, wherein i is 0.

152. The set of oligonucleotide probes of claim 149, wherein the set comprises a plurality of fluorescently labeled oligonucleotide probes such that probes corresponding to different probe nucleotides X carry different spectrally resolvable fluorescent dyes.

153. One form is 5' -N_B ^*(N)_i-S-P-O-(N)_k-O-P-O-X-3' oligonucleotide probes, in which N represents any nucleotide, N_BRepresents a moiety that cannot be extended by a ligase, represents a detectable moiety, X represents a nucleotide, (k + i) is 1-100, k is 1-100, i is 0-99, with the proviso that: a detectable moiety may be present (N)_iAt any nucleotide in (a) and additionally present at N_BOr not otherwise present in N_BThe above.

154. The oligonucleotide probe of claim 153, wherein the probe comprises at least one nucleotide with reduced degeneracy.

155. The oligonucleotide probe of claim 153, wherein i is 0.

156. The set of oligonucleotide probes of claim 153, wherein the set comprises a plurality of fluorescently labeled oligonucleotide probes such that probes corresponding to different probe nucleotides X carry different spectrally resolved fluorescent dyes.

157. An oligonucleotide probe in a form selected from the group consisting of: 3 ' -XNNSnsINI-5 ', 3 ' -XNNSnsIII-5 ', 3 ' -XNNNNSnII-5 ' and 3 ' -XNNNNIsII-5 ', wherein X and N represent any nucleotide and "s" represents a scissile junction, the scissile junction and the oligonucleotide 5 ' end between at least one residue containing a specific X marker.

158. The probe of claim 157, wherein s represents a phosphorothioate linkage.

159. A method of ligating a first polynucleotide to a second polynucleotide, the method comprising the steps of:

(a) providing a first polynucleotide immobilized in or on a semi-solid support;

(b) contacting the first polynucleotide with a second polynucleotide and a ligase; and

(c) maintaining said first and second polynucleotides in the presence of a ligase and under conditions suitable for ligation.

160. The method of claim 159, wherein the first polynucleotide is directly attached to the semi-solid support by covalent or non-covalent linkage.

161. The method of claim 159, wherein the first polynucleotide is attached to a support immobilized in or on the semi-solid support.

162. The method of claim 161, wherein the semi-solid support is a gel and the support is a microparticle.

163. The method of claim 161, wherein the semi-solid support is a gel and the support is a magnetic microparticle.

164. A method of cleaving a polynucleotide, the method comprising the steps of:

(a) providing a polynucleotide immobilized in or on a semi-solid support, wherein the polynucleotide comprises a scissile linkage;

(b) contacting the polynucleotide with a cleaving agent; and

(c) maintaining said polynucleotide in the presence of said cleavage agent and under conditions suitable for cleavage.

165. The method of claim 164, wherein the first polynucleotide is directly attached to the semi-solid support by covalent or non-covalent linkage.

166. The method of claim 164, wherein the first polynucleotide is attached to a support immobilized in or on the semi-solid support.

167. The method of claim 166, wherein the semi-solid support is a gel and the support is a microparticle.

168. The method of claim 166, wherein the semi-solid support is a gel and the support is a magnetic microparticle.

169. An automated sequencing apparatus comprising a flow cell oriented to provide gravity bubble displacement.

170. An automated sequencing system that enables the identification of 40,000 nucleotides per second.

171. An automated sequencing system that generates 8.6Gb sequence information per day.

172. An automated sequencing system that generates 48Gb sequence information daily.

173. A method of identifying a nucleotide sequence within a template polynucleotide, the method comprising the steps of:

(a) extending an initiator oligonucleotide along the template polynucleotide by ligating an oligonucleotide probe to the initiator oligonucleotide to form an extended duplex, wherein the oligonucleotide probe comprises a priming residue;

(b) identifying one or more nucleotides of the polynucleotide;

(c) cleaving the oligonucleotide probe with a cleavage agent to generate an extendable probe end; and

(d) repeating steps (a), (b) and (c) until the nucleotide sequence is determined.

174. The method of claim 173, wherein said identifying step comprises detecting a label attached to a recently ligated oligonucleotide probe.

175. The method of claim 173, wherein said oligonucleotide probe comprises an abasic residue, a damaged base, or deoxyinosine.

176. The method of claim 173, wherein said oligonucleotide probe comprises a damaged base, and further comprising the step of removing said damaged base.

177. The method of claim 176, wherein the removing step comprises contacting the extended duplex with a DNA glycosylase.

178. The method of claim 173, further comprising the step of cleaving the oligonucleotide probe with a cleavage agent to generate an extendable probe end.

179. The method of claim 178, wherein said cleavage agent is selected from the group consisting of AP endonuclease, Endo V, and periodate.

180. The method of claim 178, wherein the cleavage agent is an AP endonuclease.

181. The method of claim 178, wherein the cleavage agent is endonuclease VIII.

182. The method of claim 173, wherein the extending step is performed in or on a semi-solid support.

183. A method of determining a sequence of nucleotides in a template polynucleotide, the method comprising the steps of:

(b) ligating an extension oligonucleotide probe to the extendable terminus, forming an extended duplex comprising an extended oligonucleotide probe, wherein the extension probe comprises a priming residue;

(c) identifying in the extended duplex at least one of (1) a nucleotide complementary to the just-ligated extension probe or (2) a nucleotide residue immediately downstream of the extended oligonucleotide probe;

184. The method of claim 183, wherein the extension probe comprises an abasic residue, a damaged base, or deoxyinosine.

185. The method of claim 183, wherein each extension probe comprises a non-extendable moiety at one terminus.

186. The method of claim 183, wherein the identifying step comprises detecting a label attached to the most recently ligated extension probe.

187. The method of claim 183, wherein the identifying step comprises removing the non-extendable portion and extending the extended oligonucleotide probe with a nucleic acid polymerase in the presence of one or more labeled chain terminating nucleoside triphosphates.

188. The method of claim 183, further comprising the step of capping extended oligonucleotide probes when no extension probe is ligated to the extendable terminus in the ligating step.

189. The method of claim 183, wherein the generating step comprises cleaving the oligonucleotide probe with a cleavage agent selected from the group consisting of an AP endonuclease or a periodate.

190. The method of claim 189, wherein the cleavage agent is an AP endonuclease.

191. The method of claim 189, wherein the cleavage agent is endonuclease VIII.

192. The method of claim 183, wherein the ligating and generating steps are performed in or on a semi-solid support.

193. The method of claim 183, wherein step (a) comprises providing in separate aliquots a plurality of different probe-template duplexes, each different duplex comprising an initiator oligonucleotide probe hybridized to a template polynucleotide, wherein the template polynucleotide is the same in each duplex but the initiator oligonucleotide probe binds to a different sequence of the template polynucleotide in each duplex; steps (b) - (e) were performed independently for each sample size.

194. The method of claim 193 wherein for each aliquot, one end of the extension oligonucleotide probe comprises a non-extendable moiety.

195. The method of claim 194, wherein for each aliquot, the identifying step comprises detecting a label attached to the most recently ligated extension probe.

196. The method of claim 194, wherein for each aliquot, said identifying step comprises removing said non-extendable moiety and extending said extended oligonucleotide probe with a nucleic acid polymerase in the presence of one or more labeled chain terminating nucleoside triphosphates.

197. The method of claim 194, further comprising the step of capping extended oligonucleotide probes when no extension probe is ligated to the extendable terminus in the ligating step.

198. The method of claim 194, wherein said generating step comprises cleaving said oligonucleotide probe with a cleavage agent selected from the group consisting of an AP endonuclease or a periodate.

199. The method of claim 198, wherein the cleavage agent is an AP endonuclease.

200. The method of claim 198, wherein the cleavage agent is endonuclease VIII.

201. The method of claim 194, wherein the ligating and generating steps are performed in or on a semi-solid support.

202. The method of claim 183, further comprising the step of: (f) removing the ligation probes and the initial oligonucleotides on the template; (g) repeating step (a) with a second oligonucleotide that binds to a different sequence of the template polynucleotide; and (h) repeating steps (b) - (e).

203. The method of claim 202, wherein the method is repeated a plurality of times with a starter oligonucleotide that binds to a different sequence of the template polynucleotide.

204. The method of claim 202, wherein the removing step comprises contacting the ligation probe, the initiator oligonucleotide, and the template with a solution comprising 1.0-3.0% SDS, 100 mM NaCl, and 5-15mM sodium bisulfate (NaHSO)₄) Is contacted with the aqueous solution of (a).

205. The method of claim 202, wherein the removing step comprises contacting the ligation probe, initiator oligonucleotide, and template with a mixture comprising about 2% SDS, about 200mM NaCl, and about 10mM sodium bisulfate (NaHSO)₄) Such as 2% SDS, 200mM NaCl and 10mM sodium bisulfate (NaHSO)₄) Are contacted.

206. The method of claim 203, wherein one end of the extension probe comprises a non-extendable moiety.

207. The method of claim 203, wherein in each iteration, the identifying step comprises detecting a label attached to the most recently ligated extension probe.

208. The method of claim 203, wherein in each iteration said identifying step comprises removing said non-extendable moiety and extending said extended oligonucleotide probe with a nucleic acid polymerase in the presence of one or more labeled chain terminating nucleoside triphosphates.

209. The method of claim 203, further comprising the step of capping extended oligonucleotide probes when no extension probe is ligated to the extendable terminus in the ligating step.

210. The method of claim 203, wherein said generating step comprises cleaving said oligonucleotide probe with a cleavage agent selected from the group consisting of an AP endonuclease or a periodate.

211. The method of claim 210, wherein the cleavage agent is an AP endonuclease.

212. The method of claim 210, wherein the cleavage agent is endonuclease VIII.

213. The method of claim 203, wherein the ligating and generating steps are performed in or on a semi-solid support.

214. A method of identifying a sequence of nucleotides in a template polynucleotide attached to a support at a point of attachment, the method comprising the steps of:

(a) extending an initiator oligonucleotide along said template polynucleotide by ligating an oligonucleotide probe to said initiator oligonucleotide to form an extended duplex, wherein said oligonucleotide probe comprises a trigger residue, along said extension towards its point of attachment to said support;

(b) Identifying one or more nucleotides of the polynucleotide; and

(c) repeating steps (a) and (b) until the nucleotide sequence is determined.

215. The method of claim 214, wherein the identifying step comprises detecting a label attached to the most recently ligated extension probe.

216. The method of claim 214, wherein said generating step comprises cleaving said oligonucleotide probe with a cleavage agent selected from the group consisting of AP endonuclease, Endo V, and periodate.

217. The method of claim 216, wherein the cleavage agent is an AP endonuclease.

218. The method of claim 216, wherein the cleavage agent is endonuclease VIII.

219. The method of claim 216, wherein the ligating and generating steps are performed in or on a semi-solid support.

220. A method of determining the sequence of nucleotides in a template polynucleotide attached to a support at a point of attachment, the method comprising the steps of:

(a) providing a probe-template duplex formed by hybridization of a probe to a template polynucleotide, the primer having an extendable terminus;

(b) ligating an extension oligonucleotide probe to said extendable terminus, forming an extended duplex comprising an extended oligonucleotide probe, wherein said extension proceeds along said template to its point of attachment to said support, wherein said oligonucleotide probe comprises a priming residue;

(d) generating an extendable end on the extension oligonucleotide probe if no ready extendable end is present, such that the generated end is different from the end to which the last extension probe was ligated; and

221. The method of claim 220, wherein each extension probe comprises a non-extendable moiety at one terminus.

222. The method of claim 220, wherein the identifying step comprises detecting a label attached to the most recently ligated extension probe.

223. The method of claim 220, wherein said identifying step comprises removing said non-extendable moiety and extending said extended oligonucleotide probe with a nucleic acid polymerase in the presence of one or more labeled chain terminating nucleoside triphosphates.

224. The method of claim 220, further comprising the step of capping extended oligonucleotide probes when no extension probe is ligated to the extendable terminus in the ligating step.

225. The method of claim 220, wherein the generating step comprises cleaving the oligonucleotide probe with a cleavage agent.

226. The method of claim 225, wherein said cleavage agent is selected from the group consisting of AP endonuclease, EndoV, and periodate.

227. The method of claim 225, wherein the cleavage agent is an AP endonuclease.

228. The method of claim 225, wherein the cleavage agent is endonuclease VIII.

229. The method of claim 220, further comprising the steps of: (f) removing the ligation probes and the initial oligonucleotides on the template; (g) repeating step (a) with a second oligonucleotide that binds to a different sequence of the template polynucleotide; and (h) repeating steps (b) - (e).

230. The method of claim 229, wherein the method is repeated a plurality of times with a starter oligonucleotide that binds to a different sequence of the template polynucleotide.

231. The method of claim 220, wherein the ligating and generating steps are performed in or on a semi-solid support.

232. The method of claim 220, wherein the template is attached to a microparticle attached to a substantially planar rigid substrate.

233. The method of claim 229,the removal step comprises contacting the ligation probe, initial oligonucleotide and template with a solution containing about 1.0-3.0% SDS, 100 mM NaCl and 5-15mM sodium bisulfate (NaHSO)₄) Is contacted with the aqueous solution of (a).

234. The method of claim 229, wherein the removing step comprises contacting the ligation probes, initiator oligonucleotides and template with a composition comprising about 2% SDS, 200mM NaCl, and 10mM sodium bisulfate (NaHSO)₄) Such as 2% SDS, 200mM NaCl and 10mM sodium bisulfate (NaHSO)₄) Is contacted with the aqueous solution of (a).

235. A method of identifying a nucleotide sequence within a template polynucleotide, the method comprising the steps of:

(a) providing a template polynucleotide attached to a microparticle immobilized in or on a semi-solid support;

(b) extending an initiator oligonucleotide along the template polynucleotide by ligating an oligonucleotide probe to the initiator oligonucleotide to form an extended duplex, wherein the oligonucleotide probe comprises a priming residue;

(c) identifying one or more nucleotides of the polynucleotide; and

(d) repeating steps (b) and (c) until the nucleotide sequence is determined.

236. The method of claim 235, wherein the extending step is performed in a semi-solid support.

237. The method of claim 235, wherein the template is attached to a microparticle attached to a substantially flat rigid substrate.

238. The method of claim 235, wherein the oligonucleotide comprises a scissile linkage, or is susceptible to being modified to comprise a scissile linkage, wherein the scissile linkage is between the nucleoside and the abasic residue.

239. A method of determining a nucleotide sequence of a template polynucleotide, the method comprising the steps of:

(a) providing a probe-template duplex formed by hybridization of a probe comprising an extendable terminus to a template polynucleotide, said probe-template duplex being attached to a microparticle embedded in or on a semi-solid support;

240. The method of claim 239, wherein the ligating and generating steps are performed in a semi-solid support.

241. The method of claim 239, wherein the template is attached to a microparticle attached to a substantially flat rigid substrate.

242. A method of determining a sequence of nucleotides in a template polynucleotide, the method comprising the steps of:

(a) amplifying the template polynucleotide molecules in the emulsion chamber in the presence of the microparticles, thereby generating microparticles to which a clonal population of template polynucleotides are attached;

(b) recovering the microparticles from the emulsion;

(c) embedding the microparticles in or on a semi-solid support;

(d) extending an initiator oligonucleotide along the template polynucleotide by ligating an oligonucleotide probe to the initiator oligonucleotide to form an extended duplex, wherein the oligonucleotide probe comprises a priming residue;

(e) identifying one or more nucleotides of the polynucleotide; and

(f) repeating steps (d) and (e) until the nucleotide sequence is determined.

243. The method of claim 242, wherein (i) a plurality of template polynucleotide molecules comprising different sequences are amplified in a single emulsion chamber; (ii) (ii) recovering a plurality of microparticles from said emulsion and embedded in or on said support, each microparticle having attached thereto a clonal population of template polynucleotides, wherein said clonal populations have different sequences, and (iii) performing steps (d), (e) and (f) in parallel on said clonal populations attached to said embedded microparticles, so as to assay a plurality of sequences in parallel.

244. A method for determining nucleotide sequence information for a template polynucleotide using a first set of at least two families of differentially labeled oligonucleotide probes, the method comprising the steps of:

(a) extending an initiator oligonucleotide along the template polynucleotide by ligating an oligonucleotide probe to the initiator oligonucleotide to form an extended duplex, wherein the oligonucleotide probe is a member of the collection of differentially labeled oligonucleotide probe families and contains a priming residue;

(b) detecting a label attached to the oligonucleotide; and

245. The method of claim 244, wherein step (d) comprises decoding the ordered list of probe family names to determine the sequence.

246. The method of claim 244, comprising providing a probe-template duplex formed by hybridisation of an initiator oligonucleotide probe to a template polynucleotide, said probe having an extendable terminus, wherein said extending step comprises ligating an oligonucleotide probe to said extendable terminus to form an extended duplex containing an extended oligonucleotide probe, and further comprising the step of capping the remaining extendable termini if no oligonucleotide probe has been ligated to said extendable terminus in said extending step.

247. The method of claim 244, wherein each probe family comprises a non-extendable moiety at one terminus of the oligonucleotide probe.

248. The method of claim 244, further comprising, after each detecting step: (f) if an extendable terminus is not present, an extendable terminus is generated on the most recently ligated oligonucleotide probe such that the generated terminus is different from the terminus to which the most recently ligated oligonucleotide probe is ligated.

249. The method of claim 248, wherein the extendable probe terminus is generated by cleaving the oligonucleotide with a cleavage agent selected from the group consisting of AP endonuclease, EndoV, or periodate.

250. The method of claim 249, wherein said cleavage agent is an AP endonuclease.

251. The method of claim 249, wherein said cleavage agent is endonuclease VIII.

252. The method of claim 244, wherein the extending step is performed in or on a semi-solid support.

253. The method of claim 244, wherein the template is attached to a microparticle attached to a substantially flat rigid substrate.

254. The method of claim 244, wherein said collection comprises 2 differentially labeled probe families.

255. The method of claim 244, wherein said collection comprises 3 differentially labeled probe families.

256. The method of claim 244, wherein said collection comprises 4 differentially labeled probe families.

257. The method of claim 244, wherein the collection comprises more than 4 differentially labeled probe families.

258. The method of claim 244, wherein the oligonucleotide probes comprise restricted portions of nucleosides that are not independently selected, and wherein oligonucleotide probes comprising restricted portions that differ in sequence are assigned to a probe family according to a coding scheme.

259. The method of claim 244, wherein said oligonucleotide probes are assigned to the first, second, third and fourth probe families according to one of the 24 coding schemes listed in table 1.

260. The method of claim 245, wherein the identity of at least one nucleotide in the template is known, and wherein the decoding step comprises:

(iii) (iii) repeating step (ii) until the sequence is determined.

261. The method of claim 245, further comprising the steps of:

(iii) (iii) repeating step (ii) until the sequence is determined.

262. The method of claim 261, wherein said determining step comprises contacting a template-probe duplex with a labeled nucleotide in the presence of a polymerase under conditions that enable incorporation of said labeled nucleotide if said labeled nucleotide is complementary to said template at a position adjacent to said duplex.

263. The method of claim 245, wherein said decoding step comprises: generating at least one candidate sequence from the ordered list of probe family names; and selecting a candidate sequence as the nucleotide sequence of the template.

264. The method of claim 263, wherein the generating step comprises generating at least 4 candidate sequences.

265. The method of claim 263, wherein the generating step comprises:

(i) a species that presumes the first nucleotide of the nucleotide sequence;

(iv) (iv) repeating step (iii) until a candidate sequence is generated; and

266. The method of claim 263, wherein the selecting step comprises comparing at least one candidate sequence to one or more known sequences and selecting the candidate sequence having a predetermined degree of identity or closest proximity to one or more known sequences.

267. The method of claim 266, wherein the template is derived from an organism of interest, and wherein the step of comparing comprises comparing at least one candidate sequence to sequences in a database comprising sequences obtained from the organism.

268. The method of claim 266, wherein the step of comparing comprises comparing at least one candidate sequence to sequences in a database comprising a plurality of comparison sequences, each sequence comprising a different potential sequence for the test polynucleotide sequence.

269. The method of claim 263, wherein the selecting step comprises:

270. The method of claim 269, wherein the comparison moiety is a dinucleotide.

271. The method of claim 269, wherein the ordered list of family names of the second probe comprises only one element.

272. The method of claim 244, wherein the oligonucleotide probes in each probe family have the following structures: 5' - (XY) (N)_kN_B ^*-3 'or 3' - (XY) (N)_kN_B ^*-5', wherein N represents any nucleoside, N_BRepresenting the inability to use ligaseAn extended moiety representing a detectable moiety, XY being a constrained moiety of said probe, wherein X and Y represent the same or different nucleosides but cannot be independently selected from each other, X and Y are at least 2-fold degenerate, at least one internucleoside linkage is a scissile linkage between a nucleoside and an abasic residue, or between a nucleoside and a residue comprising a damaged base, k is from 1 to 100, with the proviso that: the detectable moiety may be present in Y or (N)_kAt any internal nucleotide and additionally present at N_BOr not otherwise present in N_BThe above.

273. The method of claim 272, wherein said at least one internucleoside linking nucleoside is a scissile linkage between an abasic residue and an abasic residue.

274. The method of claim 272, wherein the detectable moiety is linked by a cleavable linker, can be photobleached, or both.

275. The method of claim 274, wherein the cleavable linker comprises a disulfide bond.

276. The method of claim 272, wherein 4 families of differentially labeled oligonucleotide probes are used, wherein oligonucleotide probes differing in sequence by their restricted portions are assigned to the first, second, third and fourth probe families according to one of the 24 coding schemes listed in table 1.

277. The method of claim 244, wherein the detecting step comprises simultaneously obtaining an average of 2 bits of information from each of at least 2 nucleotides in the template, and not obtaining two bits of information from any single nucleotide.

278. The method of claim 244, wherein the detecting step comprises simultaneously obtaining less than 2 bits of information from each of at least 2 nucleotides in the template.

279. A method for determining nucleotide sequence information for a template polynucleotide using a first set of at least two families of differentially labeled oligonucleotide probes, the method comprising the steps of:

(a) Contacting a probe-template complex with at least two differentially labeled oligonucleotide probe families, the probe-template complex comprising a double stranded portion having an extendable end and a single stranded portion to be sequenced, to allow hybridization of an oligonucleotide probe comprising a portion complementary to the template portion immediately adjacent to the duplex portion, wherein the probes of the probe families comprise a priming residue;

(b) ligating said hybridized oligonucleotide probe to said extendable terminus, thereby generating a probe-template duplex containing an extended duplex;

(c) detecting a label attached to the ligation probe;

(d) generating an extendable probe end on the extended duplex if no existing extendable probe end is present; and

280. The method of claim 279, wherein the detecting step comprises simultaneously obtaining an average of 2 bits of information from each of at least 2 nucleotides in the template and not obtaining two bits of information from any single nucleotide.

281. The method of claim 279, wherein the detecting step comprises simultaneously obtaining less than 2 bits of information from each of at least 2 nucleotides in the template.

282. A method for determining nucleotide sequence information for a template polynucleotide using a first collection of oligonucleotide probe families, wherein probes of the probe families contain a priming residue, the method comprising the steps of:

(a) performing successive sequential cycles of extension, ligation, detection and cleavage, wherein the detecting step comprises simultaneously obtaining an average two-bit information from each of at least two nucleotides in the template, and not obtaining two-bit information from any single nucleotide; and

283. The method of claim 282, wherein the at least one other bit of information comprises an item of information selected from the group consisting of: nucleotide species in the template, information obtained by comparing a candidate sequence to at least one known sequence; and information obtained by repeating the method using a second set of oligonucleotide probe families.

284. A collection of at least two families of differentially labeled oligonucleotide probes, wherein the probes of each family of probes comprise a constrained portion and an unconstrained portion, the constrained portion being at least 2-fold degenerate at each position, the probes of each family containing a priming residue.

285. The collection of differentially labeled oligonucleotide probes of claim 284, wherein each probe comprises a ligase non-extendable terminus.

286. The collection of differentially labeled oligonucleotide probe families of claim 284, wherein each probe comprises a ligase inextensible terminus, and wherein each probe comprises a detectable moiety at a position between the scissile junction and the ligase inextensible terminus.

287. The collection of differentially labeled oligonucleotide probe families of claim 284, wherein the collection comprises 2 probe families.

288. The collection of differentially labeled oligonucleotide probe families of claim 284, wherein the collection comprises 3 probe families.

289. The collection of differentially labeled oligonucleotide probe families of claim 284, wherein the collection comprises 4 probe families.

290. The collection of differentially labeled oligonucleotide probe families of claim 284, wherein the collection comprises more than 4 probe families.

291. The collection of differentially labeled oligonucleotide probe families of claim 284, wherein the probes comprise detectable moieties that are linked by a cleavable linker, that are photobleachable, or both.

292. A kit comprising a collection of at least two families of differentially labeled oligonucleotide probes, wherein the oligonucleotide probes contain a priming residue.

293. The kit of claim 292, wherein the initiating residue is an abasic residue, deoxyinosine, or a residue containing a damaged base.

294. The kit of claim 292, wherein the probe comprises a cleavable linkage between a nucleoside and an abasic residue.

295. The kit of claim 292, further comprising at least one selected from the group consisting of: ligase, a substance capable of cleaving the scissile linkage, phosphatase, polymerase, support, buffer, thermostable polymerase, nucleotides, reagents for preparing an emulsion, and reagents for preparing a gel.

296. A kit comprising an oligonucleotide probe comprising a priming residue, wherein said probe comprises or is not readily modified to comprise a scissile linkage, wherein said probe is labeled with a detectable moiety.

297. The kit of claim 296, wherein the detectable moiety is a fluorescent dye.

298. The kit of claim 296, further comprising a substance capable of cleaving the scissile junction.

299. The kit of claim 296, further comprising a ligase.

300. The kit of claim 296, further comprising a ligase and a substance capable of cleaving the scissile junction.

301. The kit of claim 296, further comprising at least one selected from the group consisting of: ligase, a substance capable of cleaving the scissile linkage, phosphatase, polymerase, support, buffer, thermostable polymerase, nucleotides, reagents for preparing an emulsion, and reagents for preparing a gel.

302. The kit of claim 301, comprising a plurality of fluorescently labeled oligonucleotide probes, wherein the probes contain a scissile linkage between a nucleoside and a priming residue such that probes corresponding to different probe terminal nucleotides carry different spectrally resolved fluorescent dyes.

303. One form is 5' -O-P-O-X-O-P-O- (N)_kN_B ^*-3' of an oligonucleotide, wherein N represents any nucleusA nucleotide or abasic residue, provided that: at least one N is a priming residue, N_BRepresents a moiety that cannot be extended by a ligase, represents a detectable moiety, X represents a nucleotide, k is 1-100, with the proviso that: a detectable moiety may be present (N) _kAt any internal nucleotide and additionally present at N_BOr not otherwise present in N_BThe above.

304. The oligonucleotide probe of claim 303, wherein the probe comprises at least one nucleotide with reduced degeneracy.

305. The set of oligonucleotide probes of claim 303, wherein the set comprises a plurality of fluorescently labeled oligonucleotide probes such that probes corresponding to different probe nucleotides X carry different spectrally resolved fluorescent dyes.

306. One form is 5' -N_B ^*(N)_kC-O-P-O-X-3' oligonucleotide probe, wherein N represents any nucleotide or abasic residue, with the proviso that: at least one N is a priming residue, N_BRepresents a moiety that cannot be extended with a ligase, represents a detectable moiety, X represents a nucleotide, and k is 1-100, with the proviso that: a detectable moiety may be present (N)_kAt any internal nucleotide and additionally present at N_BOr not otherwise present in N_BThe above.

307. The oligonucleotide probe of claim 306, wherein the probe comprises at least one nucleotide with reduced degeneracy.

308. The set of oligonucleotide probes of claim 306, wherein the set comprises a plurality of fluorescently labeled oligonucleotide probes such that probes corresponding to different probe nucleotides X carry different spectrally resolved fluorescent dyes.

309. An oligonucleotide probe in a form selected from the group consisting of: 3 ' -XNNRINI-5 ', 3 ' -XNNRIII-5 ', 3 ' -XNNRNII-5 ', 3 ' -XNNIRII-5 ', XNNRNI-5 ', 3 ' -XNNRII-5 ', 3 ' -XNNIRI-5, wherein X and N represent any nucleotide and "R" represents a priming residue, at least one residue between the priming residue and the 5 ' end of the oligonucleotide containing a label corresponding to a particular X.

310. The probe of claim 309, wherein R represents a deoxyribose residue.

311. A method of ligating a first polynucleotide to a second polynucleotide, the method comprising the steps of:

(a) providing a first polynucleotide immobilized in or on a semi-solid support;

(c) maintaining said first and second polynucleotides in the presence of a ligase and under conditions suitable for ligation, wherein at least one of said polynucleotides contains a priming residue.

312. The method of claim 311, wherein the first polynucleotide is directly attached to the semi-solid support by covalent or non-covalent linkage.

313. The method of claim 311, wherein the first polynucleotide is attached to a support immobilized in or on the semi-solid support.

314. The method of claim 313, wherein the semi-solid support is a gel and the support is a microparticle.

315. The method of claim 313, wherein the semi-solid support is a gel and the support is a magnetic microparticle.

316. A method of cleaving a polynucleotide, the method comprising the steps of:

(a) providing a polynucleotide immobilized in or on a semi-solid support, wherein the polynucleotide comprises a priming residue;

(b) contacting the polynucleotide with a cleaving agent; and

317. The method of claim 316, wherein the priming residue is an abasic residue, deoxyinosine, or a residue containing a damaged base.

318. The method of claim 316, wherein the first polynucleotide is directly attached to the semi-solid support by covalent or non-covalent linkage.

319. The method of claim 316, wherein the first polynucleotide is attached to a support immobilized in or on the semi-solid support.

320. The method of claim 319, wherein the semi-solid support is a gel and the support is a microparticle.

321. The method of claim 320, wherein the semi-solid support is a gel and the support is a magnetic microparticle.

322. A collection of components for preparing a population of microparticles, the collection comprising:

(a) a population of microparticles, wherein a single microparticle has attached at least a first population of primers and a second population of primers, wherein the primers of the first population differ in sequence from the primers of the second population; and

(b) a library of nucleic acid fragments, wherein each nucleic acid fragment comprises a first and a second nucleic acid segment of interest, wherein the first and second primers correspond to a universal sequence located outside the first and second nucleic acid segments of interest.

323. The set of components of claim 322, wherein the first and second nucleic acid segments of interest are 5 'and 3' tags of a paired tag.

324. The collection of claim 322, wherein the nucleic acid fragments comprise internal adaptors comprising one or more primer binding sites for amplification primers to allow for PCR amplification of each nucleic acid segment.

325. The set of components of claim 324, further comprising a primer complementary to the primer binding site of the internal adaptor.

326. A microparticle having attached thereto a first population of substantially identical nucleic acid sequences comprising a first nucleic acid segment of interest and a second population of substantially identical nucleic acid sequences comprising a second nucleic acid segment of interest.

327. The microparticle of claim 326, wherein said first and second nucleic acid segments of interest are a 5 'tag and a 3' tag of a paired tag.

328. The microparticle of claim 326, wherein said first and second nucleic acid segments of interest are a 5 'tag and a 3' tag a predetermined distance apart in a naturally occurring contiguous nucleic acid.

329. The microparticle of claim 327 or 328, wherein said first and second nucleic acid segments are amplified from a single larger nucleic acid fragment.

330. The microparticle of claim 329, wherein said amplification is carried out in a single chamber of a PCR emulsion.

331. The population of microparticles of claim 326, wherein a single microparticle is linked to a population of substantially identical first and second nucleic acid sequences that is at least partially different from a population of substantially identical first and second nucleic acid sequences linked to other single microparticles.

332. The population of claim 331, wherein the first population of nucleic acid sequences attached to a single microparticle comprises a first nucleic acid segment of interest and the second population of nucleic acid sequences attached to a single microparticle comprises a second nucleic acid segment of interest.

333. The population of claim 332, wherein the first and second nucleic acid segments of interest in the population of first and second nucleic acid sequences attached to a single microparticle are or comprise a 5 'tag and a 3' tag that are paired tags.

334. The population of claim 332, wherein the first and second nucleic acid segments of interest in the population of first and second nucleic acid sequences attached to a single microparticle are amplified from a single larger nucleic acid fragment.

335. The population of claim 334, wherein said amplification is performed in a single chamber of a PCR emulsion.

336. The population of microparticles of claim 331, wherein the first and second populations of nucleic acid sequences are attached to individual microparticles by amplification in individual chambers of a PCR emulsion, wherein at least a portion of the individual chambers comprise a microparticle and a nucleic acid fragment comprising the first and second nucleic acid segments.

337. An array comprising a population of particles as described in any one of claims 331-336.

338. The array of claim 337, wherein the microparticles are immobilized in or on a semi-solid support.

339. A method of producing microparticles linked to a first population of nucleic acid sequences consisting of substantially identical nucleic acid sequences and a second population of nucleic acid sequences consisting of substantially identical nucleic acid sequences, the method comprising the steps of:

(a) providing microparticles linked to a first primer population and a second primer population;

(b) providing a nucleic acid fragment comprising first and second nucleic acid segments, wherein the first and second nucleic acid segments are flanked by binding regions for the first and second primers, the primer binding regions having sequences corresponding to the first and second primers attached to the microparticle and being separated by at least one other primer binding region;

(c) incubating the microparticle and nucleic acid fragment in the presence of suitable amplification reagents and primers to allow amplification, such that the first and second nucleic acid sequences are amplified and ligated to the microparticle.

340. The method of claim 339, wherein the first and second primer binding regions each comprise an amplification primer binding region and a sequencing primer binding region.

341. The method of claim 339, wherein the at least one additional primer binding region comprises binding regions for two amplification primers such that each nucleic acid sequence flanks the primer binding site of a pair of amplification primers to allow PCR amplification of both nucleic acid sequences.

342. The method of claim 339, wherein the amplification is performed in a single chamber of a PCR emulsion.

343. The method of claim 339, wherein said first and second nucleic acid sequences each comprise a tag, wherein said tags are 5 'and 3' tags of a paired tag.

344. A method of generating a population of microparticles linked to a distinct population of nucleic acid sequences, the nucleic acid sequences within each population of nucleic acid sequences being substantially identical, the method comprising performing the method of claim 339, wherein the amplification is performed in a plurality of chambers of a PCR emulsion, wherein at least a portion of the chambers contain a microparticle and a nucleic acid fragment comprising first and second nucleic acid segments, wherein the first and second nucleic acid sequences of a single nucleic acid fragment are different from the first and second nucleic acid segments.

345. A method of producing an array of microparticles, the method comprising immobilizing the population of microparticles of claim 344 in or on a semi-solid support.

346. A method of performing nucleic acid sequencing, the method comprising:

(a) obtaining sequence information for a first population of nucleic acid molecules attached to the microparticle;

(b) obtaining sequence information for a second population of nucleic acid molecules attached to the same microparticle, wherein the first nucleic acid molecule differs in sequence at least in part from the population and the second population of nucleic acid molecules.

347. The method of claim 346, wherein the sequence information of (a) and (b) is obtained sequentially.

348. The method of claim 346, wherein the first population of nucleic acid molecules comprises 5 'tags paired with tags and the second population of nucleic acid molecules comprises 3' tags.

349. A method of sequencing, the method comprising performing the method of claim 346 on a population of microparticles, wherein the sequence of the first and second nucleic acid molecules attached to each microparticle is at least partially different from the first and second nucleic acid molecules attached to other microparticles.

350. The method of claim 349, wherein the first population of nucleic acid molecules on the single particle is sequenced in parallel and the second population of nucleic acid molecules on the single particle is sequenced in parallel.

351. A composition comprising 1.0-3.0% SDS, 100-300mM NaCl and 5-15 mM sodium bisulfate (NaHSO) ₄) An aqueous solution of (a).

352. The composition of claim 351, having a pH of 2.0-3.0.

353. The composition of claim 351, containing about 2% SDS, about 200mM NaCl, and about 10mM sodium bisulfate (NaHSO)₄) An aqueous solution of (a).

354. The composition of claim 353, having a pH of 2.0-3.0.