[go: up one dir, main page]

CN105121661B - The method of phase is determined for genome assembling and haplotype - Google Patents

The method of phase is determined for genome assembling and haplotype Download PDF

Info

Publication number
CN105121661B
CN105121661B CN201480020008.2A CN201480020008A CN105121661B CN 105121661 B CN105121661 B CN 105121661B CN 201480020008 A CN201480020008 A CN 201480020008A CN 105121661 B CN105121661 B CN 105121661B
Authority
CN
China
Prior art keywords
dna
reading
methods
read
sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201480020008.2A
Other languages
Chinese (zh)
Other versions
CN105121661A (en
Inventor
小R·E·格林
L·F·拉里奥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of California San Diego UCSD
Original Assignee
University of California San Diego UCSD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of California San Diego UCSD filed Critical University of California San Diego UCSD
Priority to CN201810469575.6A priority Critical patent/CN108624668B/en
Publication of CN105121661A publication Critical patent/CN105121661A/en
Application granted granted Critical
Publication of CN105121661B publication Critical patent/CN105121661B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/10Design of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/60In silico combinatorial chemistry
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2521/00Reaction characterised by the enzymatic activity
    • C12Q2521/30Phosphoric diester hydrolysing, i.e. nuclease
    • C12Q2521/301Endonuclease
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2521/00Reaction characterised by the enzymatic activity
    • C12Q2521/30Phosphoric diester hydrolysing, i.e. nuclease
    • C12Q2521/319Exonuclease
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2521/00Reaction characterised by the enzymatic activity
    • C12Q2521/50Other enzymatic activities
    • C12Q2521/501Ligase
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2522/00Reaction characterised by the use of non-enzymatic proteins
    • C12Q2522/10Nucleic acid binding proteins
    • C12Q2522/101Single or double stranded nucleic acid binding proteins
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2523/00Reactions characterised by treatment of reaction samples
    • C12Q2523/10Characterised by chemical treatment
    • C12Q2523/101Crosslinking agents, e.g. psoralen
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2535/00Reactions characterised by the assay type for determining the identity of a nucleotide base or a sequence of oligonucleotides
    • C12Q2535/122Massive parallel sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2563/00Nucleic acid detection characterized by the use of physical, structural and functional properties
    • C12Q2563/131Nucleic acid detection characterized by the use of physical, structural and functional properties the label being a member of a cognate binding pair, i.e. extends to antibodies, haptens, avidin
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2565/00Nucleic acid analysis characterised by mode or means of detection
    • C12Q2565/50Detection characterised by immobilisation to a surface
    • C12Q2565/501Detection characterised by immobilisation to a surface being an array of oligonucleotides

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Organic Chemistry (AREA)
  • Molecular Biology (AREA)
  • Biochemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Library & Information Science (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Computing Systems (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Saccharide Compounds (AREA)
  • Micro-Organisms Or Cultivation Processes Thereof (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • General Chemical & Material Sciences (AREA)

Abstract

The present invention provides for greatly speeding up and improve the method for from the beginning genome assembling.Method disclosed herein utilizes data analysing method, makes the from the beginning assembling of the genome from one or more subjects quick and cheap.The present invention further provides method disclosed herein can be used for a variety of applications, including the fixed phase of haplotype and macro genome analysis.

Description

用于基因组组装及单体型定相的方法Methods for genome assembly and haplotype phasing

相关申请的交叉引用Cross References to Related Applications

本申请要求2013年2月1日提交的临时申请号61/759,941和2013年10月17日提交的临时申请号61/892,355的权益,所述临时申请的公开内容以引用方式并入本文中。This application claims the benefit of Provisional Application No. 61/759,941, filed February 1, 2013, and Provisional Application No. 61/892,355, filed October 17, 2013, the disclosures of which are incorporated herein by reference.

技术领域technical field

本发明提供了基因组组装和单体型定相的方法,用于鉴别基因组内的短的、中等的和长的连接。The present invention provides genome assembly and haplotype phasing methods for identifying short, medium and long junctions within the genome.

背景技术Background technique

理论上和实践上仍然难以产生高质量的、高度连续的基因组序列。Theoretically and practically, it is still difficult to generate high-quality, highly contiguous genome sequences.

发明内容Contents of the invention

下一代测序(NGS)数据的一个长久性缺陷在于不能跨越大的基因组重复区域,这是由于读段短和插入大小相对较小。该缺陷显著地影响了从头(de novo)组装。由于基因组重排的性质和排布是不确定的,因此被长重复区域隔开的重叠群(contig)不能被连接或重测序。此外,由于变异体不能够在长距离内确信地与单倍型相关联,难以确定定相信息。通过生成具有适宜的输入DNA的、跨越数十万碱基以及多达百万碱基级别的基因组距离的极长程读对(extremely long-range read pair,XLRP),本发明能够同时解决所有这些问题。这些数据对于克服基因组中的大重复区域(包括着丝粒)所产生的问题,是非常宝贵的;能够节约从头组装的成本;并为个体化用药产生具有足够完整度、准确度的重测序数据。A permanence limitation of next-generation sequencing (NGS) data is the inability to span large genomic repetitive regions due to short reads and relatively small insert sizes. This defect significantly affects de novo assembly. Because the nature and arrangement of genomic rearrangements are uncertain, contigs separated by long repeat regions cannot be joined or resequenced. Furthermore, phasing information is difficult to determine because variants cannot be confidently associated with haplotypes over long distances. The present invention addresses all of these issues simultaneously by generating extremely long-range read pairs (XLRPs) spanning genomic distances in the hundreds of thousands of bases and up to megabases with appropriate input DNA . These data are invaluable for overcoming the problems caused by large repetitive regions in the genome (including centromeres); saving the cost of de novo assembly; and generating resequencing data with sufficient completeness and accuracy for individualized medicine .

在相距非常远、但分子上相连的DNA片段之间形成关联的过程中使用重构染色质,是非常重要的。本发明使远离的片段能够被放到一起并通过染色质构象共价相接,从而物理地连接DNA分子先前远离的部分。后续处理能够使关联片段的序列被确定,产生读对(read pair),其在基因组上的间隔延伸至输入DNA分子的全长。由于读对来源于同一分子,因此这些读对还含有相位信息。It is important to use remodeled chromatin in the process of forming associations between very distant but molecularly connected DNA segments. The present invention enables distant fragments to be brought together and covalently joined through chromatin conformation, thereby physically linking previously distant parts of the DNA molecule. Subsequent processing enables the sequence of associated fragments to be determined, generating read pairs whose spacing on the genome extends the full length of the input DNA molecule. Since read pairs originate from the same molecule, these read pairs also contain phase information.

在一些实施方案中,本发明提供了能够用比先前所需更少的数据产生高质量组装的方法。例如,本文所公开的方法提供了基因组组装,其仅仅来自两条泳道(lane)的Illumina HiSeq数据。In some embodiments, the present invention provides methods capable of producing high-quality assemblies with less data than previously required. For example, the methods disclosed herein provide genome assemblies from only two lanes of Illumina HiSeq data.

在其它实施方案中,本发明提供了能够使用长距离读对方式产生染色体水平定相的方法。例如,本文所公开的方法能够以至少99%或更高的准确度,定相90%或更多用于该个体的杂合单核苷酸多态性(SNPs)。该准确度与实质上更昂贵、更费力的方法所产生的定相相当。In other embodiments, the present invention provides methods capable of generating chromosome-level phasing using a long-range read pair approach. For example, the methods disclosed herein can phase 90% or more of the heterozygous single nucleotide polymorphisms (SNPs) for the individual with an accuracy of at least 99% or greater. This accuracy is comparable to phasing produced by substantially more expensive and laborious methods.

在一些实施例中,能够产生达到百万碱基规模的基因组DNA片段的方法可与本文所公开的方法联用。可产生长的DNA片段,以确认本方法生成跨越由那些提取所能供给的最长片段的读对的能力。在一些情况下,可提取长度超过150kbp的DNA片段,并用其生成XLRP文库。In some embodiments, methods capable of generating genomic DNA fragments on the megabase scale can be used in conjunction with the methods disclosed herein. Long DNA fragments can be generated to confirm the ability of the method to generate read pairs spanning the longest fragments afforded by those extractions. In some cases, DNA fragments longer than 150 kbp were extracted and used to generate XLRP libraries.

本发明提供了用于大大加快并改善从头基因组组装的方法。本文所公开的方法利用数据分析方法,所述方法允许来自一个或多个受试者的基因组的快速、便宜的从头组装。本发明进一步提供,本文所公开的方法可用于多种应用,包括单体型定相和宏基因组分析。The present invention provides methods for greatly speeding up and improving de novo genome assembly. The methods disclosed herein utilize data analysis methods that allow rapid, inexpensive de novo assembly of genomes from one or more subjects. The present invention further provides that the methods disclosed herein can be used in a variety of applications, including haplotype phasing and metagenomic analysis.

在某些实施方案中,本发明提供了用于基因组组装的方法,其包括以下步骤:生成多个重叠群;从通过探测染色体、染色质或重构染色质的物理布局产生的数据来生成多个读对;将所述多个读对定位或组装至所述多个重叠群;使用读段定位或组装数据来构建重叠群的邻接矩阵;和分析所述邻接矩阵,以确定经过重叠群的路径,该路径表示重叠群的次序和/或对基因组的定向。在进一步的实施方案中,本发明提供,通过采用每个读段到重叠群边缘的距离的函数,加权至少大约90%的读对,以包含关于哪些读对表示短程接触以及哪些读对表示长程接触的信息。在其它实施方案中,重新调整邻接矩阵,以减少表示基因组混杂区域的一些重叠群上的大量接触的权重,例如针对调节染色质的骨架相互作用的一种或多种介质的保守结合位点,比如转录抑制子CTCF。在其它实施方案中,本发明提供了用于人类受试者的基因组组装的方法,其中从人类受试者的DNA生成多个重叠群,并且其中通过分析人类受试者的染色体、染色质或由人类受试者的裸DNA制得的重构染色质,生成多个读对。In certain embodiments, the invention provides a method for genome assembly comprising the steps of: generating multiple contigs; generating multiple contigs from data generated by probing chromosomes, chromatin, or reconstructing the physical layout of chromatin; read pairs; mapping or assembling the plurality of read pairs into the plurality of contigs; using the read mapping or assembly data to construct an adjacency matrix of the contigs; and analyzing the adjacency matrix to determine the adjacency matrix through the contigs Path, which represents the order of contigs and/or orientation to the genome. In a further embodiment, the invention provides that at least about 90% of the read pairs are weighted to include information about which read pairs represent short-range contacts and which read pairs represent long-range contacts by taking a function of the distance of each read to the edge of the contig contact information. In other embodiments, the adjacency matrix is rescaled to reduce the weight of substantial contacts on some contigs representing promiscuous regions of the genome, such as conserved binding sites for one or more mediators of backbone interactions that regulate chromatin, Such as the transcriptional repressor CTCF. In other embodiments, the present invention provides methods for genome assembly of a human subject, wherein a plurality of contigs are generated from the human subject's DNA, and wherein by analyzing the human subject's chromosomes, chromatin, or Multiple read pairs were generated from reconstituted chromatin made from naked DNA from human subjects.

在进一步的实施方案中,本发明提供,通过使用鸟枪(shotgun)测序法生成多个重叠群,该方法包括:使长段的受试者DNA断裂成大小不确定的随机片段;用高通量测序法对片段进行测序,以生成多个测序读段;和组装测序读段以形成多个重叠群。In a further embodiment, the invention provides that multiple contigs are generated by using shotgun sequencing, the method comprising: fragmenting long stretches of subject DNA into random fragments of indeterminate size; The sequencing method sequences the fragments to generate a plurality of sequencing reads; and assembles the sequencing reads to form a plurality of contigs.

在某些实施方案中,本发明提供,通过使用基于Hi-C的技术,探测染色体、染色质或重构染色质的物理布局,生成多个读对。在进一步的实施方案中,该基于Hi-C的技术包括:使染色体、染色质或重构染色质与固定剂交联,该固定剂例如甲醛,以形成DNA-蛋白质交联;用一种或多种限制性内切酶切割交联的DNA-蛋白质,以生成含有粘性末端的多个DNA-蛋白质复合物;用含有一种或多种标记物(marker)的核苷酸补平所述粘性末端,该标记例如生物素,以产生平末端,然后将所述平末端连接在一起;使多个DNA-蛋白质复合物断裂成片段;通过使用所述一种或多种标记物,拉下含有接点的片段;和用高通量测序法对含有接点的片段进行测序,以生成多个读对。在进一步的实施方案中,从通过探测重构染色质的物理布局产生的数据来生成用于本文所公开的方法的多个读对。In certain embodiments, the invention provides that multiple read pairs are generated by probing the physical layout of chromosomes, chromatin, or remodeling chromatin using Hi-C based techniques. In a further embodiment, the Hi-C-based technique comprises: crosslinking chromosomes, chromatin, or remodeled chromatin with a fixative, such as formaldehyde, to form DNA-protein crosslinks; Cross-linked DNA-proteins are cleaved by multiple restriction enzymes to generate multiple DNA-protein complexes containing sticky ends; the stickiness is filled in with nucleotides containing one or more markers end, the label, such as biotin, to produce blunt ends that are then ligated together; fragment the multiple DNA-protein complexes; fragments of the junctions; and sequencing the fragments containing the junctions using high-throughput sequencing to generate multiple read pairs. In further embodiments, multiple read pairs for use in the methods disclosed herein are generated from data generated by probing the physical layout of remodeled chromatin.

在不同的实施方案中,本发明提供,通过探测分离自培养细胞或初生组织的染色体或染色质的物理布局,确定多个读对。在其它实施方案中,通过探测该重构染色质的物理布局,可确定多个读对,所述重构染色质是通过将从一个或多个受试者样本得到的裸DNA与分离的组蛋白复合形成的。In various embodiments, the present invention provides for the determination of multiple read pairs by probing the physical arrangement of chromosomes or chromatin isolated from cultured cells or primary tissues. In other embodiments, multiple read pairs can be determined by probing the physical layout of the remodeled chromatin by combining naked DNA obtained from one or more subject samples with the isolated group protein complexes.

在其它实施方案中,本发明提供了一种确定单体型定相的方法,其包括以下步骤:鉴别多个读对中的一个或多个杂合位点,其中可通过鉴别含有双杂合位点的读对,确定用于等位基因变异体的定相数据。In other embodiments, the present invention provides a method of determining haplotype phasing comprising the step of: identifying one or more heterozygous sites in a plurality of read pairs, wherein the read pairs containing double heterozygous Read pairs for loci determined for phased data for allelic variants.

在不同的实施方案中,本发明提供了一种用于高通量细菌基因组组装的方法,其包括以下步骤:通过使用改良的基于Hi-C的方法探测多个微生物染色体的物理布局,来生成多个读对;该改良的基于Hi-C的方法包括以下步骤:收集来自环境的微生物;加入固定剂,例如甲醛,以在每个微生物细胞中形成交联,其中读对定位至不同的重叠群表明哪些重叠群来自相同的物种。In various embodiments, the present invention provides a method for high-throughput bacterial genome assembly comprising the steps of generating Multiple read pairs; this modified Hi-C-based method includes the steps of: collecting the microorganism from the environment; adding a fixative, such as formaldehyde, to form crosslinks in each microbial cell, wherein the read pairs map to different overlapping Groups indicate which contigs are from the same species.

在一些实施方案中,本发明提供了一种用于基因组组装的方法,其包括:(a)生成多个重叠群;(b)从通过探测染色体、染色质或重构染色质的物理布局生成的数据来确定多个读对;(c)将所述多个读对定位至所述多个重叠群;(d)使用读定位数据构建重叠群的邻接矩阵;和(e)分析所述邻接矩阵,以确定经过所述重叠群的路径,该路径表示重叠群的次序和/或对基因组的定向。In some embodiments, the invention provides a method for genome assembly comprising: (a) generating a plurality of contigs; (b) generating contigs from chromosomes, chromatin, or remodeling the physical layout of chromatin (c) map the plurality of read pairs to the plurality of contigs; (d) construct an adjacency matrix of the contigs using the read mapping data; and (e) analyze the adjacency matrix to determine a path through the contigs that represents the order of the contigs and/or the orientation to the genome.

在进一步的实施方案中,本发明提供一种通过使用基于Hi-C的技术探测染色体、染色质或重构染色质的物理布局,来生成多个读对的方法。在进一步的实施方案中,基于Hi-C的技术包括:(a)将染色体、染色质或重构染色质与固定剂交联,以形成DNA-蛋白质交联;(b)用一种或多种限制性内切酶切割交联的DNA-蛋白质,以生成含有粘性末端的多个DNA-蛋白质复合物;(c)用含有一种或多种标记物的核苷酸补平所述粘性末端,以产生平末端,然后将所述平末端连接在一起;(d)将所述多个DNA-蛋白质复合物剪切成片段;(e)通过使用所述一种或多种标记物,拉下含有接点的片段;和(f)用高通量测序法对含有接点的片段进行测序,以生成多个读对。In a further embodiment, the present invention provides a method of generating multiple read pairs by probing the physical layout of chromosomes, chromatin, or remodeling chromatin using Hi-C based techniques. In a further embodiment, the Hi-C-based technique comprises: (a) cross-linking chromosomes, chromatin, or remodeled chromatin with a fixative to form DNA-protein cross-links; (b) using one or more A restriction endonuclease cleaves the cross-linked DNA-protein to generate multiple DNA-protein complexes containing cohesive ends; (c) filling in the cohesive ends with nucleotides containing one or more labels , to generate blunt ends, which are then ligated together; (d) shearing the plurality of DNA-protein complexes into fragments; (e) by using the one or more markers, pulling and (f) sequencing the junction-containing fragments using a high-throughput sequencing method to generate multiple read pairs.

在某些实施方案中,通过探测分离自培养细胞或初生组织的染色体或染色质的物理布局,确定多个读对。在其它实施方案中,通过探测重构染色质的物理布局,确定多个读对,所述重构染色质是通过将从一个或多个受试者样本得到的裸DNA与分离的组蛋白复合形成的。In certain embodiments, multiple read pairs are determined by probing the physical arrangement of chromosomes or chromatin isolated from cultured cells or primary tissues. In other embodiments, multiple read pairs are determined by probing the physical layout of remodeled chromatin by complexing naked DNA obtained from one or more subject samples with isolated histones Forming.

在一些实施方案中,通过采用读段到重叠群边缘的距离的函数,加权至少约50%、约60%、约70%、约80%、约90%、约95%或约99%或更多的所述多个读对,以体现短接触比长接触更高的概率。在一些实施方案中,重新调整邻接矩阵,以减少表示基因组混杂区域的一些重叠群上的大量接触的权重。In some embodiments, the weighting is at least about 50%, about 60%, about 70%, about 80%, about 90%, about 95%, or about 99% or more by taking a function of the distance of the reads from the edge of the contig The plurality of read pairs is increased to reflect a higher probability of short contacts than long contacts. In some embodiments, the adjacency matrix is rescaled to reduce the weight of a large number of contacts on some contigs that represent promiscuous regions of the genome.

在某些实施方案中,该基因组混杂区域包括针对一种或多种介质的一个或多个保守结合位点,该介质调节染色质的骨架相互作用。在一些实施例中,该介质为转录抑制子CTCF。In certain embodiments, the genomic promiscuous region includes one or more conserved binding sites for one or more mediators that regulate chromatin backbone interactions. In some embodiments, the mediator is the transcriptional repressor CTCF.

在一些实施方案中,本发明提供了人类受试者的基因组组装的方法,其中从人类受试者的DNA生成多个重叠群,并且其中通过分析人类受试者的染色体、染色质或由受试者的裸DNA制得的重构染色质,生成多个读对。In some embodiments, the present invention provides methods for genome assembly of a human subject, wherein a plurality of contigs are generated from the DNA of the human subject, and wherein the human subject's chromosomes, chromatin, or Multiple read pairs were generated from reconstituted chromatin made from naked DNA from the subject.

在其它实施方案中,本发明提供了用于确定单体型定相的方法,其包括:鉴别多个读对中的一个或多个杂合位点,其中可通过鉴别含有双杂合位点的读对,确定用于等位基因变异体的定相数据。In other embodiments, the present invention provides methods for determining haplotype phasing comprising: identifying one or more heterozygous sites in a plurality of read pairs, wherein one or more heterozygous sites can be identified by identifying read pairs determined for phased data for allelic variants.

在其它实施方案中,本发明提供了一种用于宏基因组组装的方法,其中通过使用改良的基于Hi-C的方法来探测多个微生物染色体的物理布局,生成多个读对;该改良的基于Hi-C的方法包括以下步骤:收集来自环境的微生物;加入固定剂,例如甲醛,以在每个微生物细胞中形成交联,并且其中读对定位至不同的重叠群表明哪些重叠群来自相同的物种。In other embodiments, the present invention provides a method for metagenomic assembly wherein multiple read pairs are generated by probing the physical layout of multiple microbial chromosomes using a modified Hi-C-based method; the modified The Hi-C-based method involves the following steps: collection of microorganisms from the environment; addition of a fixative, such as formaldehyde, to form crosslinks in each microbial cell, and wherein mapping of read pairs to different contigs indicates which contigs are from the same species.

在一些实施方案中,本发明提供了一种组装源自单个DNA分子的多个重叠群的方法,其包括:从单个DNA分子生成多个读对,并使用所述读对组装重叠群,其中至少1%的读对在所述单个DNA分子上跨越大于50kB的距离,并且在14天内生成所述读对。在一些实施方案中,至少10%的读对在所述单个DNA分子上跨越大于50kB的距离。在其它实施方案中,至少1%的读对在所述单个DNA分子上跨越大于100kB的距离。在进一步的实施方案中,读对在7天内生成。In some embodiments, the invention provides a method of assembling multiple contigs derived from a single DNA molecule, comprising: generating multiple read pairs from a single DNA molecule, and assembling the contig using the read pairs, wherein At least 1% of read pairs span a distance greater than 50 kB on the single DNA molecule, and the read pairs are generated within 14 days. In some embodiments, at least 10% of read pairs span a distance greater than 50 kB on said single DNA molecule. In other embodiments, at least 1% of read pairs span a distance greater than 100 kB on said single DNA molecule. In a further embodiment, the read pairs are generated within 7 days.

在其它实施方案中,本发明提供了一种组装源自单个DNA分子的多个重叠群的方法,其包括:从单个DNA分子生成多个读对,并使用该读对组装重叠群,其中至少1%的读对在所述单个DNA分子上跨越大于30kB的距离。在一些实施方案中,至少10%的读对在所述单个DNA分子上跨越大于30kB的距离。在其它实施方案中,至少1%的读对在所述单个DNA分子上跨越大于50kB的距离。In other embodiments, the present invention provides a method of assembling multiple contigs derived from a single DNA molecule, comprising: generating multiple read pairs from a single DNA molecule, and using the read pairs to assemble contigs, wherein at least 1% of read pairs span a distance greater than 30 kB on that single DNA molecule. In some embodiments, at least 10% of read pairs span a distance greater than 30 kB on said single DNA molecule. In other embodiments, at least 1% of read pairs span a distance greater than 50 kB on said single DNA molecule.

在其它实施方案中,本发明提供了单体型定相的方法,其包括从单个DNA分子生成多个读对,并使用所述读对组装所述DNA分子的多个重叠群,其中至少1%的读对在所述单个DNA分子上跨越大于50kB的距离,且以大于70%的准确度进行单体型定相。在一些实施方案中,至少10%的读对在所述单个DNA分子上跨越大于50kB的距离。在其它实施方案中,至少1%的读对在所述单个DNA分子上跨越大于100kB的距离。在进一步的实施方案中,以大于90%的准确度进行单体型定相。In other embodiments, the present invention provides methods for haplotype phasing comprising generating multiple read pairs from a single DNA molecule and using said read pairs to assemble multiple contigs of said DNA molecule, wherein at least 1 % of read pairs span a distance greater than 50 kB on the single DNA molecule and are haplotype phased with greater than 70% accuracy. In some embodiments, at least 10% of read pairs span a distance greater than 50 kB on said single DNA molecule. In other embodiments, at least 1% of read pairs span a distance greater than 100 kB on said single DNA molecule. In a further embodiment, haplotype phasing is performed with greater than 90% accuracy.

在进一步的实施方案中,本发明提供了单体型定相的方法,其包括在体外从单个DNA分子生成多个读对,并使用该读对组装所述DNA分子的多个重叠群,其中至少1%的读对在所述单个DNA分子上跨越大于30kB的距离,且以大于70%的准确度进行单体型定相。在一些实施方案中,至少10%的读对在所述单个DNA分子上跨越大于30kB的距离。在其它实施方案中,至少1%的读对在所述单个DNA分子上跨越大于50kB的距离。在其它实施方案中,以大于90%的准确度进行单体型定相。在其它实施方案中,以大于70%的准确度进行单体型定相。In a further embodiment, the present invention provides a method of haplotype phasing comprising generating a plurality of read pairs from a single DNA molecule in vitro and using the read pairs to assemble a plurality of contigs of said DNA molecule, wherein At least 1% of read pairs span a distance greater than 30 kB on the single DNA molecule and are haplotype-phased with greater than 70% accuracy. In some embodiments, at least 10% of read pairs span a distance greater than 30 kB on said single DNA molecule. In other embodiments, at least 1% of read pairs span a distance greater than 50 kB on said single DNA molecule. In other embodiments, haplotype phasing is performed with greater than 90% accuracy. In other embodiments, haplotype phasing is performed with greater than 70% accuracy.

在一些实施方案中,本发明提供了一种从第一DNA分子生成第一读对的方法,其包括:(a)在体外交联第一DNA分子,其中该第一DNA分子包括第一DNA片段和第二DNA片段;(b)将第一DNA片段和第二DNA片段连接,并由此形成连接的DNA片段;和(c)对该连接的DNA片段进行测序,由此得到第一读对。In some embodiments, the invention provides a method of generating a first read pair from a first DNA molecule, comprising: (a) crosslinking the first DNA molecule in vitro, wherein the first DNA molecule comprises a first DNA fragment and a second DNA fragment; (b) ligating the first DNA fragment and the second DNA fragment, thereby forming a ligated DNA fragment; and (c) sequencing the ligated DNA fragment, thereby obtaining a first read right.

在一些实施方案中,例如来自重构染色质的多个缔合分子交联至第一DNA分子。在一些实施例中,缔合分子包括氨基酸。在进一步的实施例中,缔合分子为肽或蛋白质。在某些实施方案中,第一DNA分子与固定剂交联。在一些实施例中,固定剂为甲醛。在一些实施方案中,通过切断第一DNA分子,生成第一DNA片段和第二DNA。在某些实施方案中,该方法进一步包括用第一读对组装第一DNA分子的多个重叠群。在一些实施方案中,第一DNA片段和第二DNA片段中的每一个都与至少一个亲和标签相连,并用所述亲和标签捕获所述连接的DNA片段。In some embodiments, multiple associated molecules, eg, from reconstituted chromatin, are cross-linked to the first DNA molecule. In some embodiments, association molecules include amino acids. In further embodiments, the association molecule is a peptide or protein. In certain embodiments, the first DNA molecule is crosslinked with a fixative. In some embodiments, the fixative is formaldehyde. In some embodiments, the first DNA fragment and the second DNA are generated by cleaving the first DNA molecule. In certain embodiments, the method further comprises assembling a plurality of contigs of the first DNA molecule using the first read pair. In some embodiments, each of the first DNA fragment and the second DNA fragment is attached to at least one affinity tag, and the attached DNA fragment is captured with the affinity tag.

在进一步的实施方案中,该方法进一步包括:(a)至少向第二DNA分子提供多个缔合分子,该缔合分子例如来自重构染色质;(b)将缔合分子交联至第二DNA分子,由此在体外形成第二复合物;(c)切断第二复合物,由此生成第三DNA片段和第四DNA片段;(d)将第三DNA片段与第四DNA片段连接,由此形成第二连接的DNA片段;和(e)对该第二连接的DNA片段进行测序,由此得到第二读对。在一些实施例中,少于40%的来自所述DNA分子的DNA片段连接至来自任何其它DNA分子的DNA片段。在进一步的实施例中,少于20%的来自所述DNA分子的DNA片段连接至来自任何其它DNA分子的DNA片段。In a further embodiment, the method further comprises: (a) providing at least a second DNA molecule with a plurality of association molecules, such as from reconstituted chromatin; (b) crosslinking the association molecules to the second DNA molecule Two DNA molecules, thereby forming a second complex in vitro; (c) cleavage of the second complex, thereby generating a third DNA fragment and a fourth DNA fragment; (d) joining the third DNA fragment to the fourth DNA fragment , thereby forming a second ligated DNA fragment; and (e) sequencing the second ligated DNA fragment, thereby obtaining a second read pair. In some embodiments, less than 40% of the DNA fragments from said DNA molecule are ligated to DNA fragments from any other DNA molecule. In a further embodiment, less than 20% of the DNA fragments from said DNA molecule are ligated to DNA fragments from any other DNA molecule.

在其它实施方案中,本发明提供了一种从含有预定序列的第一DNA分子生成第一读对的方法,其包括:(a)向第一DNA分子提供一个或多个DNA结合分子,其中所述一个或多个DNA结合分子结合至所述预定序列;(b)在体外交联第一DNA分子,其中该第一DNA分子包含第一DNA片段和第二DNA片段;(c)将第一DNA片段与第二DNA片段连接,由此形成第一连接的DNA片段;和(d)对所述第一连接的DNA片段进行测序,由此得到第一读对;其中所述预定序列出现在所述读对中的概率受到所述DNA结合分子与所述预定序列的结合的影响。In other embodiments, the invention provides a method of generating a first read pair from a first DNA molecule comprising a predetermined sequence comprising: (a) providing one or more DNA binding molecules to the first DNA molecule, wherein The one or more DNA binding molecules bind to the predetermined sequence; (b) externally linking a first DNA molecule in vitro, wherein the first DNA molecule comprises a first DNA segment and a second DNA segment; (c) linking the first DNA segment a DNA fragment is ligated to a second DNA fragment, thereby forming a first ligated DNA fragment; and (d) sequencing said first ligated DNA fragment, thereby obtaining a first read pair; wherein said predetermined sequence occurs The probability of being in the read pair is affected by the binding of the DNA binding molecule to the predetermined sequence.

在一些实施方案中,DNA结合分子为能够与预定序列杂交的核酸。在一些实施例中,该核酸为RNA。在其它实施例中,该核酸为DNA。在其它实施方案中,DNA结合分子为小分子。在一些实施例中,所述小分子以小于100μM的结合亲和力结合至预定序列。在进一步的实施例中,所述小分子以小于1μM的结合亲和力结合至预定序列。在进一步的实施方案中,DNA结合分子被固定化在表面或固相载体上。In some embodiments, a DNA binding molecule is a nucleic acid capable of hybridizing to a predetermined sequence. In some embodiments, the nucleic acid is RNA. In other embodiments, the nucleic acid is DNA. In other embodiments, the DNA binding molecules are small molecules. In some embodiments, the small molecule binds to the predetermined sequence with a binding affinity of less than 100 μΜ. In a further embodiment, said small molecule binds to a predetermined sequence with a binding affinity of less than 1 μΜ. In a further embodiment, the DNA binding molecule is immobilized on a surface or solid support.

在一些实施方案中,预定序列出现在读对中的概率下降。在其它实施方案中,预定序列出现在读对中的概率升高。In some embodiments, the probability that a predetermined sequence occurs in a read pair is decreased. In other embodiments, the probability that a predetermined sequence occurs in a read pair is increased.

在其它实施方案中,本发明提供了一种包含多个读对的体外文库,每个读对至少包含第一序列元件和第二序列元件,其中第一序列元件和第二序列元件来源于单个DNA分子,且其中至少1%的读对包含在所述单个DNA分子上相距至少50kB的第一序列元件和第二序列元件。In other embodiments, the invention provides an in vitro library comprising a plurality of read pairs, each read pair comprising at least a first sequence element and a second sequence element, wherein the first sequence element and the second sequence element are derived from a single DNA molecules, wherein at least 1% of read pairs comprise a first sequence element and a second sequence element that are at least 50 kB apart on said single DNA molecule.

在一些实施方案中,至少10%的读对包含在单个DNA分子上相距至少50kB的第一序列元件和第二序列元件。在其它实施方案中,至少1%的读对包含在单个DNA分子上相距至少100kB的第一序列元件和第二序列元件。In some embodiments, at least 10% of read pairs comprise a first sequence element and a second sequence element that are at least 50 kB apart on a single DNA molecule. In other embodiments, at least 1% of read pairs comprise a first sequence element and a second sequence element that are at least 100 kB apart on a single DNA molecule.

在进一步的实施方案中,少于20%的读对包含一个或多个预定序列。在进一步的实施方案中,少于10%的读对包含一个或多个预定序列。在进一步的实施方案中,少于5%的读对包含一个或多个预定序列。In a further embodiment, less than 20% of read pairs comprise one or more predetermined sequences. In a further embodiment, less than 10% of read pairs comprise one or more predetermined sequences. In a further embodiment, less than 5% of read pairs comprise one or more predetermined sequences.

在一些实施方案中,预定序列由能够与预定序列杂交的一个或多个核酸确定。在一些实施例中,该一个或多个核酸为RNA。在其它实施例中,该一个或多个核酸为DNA。在进一步的实施方案中,该一个或多个核酸被固定化至表面或固相载体。In some embodiments, the predetermined sequence is determined by one or more nucleic acids capable of hybridizing to the predetermined sequence. In some embodiments, the one or more nucleic acids are RNA. In other embodiments, the one or more nucleic acids are DNA. In further embodiments, the one or more nucleic acids are immobilized to a surface or solid support.

在其它实施方案中,预定序列由一个或多个小分子确定。在一些实施例中,所述一个或多个小分子以小于100μM的结合亲和力结合至预定序列。在进一步的实施例中,所述一个或多个小分子以小于1μM的结合亲和力结合至预定序列。In other embodiments, the predetermined sequence is determined by one or more small molecules. In some embodiments, the one or more small molecules bind to the predetermined sequence with a binding affinity of less than 100 μΜ. In further embodiments, said one or more small molecules bind to the predetermined sequence with a binding affinity of less than 1 μΜ.

在一些实施方案中,本发明提供了一种包含DNA片段和多个缔合分子的组合物,该缔合分子例如来自重构染色质,其中:(a)所述缔合分子在体外复合物中与DNA片段交联;和(b)所述体外复合物被固定在固相载体上。In some embodiments, the invention provides a composition comprising a DNA fragment and a plurality of associated molecules, for example, from reconstituted chromatin, wherein: (a) the associated molecules are complexed in vitro cross-linking with DNA fragments; and (b) the in vitro complex is immobilized on a solid support.

在其它实施方案中,本发明提供了一种包含DNA片段、多个缔合分子和DNA结合分子的组合物,其中:(a)所述DNA结合分子与所述DNA片段的预定序列结合;和(b)所述缔合分子与所述DNA片段交联。In other embodiments, the present invention provides a composition comprising a DNA fragment, a plurality of association molecules, and a DNA binding molecule, wherein: (a) said DNA binding molecule binds to a predetermined sequence of said DNA fragment; and (b) said association molecule is cross-linked to said DNA fragment.

在一些实施方案中,DNA结合分子是能够与预定序列杂交的核酸。在一些实施例中,该核酸为RNA。在其它实施例中,该核酸为DNA。在进一步的实施例中,该核酸被固定化至表面或固相载体。In some embodiments, a DNA binding molecule is a nucleic acid capable of hybridizing to a predetermined sequence. In some embodiments, the nucleic acid is RNA. In other embodiments, the nucleic acid is DNA. In further embodiments, the nucleic acid is immobilized to a surface or solid support.

在其它实施方案中,DNA结合分子为小分子。在一些实施例中,所述小分子以小于100μM的结合亲和力结合至预定序列。在其它实施例中,所述小分子以小于1μM的结合亲和力结合至预定序列。In other embodiments, the DNA binding molecules are small molecules. In some embodiments, the small molecule binds to the predetermined sequence with a binding affinity of less than 100 μΜ. In other embodiments, the small molecule binds to the predetermined sequence with a binding affinity of less than 1 μM.

通过引用并入incorporated by reference

本说明书提及的全部公开文献、专利和专利申请,其并入程度如同指明将每份单独的公开文献、专利或专利申请具体地并个别地通过引用并入。本说明书提及的全部公开文献、专利和专利申请,以其整体以及其中所引用的任何参考内容在此通过引用而被并入。All publications, patents, and patent applications mentioned in this specification are incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. All publications, patents, and patent applications mentioned in this specification are hereby incorporated by reference in their entirety, as well as any references cited therein.

附图说明Description of drawings

本发明的新颖特征在所附的权利要求书中详细提出。通过参考以下列举了应用本发明原理的示例性实施方案的详细描述和附图,将得以更好地理解本发明的特征和优点,附图如下:The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description, which illustrates exemplary embodiments employing the principles of the invention, and to the accompanying drawings, which are as follows:

图1图示出了使用高通量测序读段的基因组组装。示出了待组装的基因组(顶部)。典型地,基因组具有多个难以组装的重复序列。收集来自基因组的随机的、高通量序列数据(中部),并将其组装成基因组中独特区域中的“重叠群”(底部)。重叠群组装通常终止于多个重复序列。最终结果为一组数千个的重叠群,其彼此相对的次序与方向都未知。在图中,从最长到最短将所述重叠群随意编号。Figure 1 schematically illustrates genome assembly using high-throughput sequencing reads. The genome to be assembled is shown (top). Typically, genomes have many repetitive sequences that are difficult to assemble. Randomized, high-throughput sequence data from genomes are collected (middle) and assembled into "contigs" in unique regions of the genome (bottom). Contig assembly often ends with multiple repeats. The end result is a set of thousands of contigs, whose order and orientation relative to each other is unknown. In the figure, the contigs are numbered arbitrarily from longest to shortest.

图2A-D图示了本发明的基于Hi-C的流程:(A)显示了DNA被交联并被处理产生用于测序的生物素化的连接片段的位置;和(B-D)提供了用于多个限制性内切酶的人类14号染色体上的接触图数据。如图所示,大多数接触是沿着染色体是局部的。2A-D schematically illustrate the Hi-C-based workflow of the present invention: (A) shows where DNA is cross-linked and processed to generate biotinylated ligated fragments for sequencing; Contact map data on human chromosome 14 for multiple restriction enzymes. As shown, most contacts are localized along the chromosomes.

图3A-C提供了本发明使用Hi-C序列数据辅助基因组组装的方法:(A)图示了使用基于Hi-C的流程交联并处理DNA的位置;(B)说明将读对数据定位至组装的重叠群的位置,所述重叠群是从随机鸟枪测序和组装中生成的;和(C)图示了在过滤和加权之后,构建汇总了全部重叠群之间的读对数据的邻接矩阵。该矩阵可以重新排列次序以表明正确的组织路径。如图所示,大多数读对将在重叠群内定位。由此,有可能得知接触距离的分布(例如参见图6)。定位至不同重叠群的读对提供了关于在正确基因组组装中哪些重叠群相邻的数据。Figure 3A-C provides the method of the present invention to assist genome assembly using Hi-C sequence data: (A) schematically shows the position of cross-linking and processing DNA using Hi-C-based workflow; (B) illustrates the alignment of reads to data to the position of assembled contigs generated from random shotgun sequencing and assembly; and (C) illustrates the construction of a contiguity summarizing read pair data between all contigs after filtering and weighting matrix. The matrix can be rearranged to show the correct organizational path. As shown, most read pairs will map within the contig. From this, it is possible to know the distribution of contact distances (see, for example, FIG. 6 ). Read pairs that map to different contigs provide data on which contigs are adjacent in a correct genome assembly.

图4示出了本发明的示例性流程:首先生成并制备DNA片段;随后在体外进行染色质组装和生物素化;然后用甲醛固定该染色质/DNA复合物并用链霉亲和素(streptavadin)珠拉下来;然后限制性酶切消化所述复合物以生成粘性末端,然后用生物素化的dCTP和内部的、硫酸化的GTP补平该粘性末端;平末端连接之后,染色质/DNA复合物进行蛋白酶消化、核酸外切酶消化和剪切;然后将DNA片段用生物素拉下并用测序接头连接;最后,通过大小选择DNA片段并测序。Figure 4 shows an exemplary process flow of the present invention: first generate and prepare DNA fragments; then carry out chromatin assembly and biotinylation in vitro; then fix the chromatin/DNA complex with formaldehyde and fix it with streptavidin (streptavadin ) beads; the complex is then restriction digested to generate cohesive ends, which are then filled in with biotinylated dCTP and internal, sulfated GTP; after blunt end ligation, chromatin/DNA The complex is subjected to protease digestion, exonuclease digestion, and shearing; the DNA fragments are then pulled down with biotin and ligated with sequencing adapters; finally, the DNA fragments are selected by size and sequenced.

图5A-B图示了基因组组装和比对中源于基因组中重复区域的不明确性。(A)由于读对不能跨过重复区域而导致连接的不确定性。(B)由于读对不能跨越边缘重复序列,而使片段的排布不确定。Figures 5A-B illustrate ambiguities in genome assembly and alignment arising from repetitive regions in the genome. (A) Uncertainty in linkage due to read pairs not being able to span duplicate regions. (B) Fragment alignment is indeterminate because read pairs cannot span marginal repeats.

图6图示了来自人类XLRP文库的读对之间的基因组距离的分布。标明了用其它技术可获得的最大距离进行比较。Figure 6 illustrates the distribution of genomic distances between read pairs from a human XLRP library. The maximum distance achievable with other techniques is indicated for comparison.

图7图示了良好表征的单体型的样本NA12878的定相准确度。所示距离为被定相的SNP之间的距离。Figure 7 illustrates the phasing accuracy for sample NA12878 of a well-characterized haplotype. Distances shown are between phased SNPs.

图8图示了根据本发明的不同实施方案的示例性计算机系统的各个部件。Figure 8 illustrates various components of an exemplary computer system according to various embodiments of the invention.

图9是说明示例性计算机系统架构的方框图,该示例性计算机系统可相关于本发明的不同实施方案被应用。Figure 9 is a block diagram illustrating the architecture of an exemplary computer system that may be employed in connection with various embodiments of the present invention.

图10是说明示例性计算机网络的图,该示例性网络可相关于本发明的不同实施方案被应用。Figure 10 is a diagram illustrating an exemplary computer network that may be employed in connection with various embodiments of the present invention.

图11是说明另一示例性计算机系统架构的方框图,该示例性计算机系统可相关于本发明的不同实施方案被应用。Figure 11 is a block diagram illustrating the architecture of another exemplary computer system that may be employed in connection with various embodiments of the present invention.

具体实施方式Detailed ways

如本文和所附权利要求书中所使用的,单数形式“a/an(一)”和“所述”包括复数对象,除非上下文清楚地指出不同的情况。因此,例如,提及“重叠群”包括多个这样的重叠群,提及“探测染色体的物理布局”包括提及一种或多种用于探测染色体物理布局的方法及其本领域技术人员已知的等同方案,依此类推。As used herein and in the appended claims, the singular forms "a/an" and "the" include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to "a contig" includes a plurality of such contigs, and reference to "detecting the physical layout of a chromosome" includes reference to one or more methods for detecting the physical layout of a chromosome and those known to those skilled in the art. known equivalents, and so on.

此外,除非特别指出不同的情况,否则使用“和”意指“和/或”。同样,“包含”、“包括”、“含有”和“具有”可以互换且并不旨在限定。Also, the use of "and" means "and/or" unless stated otherwise. Likewise, "comprises," "including," "containing," and "having" are interchangeable and are not intended to be limiting.

应进一步理解,在不同实施方案的描述使用术语“包含”时,本领域技术人员将理解,在一些具体示例中,实施方案可使用语言“基本由……组成”或“由……组成”替代性地进行描述。It should be further understood that when the term "comprising" is used in the description of different embodiments, those skilled in the art will understand that in some specific examples, the embodiment can use the language "consisting essentially of" or "consisting of" instead descriptively.

本文所使用的术语“测序读段”意指其中序列已被确定的DNA片段。As used herein, the term "sequenced read" means a DNA fragment in which the sequence has been determined.

本文所使用的术语“重叠群”意指DNA序列的邻接区域。通过本领域已知的多种方法,可以确定“重叠群”,例如通过比对针对重叠序列的测序读段,和/或通过与已知序列的数据库比对测序读段,以鉴别哪些测序读段具有邻接的高概率。The term "contig" as used herein means a contiguous region of a DNA sequence. "Contigs" can be determined by a variety of methods known in the art, such as by aligning the sequencing reads against overlapping sequences, and/or by aligning the sequencing reads to a database of known sequences to identify which sequencing reads A segment has a high probability of being contiguous.

本文所使用的术语“受试者”可意指任何真核或原核生物。The term "subject" as used herein may mean any eukaryotic or prokaryotic organism.

本文所使用的术语“裸DNA”可意指基本不含复合蛋白质的DNA。例如,其可意指与少于约50%、约40%、约30%、约20%、约10%、约5%或约1%发现于细胞核中的内生蛋白质复合的DNA。As used herein, the term "naked DNA" may mean DNA that is substantially free of complex proteins. For example, it can mean DNA complexed with less than about 50%, about 40%, about 30%, about 20%, about 10%, about 5%, or about 1% of endogenous proteins found in the nucleus.

本文所使用的术语“重构染色质”可意指通过将分离的核蛋白与裸DNA复合形成的染色质。The term "reconstituted chromatin" as used herein may mean chromatin formed by complexing isolated nucleoproteins with naked DNA.

本文所使用的术语“读对”可意指两个或多个相关联以提供序列信息的元件。在一些情况下,读对的数量可意指可定位的读对的数量。在其它情况下,读对的数量可意指所生成的读对的总数。As used herein, the term "read pair" may mean two or more elements that are associated to provide sequence information. In some cases, the number of read pairs may mean the number of mappable read pairs. In other cases, the number of read pairs may mean the total number of read pairs generated.

除非另外定义,本文所用的所有技术和科学术语具有与本发明所属领域的普通技术人员通常理解的相同的意义。虽然与本文所述的相似或等同的任何方法和材料可以用于实施本发明的方法和组合物,当前仍说明了示例性的方法和材料。Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice of the methods and compositions of the present invention, exemplary methods and materials are now described.

本发明提供了一种用于生成极长程读对的方法,并利用该数据用于提升所有上述目的的方法。在一些实施方案中,本发明提供了仅用约3亿读对产生高度邻接的和准确的人类基因组组装的方法。在其它实施方案中,本发明提供了以99%或更高的准确度定相人类基因组中90%或更多杂合变异体的方法。此外,本发明生成的读对的范围可延伸跨越更大的基因组距离。该组装除了从极长程读对文库产生之外,还从标准的鸟枪文库产生。在其它实施方案中,本发明提供了能够利用这两组测序数据的软件。用单个长程读对文库产生定相的变异体,来自该文库的读段被定位至参照基因组并随后用于将变异体指配至个体的两个亲代染色体之一。最后,本发明可供使用已知技术提取更大的DNA片段,以生成特别长的读段。The present invention provides a method for generating very long range read pairs and utilizing this data for enhancing all of the above purposes. In some embodiments, the present invention provides methods for generating highly contiguous and accurate assemblies of the human genome using only about 300 million read pairs. In other embodiments, the present invention provides methods for phasing 90% or more of heterozygous variants in the human genome with an accuracy of 99% or more. Furthermore, the range of read pairs generated by the present invention can be extended across greater genomic distances. The assembly was generated from a standard shotgun library in addition to the very long read pair library. In other embodiments, the invention provides software capable of utilizing both sets of sequencing data. A single long-range read is used to generate phased variants from a library from which reads are mapped to a reference genome and subsequently used to assign variants to one of the individual's two parental chromosomes. Finally, the present invention allows extraction of larger DNA fragments using known techniques to generate exceptionally long reads.

这些重复序列阻碍组装和比对过程的机制相当简单,并且最终使结果不明确(图5)。在大重复区域的情况下,困难是跨越。如果读段或读对的长度不足以跨越重复区域,则不能确信地连接边缘为重复元件的区域。在较小重复元件的情况下,问题主要是排布。当区域两侧为两个重复元件时(这在基因组中是很常见的),如果不是不可能,也难以确定其准确的排布,因为两侧元件与其类别的其它元件相似。在这两种情况下,重复元件中缺乏区别性的信息,使得具体重复元件的确认和排布具有挑战性。需要能够在被重复区域包围或隔开的独特片段之间实验性地建立连接。The mechanism by which these repeats impede the assembly and alignment process is rather simple and ultimately obscures the results (Fig. 5). In the case of large repeat regions, the difficulty is spanning. Regions bordering repetitive elements cannot be confidently joined if the reads or read pairs are not long enough to span the repetitive region. In the case of smaller repeating elements, the problem is mainly layout. When a region is flanked by two repetitive elements (which is very common in genomes), it is also difficult, if not impossible, to determine its exact arrangement because the flanking elements are similar to other elements of its class. In both cases, the lack of discriminative information among repeat elements makes identification and assignment of specific repeat elements challenging. There is a need to be able to experimentally establish connections between unique fragments surrounded or separated by repeating regions.

本发明的方法通过克服由这些重复区域产生的巨大阻碍,极大地推动了基因组领域的发展,从而能在基因组分析的多个领域内产生重大进步。为了用先前的技术完成从头组装,人们必须接受将组装片段化至多个小支架(scaffold)中,或者花费大量时间和资源来产生大插入片段文库,或使用其它方法来生成更邻接的组装。这些方法可能包括获得非常深度的测序覆盖、构建BAC或者F黏粒(fosmid)文库、光学定位或者最可能地,这些方法与其它技术的某些组合。高度的资源和时间要求使得这些方法难以被多数小型试验室采用,并限制了对非模式生物的研究。由于本文所述的方法能够产生非常长程的读对,可用单轮测序完成从头组装。这将使组装成本降低多个数量级,并使所需时间从数月或数年缩短至数周。一些情况下,本发明所公开的方法使得以少于14天、少于13天、少于12天、少于11天、少于10天、少于9天、少于8天、少于7天、少于6天、少于5天、少于4天内或任意两个上述具体时间段之间的范围,生成多个读对。例如,该方法允许在约10天至14天内生成多个读对。即使对最小生境的生物,构建生物组将变得常规,系统发育分析将不会面临缺乏比较的情况,而例如Genome 10k的项目将可被实现。The method of the present invention greatly advances the field of genomics by overcoming the formidable hindrance posed by these repetitive regions, thereby enabling significant advances in multiple areas of genomic analysis. To accomplish de novo assembly with previous techniques, one must either accept fragmentation of the assembly into multiple small scaffolds, or spend significant time and resources generating large insert libraries, or use other methods to generate more contiguous assemblies. These methods may include obtaining very deep sequencing coverage, constructing BAC or fosmid libraries, optical mapping or, most likely, some combination of these methods with other techniques. High resource and time requirements make these methods difficult to adopt in most small laboratories and limit the study of non-model organisms. Because the methods described here are capable of generating very long-range read pairs, de novo assembly can be accomplished with a single round of sequencing. This would reduce assembly costs by orders of magnitude and reduce the time required from months or years to weeks. In some cases, the methods disclosed herein result in less than 14 days, less than 13 days, less than 12 days, less than 11 days, less than 10 days, less than 9 days, less than 8 days, Multiple read pairs are generated within a range of days, less than 6 days, less than 5 days, less than 4 days, or between any two of the above specified time periods. For example, the method allows multiple read pairs to be generated in about 10 to 14 days. Even for organisms of the smallest habitats, constructing biological groups will become routine, phylogenetic analysis will not face lack of comparisons, and projects such as Genome 10k will be realized.

同样,用于医疗目的的结构和定相分析也仍然具有挑战性。癌症、患同类型癌症的个体之间或者即使在同个肿瘤中,存在令人震惊的异质性。要消除所导致影响的原因,需要在每个样本成本低的情况下,高度的准确性和通量。在个体化用药领域中,基因组医疗的金标准之一就是全部变异体被充分表征和定相的测序的基因组,包括大的和小的结构重排和新的突变。为了用先前的技术实现该目的,要求进行类似于从头组装所需的工作,而这目前仍太昂贵和费力而不能作为常规医疗程序。所公开的方法能够以低成本快速地产生完整的、精确的基因组,从而能够在人类疾病研究和治疗中发挥备受欢迎的作用。Likewise, structural and phasing analysis for medical purposes remains challenging. There is alarming heterogeneity in cancer, between individuals with the same type of cancer, or even within the same tumor. Eliminating the causes of the resulting effects requires a high degree of accuracy and throughput at a low cost per sample. In the field of personalized medicine, one of the gold standards for genomic medicine is a fully characterized and phased sequenced genome of all variants, including large and small structural rearrangements and novel mutations. To achieve this with previous techniques requires work similar to that required for de novo assembly, which is currently still too expensive and laborious to be a routine medical procedure. The disclosed method enables the rapid and low-cost generation of complete, accurate genomes, which could play a highly sought-after role in the study and treatment of human disease.

最后,将本文所公开的方法应用于定相,能够将统计方法的便利性和家族分析的准确度结合,与单使用其它方法相比,节约了金钱、人力和样本。用本文所公开的方法能够容易得完成从头的变异体定相,一种用以前的技术令人望而却步却备受期待的定相分析。鉴于绝大多数的人类变异体都是罕见的(次等位基因频率小于5%),这特别重要。定相信息对群体遗传研究非常有价值,其相对于分离的基因型,从高度连接的单体型(被分配至单个染色体的变异体的集合)的网络中获得重大优势。单体型信息使得更高分辨率研究群体大小、迁移和亚群体之间交换的历史变迁成为可能,并且允许我们将特定的变异体追踪回溯至父辈和祖辈。这反过来表明疾病相关变异体的遗传传递,以及当多种变异体出现在单个个体中时,变异体之间的相互作用。本发明的方法可最终使极长程读对(XLRP)文库的制备、测序和分析成为可能。Finally, applying the methods disclosed herein to phasing can combine the convenience of statistical methods with the accuracy of family analysis, saving money, manpower and samples compared to using other methods alone. De novo variant phasing, a highly desirable phasing assay that has been prohibitive with previous techniques, can be readily accomplished with the methods disclosed herein. This is especially important given that the vast majority of human variants are rare (less than 5% minor allele frequency). Phasing information is very valuable for population genetic studies, which derive significant advantages from networks of highly connected haplotypes (collections of variants assigned to individual chromosomes) over segregated genotypes. Haplotype information enables higher resolution studies of historical changes in population size, migration, and exchange between subpopulations, and allows us to trace specific variants back to parents and grandparents. This in turn suggests the genetic transmission of disease-associated variants and, when multiple variants arise in a single individual, interactions between variants. The methods of the present invention may ultimately enable the preparation, sequencing and analysis of extremely long-range read pair (XLRP) libraries.

在本发明的一些实施方案中,可提供来自受试者的组织或DNA样本,且该方法可返还已组装的基因组、与被识别的变异体(包括大结构变异体)的比对、定相的变异体识别(variant call)或任何另外的分析。在其它实施方案中,本文所公开的方法可直接为个体提供XLRP文库。In some embodiments of the invention, a tissue or DNA sample from a subject can be provided and the method can return an assembled genome, alignment to identified variants (including large structural variants), phasing variant calling (variant call) or any additional analysis. In other embodiments, the methods disclosed herein provide an XLRP library directly to an individual.

在本发明的不同实施方案中,本文所公开的方法可生成被大距离分隔开来的极长程读对。该距离的上限可随着采集大尺寸DNA样本的能力而提高。在一些情况下,读对可跨越最高达50kbp、60kbp、70kbp、80kbp、90kbp、100kbp、125kbp、150kbp、175kbp、200kbp、225kbp、250kbp、300kbp、400kbp、500kbp、600kbp、700kbp、800kbp、900kbp、1000kbp、1500kbp、2000kbp、2500kbp、3000kbp、4000kbp、5000kbp或更大的基因组距离。在一些实施例中,读对可跨越最高达500kbp的基因组距离。在其它实施例中,读对可跨越最高达2000kbp的基因组距离。本文所公开的方法可整合并建立于分子生物学中的标准技术之上,并且进一步适合于提高效率、特异性和基因组覆盖。在一些情况下,读对可以少于1天、2天、3天、4天、5天、6天、7天、8天、9天、10天、11天、12天、13天、14天、15天、16天、17天、18天、19天、20天、21天、22天、23天、24天、25天、26天、27天、28天、29天、30天、60天或90天生成。在一些实施例中,读对可以少于约14天生成。在进一步的实施例中,读对可以少于约10天生成。在一些情况下,本发明的方法可提供多于约5%、约10%、约15%、约20%、约30%、约40%、约50%、约60%、约70%、约80%、约90%、约95%、约99%或约100%的读对,其在正确排序和/或定向多个重叠群中具有至少约50%、约60%、约70%、约80%、约90%、约95%、约99%或约100%的准确度。例如,该方法可在正确排序和/或定向多个重叠群中提供至少约90%至100%的准确度。In various embodiments of the invention, the methods disclosed herein can generate extremely long-range read pairs separated by large distances. The upper limit of this distance can be increased with the ability to collect large size DNA samples. In some cases, read pairs can span up to 50kbp, 60kbp, 70kbp, 80kbp, 90kbp, 100kbp, 125kbp, 150kbp, 175kbp, 200kbp, 225kbp, 250kbp, 300kbp, 400kbp, 500kbp, 600kbp, 700kbp, 0kbp, 90 , 1500kbp, 2000kbp, 2500kbp, 3000kbp, 4000kbp, 5000kbp or greater genomic distance. In some embodiments, read pairs can span genomic distances of up to 500 kbp. In other embodiments, read pairs may span genomic distances of up to 2000 kbp. The methods disclosed herein can be integrated and built upon standard techniques in molecular biology and further adapted to improve efficiency, specificity and genome coverage. In some cases, read pairs can be less than 1 day, 2 days, 3 days, 4 days, 5 days, 6 days, 7 days, 8 days, 9 days, 10 days, 11 days, 12 days, 13 days, 14 days days, 15 days, 16 days, 17 days, 18 days, 19 days, 20 days, 21 days, 22 days, 23 days, 24 days, 25 days, 26 days, 27 days, 28 days, 29 days, 30 days, 60 or 90 day generation. In some embodiments, read pairs can be generated in less than about 14 days. In further embodiments, read pairs can be generated in less than about 10 days. In some cases, the methods of the invention can provide more than about 5%, about 10%, about 15%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, about 90%, about 95%, about 99%, or about 100% of read pairs that have at least about 50%, about 60%, about 70%, about 80%, about 90%, about 95%, about 99%, or about 100% accuracy. For example, the method can provide at least about 90% to 100% accuracy in correctly ordering and/or orienting multiple contigs.

在其它实施方案中,本文所公开的方法可与现有所采用的测序技术联用。例如,本方法可与测试良好的和/或广泛采用的测序仪器组合使用。在进一步的实施方案中,本文所公开的方法可与从现有所采用的测序技术衍生的技术和方法联用。In other embodiments, the methods disclosed herein can be used in conjunction with currently employed sequencing technologies. For example, the method can be used in combination with well-tested and/or widely adopted sequencing instruments. In further embodiments, the methods disclosed herein can be used in conjunction with techniques and methods derived from currently employed sequencing technologies.

本发明的方法大大地简化了大量生物体的从头基因组组装。使用先前的技术,这些组装目前受限于低成本配对文库(mate-pair library)的短插入片段。尽管有可能生成利用F黏离可得到的高达40-50kbp的基因组距离的读对,这些成本高昂,难以处理,且对跨越最长重复区域而言过短,所述最长重复区域包括那些位于着丝粒中的,在人类中大小为300kbp至5Mbp。本文所公开的方法可提供能够跨越大距离(例如百万碱基或更长)的读对,从而克服这些骨架完整性的问题。因此,通过利用本发明的方法,产生染色体水平的组装可以变得常规。用于组装的更费力的途径——目前耗费了研究实验室难以置信的大量时间和金钱且不能产生大范围的基因组目录——可变得没有必要,为更多有意义的分析节约资源。同样地,长程定相信息的获得能够极大地额外有助于群体基因组研究、系统发育研究和疾病研究。本文所公开的方法使得能够对大量个体进行定相准确,从而扩展了我们在群体和深度时间(deep-time)水平探测基因组的能力的广度和深度。The method of the present invention greatly simplifies de novo genome assembly of a large number of organisms. Using previous techniques, these assemblies are currently limited to short inserts from low-cost mate-pair libraries. While it is possible to generate read pairs with genomic distances of up to 40-50 kbp available using F cohesion, these are costly, unwieldy, and too short to span the longest repeat regions, including those located in In the centromere, the size is 300kbp to 5Mbp in humans. The methods disclosed herein can provide read pairs capable of spanning large distances (eg, megabases or longer), thereby overcoming these backbone integrity issues. Thus, by utilizing the methods of the present invention, generating chromosome-level assemblies can become routine. More laborious approaches to assembly—which currently cost research laboratories an incredible amount of time and money and fail to generate large-scale genome catalogs—could become unnecessary, saving resources for more meaningful analyses. Likewise, the availability of long-range phasing information can greatly additionally assist population genomic studies, phylogenetic studies, and disease studies. The methods disclosed herein enable accurate phasing of large numbers of individuals, extending the breadth and depth of our ability to probe genomes at the population and deep-time levels.

在个体化用药领域,由本文所公开的方法生成的XLRP读对代表在准确、低成本、定相和快速产生个体基因组方面,取得了有意义的进步。现有方法不能够长距离定相变异体,从而妨碍了表征复合杂合基因型的表现型影响。此外,对基因组疾病具有实质影响的结构变异体难以用现有技术准确鉴别和表征,这是由于其与用于研究它们的读段及读对插入片段相比,尺寸较大。跨越上万碱基至上百万碱基或更长的读对可有助于解决该困难,从而实现对结构变异体的高度平行和个体化分析。In the field of personalized medicine, XLRP read pairs generated by the methods disclosed herein represent a meaningful advance in the accurate, low-cost, phased, and rapid generation of individual genomes. Existing methods are not capable of phasing variants over long distances, hampering the characterization of the phenotypic effects of compound heterozygous genotypes. Furthermore, structural variants with substantial impact on genomic disease are difficult to accurately identify and characterize with current technologies due to their large size compared to the reads and read-pair inserts used to study them. Read pairs spanning tens of thousands of bases to millions of bases or longer can help address this difficulty, enabling highly parallel and individualized analysis of structural variants.

通过在高通量测序中的技术进步,推动基础进化的和生物医学研究。而全基因组测序和组装过去常常是大基因组测序中心的来源,商业上可获得的测序仪现在成本足够低,以致大多数研究大学拥有一台或多台这些机器。目前生成大量DNA序列数据相对便宜。然而,理论上和实践上都难以用现有技术产生高质量的、高度邻接的基因组序列。此外,因为分析所关注的大多数生物是二倍体,包括人类,每个个体具有基因组的两份单倍体拷贝。在杂合位点(例如来自母方的等位基因与来自父方的等位基因的不同之处),难以知晓哪组等位基因来自哪个亲本(被称为单倍体定相)。该信息可用于进行多种进化和生物医学研究,例如疾病和性状关联性研究。Advancing basic evolutionary and biomedical research through technological advances in high-throughput sequencing. While whole genome sequencing and assembly used to be the source of large genome sequencing centers, commercially available sequencers are now low enough in cost that most research universities own one or more of these machines. Generating large amounts of DNA sequence data is currently relatively cheap. However, it is theoretically and practically difficult to generate high-quality, highly contiguous genome sequences with existing technologies. Furthermore, because most organisms of interest for analysis are diploid, including humans, each individual has two haploid copies of the genome. At heterozygous loci (eg, where the alleles from the mother differ from the alleles from the father), it is difficult to know which set of alleles came from which parent (known as haploid phasing). This information can be used to conduct a variety of evolutionary and biomedical studies, such as disease and trait association studies.

在各种实施方案中,本发明提供了基因组组装的方法,该方法结合了DNA制备技术和双端测序,用于高通量发现所给出的基因组中短的、中等的和长的连接。本发明进一步提供了使用这些连接以协助基因组组装的方法,用于单体型定相和/或宏基因组研究。当本发明提供的方法可用于确定受试者基因组的组装,还应当理解为,本发明提供的方法也可用于确定受试者基因组部分例如染色体的组装,或者不同长度的受试者染色质的组装。In various embodiments, the present invention provides methods for genome assembly that combine DNA preparation techniques and paired-end sequencing for high-throughput discovery of short, medium, and long junctions in a given genome. The invention further provides methods of using these linkages to facilitate genome assembly for haplotype phasing and/or metagenomic studies. While the methods provided herein can be used to determine the assembly of a subject's genome, it should also be understood that the methods provided herein can also be used to determine the assembly of parts of a subject's genome, such as chromosomes, or the composition of subject chromatin of varying lengths. Assemble.

在一些实施方案中,本发明提供了一种或多种在此公开的方法,其包括从获自受试者的靶DNA的测序片段生成多个重叠群的方法。通过用一种或多种限制性内切酶切割DNA、剪切DNA或二者的组合,可将长段的靶DNA片段化。使用高通量测序法对得到的片段进行测序,获得多个测序读段。可与本发明的方法联用的高通量测序法的示例包括但不限于由Roche Diagnostics研发的454焦磷酸测序法、由Illumina研发的“簇(clusters)”测序法、由Life Technologies研发的SOLiD和离子半导体测序法、以及由Complete Genomics研发的DNA纳米球测序法。然后可组装不同测序读段的重叠端,以形成重叠群。或者,可将片段化的靶DNA克隆到载体中。然后用DNA载体转染细胞或生物体,以形成文库。复制转染的细胞或生物体后,分离并测序载体,以生成多个测序读段。然后可组装不同测序读段的重叠端,形成重叠群。In some embodiments, the invention provides one or more methods disclosed herein comprising a method of generating a plurality of contigs from sequenced fragments of target DNA obtained from a subject. Long stretches of target DNA can be fragmented by cutting the DNA with one or more restriction enzymes, shearing the DNA, or a combination of both. The resulting fragments are sequenced using high-throughput sequencing methods to obtain multiple sequencing reads. Examples of high-throughput sequencing methods that can be used in conjunction with the methods of the invention include, but are not limited to, 454 pyrosequencing developed by Roche Diagnostics, "clusters" sequencing developed by Illumina, SOLiD developed by Life Technologies and ion semiconductor sequencing, and DNA nanosphere sequencing developed by Complete Genomics. The overlapping ends of the different sequencing reads can then be assembled to form contigs. Alternatively, fragmented target DNA can be cloned into a vector. Cells or organisms are then transfected with the DNA vector to form a library. After replicating the transfected cells or organisms, the vectors are isolated and sequenced to generate multiple sequencing reads. The overlapping ends of the different sequencing reads can then be assembled to form contigs.

如图1所示,基因组组装,特别采用高通量测序技术基因组组装,可能产生问题。组装常常由上千个或上万个短的重叠群组成。这些重叠群的次序和方向通常是未知的,这限制了基因组组装的有用性。存在对这些骨架排序和定向的技术,但它们通常很昂贵、耗费大量人力,而且经常无法发现非常长距离的相互作用。As shown in Figure 1, genome assembly, especially using high-throughput sequencing technologies, can be problematic. Assemblies often consist of thousands or tens of thousands of short contigs. The order and orientation of these contigs are often unknown, limiting the usefulness of genome assemblies. Techniques exist to sequence and orient these backbones, but they are often expensive, labor-intensive, and often fail to discover very long-range interactions.

可通过许多方式从受试者获得含有用于生成重叠群的靶DNA的样本,所述方式包括通过提取体液(例如血液、尿液、血清、淋巴液、唾液、肛门及阴道分泌物、汗液和精液)、提取组织或通过采集细胞/生物体。获得的样本可包括单种类型的细胞/生物体或可包括多种类型的细胞/生物体。可从受试者的样本中提取并制备DNA。例如,可使用已知的裂解缓冲液、超声处理技术、电穿孔等处理样本,以裂解含有多核苷酸的细胞。可通过使用乙醇提取、铯梯度和/或柱层析,进一步纯化靶DNA,以去除污染物,例如蛋白质。Samples containing target DNA for contig generation can be obtained from a subject in a number of ways, including by extraction of bodily fluids such as blood, urine, serum, lymph, saliva, anal and vaginal secretions, sweat, and semen), tissue extraction, or by harvesting cells/organisms. The sample obtained may include a single type of cell/organism or may include multiple types of cells/organisms. DNA can be extracted and prepared from a sample from a subject. For example, a sample can be treated to lyse polynucleotide-containing cells using known lysis buffers, sonication techniques, electroporation, and the like. Target DNA can be further purified to remove contaminants, such as proteins, by using ethanol extraction, cesium gradients, and/or column chromatography.

在本发明的其它实施方案中,提供了一种提取极高分子量DNA的方法。在一些情况下,通过增大输入DNA的片段大小,可改善来自XLRP文库的数据。在一些实施例中,从细胞提取上百万碱基大小的DNA片段,可在基因组中产生被上百万碱基隔开的读对。在一些情况下,产生的读对可提供以下跨度的序列信息:大于约10kB、约50kB、约100kB、约200kB、约500kB、约1Mb、约2Mb、约5Mb、约10Mb或约100Mb。在一些实施例中,读对可提供大于约500kB的跨度的序列信息。在进一步的实施例中,读对可提供大于约2Mb的跨度的序列信息。在一些情况下,可通过非常温和的细胞裂解(Teague,B.等,(2010)Proc.Nat.Acad.Sci.USA107(24),10848–53)和琼脂糖块包埋(Schwartz,D.C.和Cantor,C.R.(1984)Cell,37(1),67–75)提取该极高分子量DNA。在其它情况下,能够将DNA分子纯化至上百万碱基长度的商用机器可用于提取极高分子量DNA。In other embodiments of the invention, a method of extracting very high molecular weight DNA is provided. In some cases, data from XLRP libraries can be improved by increasing the fragment size of the input DNA. In some embodiments, extracting megabase-sized DNA fragments from cells can generate read pairs in the genome that are separated by millions of bases. In some cases, the read pairs generated can provide sequence information spanning greater than about 10 kB, about 50 kB, about 100 kB, about 200 kB, about 500 kB, about 1 Mb, about 2 Mb, about 5 Mb, about 10 Mb, or about 100 Mb. In some embodiments, a read pair can provide sequence information with a span greater than about 500 kB. In further embodiments, read pairs may provide sequence information with a span greater than about 2 Mb. In some cases, very gentle cell lysis (Teague, B. et al., (2010) Proc. Nat. Acad. Sci. USA107 (24), 10848–53) and agarose block embedding (Schwartz, D.C. and Cantor, C.R. (1984) Cell, 37(1), 67-75) extracted this very high molecular weight DNA. In other cases, commercially available machines capable of purifying DNA molecules to millions of bases in length can be used to extract very high molecular weight DNA.

在各种实施方案中,本发明提供了一种或多种在此公开的方法,其包括探测活细胞内染色体的物理布局的步骤。通过测序探测染色体的物理布局的技术示例包括“C”族技术,例如染色体构象捕获(chromosome conformation capture,“3C”)、环形染色体构象捕获(circularized chromosome conformation capture,“4C”)、碳拷贝染色体捕获(carbon-copy chromosome capture,“5C”)和基于Hi-C的方法;和基于染色质免疫共沉淀(ChIP)的方法,例如染色质免疫共沉淀-环(ChIP-loop)、染色质免疫共沉淀-配对末端标签(ChIP-PET)。这些技术利用活细胞中染色质的固定以增强细胞核内的空间关系。产物的后续处理和测序使研究者得到基因组区域间的邻近关联的矩阵。随着进一步的分析,这些关联可用于产生染色体的三维几何图,依照染色体在活细胞核中物理地排布。这些技术描述了活细胞中染色体的离散空间组织,并使染色体位点之间的功能性相互作用能被准确观察。困扰这些功能性研究的一个问题是存在非特异性的相互作用,数据中出现的仅仅归因于染色体接近的关联。在本发明中,这些非特异性的染色体内的相互作用通过本文所示的方法被捕获,以提供用于组装的有价值的信息。In various embodiments, the present invention provides one or more of the methods disclosed herein comprising the step of probing the physical layout of chromosomes in living cells. Examples of techniques for probing the physical layout of chromosomes by sequencing include "C" family technologies such as chromosome conformation capture ("3C"), circularized chromosome conformation capture ("4C"), carbon copy chromosome capture (carbon-copy chromosome capture, "5C") and Hi-C-based methods; and chromatin immunoprecipitation (ChIP)-based methods, such as chromatin immunoprecipitation-loop (ChIP-loop), chromatin immunoprecipitation Precipitation - paired end tagging (ChIP-PET). These techniques exploit the fixation of chromatin in living cells to enhance spatial relationships within the nucleus. Subsequent processing and sequencing of the products allows the investigator to obtain a matrix of proximity associations between genomic regions. With further analysis, these associations can be used to generate a three-dimensional geometric map of the chromosomes as they are physically arranged in the nucleus of a living cell. These techniques describe the discrete spatial organization of chromosomes in living cells and enable the precise visualization of functional interactions between chromosomal loci. A problem that plagues these functional studies is the presence of nonspecific interactions, which appear in the data only due to associations of chromosomal proximity. In the present invention, these non-specific intrachromosomal interactions are captured by the methods presented herein to provide valuable information for assembly.

在一些实施方案中,染色体内的相互作用与染色体的连接性相关联。在一些情况下,染色体内数据可协助基因组的组装。在一些情况下,在体外重新构建染色质。这可能是有利的,因为在用于通过测序检测染色质构象和结构的最常用“C”族技术3C、4C、5C和Hi-C中,染色质——特别是染色质的主要蛋白质成分组蛋白——对于固定很重要。In some embodiments, intrachromosomal interactions are associated with chromosomal connectivity. In some cases, intrachromosomal data can assist in genome assembly. In some cases, chromatin was reconstituted in vitro. This may be advantageous because in 3C, 4C, 5C, and Hi-C, the most commonly used "C" family technologies for detection of chromatin conformation and structure by sequencing, chromatin—particularly its major protein Proteins - important for immobilization.

图2概括了染色质构象捕获技术。简而言之,在物理上紧密邻接的基因组区域之间创建交联。根据本文其它部分进一步详述或本领域已知的合适的方法,可完成蛋白质(例如组蛋白)与DNA分子在染色质内的交联,该DNA分子例如基因组DNA。在一些情况下,两个或多个核苷酸序列可通过结合至一个或多个核苷酸序列的蛋白质进行交联。一种方法是将染色质暴露于紫外线照射(Gilmour等,Proc.Nat’l.Acad.Sci.USA81:4275-4279,1984)。利用其它方法还可以交联多核苷酸片段,所述其他方法例如化学的或物理的(例如光学的)交联。合适的化学交联剂包括但不限于甲醛和补骨脂素(Solomon等,Proc.NatL.Acad.Sci.USA82:6470-6474,1985;Solomon等,Cell 53:937-947,1988)。例如,可通过向含有DNA分子和染色质蛋白质的混合物中加入2%甲醛进行交联。其它可用于交联DNA的试剂示例包括但不限于紫外线、丝裂霉素C、氮芥、美法仑、1,3-丁二烯双环氧化物、顺二胺二氯铂(II)和环磷酰胺。合适地,交联剂将形成桥接相对短距离——例如约——的交联,从而选择可被逆转的紧密相互作用。Figure 2 summarizes the chromatin conformation capture technique. Briefly, crosslinks are created between physically closely adjacent genomic regions. Cross-linking of proteins (eg, histones) with DNA molecules, eg, genomic DNA, within chromatin can be accomplished according to suitable methods as further detailed elsewhere herein or known in the art. In some cases, two or more nucleotide sequences can be cross-linked by a protein that binds to one or more nucleotide sequences. One approach is to expose chromatin to ultraviolet radiation (Gilmour et al., Proc. Nat'l. Acad. Sci. USA 81:4275-4279, 1984). Polynucleotide fragments may also be crosslinked using other methods, such as chemical or physical (eg optical) crosslinking. Suitable chemical cross-linking agents include, but are not limited to, formaldehyde and psoralen (Solomon et al., Proc. Nat L. Acad. Sci. USA 82:6470-6474, 1985; Solomon et al., Cell 53:937-947, 1988). For example, crosslinking can be performed by adding 2% formaldehyde to a mixture containing DNA molecules and chromatin proteins. Examples of other reagents that can be used to cross-link DNA include, but are not limited to, UV light, mitomycin C, nitrogen mustard, melphalan, 1,3-butadiene diepoxide, cis-diamine dichloroplatinum(II), and cyclic phosphoramide. Suitably, the cross-linking agent will form bridges over relatively short distances - for example about - cross-linking, thereby selecting tight interactions that can be reversed.

在一些实施方案中,DNA分子可在交联之前或之后被免疫共沉淀。在一些情况下,DNA分子可被片段化。片段可与结合伴侣接触,例如特异性识别并结合至乙酰化组蛋白的抗体,该组蛋白例如H3。这些抗体的示例包括但不限于乙酰化组蛋白H3抗体,该抗体可从Upstate Biotechnology(Lake Placid,N.Y.)获得。随后可从免疫沉淀物中采集来自免疫沉淀物的多核苷酸。在使染色质片段化之前,乙酰化组蛋白可交联至相邻的多核苷酸序列。然后处理混合物以分离混合物中的多核苷酸。分离技术在本领域中已知并且包括例如剪切技术,以生成较小的基因组片段。可使用已有的使染色质片段化的方法,包括例如超声处理、剪切和/或使用限制性内切酶,进行片段化。限制性内切酶可具有长度为1、2、3、4、5或6个碱基的限制酶切位点。限制性内切酶的示例包括但不限于AatII、Acc65I、AccI、AciI、AclI、AcuI、AfeI、AflII、AflIII、AgeI、AhdI、AleI、AluI、AlwI、AlwNI、ApaI、ApaLI、ApeKI、ApoI、AscI、AseI、AsiSI、AvaI、AvaII、AvrII、BaeGI、BaeI、BamHI、BanI、BanII、BbsI、BbvCI、BbvI、BccI、BceAI、BcgI、BciVI、BclI、BfaI、BfuAI、BfuCI、BglI、BglII、BlpI、BmgBI、BmrI、BmtI、BpmI、Bpul0I、BpuEI、BsaAI、BsaBI、BsaHI、BsaI、BsaJI、BsaWI、BsaXI、BscRI、BscYI、BsgI、BsiEI、BsiHKAI、BsiWI、BslI、BsmAI、BsmBI、BsmFI、BsmI、BsoBI、Bsp1286I、BspCNI、BspDI、BspEI、BspHI、BspMI、BspQI、BsrBI、BsrDI、BsrFI、BsrGI、BsrI、BssHII、BssKI、BssSI、BstAPI、BstBI、BstEII、BstNI、BstUI、BstXI、BstYI、BstZ17I、Bsu36I、BtgI、BtgZI、BtsCI、BtsI、Cac8I、ClaI、CspCI、CviAII、CviKI-1、CviQI、DdcI、DpnI、DpnII、DraI、DraIII、DrdI、EacI、EagI、EarI、EciI、Eco53kI、EcoNI、EcoO109I、EcoP15I、EcoRI、EcoRV、FatI、FauI、Fnu4HI、FokI、FseI、FspI、HaeII、HaeIII、HgaI、HhaI、HincII、HindIII、HinfI、HinPlI、HpaI、HpaII、HphI、Hpy166II、Hpy188I、Hpy188III、Hpy99I、HpyAV、HpyCH4III、HpyCH4IV、HpyCH4V、KasI、KpnI、MboI、MboII、MfeI、MluI、MlyI、MmeI、MnlI、MscI、MseI、MslI、MspAlI、MspI、MwoI、NaeI、NarI、Nb.BbvCI、Nb.BsmI、Nb.BsrDI、Nb.BtsI、NciI、NcoI、NdeI、NgoMIV、NheI、NlaIII、NlaIV、NmeAIII、NotI、NruI、NsiI、NspI、Nt.AlwI、Nt.BbvCI、Nt.BsmAI、Nt.BspQI、Nt.BstNBI、Nt.CviPII、PacI、PaeR7I、PciI、PflFI、PflMI、PhoI、PleI、PmeI、PmlI、PpuMI、PshAI、PsiI、PspGI、PspOMI、PspXI、PstI、PvuI、PvuII、RsaI、RsrII、SacI、SacII、SalI、SapI、Sau3AI、Sau96I、SbfI、ScaI、ScrFI、SexAI、SfaNI、SfcI、SfiI、SfoI、SgrAI、SmaI、SmlI、SnaBI、SpeI、SphI、SspI、StuI、StyD4I、StyI、SwaI、T、TaqαI、TfiI、TliI、TseI、Tsp45I、Tsp509I、TspMI、TspRI、Tth111I、XbaI、XcmI、XhoI、XmaI、XmnI和ZraI。得到的片段可大小不一。得到的片段还可包含在5’或3’端处的单链突出。In some embodiments, DNA molecules can be co-immunoprecipitated either before or after crosslinking. In some cases, DNA molecules can be fragmented. The fragments can be contacted with a binding partner, such as an antibody that specifically recognizes and binds to an acetylated histone, such as H3. Examples of such antibodies include, but are not limited to, acetylated histone H3 antibodies, which are available from Upstate Biotechnology (Lake Placid, N.Y.). Polynucleotides from the immunoprecipitate can then be harvested from the immunoprecipitate. Acetylated histones can cross-link to adjacent polynucleotide sequences prior to fragmenting chromatin. The mixture is then processed to separate the polynucleotides in the mixture. Separation techniques are known in the art and include, for example, shearing techniques to generate smaller genomic fragments. Fragmentation can be performed using established methods for fragmenting chromatin, including, for example, sonication, shearing, and/or use of restriction enzymes. Restriction enzymes can have restriction sites that are 1, 2, 3, 4, 5 or 6 bases in length. Examples of restriction enzymes include, but are not limited to, AatII, Acc65I, AccI, AciI, Acll, Acul, AfeI, AflII, AflIII, AgeI, AhdI, AleI, AluI, AlwI, AlwNI, ApaI, ApaLI, ApeKI, ApoI, AscI , AseI, AsiSI, AvaI, AvaII, AvrII, BaeGI, BaeI, BamHI, BanI, BanII, BbsI, BbvCI, BbvI, BccI, BceAI, BcgI, BciVI, BclI, BfaI, BfuAI, BfuCI, BglI, BglII, BlpI, BmgBI , BmrI, BmtI, BpmI, BpulOI, BpuEI, BsaAI, BsaBI, BsaHI, BsaI, BsaJI, BsaWI, BsaXI, BscRI, BscYI, BsgI, BsiEI, BsiHKAI, BsiWI, BslI, BsmAI, BsmBI, BsmFI, BsmI, BsoBI, Bsp1286I , BspCNI, BspDI, BspEI, BspHI, BspMI, BspQI, BsrBI, BsrDI, BsrFI, BsrGI, BsrI, BssHII, BssKI, BssSI, BstAPI, BstBI, BstEII, BstNI, BstUI, BstXI, BstYI, BstZ17I, Bsu36I, BtgI, BtgZI , BtsCI, BtsI, Cac8I, ClaI, CspCI, CviAII, CviKI-1, CviQI, DdcI, DpnI, DpnII, DraI, DraIII, DrdI, EacI, EagI, EarI, EciI, Eco53kI, EcoNI, EcoO109I, EcoP15I, EcoRI, EcoRV , FatI, FauI, Fnu4HI, FokI, FseI, FspI, HaeII, HaeIII, HgaI, HhaI, HincII, HindIII, HinfI, HinPlI, HpaI, HpaII, HphI, Hpy166II, Hpy188I, Hpy188III, Hpy99I, HpyAV, HpyCH4III, HpyCH4IV, HpyCH4V , KasI, KpnI, MboI, MboII, MfeI, MluI, MlyI, MmeI, MnlI, MscI, MseI, MslI, MspAlI, MspI, MwoI, NaeI, NarI, Nb.BbvCI, Nb.BsmI, Nb.BsrDI, Nb.BtsI , N ciI, NcoI, NdeI, NgoMIV, NheI, NlaIII, NlaIV, NmeAIII, NotI, NruI, NsiI, NspI, Nt.AlwI, Nt.BbvCI, Nt.BsmAI, Nt.BspQI, Nt.BstNBI, Nt.CviPII, PacI, PaeR7I, PciI, PflFI, PflMI, PhoI, PleI, PmeI, PmlI, PpuMI, PshAI, PsiI, PspGI, PspOMI, PspXI, PstI, PvuI, PvuII, RsaI, RsrII, SacI, SacII, SalI, SapI, Sau3AI, Sau96I, SbfI, ScaI, ScrFI, SexAI, SfaNI, SfcI, SfiI, SfoI, SgrAI, SmaI, SmlI, SnaBI, SpeI, SphI, SspI, StuI, StyD4I, StyI, SwaI, T, TaqαI, TfiI, TliI, TseI, Tsp45I, Tsp509I, TspMI, TspRI, Tth111I, XbaI, XcmI, XhoI, XmaI, XmnI, and ZraI. The resulting fragments can vary in size. The resulting fragments may also contain single-stranded overhangs at the 5' or 3' ends.

在一些实施方案中,使用超声处理技术,可以得到约100至5000个核苷酸的片段。或者,可得到约100至1000、约150至1000、约150至500、约200至500或约200至400个核苷酸的片段。样本可被制备为用于测序交联的偶连序列片段。在一些情况下,例如通过连接两个在分子内交联的序列片段,可生成单个的、短段的多核苷酸。使用本文进一步具体描述的或本领域已知的任何合适的测序技术,例如高通量测序法,可从样本中获得序列信息。例如连接产物可进行双端测序,得到来自片段每个末端的序列信息。在得到的序列信息中可描述序列片段的配对,这与沿着多核苷酸隔开两个序列片段的线性距离的单体型信息相关。In some embodiments, fragments of about 100 to 5000 nucleotides can be obtained using sonication techniques. Alternatively, fragments of about 100 to 1000, about 150 to 1000, about 150 to 500, about 200 to 500, or about 200 to 400 nucleotides may be obtained. Samples can be prepared as cross-linked concatenated read fragments for sequencing. In some cases, single, short stretches of polynucleotides can be produced, for example, by joining two sequence fragments that cross-link intramolecularly. Sequence information can be obtained from a sample using any suitable sequencing technique described in further detail herein or known in the art, eg, high-throughput sequencing. For example, ligation products can be paired-end sequenced to obtain sequence information from each end of the fragment. Pairings of sequence segments can be described in the resulting sequence information, which is associated with haplotype information along the linear distance separating the two sequence segments along the polynucleotide.

由Hi-C生成的数据特征之一是发现当被定位回基因组时,大多数读对紧密线性接近。即,发现大多数读对在基因组中互相靠近。在得到的数据组中,染色体内接触的概率平均比染色体间接触的概率高得多,正如所预期的如果染色体占据不同的区域。此外,尽管相互作用的概率随着线性距离而急剧减小,相同染色体上甚至被大于200Mb隔开的基因座比不同染色体上的基因座更可能相互作用。在检测长距离染色体内以及特别是染色体间的接触过程中,该短距离和中等距离染色体内接触的“背景(background)”为待使用Hi-C分析剔除的背景噪音。One of the characteristics of the data generated by Hi-C is that most read pairs are found to be in close linear proximity when mapped back to the genome. That is, most read pairs are found to be close to each other in the genome. In the resulting data set, the probability of intrachromosomal contacts was on average much higher than that of interchromosomal contacts, as expected if chromosomes occupy different regions. Furthermore, loci on the same chromosome separated by even >200 Mb are more likely to interact than loci on different chromosomes, although the probability of interaction decreases dramatically with linear distance. In the detection of long distance intrachromosomal and especially interchromosomal contacts, the "background" of the short and intermediate distance intrachromosomal contacts is the background noise to be removed using Hi-C analysis.

显著地,真核生物中的Hi-C试验已经显示,除了种特异性的和细胞类型特异性的染色质相互作用,还有两种典型的相互作用类型。一种类型是距离依赖性衰减(distance-dependent decay,DDD),作为基因组距离的函数的相互作用频率中的衰减总体趋势。第二种类型顺反比(cis-trans ratio,CTR)是位于同个染色体上的基因座之间比不同染色体上的基因座之间显著更高的相互作用频率,即使当同个染色体上的基因座被上千万碱基序列隔开。这些类型可反映总体的聚合物动力学(其中近端基因座具有更高的随机相互作用的概率)、以及特异性核组织特征,所述特征例如染色体区域的形成、分裂间期染色体倾向于占据细胞核内不同的空间而几乎不混合的现象。尽管这两种类型的确切细节可在种、细胞类型和细胞条件之间有所变化,它们是普遍存在和突出的。这些类型如此强且一致,以致它们被用于评估试验质量,并且经常在数据中被标准化,以揭示详细的相互作用。然而,在本文所公开的方法中,基因组组装可利用基因组的三维结构。使得典型Hi-C相互作用类型成为特异性环相互作用分析的阻碍的特征,即其普遍性、强度和一致性,可用作估算重叠群的基因组位置的有力工具。Remarkably, Hi-C assays in eukaryotes have shown that, in addition to species-specific and cell-type-specific chromatin interactions, there are two canonical types of interactions. One type is distance-dependent decay (DDD), the general trend of decay in interaction frequency as a function of genomic distance. The second type of cis-trans ratio (cis-trans ratio, CTR) is a significantly higher interaction frequency between loci located on the same chromosome than between loci on different chromosomes, even when genes on the same chromosome Blocks are separated by tens of millions of bases. These patterns may reflect overall polymer dynamics (where proximal loci have a higher probability of random interactions), as well as specific nuclear organization features such as the formation of chromosomal regions, the tendency of interphase chromosomes to occupy A phenomenon in which different spaces within the nucleus hardly mix. Although the exact details of these two types may vary between species, cell types, and cell conditions, they are ubiquitous and prominent. These types are so strong and consistent that they are used to assess trial quality and are often normalized across data to reveal detailed interactions. However, in the methods disclosed herein, genome assembly can take advantage of the three-dimensional structure of the genome. The features that make canonical Hi-C interaction types a hindrance to the analysis of specific loop interactions, namely their prevalence, strength, and consistency, can be used as powerful tools for estimating the genomic position of contigs.

在特别的实施方案中,检查染色体内读对之间的物理距离表明基因组组装相关数据的一些有用特征。首先,较短距离的相互作用比较长距离的相互作用更为常见(例如参见图6)。换言之,读对中的每个读段更可能以实际基因组中邻近的区域匹配,而不是远离的区域。其次,中等距离和长距离的相互作用具有长长的尾巴。换言之,读对载有关于千碱基(kB)或甚至百万碱基(Mb)距离的染色体内排列的信息。例如,读对可提供跨度为大于约10kB、约50kB、约100kB、约200kB、约500kB、约1Mb、约2Mb、约5Mb、约10Mb或约100Mb的序列信息。这些数据特征简单表明,相同染色体上邻近的基因组区域更可能紧密地物理接近——预期的结果,因为它们通过DNA骨架彼此化学地连接。可推测基因组广度的染色质相互作用数据组,例如那些由Hi-C生成的,将提供与沿着整个染色体的序列的分组和线性组织有关的长距离信息。In particular embodiments, examining the physical distance between read pairs within a chromosome reveals some useful features of genome assembly-related data. First, shorter-range interactions are more common than longer-range interactions (see, for example, Figure 6). In other words, each read in a read pair is more likely to be matched with a region that is close in the actual genome than a region that is far away. Second, medium- and long-range interactions have long tails. In other words, the read pairs carry information about the intrachromosomal alignment over kilobase (kB) or even million base (Mb) distances. For example, a read pair can provide sequence information that spans greater than about 10 kB, about 50 kB, about 100 kB, about 200 kB, about 500 kB, about 1 Mb, about 2 Mb, about 5 Mb, about 10 Mb, or about 100 Mb. These data characteristics simply indicate that adjacent genomic regions on the same chromosome are more likely to be in close physical proximity—an expected outcome, since they are chemically linked to each other by the DNA backbone. Presumably genome-wide chromatin interaction datasets, such as those generated by Hi-C, would provide long-range information about the grouping and linear organization of sequences along entire chromosomes.

尽管用于Hi-C的试验方法简单并且成本相对低,用于基因组组装和单体型分析的现有流程需要106–108个细胞,这是不可能得到的相当大量的材料,特别是来自某些人类患者样本。相比之下,本文所公开的方法包括允许以显著较少的来自细胞的材料对基因型组装、单体型定相和宏基因组产生准确和预测性的结果的方法。例如,用本文所公开的方法可使用少于约0.1μg、约0.2μg、约0.3μg、约0.4μg、约0.5μg、约0.6μg、约0.7μg、约0.8μg、约0.9μg、约1.0μg、约1.2μg、约1.4μg、约1.6μg、约1.8μg、约2.0μg、约2.5μg、约3.0μg、约3.5μg、约4.0μg、约4.5μg、约5.0μg、约6.0μg、约7.0μg、约8.0μg、约9.0μg、约10μg、约15μg、约20μg、约30μg、约40μg、约50μg、约60μg、约70μg、约80μg、约90μg、约100μg、约150μg、约200μg、约300μg、约400μg、约500μg、约600μg、约700μg、约800μg、约900μg或约1000μg的DNA。在一些实施例中,本文所公开的方法中所使用的DNA可提取自少于约1,000,000个、约500,000个、约100,000个、约50,000个、约10,000个、约5,000个、约1,000个、约5,000个或约1,000个、约500个或约100个细胞。Despite the simplicity and relatively low cost of assays for Hi-C, existing protocols for genome assembly and haplotype analysis require 10 6 –10 8 cells, which is not possible with considerable amounts of material, especially From certain human patient samples. In contrast, the methods disclosed herein include methods that allow accurate and predictive results for genotype assembly, haplotype phasing, and metagenomics with significantly less cell-derived material. For example, less than about 0.1 μg, about 0.2 μg, about 0.3 μg, about 0.4 μg, about 0.5 μg, about 0.6 μg, about 0.7 μg, about 0.8 μg, about 0.9 μg, about 1.0 μg can be used using the methods disclosed herein. μg, about 1.2μg, about 1.4μg, about 1.6μg, about 1.8μg, about 2.0μg, about 2.5μg, about 3.0μg, about 3.5μg, about 4.0μg, about 4.5μg, about 5.0μg, about 6.0μg, About 7.0 μg, about 8.0 μg, about 9.0 μg, about 10 μg, about 15 μg, about 20 μg, about 30 μg, about 40 μg, about 50 μg, about 60 μg, about 70 μg, about 80 μg, about 90 μg, about 100 μg, about 150 μg, about 200 μg , about 300 μg, about 400 μg, about 500 μg, about 600 μg, about 700 μg, about 800 μg, about 900 μg, or about 1000 μg of DNA. In some embodiments, the DNA used in the methods disclosed herein can be extracted from less than about 1,000,000, about 500,000, about 100,000, about 50,000, about 10,000, about 5,000, about 1,000, about 5,000 or about 1,000, about 500 or about 100 cells.

一般地,用于染色体物理布局的方法,例如基于Hi-C的技术,利用在细胞/生物体中形成的染色质,例如分离自培养细胞或初级组织的染色质。本发明不仅可供用分离自细胞/生物体的染色质使用这些技术,还可用重构染色质。重构染色质在多种特征方面区别于细胞/生物体内形成的染色质。首先,对于很多样本,可通过使用多种非入侵性至入侵性的方法,例如通过采集体液、口腔或直肠区域的拭子、取上皮样本等,完成裸DNA样本的采集。其次,重构染色质基本阻止了染色体间以及其它长距离相互作用的形成,这些相互作用生成了基因组组装和单体型定相的假象(artifacts)。在一些情况下,根据本发明的方法和组合物,样本可具有少于约20、15、12、11、10、9、8、7、6、5、4、3、2、1、0.5、0.4、0.3、0.2、0.1%或更少的染色体间或分子间的交联。在一些实施例中,样本可具有少于约5%染色体间或分子间的交联。在一些实施例中,样本可具有少于约3%染色体间或分子间的交联。在进一步的实施例中,样本可具有少于约1%染色体间或分子间的交联。第三,可调整能够交联的位点的频率,并从而调整多核苷酸内的分子内交联频率。例如,可改变DNA与组蛋白的比例,以使核小体密度可被调整至期望值。在一些情况下,核小体密度减少至低于生理水平。相应地,可改变交联的分布以有利于长程相互作用。在一些实施方案中,可制备具有不同交联密度的子样本,以覆盖短程和长程关联。例如,可调整交联条件,以使至少约1%、约2%、约3%、约4%、约5%、约6%、约7%、约8%、约9%、约10%、约11%、约12%、约13%、约14%、约15%、约16%、约17%、约18%、约19%、约20%、约25%、约30%、约40%、约45%、约50%、约60%、约70%、约80%、约90%、约95%或约100%的交联发生在下述DNA片段之间,所述DNA片段在样本DNA分子上相隔至少约50kb、约60kb、约70kb、约80kb、约90kb、约100kb、约110kb、约120kb、约130kb、约140kb、约150kb、约160kb、约180kb、约200kb、约250kb、约300kb、约350kb、约400kb、约450kb或约500kb。In general, methods for the physical layout of chromosomes, such as Hi-C-based techniques, utilize chromatin formed in cells/organisms, such as chromatin isolated from cultured cells or primary tissues. The present invention not only allows the use of these techniques with chromatin isolated from cells/organisms, but also remodels chromatin. Reconstituted chromatin differs from chromatin formed in cells/organisms in several features. First, for many samples, collection of naked DNA samples can be accomplished using a variety of non-invasive to invasive methods, such as by collection of bodily fluids, swabs of the oral or rectal area, epithelial samples, etc. Second, remodeling chromatin substantially prevents the formation of interchromosomal and other long-distance interactions that generate artifacts of genome assembly and haplotype phasing. In some cases, according to the methods and compositions of the present invention, a sample may have less than about 20, 15, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0.5, 0.4, 0.3, 0.2, 0.1% or less interchromosomal or intermolecular crosslinks. In some embodiments, a sample may have less than about 5% interchromosomal or intermolecular crosslinks. In some embodiments, a sample may have less than about 3% interchromosomal or intermolecular crosslinks. In further embodiments, the sample may have less than about 1% interchromosomal or intermolecular crosslinks. Third, the frequency of sites capable of crosslinking, and thus the frequency of intramolecular crosslinking within a polynucleotide, can be adjusted. For example, the ratio of DNA to histone can be varied so that the nucleosome density can be adjusted to a desired value. In some instances, nucleosome density is reduced below physiological levels. Accordingly, the distribution of crosslinks can be altered to favor long-range interactions. In some embodiments, subsamples with different crosslink densities can be prepared to cover both short-range and long-range associations. For example, the crosslinking conditions can be adjusted so that at least about 1%, about 2%, about 3%, about 4%, about 5%, about 6%, about 7%, about 8%, about 9%, about 10% , about 11%, about 12%, about 13%, about 14%, about 15%, about 16%, about 17%, about 18%, about 19%, about 20%, about 25%, about 30%, about 40%, about 45%, about 50%, about 60%, about 70%, about 80%, about 90%, about 95%, or about 100% of the crosslinking occurs between DNA fragments that are at The sample DNA is molecularly separated by at least about 50 kb, about 60 kb, about 70 kb, about 80 kb, about 90 kb, about 100 kb, about 110 kb, about 120 kb, about 130 kb, about 140 kb, about 150 kb, about 160 kb, about 180 kb, about 200 kb, about 250 kb , about 300 kb, about 350 kb, about 400 kb, about 450 kb, or about 500 kb.

在各种实施方案中,本发明提供了多种方法,使多个读对能够被定位至多个重叠群。有一些公众可获得的计算机程序用于使读段定位至重叠群序列。这些读段定位程序数据还提供描述基因组内的特定读段定位独特性程度的数据。从以高置信度、独特地定位至重叠群内的读段群体,我们可推断出每个读对中的读段之间的距离分布。这些是图6中所示的数据。对于那些读段确信地定位至不同重叠群的读对,该定位数据意味着所谈及的两个重叠群之间的连接。这也暗示了两个重叠群之间的距离,其与上述分析所得距离的分布成比例。因此,其读段定位至不同重叠群的每个读对都意味着在正确组装中两个重叠群之间的连接。从所有这些定位的读对推断得出的连接可汇总于邻接矩阵中,其中用行和列表示每个重叠群。连接重叠群的读对在相应行和列中标记为非零的值,表示读对中的读段所定位至的重叠群。大多数读对将定位在重叠群内,从中可得到读对之间距离的分布,且从中可使用定位至不同重叠群的读对构建重叠群的邻接矩阵。In various embodiments, the present invention provides methods that enable multiple read pairs to be mapped to multiple contigs. There are several publicly available computer programs for mapping reads to contig sequences. These read mapper data also provide data describing the degree to which particular read maps within a genome are unique. From the population of reads that uniquely map into the contig with high confidence, we can infer the distribution of distances between reads in each read pair. These are the data shown in Figure 6. For those read pairs whose reads confidently map to different contigs, the mapping data imply a connection between the two contigs in question. This also implies that the distance between two contigs is proportional to the distribution of distances obtained from the above analysis. Thus, each read pair whose reads map to a different contig implies a junction between the two contigs in the correct assembly. The connections inferred from all these mapped read pairs can be summarized in an adjacency matrix, with rows and columns representing each contig. Read pairs that join contigs are marked with a non-zero value in the corresponding row and column, indicating the contig to which the reads in the read pair map. Most read pairs will map within a contig, from which a distribution of distances between read pairs can be derived, and from which an adjacency matrix for the contig can be constructed using read pairs that map to different contigs.

在各种实施方案中,本发明提供了以下方法,其包括利用来自读对数据的读段定位数据,构建重叠群的邻接矩阵。在一些实施方案中,邻接矩阵使用了用于读对的加权方式,体现了短程相互作用超过长程相互作用的趋势(例如参见图3)。跨越较短距离的读对通常比跨越较长距离的读对更为常见。使用定位至单个重叠群的读对数据,可拟合描述特定距离概率的函数,以了解这种分布。因此,定位至不同重叠群的读对的重要特征之一就是其定位在重叠群上的位置。对于均定位至靠近重叠群一端的读对,这些重叠群之间的推断距离可以是短的并从而相接的读段之间的距离小。由于读对之间的较短距离比较长距离更为常见,该配置更强有力地证明,这两个重叠群比远离重叠群的边缘定位的读段相邻。因此,邻接矩阵中的连接由读段距重叠群边缘的距离被进一步加权。在进一步的实施方案中,可进一步重新调节邻接矩阵,以减少表示基因组混杂区域的一些重叠群上的大量接触的权重。这些基因组的区域可通过具有高比例的定位至它们的读段被鉴别,其是更可能含有可能为组装提供错误信息的伪读段定位的先验(priori)。在进一步的实施方案中,该调节可通过寻找针对一个或多个调节染色质骨架相互作用的介质的一个或多个保守结合位点进行引导,例如转录抑制子CTCF、内分泌受体、粘连蛋白或共价修饰的组蛋白。In various embodiments, the invention provides methods comprising constructing an adjacency matrix of contigs using read mapping data from read pair data. In some embodiments, the adjacency matrix uses a weighting scheme for read pairs that captures the tendency for short-range interactions over long-range interactions (see, eg, FIG. 3 ). Read pairs spanning shorter distances are generally more common than read pairs spanning longer distances. Using the read pair data mapped to a single contig, a function describing the probability of a particular distance can be fitted to understand this distribution. Therefore, one of the important characteristics of a read pair that maps to a different contig is its location on the contig. For read pairs that both map to one end of a contig, the inferred distance between these contigs can be short and thus the distance between contiguous reads small. Since shorter distances between read pairs are more common than longer distances, this configuration provides stronger evidence that the two contigs are adjacent than reads positioned farther from the edges of the contigs. Thus, connections in the adjacency matrix are further weighted by the distance of the reads from the contig edge. In a further embodiment, the adjacency matrix can be further rescaled to reduce the weight of a large number of contacts on some contigs that represent promiscuous regions of the genome. These regions of the genome can be identified by having a high proportion of reads mapping to them, a priori that are more likely to contain spurious reads mapping that could provide misinformation for assembly. In a further embodiment, this regulation can be directed by finding one or more conserved binding sites for one or more mediators that regulate chromatin backbone interactions, such as the transcriptional repressor CTCF, endocrine receptors, cohesin or Covalently modified histones.

在一些实施方案中,本发明提供了在此公开的一种或多种方法,其包括以下步骤:分析邻接矩阵,以确定经过代表其次序和/或对基因组的方向的重叠群的路径。在其它实施方案中,可选择经过重叠群的路径,以使每个重叠群正好被访问一次。在进一步的实施方案中,选择经过重叠群的路径,以使经过邻接矩阵的路径最大化被访问的边缘权重的总和。通过这种方式,提出最可能的重叠群连接用于正确组装。在进一步的实施方案中,选择经过重叠群的路径,以使每个重叠群正好被访问一次并且使邻接矩阵的边缘权重最大化。In some embodiments, the invention provides one or more of the methods disclosed herein, comprising the step of analyzing an adjacency matrix to determine a path through a contig representing its sequence and/or orientation to the genome. In other embodiments, the path through the contigs can be chosen such that each contig is visited exactly once. In a further embodiment, the path through the contig is selected such that the path through the adjacency matrix maximizes the sum of edge weights visited. In this way, the most likely contig joins are proposed for correct assembly. In a further embodiment, the path through the contigs is chosen such that each contig is visited exactly once and the edge weights of the adjacency matrix are maximized.

在二倍体基因组中,往往很重要的是了解哪些等位基因变异体相连在同个染色体上。这被称为单体型定相。来自高通量序列数据的短读段很少允许人们直接观察哪些等位基因变异体相连。在长距离的情况下,单体型定相的计算推断是不可靠的。本发明提供了一种或多种方法,其使用读对上的等位基因变异体,确定哪些等位基因变异体相连。In diploid genomes, it is often important to know which allelic variants are linked on the same chromosome. This is called haplotype phasing. Short reads from high-throughput sequence data rarely allow one to directly observe which allelic variants are linked. Computational inference of haplotype phasing is unreliable over long distances. The invention provides one or more methods that use the allelic variants on a read pair to determine which allelic variants are linked.

在各种实施方案中,本发明的方法和组合物使与多个等位基因变异体相关的二倍体或多倍体基因组的单体型定相成为可能。本文所描述的方法从而可基于来自读对和/或使用所述读对的组装的重叠群的变异体信息,确定相连的等位基因变异体被连接。等位基因变异体的示例包括但不限于那些由1000genomes、UK10K、HapMap或其它用于发现人类中遗传变异的计划所得知的。通过得到所证实的单体型定相数据,可以更容易揭露疾病与特定基因的相关性,例如通过在SH3TC2的两个拷贝中发现非连接的、失活性突变,导致腓骨肌萎缩症(Charcot-Marie-Tooth)神经病变(Lupski JR,Reid JG,Gonzaga-Jauregui C等,N.Engl.J.Med.362:1181–91,2010),以及通过在ABCG5的两个拷贝中发现非连接的、失活性突变,导致高胆固醇血症9(Rios J,Stein E,Shendure J等,Hum.Mol.Genet.19:4313–18,2010)。In various embodiments, the methods and compositions of the invention enable haplotype phasing of diploid or polyploid genomes associated with multiple allelic variants. The methods described herein can thus determine that linked allelic variants are linked based on variant information from read pairs and/or assembled contigs using the read pairs. Examples of allelic variants include, but are not limited to, those known from 1000genomes, UK10K, HapMap, or other projects for discovering genetic variation in humans. By having validated haplotype phasing data, disease associations to specific genes can be more easily unmasked, for example by the discovery of nonjoint, inactivating mutations in both copies of SH3TC2 that cause Charcot-Marie-Tooth disease (Charcot-Marie-Tooth disease). Marie-Tooth) neuropathy (Lupski JR, Reid JG, Gonzaga-Jauregui C et al., N. Engl. J. Med. 362:1181–91, 2010), and by the discovery of non-connected, Inactivating mutations, leading to hypercholesterolemia 9 (Rios J, Stein E, Shendure J et al., Hum. Mol. Genet. 19:4313–18, 2010).

人类平均在1,000个位点中有1个位点是杂合的。在一些情况下,使用高通量测序法的单个泳道数据能够产生至少约150,000,000个的读对。读对的长度可为约100个碱基对。由这些参数可知,估计来自人类样本的全部读段的十分之一覆盖杂合位点。因此,估计平均来自人类样本的全部读对的百分之一覆盖杂合位点。相应地,约1,500,000个读对(150,000,000的百分之一)使用单个泳道提供定相数据。人类基因组中大约有30亿个碱基,千分之一为杂合子,人类基因组平均有大约3百万个杂合位点。表示一对杂合位点的读对为大约1,500,000个,使用典型的高通量测序机器,使用高通量测序法定相的单泳道,每个杂合位点的平均覆盖率为约(1X)。从而,二倍体人类基因组可用一条泳道的高通量测序数据进行可靠地、完整地定相,该数据与来自使用本文所公开的方法制备的样本的序列变异体相关。在一些实施例中,一条泳道的数据可以是一组DNA序列读段数据。在进一步的实施例中,一条泳道的数据可以是来自高通量测序仪器的单次运行的一组DNA序列读段数据。Humans are heterozygous for an average of 1 locus in 1,000 loci. In some cases, a single lane of data using a high-throughput sequencing method is capable of generating at least about 150,000,000 read pairs. A read pair can be about 100 base pairs in length. From these parameters, an estimated one-tenth of all reads from human samples covered heterozygous sites. Therefore, it was estimated that on average one percent of all read pairs from human samples covered heterozygous sites. Correspondingly, approximately 1,500,000 read pairs (1 percent of 150,000,000) provided phased data using a single lane. There are about 3 billion bases in the human genome, one in a thousand is heterozygous, and the average human genome has about 3 million heterozygous sites. The number of read pairs representing a pair of heterozygous loci is about 1,500,000, and using a typical high-throughput sequencing machine, using a single lane phased by a high-throughput sequencing method, the average coverage per heterozygous locus is about (1X) . Thus, the diploid human genome can be reliably and completely phased with one lane of high-throughput sequencing data associated with sequence variants from samples prepared using the methods disclosed herein. In some embodiments, a lane of data can be a set of DNA sequence read data. In a further embodiment, the data for one lane may be a set of DNA sequence read data from a single run of a high throughput sequencing instrument.

由于人类基因组由两套同源染色体组成,理解个体的真实基因组成要求描述基因材料的母方和父方拷贝或单体型。在个体中获得的单体型在一些方面是有用的。首先,单体型在预测器官移植中的供体-受体配型结果中是临床上有用的,并且越来越多地用作检测疾病相关性的方式。其次,在显示复合杂合性的基因中,单体型提供了关于两种有害变异体是否位于相同等位基因上的信息,大大地影响了预测这些变异体的遗传是否有害。第三,来自多组个体的单体型已经提供了群体结构和人类进化历史上的信息。最后,最近描述的基因表达中的普遍的等位基因失衡表明,等位基因之间的遗传或表观遗传的差异可能促成表达中的定量差异。理解单体型结构将描述促成等位基因失衡的变异体机制。Since the human genome consists of two homologous sets of chromosomes, understanding the true genetic makeup of an individual requires describing the maternal and paternal copies or haplotypes of the genetic material. The haplotypes obtained in an individual are useful in several ways. First, haplotypes are clinically useful in predicting donor-recipient matching outcomes in organ transplantation and are increasingly used as a way to detect disease associations. Second, in genes exhibiting compound heterozygosity, haplotypes provide information on whether two deleterious variants are located on the same allele, greatly affecting the prediction of whether inheritance of these variants is deleterious. Third, haplotypes from multiple groups of individuals have provided information on population structure and the evolutionary history of humans. Finally, the recently described pervasive allelic imbalance in gene expression suggests that genetic or epigenetic differences between alleles may contribute to quantitative differences in expression. Understanding haplotype structure will describe the mechanisms of variants that contribute to allelic imbalance.

在特定实施方案中,本文所公开的方法包括固定和捕获基因组远隔的区域之间的相关性的体外技术,如长程连接和定相所需。在一些情况下,该方法包括构建和测序XLRP文库,以产出基因组上非常远隔的读对。在一些情况下,相互作用最初从单个DNA片段内的随机配对产生。在一些实施例中,可推断片段之间的基因组距离,因为在DNA分子中彼此邻近的片段更频繁地相互作用,概率更高,而分子远隔的部分之间的相互作用将较不频繁。因此,连接两个基因座的对数与其在输入DNA上的邻接性之间,具有系统关联性。如图2中所示,本发明可产生能够跨越提取中最大的DNA片段的读对。用于该文库的输入DNA最大长度为150kbp,这是我们从测序数据观察到的最长的有意义的读对。这表示如果输入DNA片段更大,本发明还可连接基因组上相隔更远的基因座。通过应用特别适合处理由本方法产生的数据类型的改进的组装软件工具,可实现完整的基因组组装。In certain embodiments, the methods disclosed herein include in vitro techniques for immobilizing and capturing correlations between distant regions of the genome, as required for long-range ligation and phasing. In some cases, the method includes constructing and sequencing an XLRP library to yield genomically very distant read pairs. In some cases, interactions initially arise from random pairings within a single DNA segment. In some embodiments, genomic distances between fragments can be inferred because fragments that are close to each other in a DNA molecule interact more frequently with higher probability, while distant parts of the molecule will interact less frequently. Thus, there is a systematic correlation between the logarithm connecting two loci and their contiguity on the input DNA. As shown in Figure 2, the present invention can generate read pairs that span the largest DNA fragments in the extraction. The maximum length of input DNA used for this library was 150kbp, which is the longest meaningful read pair we have observed from the sequencing data. This means that if the input DNA fragments are larger, the invention can also join more distant loci on the genome. Complete genome assemblies can be achieved through the application of improved assembly software tools that are particularly suited to handle the type of data generated by the present method.

通过使用本发明的方法和组合物产生的数据,可以得到非常高的定相准确度。与先前的方法相比,本文所述的方法可定相更高比例的变异体。可实现定相,同时维持高水平的准确度。该相位信息可延伸至更长的范围,例如大于约200kbp、约300kbp、约400kbp、约500kbp、约600kbp、约700kbp、约800kbp、约900kbp、约1Mbp、约2Mbp、约3Mbp、约4Mbp、约5Mbp或约10Mbp。在一些实施方案中,使用少于约2.5亿读段或读对,例如通过仅使用IlluminaHiSeq的1条泳道的数据,可以以高于99%的准确度定相大于90%的用于人类样本的杂合SNP。在其它情况下,使用少于约2.5亿或5亿读段或读对,例如通过仅使用Illumina HiSeq的1条或2条泳道的数据,可以以高于约70%、80%、90%、95%或99%的准确度定相大于约40%、50%、60%、70%、80%、90%、95%或99%的用于人类样本的杂合SNP。例如,使用少于约2.5亿万或5亿读段或读对,可以以高于95%或99%的准确度定相大于95%或99%的用于人类样本的杂合SNP。在进一步的情况下,可通过将读段的长度增加至约200bp、250bp、300bp、350bp、400bp、450bp、500bp、600bp、800bp、1000bp、1500bp、2kbp、3kbp、4kbp、5kbp、10kbp、20kbp、50kbp或100kbp,捕获另外的变异体。Very high phasing accuracy can be obtained by using the data generated by the methods and compositions of the present invention. Compared with previous methods, the method described here can phase a higher proportion of variants. Phasing can be achieved while maintaining a high level of accuracy. The phase information can be extended to a longer range, such as greater than about 200kbp, about 300kbp, about 400kbp, about 500kbp, about 600kbp, about 700kbp, about 800kbp, about 900kbp, about 1Mbp, about 2Mbp, about 3Mbp, about 4Mbp, about 5Mbps or about 10Mbps. In some embodiments, greater than 90% of reads for human samples can be phased with greater than 99% accuracy using fewer than about 250 million reads or read pairs, e.g., by using data from only 1 lane of Illumina HiSeq. Heterozygous SNP. In other cases, using less than about 250 million or 500 million reads or read pairs, e.g., by using data from only 1 or 2 lanes of Illumina HiSeq, results in greater than about 70%, 80%, 90%, 95% or 99% accuracy phases greater than about 40%, 50%, 60%, 70%, 80%, 90%, 95% or 99% of heterozygous SNPs for human samples. For example, greater than 95% or 99% of heterozygous SNPs for human samples can be phased with greater than 95% or 99% accuracy using less than about 250 million or 500 million reads or read pairs. In further cases, the length of the read can be increased to about 200bp, 250bp, 300bp, 350bp, 400bp, 450bp, 500bp, 600bp, 800bp, 1000bp, 1500bp, 2kbp, 3kbp, 4kbp, 5kbp, 10kbp, 20kbp, 50kbp or 100kbp to capture additional variants.

在本发明的其它实施方案中,来自XLRP文库的数据可用于确认长程读对的定向能力。如图6中所示,那些结果的准确度与现有最好的技术处于同等水平,但进一步延伸至显著更长的距离。用于特定测序方法的现有样本制备流程鉴别位于读长为例如150bp内的靶向限制酶切位点内的变异体型,用于定相。在一个实施例中,来自为NA12878构建的XLRP文库,NA12878是一种用于组装的基准样本,以大于99%的准确度定相44%的所存在的1,703,909个杂合SNP。在一些情况下,用明智选择的限制性内切酶或用不同酶的组合,该比例可扩展至几乎全部变异位点。In other embodiments of the invention, data from XLRP libraries can be used to confirm the orientation ability of long-range read pairs. As shown in Figure 6, the accuracy of those results is on par with the state-of-the-art, but extends further to significantly longer distances. Existing sample preparation workflows for specific sequencing methods identify variant isoforms located within targeted restriction sites within read lengths, eg, 150 bp, for phasing. In one example, 44% of the 1,703,909 heterozygous SNPs present were phased with greater than 99% accuracy from an XLRP library constructed for NA12878, a reference sample for assembly. In some cases, with judicious choice of restriction enzymes or with a combination of different enzymes, this ratio can be extended to nearly all variant sites.

在一些实施方案中,本发明所述的组合物和方法允许研究宏基因组,例如那些在人类消化道内发现的。相应地,可研究栖息于给定生态环境的一些或所有生物体的部分或全部基因组序列。示例包括随机测序全部消化道微生物、发现于皮肤某些区域上的微生物和生活在毒性废物位置的微生物。可使用本发明所述的组合物和方法,确定这些环境中的微生物群体的组成,以及由其各自基因组编码的互相联系的生物化学方面。本发明所述的方法可实现从复杂生物环境中进行宏基因组研究,例如那些包含多于2个、3个、4个、5个、6个、7个、8个、9个、10个、12个、15个、20个、25个、30个、40个、50个、60个、70个、80个、90个、100个、125个、150个、175个、200个、250个、300个、400个、500个、600个、700个、800个、900个、1000个、5000个、10000个或更多的生物体和/或生物体变异体的生物环境。In some embodiments, the compositions and methods described herein allow for the study of metagenomes, such as those found within the human digestive tract. Accordingly, partial or complete genome sequences of some or all organisms inhabiting a given ecological environment can be studied. Examples include random sequencing of the entire gut microbes, microbes found on certain areas of the skin, and microbes living in toxic waste sites. The compositions and methods of the microbial populations in these environments can be determined using the compositions and methods described herein, as well as the interconnected biochemical aspects encoded by their respective genomes. The method of the present invention can realize metagenomic research from complex biological environments, such as those comprising more than 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 250 , 300, 400, 500, 600, 700, 800, 900, 1000, 5000, 10000 or more organisms and/or organism variants.

使用本发明所述的方法和系统,可实现癌症基因组测序所需的高准确度。不准确的参照基因组可能在测序癌症基因组时面临碱基识别的挑战。异质样本和小的起始材料,例如由活组织检查得到的样本,带来了额外的挑战。此外,对于癌症基因组测序,检测大型结构变异型和/或杂合性的损失通常是至关重要的,区别体细胞变异型和碱基识别中的错误的能力也是至关重要的。Using the methods and systems described herein, the high accuracy required for sequencing cancer genomes can be achieved. Inaccurate reference genomes may pose basecalling challenges when sequencing cancer genomes. Heterogeneous samples and small starting materials, such as those obtained from biopsies, pose additional challenges. Furthermore, for cancer genome sequencing, detection of large structural variants and/or loss of heterozygosity is often critical, as is the ability to distinguish somatic variants from errors in base calling.

本发明所述的系统和方法可从含有2个、3个、4个、5个、6个、7个、8个、9个、10个、12个、15个、20个或更多个不同基因组的复杂样本中,生成准确的长序列。可分析正常、良性和/或肿瘤来源的混合样本,视情况可不需要正常对照。在一些实施方案中,使用少至100ng或者甚至少至数百个基因组当量的起始样本,以生成准确的长序列。本发明所述的系统和方法可检测大型结构变异体和重排。可沿着长序列获得定相的变异体识别(variantcall),该长序列跨越约1kbp、约2kbp、约5kbp、约10kbp、20kbp、约50kbp、约100kbp、约200kbp、约500kbp、约1Mbp、约2Mbp、约5Mbp、约10Mbp、约20Mbp、约50Mbp或约100Mbp或更多个核苷酸。例如可沿着跨越约1Mbp或约2Mbp的长序列,获得定相的变异体识别。The system and method of the present invention can contain 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20 or more Generate accurate long sequences in complex samples with diverse genomes. Mixed samples of normal, benign and/or neoplastic origin can be analyzed, optionally without normal controls. In some embodiments, as little as 100 ng, or even as few as hundreds of genome equivalents, of starting sample is used to generate accurate long sequences. The systems and methods described herein can detect large structural variants and rearrangements. Phased variant calls can be obtained along long sequences spanning about 1 kbp, about 2 kbp, about 5 kbp, about 10 kbp, 20 kbp, about 50 kbp, about 100 kbp, about 200 kbp, about 500 kbp, about 1 Mbp, about 2Mbp, about 5Mbp, about 10Mbp, about 20Mbp, about 50Mbp, or about 100Mbp or more nucleotides. For example, phased variant identification can be obtained along long sequences spanning about 1 Mbp or about 2 Mbp.

使用本发明所述的方法和系统确定的单体型可被指配至计算资源,例如网络上的计算资源,该网络例如云系统。如果必要,可以使用计算资源中储存的相关信息,修正短的变异体识别。可基于来自短的变异体识别和计算资源中储存的信息的组合信息检测结构变异体。为了提高准确度,可重新组装基因组的不确定部分,例如片段重复、易于结构变异的区域、高度变异和医学上相关的MHC区、着丝粒和端粒区、以及其它异染色质区,包括但不限于那些具有重复区、低序列准确度、高变异率、ALU重复、片段重复或其它任何本领域已知的相关不确定部分。Haplotypes determined using the methods and systems described herein can be assigned to computing resources, such as computing resources on a network, such as a cloud system. Short variant calls can be revised, if necessary, using relevant information stored in the computing resource. Structural variants can be detected based on combined information from short variant identification and information stored in computational resources. To improve accuracy, uncertain parts of the genome can be reassembled, such as segmental repeats, regions prone to structural variation, highly variable and medically relevant MHC regions, centromere and telomeric regions, and other heterochromatin regions, including But not limited to those with repetitive regions, low sequence accuracy, high variation rate, ALU repeats, segmental repeats or any other relevant uncertain parts known in the art.

样本类型可被指配至本地序列信息或联网的计算资源中的序列信息,该计算资源例如云。在已知信息来源的情况下,例如当信息来源自癌症或正常组织,该来源可被指配至样本作为样本类型的一部分。其它样本类型示例通常包括但不限于组织类型、样本采集方法、传染的存在、传染的类型、处理方法、样本的大小等。在可获得完整的或部分的比较基因组序列的情况下,例如与癌症基因组比对的正常基因组,可确定并可选择地输出样本数据和比较基因组序列之间的差异。Sample types can be assigned to sequence information locally or in a networked computing resource, such as the cloud. Where the source of the information is known, for example when the source of the information is from cancer or normal tissue, that source can be assigned to the sample as part of the sample type. Other sample type examples generally include, but are not limited to, tissue type, sample collection method, presence of infection, type of infection, processing method, size of sample, and the like. Where a complete or partial comparative genome sequence is available, such as a normal genome aligned to a cancer genome, differences between the sample data and the comparative genome sequence can be determined and optionally output.

本发明的方法可用于感兴趣的选择性基因组区域的遗传信息分析,以及可与感兴趣的选择性区域相互作用的基因组区域。本文公开的扩增方法可用于本领域已知用于遗传分析的装置、试剂盒和方法,例如,但不限于在美国专利6,449,562、6,287,766、7,361,468、7,414,117、6,225,109和6,110,709中所发现的那些。在一些情况下,本发明的扩增方法可用于扩增用于DNA杂交研究的靶核酸,用于确定存在或不存在多态性。多态性,或等位基因,可与例如遗传病的疾病或失调相关。在其它情况下,多态性可与疾病或失调的易感性相关,例如与成瘾、退化和年龄相关的失调、癌症等相关的多态性。例如在其它情况下,多态性可与有益性状相关,例如增强冠状动脉健康或对例如HIV或疟疾的疾病的抵抗力或对例如骨质疏松症、阿茨海默氏症或痴呆症的退行性疾病的抵抗力。The methods of the present invention can be used for the analysis of genetic information of selected genomic regions of interest, as well as genomic regions that can interact with selected regions of interest. The amplification methods disclosed herein can be used with devices, kits, and methods known in the art for genetic analysis, such as, but not limited to, those found in US Pat. In some cases, the amplification methods of the invention can be used to amplify target nucleic acids for DNA hybridization studies to determine the presence or absence of polymorphisms. Polymorphisms, or alleles, can be associated with diseases or disorders such as genetic diseases. In other cases, polymorphisms may be associated with susceptibility to a disease or disorder, such as polymorphisms associated with addiction, degenerative and age-related disorders, cancer, and the like. For example, in other cases, polymorphisms may be associated with beneficial traits, such as enhanced coronary artery health or resistance to diseases such as HIV or malaria, or to regression of diseases such as osteoporosis, Alzheimer's or dementia disease resistance.

本发明的组合物和方法可用于诊断、预后、治疗、对病人进行分层、药物开发、治疗选择和筛选目的。本发明具有优势如下:使用本发明的方法,可一次从单个生物分子样本分析多种不同的靶分子。这允许,例如在一个样本上进行几次诊断检测。The compositions and methods of the invention are useful for diagnostic, prognostic, therapeutic, patient stratification, drug development, treatment selection and screening purposes. The present invention has the following advantages: Using the method of the present invention, a plurality of different target molecules can be analyzed from a single biomolecule sample at one time. This allows, for example, to perform several diagnostic tests on one sample.

本发明的组合物和方法可用于基因组中。本发明所述的方法可快速提供满足该应用的答案。本发明所述的方法和组合物可用于发现生物标记的过程,该生物标记可用于诊断或预后以及用作健康和疾病的指示剂。本发明所述的方法和组合物可用于筛选药物,例如药物开发、治疗选择、测定疗效和/或鉴别用于药物开发的靶。在涉及药物的筛选试验时测试基因表达的能力非常重要,因为蛋白质是体内的最终基因产物。在一些实施方案中,本发明所述的方法和组合物将同时测量蛋白质和基因表达,这将提供与正在进行的特定筛选相关的最多信息。The compositions and methods of the invention can be used in genomes. The method described in the present invention can quickly provide answers that meet this application. The methods and compositions described herein are useful in the process of discovering biomarkers useful for diagnosis or prognosis and as indicators of health and disease. The methods and compositions described herein are useful for screening drugs, eg, drug development, treatment selection, determining efficacy and/or identifying targets for drug development. The ability to test gene expression is very important in screening assays involving drugs because proteins are the final gene products in the body. In some embodiments, the methods and compositions described herein will measure protein and gene expression simultaneously, which will provide the most information relevant to the particular screen being done.

本发明的组合物和方法可用于基因表达分析中。本发明所述的方法区别核苷酸序列。靶核苷酸序列之间的差异可为例如单个核酸碱基差异、核酸缺失、核酸插入或重排。这些涉及多于一个碱基的序列差异还可被检测。本发明的方法能够检测传染病、遗传病和癌症。这在环境监测、鉴证和食品科学中也是有用的。可在核酸上进行的遗传分析的示例包括例如SNP检测、STR检测、RNA表达分析、启动子甲基化、基因表达、病毒检测、病毒亚分型和抗药性。The compositions and methods of the invention can be used in gene expression analysis. The methods described herein discriminate between nucleotide sequences. Differences between target nucleotide sequences can be, for example, single nucleic acid base differences, nucleic acid deletions, nucleic acid insertions, or rearrangements. These sequence differences involving more than one base can also be detected. The methods of the invention enable the detection of infectious diseases, genetic diseases and cancer. This is also useful in environmental monitoring, forensics and food science. Examples of genetic analyzes that can be performed on nucleic acids include, for example, SNP detection, STR detection, RNA expression analysis, promoter methylation, gene expression, virus detection, virus subtyping, and drug resistance.

本方法可应用于分析从患者得到或来源于患者的生物分子样本,以确定患病的细胞类型是否存在于样本中、疾病的阶段、患者的预后、患者响应特定治疗的能力或对于患者最好的治疗。本发明方法还可应用于鉴别特定疾病的生物标记。The method can be applied to the analysis of biomolecular samples obtained from or derived from a patient to determine whether a diseased cell type is present in the sample, the stage of the disease, the patient's prognosis, the patient's ability to respond to a particular treatment, or what is best for the patient. Treatment. The methods of the invention can also be applied to identify biomarkers for specific diseases.

在一些实施方案中,本发明所述的方法用于病情诊断中。此处所用的术语“诊断”或病情“诊断”可包括预测或诊断病情、确定对病情的易感染体质、监控病情治疗、诊断疾病的治疗响应或对病情、病情进展或疾病的特定治疗的响应的预后。例如根据本发明所述方法中的任何一种可化验血液样本,以确定疾病或恶性细胞类型的标记在样本中的存在和/或数量,从而对疾病或癌症进行诊断或分期。In some embodiments, the methods described herein are used in the diagnosis of a condition. As used herein, the term "diagnosing" or a condition "diagnosing" may include predicting or diagnosing a condition, determining susceptibility to a condition, monitoring treatment for a condition, diagnosing treatment response to a condition, or response to a condition, progression of a condition, or specific treatment for a disease prognosis. For example, a blood sample can be assayed according to any of the methods described herein to determine the presence and/or amount of a marker of a disease or malignant cell type in the sample to diagnose or stage the disease or cancer.

在一些实施方案中,本发明所述的方法和组合物用于病情的诊断和预后。In some embodiments, the methods and compositions described herein are used in the diagnosis and prognosis of disease conditions.

本发明所述的方法特别地可用于处理许多免疫性、增生性和恶性的疾病及失调。免疫性疾病及失调包括变应性疾病及失调、免疫功能的失调、以及自免疫疾病和状况。变应性疾病及失调包括但不限于过敏性鼻炎、过敏性结膜炎、过敏性哮喘、特应性湿疹、特应性皮炎和食物过敏。免疫功能缺陷类病变包括但不限于重症联合免疫缺陷(SCID)、高嗜酸粒细胞综合征、慢性肉芽肿病、白细胞黏附缺陷病I型和II型、高IgE综合征、先天性白细胞颗粒异常综合征(Chediak Higashi)、中性白细胞增多症、嗜中性白细胞减少症、发育不全、血中丙球蛋白缺乏症、高IgM综合征、迪格奥尔格/软腭-心-面综合征和干扰素γ-TH1通路缺陷。自免疫和免疫调节异常疾病包括但不限于风湿性关节炎、糖尿病、全身性红斑狼疮、格雷夫斯病、格雷夫斯眼病、克罗恩病、多发性硬化症、银屑病、全身性硬化症、甲状腺肿及淋巴瘤性甲状腺肿(桥本氏甲状腺炎、淋巴细胞性甲状腺肿)、斑秃、自身免疫性心肌炎、硬化性苔癣、自身免疫性葡萄膜炎、阿狄森氏病、萎缩性胃炎、重症肌无力、特发性血小板减少性紫癜、溶血性贫血、原发性胆汁性肝硬化、韦格纳氏肉芽肿病、结节性多动脉炎、以及炎症性肠病、同种异体移植物排斥以及来自对传染性微生物或环境抗原的变态反应的组织破坏。The methods described herein are particularly useful in the management of a number of immunological, proliferative and malignant diseases and disorders. Immune diseases and disorders include allergic diseases and disorders, disorders of immune function, and autoimmune diseases and conditions. Allergic diseases and disorders include, but are not limited to, allergic rhinitis, allergic conjunctivitis, allergic asthma, atopic eczema, atopic dermatitis, and food allergy. Immunodeficiency disorders include but are not limited to severe combined immunodeficiency (SCID), hypereosinophilic syndrome, chronic granulomatous disease, leukocyte adhesion deficiency type I and type II, high IgE syndrome, and congenital leukocyte granule abnormalities Syndrome (Chediak Higashi), neutropenia, neutropenia, hypoplasia, blood gammaglobulinemia, hyper-IgM syndrome, DiGeorge/soft palate-cardio-facial syndrome and interference Defects in the γ-TH1 pathway. Autoimmune and immunoregulatory disorders including, but not limited to, rheumatoid arthritis, diabetes, systemic lupus erythematosus, Graves' disease, Graves' ophthalmopathy, Crohn's disease, multiple sclerosis, psoriasis, systemic sclerosis goiter, goiter and lymphomatous goiter (Hashimoto's thyroiditis, lymphocytic goiter), alopecia areata, autoimmune myocarditis, lichen sclerosus, autoimmune uveitis, Addison's disease, atrophy gastritis, myasthenia gravis, idiopathic thrombocytopenic purpura, hemolytic anemia, primary biliary cirrhosis, Wegener's granulomatosis, polyarteritis nodosa, and inflammatory bowel disease, allogeneic Allograft rejection and tissue destruction from allergic reactions to infectious microorganisms or environmental antigens.

可通过本发明的方法进行评估的增生性疾病和失调包括但不限于新生儿血管瘤病;继发进展型多发性硬化症;慢性进展型骨髓退行性疾病;神经纤维瘤;节细胞性神经瘤病;瘢痕疙瘩形成;畸形性骨炎、纤维性囊肿病(例如乳腺的或子宫的);类肉状瘤病;Peronies和Duputren纤维化(Peronies and Duputren's fibrosis)、肝硬化、动脉粥样硬化和血管再狭窄。Proliferative diseases and disorders that may be assessed by the methods of the invention include, but are not limited to, neonatal angiomatosis; secondary progressive multiple sclerosis; chronic progressive myeloid degenerative disease; neurofibromas; keloid formation; osteitis deformans, fibrocystic disease (e.g. breast or uterine); sarcoidosis; Peronies and Duputren's fibrosis, liver cirrhosis, atherosclerosis and Vascular restenosis.

可通过本发明的方法进行评估的恶性疾病和失调包括恶性血液病和实体瘤两者。Malignant diseases and disorders that can be assessed by the methods of the invention include both hematological malignancies and solid tumors.

当样本为血液样本时,恶性血液病尤其可由本发明的方法处理,因为这样的恶性肿瘤涉及血液传播的细胞中的变化。这样的恶性肿瘤包括非霍奇金氏淋巴瘤、霍奇金氏淋巴瘤、非-B细胞淋巴瘤和其它淋巴瘤、急性或慢性白血病、红血球增多症、血小板增多症、多发性骨髓瘤、骨髓增生异常综合征、骨髓增生性疾病、骨髓纤维化、非典型免疫淋巴增生和浆细胞异常。Hematological malignancies are particularly treatable by the methods of the invention when the sample is a blood sample, since such malignancies involve changes in blood-borne cells. Such malignancies include non-Hodgkin's lymphoma, Hodgkin's lymphoma, non-B cell lymphoma and other lymphomas, acute or chronic leukemia, polycythemia, thrombocytosis, multiple myeloma, bone marrow Dysplastic syndromes, myeloproliferative disorders, myelofibrosis, atypical immune lymphoproliferative and plasma cell abnormalities.

可通过本发明的方法评估的浆细胞异常包括多发性骨髓瘤、淀粉样变性病和原发性巨球蛋白血症。Plasma cell abnormalities that can be assessed by the methods of the invention include multiple myeloma, amyloidosis, and essential macroglobulinemia.

实体瘤的示例包括但不限于直肠癌、乳腺癌、肺癌、前列腺癌、脑瘤、中枢神经系统肿瘤、膀胱瘤、黑素瘤、肝癌、骨肉瘤及其它骨癌、睾丸和卵巢癌、头颈部肿瘤和宫颈肿瘤。Examples of solid tumors include, but are not limited to, rectal cancer, breast cancer, lung cancer, prostate cancer, brain tumors, central nervous system tumors, bladder tumors, melanoma, liver cancer, osteosarcoma and other bone cancers, testicular and ovarian cancers, head and neck cancers tumors of the neck and cervix.

本发明的方法还可检测遗传病。这可通过针对染色体和基因畸变或用于遗传病的产前或产后筛选进行。可检测的遗传病的示例包括:21羟化酶缺陷症、囊性纤维化、脆性X综合征、特纳综合征、杜氏肌营养不良症、唐氏综合症或其它染色体三倍体症、心脏病、单基因疾病、人类白细胞抗原(HLA)分型、苯丙酮尿症、镰状细胞性贫血、泰-萨克斯病、地中海贫血、克兰费尔特综合征、泰-萨克斯病、亨廷顿病、自身免疫疾病、脂沉积症、肥胖缺陷症、血友病、先天性代谢异常和糖尿病。The methods of the invention can also detect genetic diseases. This can be done through prenatal or postnatal screening for chromosomal and gene aberrations or for genetic diseases. Examples of genetic disorders that can be tested include: 21-hydroxylase deficiency, cystic fibrosis, fragile X syndrome, Turner syndrome, Duchenne muscular dystrophy, Down syndrome or other trisomy, cardiac disease, monogenic disease, human leukocyte antigen (HLA) typing, phenylketonuria, sickle cell anemia, Tay-Sachs disease, thalassemia, Klinefelter syndrome, Tay-Sachs disease, Huntington's disease, Autoimmune diseases, lipidosis, obesity deficiency, hemophilia, inborn errors of metabolism and diabetes.

本发明所述的方法可通过分别确定样品中细菌或病毒的标记物的存在和/或数量,用于诊断病原体传染,例如胞内细菌和病毒传染。The methods of the present invention can be used to diagnose pathogenic infections, such as intracellular bacterial and viral infections, by determining the presence and/or amount of markers of bacteria or viruses, respectively, in a sample.

本发明的方法可检测多种传染病。这些传染病可由细菌、病毒、寄生虫和真菌感染原所导致。也可使用本发明确定各种感染原的抗药性。The methods of the invention can detect a variety of infectious diseases. These infections can be caused by bacterial, viral, parasitic and fungal infectious agents. Drug resistance of various infectious agents can also be determined using the invention.

可由本发明检测的细菌感染原包括大肠杆菌(Escherichia coli)、沙门氏菌(Salmonella)、志贺氏菌(Shigella)、克雷伯氏细菌(Klesbiella)、假单胞菌(Pseudomonas)、单核细胞增多性李斯特氏菌(Listeria monocytogenes)、结核分枝杆菌(Mycobacterium tuberculosis)、鸟胞内分枝杆菌(Mycobacteriumaviumintracellulare)、耶尔森氏菌(Yersinia)、弗朗西斯氏菌(Francisella)、巴斯德菌(Pasteurella)、布鲁氏菌(Brucella)、梭状芽胞杆菌(Clostridia)、百日咳博德特氏杆菌(Bordetella pertussis)、拟杆菌(Bacteroides)、金黄色葡萄球菌(Staphylococcusaureus)、肺炎链球菌(Streptococcus pneumonia)、乙型溶血性链球菌(B-Hemolyticstrep.)、棒状杆菌(Corynebacteria)、军团杆菌(Legionella)、支原体(Mycoplasma)、脲原体(Ureaplasma)、衣原体(Chlamydia)、淋病奈瑟菌(Neisseria gonorrhea)、脑膜炎奈瑟菌(Neisseria meningitides)、流感嗜血杆菌(Hemophilus influenza)、粪肠球菌(Enterococcus faecalis)、普通变形杆菌(Proteus vulgaris)、奇异变形杆菌(Proteusmirabilis)、幽门螺杆菌(Helicobacter pylori)、梅毒螺旋体(Treponema palladium)、伯氏疏螺旋体(Borrelia burgdorferi)、回归热疏螺旋体(Borrelia recurrentis)、立克次氏体属病原体(Rickettsial pathogens)、诺卡氏菌(Nocardia)和放线菌(Acitnomycetes)。Bacterial infectious agents detectable by the present invention include Escherichia coli, Salmonella, Shigella, Klebsiella, Pseudomonas, mononucleosis Listeria monocytogenes, Mycobacterium tuberculosis, Mycobacterium avium intracellulare, Yersinia, Francisella, Pasteurella ( Pasteurella), Brucella, Clostridia, Bordetella pertussis, Bacteroides, Staphylococcus aureus, Streptococcus pneumonia ), B-Hemolytic streptococcus (B-Hemolyticstrep.), Corynebacteria (Corynebacteria), Legionella (Legionella), Mycoplasma (Mycoplasma), Ureaplasma (Ureaplasma), Chlamydia (Chlamydia), Neisseria gonorrhoeae (Neisseria gonorrhea), Neisseria meningitides, Hemophilus influenza, Enterococcus faecalis, Proteus vulgaris, Proteusmirabilis, Helicobacter pylori pylori), Treponema palladium, Borrelia burgdorferi, Borrelia recurrentis, Rickettsial pathogens, Nocardia and actin bacteria (Acitnomycetes).

可由本发明检测的真菌感染原包括新型隐球菌(Cryptococcus neoformans)、皮炎芽生菌(Blastomyces dermatitidis)、荚膜组织胞浆菌(Histoplasma capsulatum)、粗球孢子菌(Coccidioides immitis)、巴西副球孢子菌(Paracoccidioides brasiliensis)、白色念珠菌(Candida albicans)、烟曲霉(Aspergillus fumigautus)、藻菌类(根霉菌属)(Phycomycetes(Rhizopus))、申克氏孢子丝菌(Sporothrix schenckii)、着色真菌病(Chromomycosis)和马杜拉分支菌病(Maduromycosis)。Fungal infectious agents detectable by the present invention include Cryptococcus neoformans, Blastomyces dermatitidis, Histoplasma capsulatum, Coccidioides immitis, Paracoccidioides brasiliensis (Paracoccidioides brasiliensis), Candida albicans, Aspergillus fumigautus, algae (Phycomycetes (Rhizopus)), Sporothrix schenckii, pigmented mycoses ( Chromomycosis) and Maduromycosis.

可由本发明检测的病毒感染原包括人类免疫缺陷病毒、人类T淋巴细胞病毒、肝炎病毒(例如乙型肝炎病毒和丙型肝炎病毒)、爱泼斯坦-巴尔氏病毒(Epstein-Barr virus)、巨细胞病毒、人乳头瘤病毒、正粘病毒、副粘液病毒、腺病毒、冠状病毒、弹状病毒(rhabdoviruses)、脊髓灰质炎病毒、披膜病毒、布尼亚病毒(bunya viruses)、沙粒病毒(arenaviruses)、风疹病毒(rubella viruses)和呼肠孤病毒(reo viruses)。Viral infectious agents detectable by the present invention include human immunodeficiency virus, human T lymphocyte virus, hepatitis viruses (such as hepatitis B virus and hepatitis C virus), Epstein-Barr virus (Epstein-Barr virus), giant Cytoviruses, human papillomaviruses, orthomyxoviruses, paramyxoviruses, adenoviruses, coronaviruses, rhabdoviruses, polioviruses, togaviruses, bunya viruses, arenaviruses (arenaviruses), rubella viruses (rubella viruses) and reo viruses (reo viruses).

本发明可检测的寄生性介质包括恶性疟原虫(Plasmodium falciparum)、三日疟原虫(Plasmodium malaria)、间日疟原虫(Plasmodium vivax)、卵形疟原虫(Plasmodiumovale)、旋盘尾丝虫(Onchoverva volvulus)、利什曼虫(Leishmania)、锥虫属(Trypanosoma spp.)、血吸虫属(Schistosoma spp.)、痢疾变形虫(Entamoebahistolytica)、隐孢子虫(Cryptosporidum)、贾第虫属(Giardia spp.)、毛滴虫属(Trichimonas spp.)、结肠小袋虫(Balatidium coli)、班氏吴策线虫(Wuchereriabancrofti)、弓形虫属(Toxoplasma spp.)、蠕形住肠线虫(Enterobius vermicularis)、蛔虫(Ascaris lumbricoides)、鞭虫(Trichuris trichiura)、麦地那龙线虫(Dracunculusmedinesis)、吸虫(trematodes)、阔节裂头绦虫(Diphyllobothrium latum)、绦虫属(Taenia spp.)、卡氏肺孢子虫(Pneumocystis carinii)和美州板口线虫(Necatoramericanis)。The parasitic medium detectable by the present invention includes Plasmodium falciparum, Plasmodium malaria, Plasmodium vivax, Plasmodium ovale, Onchoverva volvulus), Leishmania, Trypanosoma spp., Schistosoma spp., Entamoebahistolytica, Cryptosporidium, Giardia spp. ), Trichimonas spp., Balatidium coli, Wuchereria bancrofti, Toxoplasma spp., Enterobius vermicularis, roundworm ( Ascaris lumbricoides), Trichuris trichiura, Dracunculus medinsis, trematodes, Diphyllobothrium latum, Taenia spp., Pneumocystis carinii carinii) and Necatoramericanis.

本发明还可用于检测感染原的抗药性。例如耐万古霉素屎肠球菌(Enterococcusfaecium)、耐甲氧西林金黄色葡萄球菌(Staphylococcus aureus)、耐盘尼西林肺炎链球菌(Streptococcus pneumoniae)、多重耐药结核分支杆菌(Mycobacterium tuberculosis)和耐叠氮胸苷人类免疫缺陷病毒,均可用本发明进行鉴别。The invention can also be used to detect drug resistance of infectious agents. Examples include vancomycin-resistant Enterococcus faecium, methicillin-resistant Staphylococcus aureus, penicillin-resistant Streptococcus pneumoniae, multidrug-resistant Mycobacterium tuberculosis, and azidothorax-resistant Glycoside human immunodeficiency virus can be identified by the present invention.

因此,使用本发明的组合物和方法检测的靶分子可以是患者标记物(例如癌症标记物)或者用外来介质感染的标记物,例如细菌或病毒标记物。Accordingly, target molecules detected using the compositions and methods of the invention may be markers of the patient (eg, cancer markers) or markers of infection with a foreign agent, eg, bacterial or viral markers.

本发明的组合物和方法可用于鉴别和/或定量靶分子,其丰度表示生物学状态或疾病状况,例如随着病情上调或下调的血液标记物。The compositions and methods of the invention can be used to identify and/or quantify target molecules, the abundance of which is indicative of a biological state or disease condition, such as blood markers that are up- or down-regulated with the disease.

在一些实施方案中,本发明的方法和组合物可用于细胞因子表达。本文所述方法的低敏感性将有助于早期检测细胞因子(例如作为病情生物标记物的细胞因子),例如癌症的疾病的诊断或预后,,以及鉴别亚临床状况。In some embodiments, the methods and compositions of the invention are useful for cytokine expression. The low sensitivity of the methods described herein will facilitate early detection of cytokines (eg, cytokines as biomarkers of a condition), diagnosis or prognosis of a disease such as cancer, and identification of subclinical conditions.

靶多核苷酸来源的不同样本可包含多个来自同一个体的样本、来自不同个体的样本或其组合。在一些实施方案中,样本包含多个来自同一个体的多核苷酸。在一些实施方案中,样本包含多个来自两个或更多个体的多核苷酸。个体为靶多核苷酸来源的任何生物体或该生物体的一部分,其非限制性示例包括植物、动物、真菌、原生生物、无核原生物、病毒、线粒体和叶绿体。样本多核苷酸可分离自受试者,例如来源于该受试者的细胞样本、组织样本或器官样本,包括例如培养的细胞系、活组织检查、血液样本或含有细胞的血液样本。受试者可以是动物,包括但不限于例如牛、猪、小鼠、大鼠、鸡、猫、狗等动物,并且通常为哺乳动物,例如人类。样本还可例如通过化学合成被人工地获得。在一些实施方案中,样本包含DNA。在一些实施方案中,样本包含基因组DNA。在一些实施方案中,样本包含线粒体DNA、叶绿体DNA、质粒DNA、细菌人工染色体、酵母人工染色体、寡核苷酸标签或其组合。在一些实施方案中,样本包括使用引物和DNA聚合酶的任何合适的组合,由引物延伸反应生成的DNA,所述反应包括但不限于聚合物链式反应(PCR)、逆转录、及其组合。用于引物延伸反应的模版为RNA时,逆转录产物称为互补DNA(cDNA)。在引物延伸反应中有用的引物可包括对一个或多个靶、随机序列、部分随机序列及其组合特异性的序列。适合于引物延伸反应的反应条件在本领域中是已知的。通常,样本多核苷酸包括存在于样本中的任何多核苷酸,其可能含有或可能不含有靶多核苷酸。The different samples from which the target polynucleotides are derived can comprise multiple samples from the same individual, samples from different individuals, or a combination thereof. In some embodiments, a sample comprises multiple polynucleotides from the same individual. In some embodiments, a sample comprises a plurality of polynucleotides from two or more individuals. An individual is any organism or part of an organism from which a target polynucleotide is derived, non-limiting examples of which include plants, animals, fungi, protists, akaryotes, viruses, mitochondria, and chloroplasts. A sample polynucleotide can be isolated from a subject, eg, a cell sample, tissue sample, or organ sample derived from the subject, including, eg, a cultured cell line, a biopsy, a blood sample, or a blood sample containing cells. A subject can be an animal, including but not limited to animals such as cows, pigs, mice, rats, chickens, cats, dogs, and typically mammals, such as humans. Samples may also be obtained artificially, for example by chemical synthesis. In some embodiments, the sample comprises DNA. In some embodiments, the sample comprises genomic DNA. In some embodiments, the sample comprises mitochondrial DNA, chloroplast DNA, plasmid DNA, bacterial artificial chromosomes, yeast artificial chromosomes, oligonucleotide tags, or combinations thereof. In some embodiments, the sample includes DNA generated from a primer extension reaction using any suitable combination of primers and DNA polymerase, including but not limited to polymer chain reaction (PCR), reverse transcription, and combinations thereof . When the template used in the primer extension reaction is RNA, the reverse transcription product is called complementary DNA (cDNA). Primers useful in primer extension reactions can include sequences specific for one or more targets, random sequences, partially random sequences, and combinations thereof. Reaction conditions suitable for primer extension reactions are known in the art. In general, a sample polynucleotide includes any polynucleotide present in a sample, which may or may not contain a target polynucleotide.

在一些实施方案中,从含有多种其它组分例如蛋白质、脂质和非模板核酸的生物样本中分离核酸模板分子(例如DNA或RNA)。可以从任何细胞材料中获得核酸模板分子,从动物、植物、细菌、真菌或任何其它细胞生物体中获得所述细胞材料。用于本发明中的生物样本包括病毒颗粒或制剂。可直接从生物体或从来自生物体的生物样本,例如血液、尿液、脑脊液、精液、唾液、痰液、粪便和组织获得核酸模板分子。任何组织和体液试样都可用作用于本发明的核酸的来源。核酸模板分子也可分离自培养细胞,例如原代细胞培养物或细胞系。模板核酸分子源自的细胞或组织可被病毒或其它胞内病原体感染。样本还可以是从生物试样提取的总RNA、cDNA文库、病毒的或基因组DNA。样本还可以是来自非细胞来源的分离的DNA,例如来自冻箱的扩增的/分离的DNA。In some embodiments, nucleic acid template molecules (eg, DNA or RNA) are isolated from biological samples that contain various other components, such as proteins, lipids, and non-template nucleic acids. Nucleic acid template molecules can be obtained from any cellular material obtained from animals, plants, bacteria, fungi or any other cellular organism. Biological samples for use in the present invention include viral particles or preparations. Nucleic acid template molecules can be obtained directly from the organism or from biological samples derived from the organism, such as blood, urine, cerebrospinal fluid, semen, saliva, sputum, stool, and tissue. Any tissue and body fluid sample can be used as a source of nucleic acid for use in the present invention. Nucleic acid template molecules can also be isolated from cultured cells, such as primary cell cultures or cell lines. The cell or tissue from which the template nucleic acid molecule is derived may be infected by a virus or other intracellular pathogen. A sample can also be total RNA, a cDNA library, viral or genomic DNA extracted from a biological sample. The sample can also be isolated DNA from a non-cellular source, such as amplified/isolated DNA from a freezer.

用于提取和纯化核酸的方法在本领域中已知。例如,核酸可通过用苯酚、苯酚/氯仿/异戊醇或类似的配方,包括TRIzol和TriReagent,进行有机提取纯化。提取技术的其它非限制性示例包括:(1)例如使用苯酚/氯仿有机试剂进行有机提取之后进行乙醇沉淀(Ausubel等,1993),使用或不使用自动核酸提取器,例如可从Applied Biosystems(FosterCity,Calif.)获得的型号341DNA提取器;(2)固定相吸附法(美国专利号5,234,809;Walsh等,1991);和(3)盐诱导核酸沉淀法(Miller等,1988),这样沉淀方法通常被称为“盐析出”法。另一个核酸分离和/或纯化的示例包括使用磁粒子,核酸可以特异性或非特异性地与磁粒子结合,然后使用磁铁进行磁珠分离,以及洗涤并将核酸从磁珠上洗脱(参见例如美国专利号5,705,628)。在一些实施方案中,上述分离方法之前可以是酶消化步骤,以帮助从样本中除去不期望的蛋白质,例如用蛋白酶K或其它类似的蛋白酶消化。参见例如美国专利号7,001,724。如果需要,可向裂解缓冲液添加核糖核酸酶抑制剂。对于某些细胞或样本类型,可取的是在流程中加入蛋白质变性/消化步骤。纯化方法可用于分离DNA、RNA或二者。当在提取过程中或提取过程之后一起分离DNA和RNA时,可采用进一步的步骤纯化一种或分别纯化二者。也可生成被提取的核酸的亚级分,例如按大小、序列或其它物理或化学特性进行纯化。除了初始核酸分离步骤,还可在本发明的方法中的任何步骤之后进行核酸纯化,例如用于除去过量的或不期望的试剂、反应物或产物。Methods for extracting and purifying nucleic acids are known in the art. For example, nucleic acids can be purified by organic extraction with phenol, phenol/chloroform/isoamyl alcohol, or similar formulations, including TRIzol and TriReagent. Other non-limiting examples of extraction techniques include: (1) ethanol precipitation followed by organic extraction such as with phenol/chloroform organic reagents (Ausubel et al., 1993), with or without automated nucleic acid extractors such as those available from Applied Biosystems (Foster City , Calif.) model 341 DNA extractor; (2) stationary phase adsorption (US Pat. No. 5,234,809; Walsh et al., 1991); and (3) salt-induced nucleic acid precipitation (Miller et al., 1988), such precipitation methods are often Known as the "salting out" method. Another example of nucleic acid isolation and/or purification includes the use of magnetic particles, to which nucleic acids can be specifically or non-specifically bound, followed by magnetic bead separation using a magnet, and washing and elution of the nucleic acids from the beads (see e.g. U.S. Patent No. 5,705,628). In some embodiments, the separation methods described above may be preceded by an enzymatic digestion step to aid in the removal of undesired proteins from the sample, such as digestion with proteinase K or other similar proteases. See, eg, US Patent No. 7,001,724. Add ribonuclease inhibitors to the lysis buffer if desired. For some cell or sample types, it may be advisable to include a protein denaturation/digestion step in the protocol. Purification methods can be used to isolate DNA, RNA, or both. When DNA and RNA are isolated together during or after the extraction process, further steps may be used to purify one or both separately. Subfractions of the extracted nucleic acid can also be generated, eg, purified by size, sequence, or other physical or chemical characteristics. In addition to the initial nucleic acid isolation step, any step in the methods of the invention may be followed by nucleic acid purification, for example to remove excess or undesired reagents, reactants or products.

可按照2003年10月9日公开的、公开号为US2002/0190663A1的美国专利申请所述,获得核酸模板分子。通常,可通过多种技术从生物样本中提取核酸,这些技术例如Maniatis等在Molecular Cloning:A Laboratory Manual,Cold Spring Harbor,N.Y.,第280-281页(1982)中所描述的。在一些情况下,可首先从生物样本中提取核酸,然后在体外交联。在一些情况下,可进一步从核酸中除去天然缔合蛋白质(例如组蛋白)。Nucleic acid template molecules can be obtained as described in US Patent Application Publication No. US2002/0190663A1 published October 9, 2003. In general, nucleic acids can be extracted from biological samples by a variety of techniques, such as those described by Maniatis et al., Molecular Cloning: A Laboratory Manual, Cold Spring Harbor, N.Y., pp. 280-281 (1982). In some cases, nucleic acid can first be extracted from a biological sample and then cross-linked in vitro. In some cases, naturally associated proteins (eg, histones) can further be removed from the nucleic acid.

在其它实施方案中,本发明可以容易地应用于任何高分子量的双链DNA,包括例如分离自组织、细胞培养物、体液、动物组织、植物、细菌、真菌、病毒等的DNA。In other embodiments, the present invention can be readily applied to any high molecular weight double-stranded DNA, including, for example, DNA isolated from tissues, cell cultures, body fluids, animal tissues, plants, bacteria, fungi, viruses, and the like.

在一些实施方案中,多个独立样本中的每一个可独立地包含至少约1ng、2ng、5ng、10ng、20ng、30ng、40ng、50ng、75ng、100ng、150ng、200ng、250ng、300ng、400ng、500ng、1μg、1.5μg、2μg、5μg、10μg、20μg、50μg、100μg、200μg、500μg或1000μg或更多的核酸材料。在一些实施方案中,多个独立样本中的每一个可独立地包含少于约1ng、2ng、5ng、10ng、20ng、30ng、40ng、50ng、75ng、100ng、150ng、200ng、250ng、300ng、400ng、500ng、1μg、1.5μg、2μg、5μg、10μg、20μg、50μg、100μg、200μg、500μg或1000μg或更多的核酸。In some embodiments, each of the plurality of independent samples can independently comprise at least about 1 ng, 2 ng, 5 ng, 10 ng, 20 ng, 30 ng, 40 ng, 50 ng, 75 ng, 100 ng, 150 ng, 200 ng, 250 ng, 300 ng, 400 ng, 500 ng, 1 μg, 1.5 μg, 2 μg, 5 μg, 10 μg, 20 μg, 50 μg, 100 μg, 200 μg, 500 μg or 1000 μg or more of nucleic acid material. In some embodiments, each of the plurality of independent samples may independently comprise less than about 1 ng, 2 ng, 5 ng, 10 ng, 20 ng, 30 ng, 40 ng, 50 ng, 75 ng, 100 ng, 150 ng, 200 ng, 250 ng, 300 ng, 400 ng , 500 ng, 1 μg, 1.5 μg, 2 μg, 5 μg, 10 μg, 20 μg, 50 μg, 100 μg, 200 μg, 500 μg, or 1000 μg or more of nucleic acid.

在一些实施方案中,用商业试剂盒进行末端修复以生成平末端5’磷酸化核酸末端,这些商业试剂盒例如可从Epicentre Biotechnologies(Madison,WI)获得的那些。In some embodiments, end repair is performed using commercial kits, such as those available from Epicentre Biotechnologies (Madison, WI), to generate blunt 5' phosphorylated nucleic acid ends.

接头寡核苷酸包括具有能够连接至靶多核苷酸的序列的任何寡核苷酸,所述序列的至少一部分是已知的。接头寡核苷酸可包含DNA、RNA、核苷酸类似物、非典型核苷酸、标记的核苷酸、修饰的核苷酸或其组合。接头寡核苷酸可以是单链的、双链的或部分双链的。通常,部分双链的接头包含一个或多个单链区域和一个或多个双链区域。双链接头可包含两个不同的互相杂交的寡核苷酸(也称为“寡核苷酸双链体”),杂交可留下一个或多个平末端、一个或多个3’突出、一个或多个5’突出、由于错配的和/或未配对的核苷酸产生的一个或多个凸起或这些的组合。在一些实施方案中,单链接头包含两个或更多个能够彼此杂交的序列。当单链接头中含有两个这种可杂交的序列时,杂交产生发夹结构(发夹接头)。当接头的两个杂交的区域通过非杂交的区域被彼此分隔开,产生“气泡”结构。含有气泡结构的接头可由包含内杂交的单个接头寡核苷酸组成,或者可包含两个或更多个彼此杂交的接头寡核苷酸。例如接头内的两个可杂交序列之间的内部序列杂交,可在单链接头寡核苷酸中产生双链结构。不同种类的接头可组合使用,例如发夹接头和双链接头,或者不同序列的接头。发夹接头中的可杂交序列可包括或可不包括寡核苷酸的一个或两个末端。当可杂交序列中不包括任何两个末端时,两个末端都是“自由的”或“突出的”。当接头中只有一个末端可与另一个序列杂交时,另一个末端形成突出,例如3’突出或5’突出。当可杂交序列中含有5’端核苷酸和3’端核苷酸二者,以使得5’端核苷酸和3’端核苷酸彼此互补并杂交时,该末端称为“平的(blunt)”。不同的接头可在连续反应中或同时连接至靶多核苷酸。例如,可向同个反应加入第一接头和第二接头。可在与靶多核苷酸结合之前处理接头。例如,可加入或除去端磷酸盐。Adapter oligonucleotides include any oligonucleotide having a sequence capable of ligation to a target polynucleotide, at least a portion of which is known. Adapter oligonucleotides may comprise DNA, RNA, nucleotide analogs, atypical nucleotides, labeled nucleotides, modified nucleotides, or combinations thereof. Adapter oligonucleotides can be single-stranded, double-stranded or partially double-stranded. Typically, a partially double-stranded linker comprises one or more single-stranded regions and one or more double-stranded regions. A double-stranded linker may comprise two different oligonucleotides (also referred to as an "oligonucleotide duplex") that hybridize to each other, and the hybridization may leave one or more blunt ends, one or more 3' overhangs, One or more 5' overhangs, one or more bulges due to mismatched and/or unpaired nucleotides, or a combination of these. In some embodiments, a single-stranded linker comprises two or more sequences capable of hybridizing to each other. When two such hybridizable sequences are contained in a single-stranded junction, hybridization results in a hairpin structure (hairpin junction). A "bubble" structure results when two hybridized regions of a linker are separated from each other by a non-hybridized region. A bubble-containing linker may consist of a single linker oligonucleotide comprising internal hybridization, or may comprise two or more linker oligonucleotides which hybridize to each other. For example, internal sequence hybridization between two hybridizable sequences within a linker can create a double-stranded structure in a single-stranded linker oligonucleotide. Different kinds of adapters can be used in combination, such as hairpin adapters and double-stranded adapters, or adapters of different sequences. The hybridizable sequence in the hairpin adapter may or may not include one or both ends of the oligonucleotide. When neither terminus is included in a hybridizable sequence, both termini are "free" or "overhanging". When only one end of the adapter is hybridizable to another sequence, the other end forms an overhang, such as a 3' overhang or a 5' overhang. When both the 5' terminal nucleotide and the 3' terminal nucleotide are contained in the hybridizable sequence such that the 5' terminal nucleotide and the 3' terminal nucleotide are complementary to each other and hybridize, the terminal is called "blunt". (blunt)". Different adapters can be ligated to the target polynucleotide in successive reactions or simultaneously. For example, a first linker and a second linker can be added to the same reaction. The linker can be treated prior to binding to the target polynucleotide. For example, terminal phosphates may be added or removed.

接头可含有多种序列元件中的一个或多个,包括但不限于一个或多个扩增引物退火序列或其互补序列、一个或多个测序引物退火序列或其互补序列、一个或多个的条码序列、在多种不同接头或不同接头的子集中共享的一个或多个共有序列、一个或多个限制性内切酶识别位点、与一个或多个靶多核苷酸突出互补的一个或多个突出、一个或多个探针结合位点(例如用于连接至测序平台,例如用于大规模平行测序的流动池,例如由Illumina公司所开发)、一个或多个随机或几乎随机的序列(例如从两个或多个不同核苷酸的集合在一个或多个位点随机选择的一个或多个核苷酸,其中在一个或多个位点选择的该不同核苷酸中的每一个都存在于含有该随机序列的接头池)、及其组合。两个或更多个序列元件可以互不相邻(例如被一个或多个核苷酸隔开)、彼此相邻、部分重叠或完全重叠。例如,扩增引物退火序列还可用作测序引物退火序列。序列元件可位于或靠近3’末端、位于或靠近5’末端或者在接头寡核苷酸内部。当接头寡核苷酸能够形成二级结构(例如发夹)时,序列元件可部分地或完全地位于该二级结构外部、部分地或完全地位于该二级结构内部或在参与该二级结构的序列之间中。例如,当接头寡核苷酸包含发夹结构时,序列元件可部分地或完全地位于该可杂交的序列(“茎”)的内部或外部,包括在该可杂交序列之间的序列(“环(loop)”)中。在一些实施方案中,具有不同条码序列的多个第一接头寡核苷酸中的第一接头寡核苷酸,包含在该多个第一接头寡核苷酸全体中共同的序列元件。在一些实施方案中,全部第二接头寡核苷酸包含在全部第二接头寡核苷酸中共同的序列元件,其与第一接头寡核苷酸共享的共有序列元件不同。序列元件的差异可以是任何使至少部分不同接头不完全对齐的差异,例如这是由于序列长度的改变、一个或多个核苷酸的缺失或插入或核苷酸组成在一个或多个核苷酸部位的改变(例如碱基改变或碱基修饰)。在一些实施方案中,接头寡核苷酸包含与一个或多个靶多核苷酸互补的5’突出、3’突出或者二者。互补的突出可以长度为一个或多个核苷酸,包括但不限于长度为1个、2个、3个、4个、5个、6个、7个、8个、9个、10个、11个、12个、13个、14个、15个或更多个核苷酸。例如互补的突出可以长度为约1个、2个、3个、4个、5个或6个核苷酸。互补的突出可以包括固定序列。互补突出可包含一个或多个核苷酸的随机序列,以使得一个或多个核苷酸是从两个或多个不同核苷酸的集合中在一个或多个位点随机选择的,在一个或多个位点选择的该不同核苷酸中的每一个都存在于接头池中,所述接头池中的接头具有包含该随机序列的互补突出。在一些实施方案中,接头突出与靶多核苷酸突出互补,该靶多核苷酸突出由限制性内切核酸酶消化产生。在一些实施方案中,接头突出包含腺嘌呤或胸腺嘧啶。Adapters may contain one or more of a variety of sequence elements, including, but not limited to, one or more amplification primer annealing sequences or their complements, one or more sequencing primer annealing sequences or their complements, one or more A barcode sequence, one or more consensus sequences shared among a plurality of different adapters or a subset of different adapters, one or more restriction enzyme recognition sites, one or more overhangs complementary to one or more target polynucleotides Multiple overhangs, one or more probe binding sites (e.g. for attachment to sequencing platforms, e.g. flow cells for massively parallel sequencing, e.g. developed by Illumina), one or more random or nearly random sequence (e.g., one or more nucleotides randomly selected at one or more positions from a set of two or more different nucleotides, wherein one or more of the different nucleotides selected at one or more positions each present in the pool of adapters containing the random sequence), and combinations thereof. Two or more sequence elements may be non-adjacent to each other (eg, separated by one or more nucleotides), adjacent to each other, partially overlapping, or completely overlapping. For example, an amplification primer annealing sequence can also be used as a sequencing primer annealing sequence. The sequence element can be at or near the 3' end, at or near the 5' end, or within the adapter oligonucleotide. When the linker oligonucleotide is capable of forming a secondary structure (such as a hairpin), the sequence element may be partially or completely external to the secondary structure, partially or completely internal to the secondary structure, or participate in the secondary structure. Between sequences of structures. For example, when an adapter oligonucleotide comprises a hairpin structure, sequence elements may be located partially or completely inside or outside the hybridizable sequence ("stem"), including sequences between the hybridizable sequences (" loop (loop)"). In some embodiments, the first linker oligonucleotides of the plurality of first linker oligonucleotides having different barcode sequences comprise sequence elements that are common to the ensemble of the plurality of first linker oligonucleotides. In some embodiments, all second adapter oligonucleotides comprise a sequence element common among all second adapter oligonucleotides that is different from the consensus sequence element shared by the first adapter oligonucleotide. The difference in sequence elements may be any difference that results in incomplete alignment of at least some of the different linkers, for example due to a change in sequence length, deletion or insertion of one or more nucleotides, or nucleotide composition within one or more nucleosides Changes in acid sites (eg, base changes or base modifications). In some embodiments, an adapter oligonucleotide comprises a 5' overhang, a 3' overhang, or both, that are complementary to one or more target polynucleotides. Complementary overhangs can be one or more nucleotides in length, including but not limited to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 or more nucleotides. For example, complementary overhangs can be about 1, 2, 3, 4, 5 or 6 nucleotides in length. Complementary overhangs can include fixed sequences. A complementary overhang may comprise a random sequence of one or more nucleotides such that one or more nucleotides are randomly selected at one or more positions from a set of two or more different nucleotides, in Each of the different nucleotides selected by one or more sites is present in a pool of adapters having complementary overhangs comprising the random sequence. In some embodiments, the adapter overhang is complementary to a target polynucleotide overhang produced by restriction endonuclease digestion. In some embodiments, the linker overhang comprises adenine or thymine.

接头寡核苷酸可具有合适的长度,至少足以容纳它们所含有的一个或多个序列元件。在一些实施方案中,接头长度为约10个、15个、20个、25个、30个、35个、40个、45个、50个、55个、60个、65个、70个、75个、80个、90个、100个、200个或更多个核苷酸,或小于该长度,或大于该长度。在一些实施例中,接头长度可为约10至约50个核苷酸。在进一步的实施例中,接头长度可为约20至约40个核苷酸。Adapter oligonucleotides may be of a suitable length at least sufficient to accommodate the sequence element or elements they contain. In some embodiments, the linker is about 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75 80, 90, 100, 200 or more nucleotides, or less than that length, or more than that length. In some embodiments, linkers may be about 10 to about 50 nucleotides in length. In further embodiments, the linker may be about 20 to about 40 nucleotides in length.

如本文所用,术语“条码”指已知的核酸序列,其允许鉴别与该条码相关联的多核苷酸的某些特征。在一些实施方案中,待鉴别的多核苷酸的特征为该多核苷酸来源的样本。在一些实施方案中,条码长度可为至少3个、4个、5个、6个、7个、8个、9个、10个、11个、12个、13个、14个、15个或更多个核苷酸。例如,条码长度可为至少10、11、12、13、14或15个核苷酸。在一些实施方案中,条码长度可小于10、9、8、7、6、5或4个核苷酸。例如,条码长度可小于10个核苷酸。在一些实施方案中,与某些多核苷酸相关联的条码的长度不同于与其它多核苷酸相关联的条码。通常,条码的长度足够长,并且条码包含充分不同的序列,以允许基于与其相关联的条码鉴别样本。在一些实施方案中,可在条码序列中的一个或多个核苷酸的突变、插入或缺失之后,例如1个、2个、3个、4个、5个、6个、7个、8个、9个、10个或更多个核苷酸的突变、插入或缺失之后,准确鉴别条码和与其相关联的样本来源。在一些实施例中,可突变、插入和/或缺失1个、2个或3个核苷酸。在一些实施方案中,多个条码中的每个条码至少在两个核苷酸位点,例如至少2个、3个、4个、5个、6个、7个、8个、9个、10个或更多个位点,不同于多个条码中的每个其它条码。在一些实施例中,每个条码可在至少2个、3个、4个或5个位点不同于每个其它条码。在一些实施方案中,第一位点和第二位点均包含多个条码序列中的至少一个。在一些实施方案中,用于第二个位点的条码独立于用于第一接头寡核苷酸的条码进行选择。在一些实施方案中,使具有条码的第一位点和第二位点配对,以使配对的序列包含相同的或不同的一个或多个条码。在一些实施方案中,本发明的方法进一步包括基于靶多核苷酸连接的条码序列,鉴别靶多核苷酸来源的样本。通常,条码可包含下述核酸序列,所述核酸序列当被连接至靶多核苷酸时,用作靶多核苷酸来源样本的标识。As used herein, the term "barcode" refers to a known nucleic acid sequence that allows identification of certain features of the polynucleotide associated with the barcode. In some embodiments, the polynucleotide to be identified is characterized by the sample from which the polynucleotide was derived. In some embodiments, the barcode length can be at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, or more nucleotides. For example, the barcode can be at least 10, 11, 12, 13, 14 or 15 nucleotides in length. In some embodiments, barcodes may be less than 10, 9, 8, 7, 6, 5, or 4 nucleotides in length. For example, barcodes can be less than 10 nucleotides in length. In some embodiments, barcodes associated with certain polynucleotides are of different lengths than barcodes associated with other polynucleotides. Typically, the barcodes are of sufficient length and contain sufficiently distinct sequences to allow identification of samples based on their associated barcodes. In some embodiments, one or more nucleotide mutations, insertions or deletions in the barcode sequence, for example, 1, 2, 3, 4, 5, 6, 7, 8 After mutations, insertions, or deletions of 1, 9, 10, or more nucleotides, the barcodes and their associated sample origins are accurately identified. In some embodiments, 1, 2 or 3 nucleotides may be mutated, inserted and/or deleted. In some embodiments, each barcode in the plurality of barcodes is at least two nucleotide positions, e.g., at least 2, 3, 4, 5, 6, 7, 8, 9, 10 or more positions that differ from every other barcode in multiple barcodes. In some embodiments, each barcode can differ from every other barcode in at least 2, 3, 4, or 5 positions. In some embodiments, both the first locus and the second locus comprise at least one of the plurality of barcode sequences. In some embodiments, the barcode for the second site is selected independently of the barcode for the first adapter oligonucleotide. In some embodiments, the first locus and the second locus having a barcode are paired such that the paired sequences comprise the same or different one or more barcodes. In some embodiments, the methods of the invention further comprise identifying the sample from which the target polynucleotide was derived based on the barcode sequence linked to the target polynucleotide. Typically, a barcode may comprise a nucleic acid sequence that, when linked to a target polynucleotide, serves as an identifier for the sample from which the target polynucleotide was derived.

在真核生物中,基因组DNA被装进染色质中,以组成为细胞核内的染色体。染色质的基本结构单元为核小体,其由缠绕组蛋白八聚体的146个DNA碱基对(bp)组成。组蛋白八聚体由核心组蛋白H2A-H2B二聚体和H3-H4二聚体中的每一个的两个拷贝组成。核小体以通常被称作“绳珠”的方式有规律地沿着DNA隔开。In eukaryotes, genomic DNA is packaged into chromatin to organize chromosomes within the nucleus. The basic structural unit of chromatin is the nucleosome, which consists of 146 base pairs (bp) of DNA wrapped around a histone octamer. The histone octamer consists of two copies of each of the core histone H2A-H2B dimers and H3-H4 dimers. Nucleosomes are regularly spaced along the DNA in what is commonly referred to as "beads of rope".

伴侣蛋白质和相关组装因子介导核心组蛋白和DNA组装为核小体。几乎所有这些因子都是核心组蛋白结合蛋白。部分组蛋白分子伴侣,例如核小体组装蛋白-1(NAP-1),显示出结合至组蛋白H3和H4的偏好。同样已经观察到的是,新合成的组蛋白被乙酰化,然后在组装入染色质之后去乙酰化。介导组蛋白乙酰化或去乙酰化的因子从而在染色质组装过程中发挥重要的作用。Chaperone proteins and associated assembly factors mediate the assembly of core histones and DNA into nucleosomes. Almost all of these factors are core histone binding proteins. Some histone chaperones, such as nucleosome assembly protein-1 (NAP-1), show a preference for binding to histones H3 and H4. It has also been observed that newly synthesized histones are acetylated and then deacetylated after assembly into chromatin. Factors that mediate histone acetylation or deacetylation thus play an important role in the process of chromatin assembly.

通常,已经开发了两种在体外的方法,用于重构或组装染色质。一种方法是不依赖ATP的,而第二种是依赖ATP的。用于重构染色质的不依赖ATP的方法涉及DNA和核心组蛋白,再加上类似NAP-1的蛋白质或盐,充当组蛋白分子伴侣。该方法导致组蛋白在DNA上的随机排布,其实质上不能准确地模仿细胞中的天然核心核小体颗粒。这些颗粒通常被称为单核小体,因为它们排布不规则,所使用的延伸核小体阵列和DNA序列往往不长于250bp(Kundu,T.K.等,Mol.Cell 6:551-561,2000)。为了在更长的DNA序列上生成有序核小体的延伸阵列,必须通过依赖ATP的过程组装染色质。Generally, two in vitro methods have been developed for remodeling or assembling chromatin. One method is ATP-independent, while the second is ATP-dependent. An ATP-independent approach for remodeling chromatin involves DNA and core histones, coupled with NAP-1-like proteins or salts, that act as histone chaperones. This approach results in a random arrangement of histones on the DNA, which in essence does not accurately mimic the natural core nucleosomal particles in the cell. These particles are often referred to as mononucleosomes because they are irregularly arranged, using extended nucleosome arrays and DNA sequences that are often no longer than 250 bp (Kundu, T.K. et al., Mol. Cell 6:551-561, 2000) . To generate extended arrays of ordered nucleosomes on longer DNA sequences, chromatin must be assembled by an ATP-dependent process.

周期性核小体阵列的依赖ATP的组装类似于天然染色质中所看到的,需要DNA序列、核心组蛋白颗粒、伴侣蛋白质和利用ATP的染色质组装因子。ACF(利用ATP的染色质组装和重构因子)或RSF(重构和间距因子)是两种被广泛研究的组装因子,用于将延伸的有序的核小体阵列在体外生成为染色质(Fyodorov、D.V.和Kadonaga,J.T.Method Enzymol.371:499-515,2003;Kundu、T.K.等,Mol.Cell 6:551-561,2000)。ATP-dependent assembly of periodic nucleosome arrays resembles that seen in native chromatin, requiring DNA sequences, core histone granules, chaperone proteins, and ATP-utilizing chromatin assembly factors. ACF (ATP-utilizing Chromatin Assembly and Remodeling Factor) or RSF (Remodeling and Spacing Factor) are two widely studied assembly factors used to generate extended, ordered arrays of nucleosomes into chromatin in vitro (Fyodorov, D.V. and Kadonaga, J.T. Method Enzymol. 371:499-515, 2003; Kundu, T.K. et al., Mol. Cell 6:551-561, 2000).

在特定的实施方案中,本发明的方法可以容易地应用于任何类型的片段化的双链DNA,包括但不限于例如分离自血浆的、血清和/或尿液的游离DNA;分离自细胞和/或组织的凋亡DNA;在体外酶促地片段化的DNA(例如通过脱氧核糖核酸酶I和/或限制性内切核酸酶);和/或被机械力片段化的DNA(充液剪切、超声处理、雾化等)。In particular embodiments, the methods of the present invention can be readily applied to any type of fragmented double-stranded DNA, including but not limited to, for example, cell-free DNA isolated from plasma, serum and/or urine; isolated from cells and /or apoptotic DNA of tissues; DNA fragmented enzymatically in vitro (e.g., by deoxyribonuclease I and/or restriction endonucleases); and/or DNA fragmented mechanically (fluid-filled shears) cutting, sonication, nebulization, etc.).

从生物样本获得的核酸可被片段化,以产生用于分析的合适的片段。使用多种机械的、化学的和/或酶促的方法,模板核酸可被片段化或剪切至所需长度。DNA可通过超声处理被随机剪切,例如Covaris法、短期暴露至脱氧核糖核酸酶或使用一种或多种限制性内切酶的混合物或转座酶或切口酶。通过短期暴露至核糖核酸酶、热加镁或通过剪切,可将RNA片段化。RNA可被转化为cDNA。如果采用片段化,可以在片段化之前或之后,将RNA转化为cDNA。在一些实施方案中,来自生物样本的核酸通过超声处理被片段化。在其它实施方案中,核酸被充液剪切仪器片段化。通常,各核酸模板分子可从约2kb碱基到约40kb。在各种实施方案中,核酸可以是约6kb-10kb片段。核酸分子可以是单链的、双链的或带有单链区域的双链(例如茎环结构)。Nucleic acids obtained from biological samples can be fragmented to produce suitable fragments for analysis. Template nucleic acids can be fragmented or sheared to a desired length using a variety of mechanical, chemical and/or enzymatic methods. DNA can be randomly sheared by sonication, eg, the Covaris method, short-term exposure to deoxyribonucleases, or using a mixture of one or more restriction enzymes or a transposase or nickase. RNA can be fragmented by short-term exposure to ribonucleases, heat plus magnesium, or by shearing. RNA can be converted to cDNA. If fragmentation is used, the RNA can be converted to cDNA either before or after fragmentation. In some embodiments, nucleic acids from a biological sample are fragmented by sonication. In other embodiments, nucleic acids are fragmented by a fluid-filled shearing instrument. Typically, each nucleic acid template molecule can range from about 2 kb bases to about 40 kb. In various embodiments, nucleic acids may be fragments of about 6kb-10kb. Nucleic acid molecules can be single-stranded, double-stranded, or double-stranded with single-stranded regions (eg, a stem-loop structure).

在一些实施方案中,交联的DNA分子可经历大小选择步骤。可在特定大小之下或之上对交联的DNA分子进行核酸的大小选择。大小选择可进一步受到交联频率和/或片段化方法的影响,例如通过选择频繁切割限制性内切酶或稀有切割限制性内切酶。在一些实施方案中,可制备组合物,包含交联在以下范围的DNA分子:约1kb至5Mb、约5kb至5Mb、约5kB至2Mb、约10kb至2Mb、约10kb至1Mb、约20kb至1Mb、约20kb至500kb、约50kb至500kb、约50kb至200kb、约60kb至200kb、约60kb至150kb、约80kb至150kb、约80kb至120kb或约100kb至120kb或由这些数值中的任何一个所限定的任何范围(例如约150kb至1Mb)。In some embodiments, the cross-linked DNA molecules can be subjected to a size selection step. Size selection of nucleic acids can be performed on cross-linked DNA molecules below or above a specific size. Size selection can further be influenced by cross-linking frequency and/or fragmentation method, for example by choosing frequent-cutting restriction enzymes or rare-cutting restriction enzymes. In some embodiments, compositions can be prepared comprising DNA molecules crosslinked in the range of about 1 kb to 5 Mb, about 5 kb to 5 Mb, about 5 kb to 2 Mb, about 10 kb to 2 Mb, about 10 kb to 1 Mb, about 20 kb to 1 Mb , about 20kb to 500kb, about 50kb to 500kb, about 50kb to 200kb, about 60kb to 200kb, about 60kb to 150kb, about 80kb to 150kb, about 80kb to 120kb, or about 100kb to 120kb or defined by any of these values Any range of (eg, about 150 kb to 1 Mb).

在一些实施方案中,样本多核苷酸被片段化成一个或多个特定大小范围的片段DNA分子的群体。在一些实施方案中,可从至少约1、约2、约5、约10、约20、约50、约100、约200、约500、约1000、约2000、约5000、约10,000、约20,000、约50,000、约100,000、约200,000、约500,000、约1,000,000、约2,000,000、约5,000,000、约10,000,000或更大基因组当量的起始DNA,生成片段。片段化可通过本领域已知的方法完成,包括化学的、酶促地和机械的片段化。在一些实施方案中,片段的平均长度为约10个至约10,000个、约10个至约20,000个、约10个至约30,000个、约10个至约40,000个、约10个至约50,000个、约10个至约60,000个、约10个至约70,000个、约10个至约80,000个、约10个至约90,000个、约10个至约100,000个、约10个至约150,000个、约10个至约200,000个、约10个至约300,000个、约10个至约400,000个、约10个至约500,000个、约10个至约600,000个、约10个至约700,000个、约10个至约800,000个、约10个至约900,000个、约10个至约1,000,000个、约10个至约2,000,000个、约10个至约5,000,000个、约10个至约10,000,000个或更多个核苷酸。在一些实施方案中,片段的平均长度为约1kb至约10Mb。在一些实施方案中,片段的平均长度为从约1kb至5Mb、约5kb至5Mb、约5kB至2Mb、约10kb至2Mb、约10kb至1Mb、约20kb至1Mb、约20kb至500kb、约50kb至500kb、约50kb至200kb、约60kb至200kb、约60kb至150kb、约80kb至150kb、约80kb至120kb或约100kb至120kb或由这些数值中的任何一个限定的任何范围(例如约60至120kb)。在一些实施方案中,片段的平均长度小于约10Mb、小于约5Mb、小于约1Mb、小于约500kb、小于约200kb、小于约100kb或小于约50kb。在其它实施方案中,片段的平均长度大于约5kb、大于约10kb、大于约50kb、大于约100kb、大于约200kb、大于约500kb、大于约1Mb、大于约5Mb或大于约10Mb。在一些实施方案中,机械完成的片段化包含对样本DNA分子进行声处理。在一些实施方案中,片段化包括在适合于一种或多种酶生成双链核酸断裂的条件下,用一种或多种酶处理样本DNA分子。用于生成DNA片段的酶的示例包括序列特异性和非序列特异性的核酸酶。核酸酶的非限制性示例包括脱氧核糖核酸酶I(DNase I)、片段化酶(Fragmentase)、限制性内切核酸酶、及其变异体、及其组合。例如,用DNase I消化可包括不存在Mg++和存在Mn++的情况下,引起DNA中的随机双链断裂。在一些实施方案中,片段化包括用一种或多种限制性内切核酸酶处理样本DNA分子。片段化可产生具有5’突出、3’突出、平末端或其组合的片段。在一些实施方案中,例如当片段化包括使用一种或多种限制性内切核酸酶时,样本DNA分子的断解留下可预见的序列。在一些实施方案中,该方法包括通过标准方法选择片段大小的步骤,所述标准方法例如柱纯化或从琼脂糖胶分离。In some embodiments, sample polynucleotides are fragmented into one or more populations of fragmented DNA molecules of a particular size range. In some embodiments, from at least about 1, about 2, about 5, about 10, about 20, about 50, about 100, about 200, about 500, about 1000, about 2000, about 5000, about 10,000, about 20,000 , about 50,000, about 100,000, about 200,000, about 500,000, about 1,000,000, about 2,000,000, about 5,000,000, about 10,000,000 or more genome equivalents of starting DNA to generate fragments. Fragmentation can be accomplished by methods known in the art, including chemical, enzymatic and mechanical fragmentation. In some embodiments, the fragments have an average length of about 10 to about 10,000, about 10 to about 20,000, about 10 to about 30,000, about 10 to about 40,000, about 10 to about 50,000 , about 10 to about 60,000, about 10 to about 70,000, about 10 to about 80,000, about 10 to about 90,000, about 10 to about 100,000, about 10 to about 150,000, about 10 to about 200,000, about 10 to about 300,000, about 10 to about 400,000, about 10 to about 500,000, about 10 to about 600,000, about 10 to about 700,000, about 10 to about 800,000, about 10 to about 900,000, about 10 to about 1,000,000, about 10 to about 2,000,000, about 10 to about 5,000,000, about 10 to about 10,000,000 or more nucleosides acid. In some embodiments, the average length of the fragments is from about 1 kb to about 10 Mb. In some embodiments, the average length of the fragments is from about 1 kb to 5 Mb, about 5 kb to 5 Mb, about 5 kb to 2 Mb, about 10 kb to 2 Mb, about 10 kb to 1 Mb, about 20 kb to 1 Mb, about 20 kb to 500 kb, about 50 kb to 500kb, about 50kb to 200kb, about 60kb to 200kb, about 60kb to 150kb, about 80kb to 150kb, about 80kb to 120kb, or about 100kb to 120kb or any range defined by any of these values (e.g. about 60 to 120kb) . In some embodiments, the average length of the fragments is less than about 10 Mb, less than about 5 Mb, less than about 1 Mb, less than about 500 kb, less than about 200 kb, less than about 100 kb, or less than about 50 kb. In other embodiments, the average length of the fragments is greater than about 5 kb, greater than about 10 kb, greater than about 50 kb, greater than about 100 kb, greater than about 200 kb, greater than about 500 kb, greater than about 1 Mb, greater than about 5 Mb, or greater than about 10 Mb. In some embodiments, mechanically accomplished fragmentation comprises sonicating the sample DNA molecules. In some embodiments, fragmenting comprises treating the sample DNA molecule with one or more enzymes under conditions suitable for the one or more enzymes to generate double-stranded nucleic acid breaks. Examples of enzymes used to generate DNA fragments include sequence-specific and non-sequence-specific nucleases. Non-limiting examples of nucleases include DNase I, Fragmentase, restriction endonuclease, variants thereof, and combinations thereof. For example, digestion with DNase I can involve the absence of Mg ++ and the presence of Mn ++ , causing random double-strand breaks in the DNA. In some embodiments, fragmenting comprises treating the sample DNA molecules with one or more restriction endonucleases. Fragmentation can generate fragments with 5' overhangs, 3' overhangs, blunt ends, or combinations thereof. In some embodiments, eg, when fragmentation comprises the use of one or more restriction endonucleases, cleavage of the sample DNA molecule leaves a predictable sequence. In some embodiments, the method includes the step of selecting the size of the fragments by standard methods, such as column purification or separation from an agarose gel.

在一些实施方案中,不在连接反应之前修饰片段化的DNA的5’和/或3’末端核苷酸序列。例如,通过限制性内切核酸酶的片段化可用于留下可预见的突出,然后用核酸末端连接,该核酸末端包含与DNA片段上可预见的突出互补的突出。在另一个实施例中,通过酶的断解留下了可预见的平末端,随后可将平末端的DNA片段连接至包含平末端的核酸,例如接头、寡核苷酸或多核苷酸。在一些实施方案中,片段化的DNA分子为平末端修饰的(或“末端修复的(end repaired)”),以便在连接至接头之前,产生具有平末端的DNA片段。平末端修饰步骤可通过与合适的酶孵育完成,例如具有3’至5’核酸外切酶活性和5’至3’核酸外切酶活性两者的DNA聚合酶,例如T4聚合酶。在一些实施方案中,末端修复之后可加入1、2、3、4、5、6、7、8、9、10、11、12、13、14、15、16、17、18、19、20或更多个核苷酸,例如一个或多个腺嘌呤、一个或多个胸腺嘧啶、一个或多个鸟嘌呤或者一个或多个胞嘧啶,以产生突出。例如,末端配对之后可加入1、2、3、4、5或6个核苷酸。具有突出的DNA片段可与具有互补突出的一种或多种核苷酸连接,例如寡核苷酸、接头寡核苷酸或多核苷酸,例如在连接反应中。例如,可使用不依赖模板的聚合酶,向末端修复的DNA片段的3’末端加入单个腺嘌呤,然后连接至一个或多个接头,每个接头在3’端具有胸腺嘧啶。在一些实施方案中,可将核酸,例如寡核苷酸或多核苷酸,加入至平末端的双链DNA分子,其已经通过3’末端延伸一个或多个核苷酸修饰,且随后进行5’磷酸化。在一些情况下,3’末端的延伸可用聚合酶完成,该聚合酶例如Klenow聚合酶或本文所提供的任何合适的聚合酶,或在存在一种或多种dNTP于含有镁的合适缓冲液中的情况下使用末端脱氧核苷酸转移酶。在一些实施方案中,将具有平末端的靶多核苷酸连接至一个或多个含有平末端的接头。DNA片段分子的5’末端的磷酸化可例如用T4多核苷酸激酶在含有ATP和镁的合适的缓冲液中进行。可任选地处理片段化的DNA分子,以去磷酸化5’末端或3’末端,例如通过使用本领域已知的酶,例如磷酸酶,进行所述处理。In some embodiments, the 5' and/or 3' terminal nucleotide sequences of the fragmented DNA are not modified prior to the ligation reaction. For example, fragmentation by restriction endonucleases can be used to leave predictable overhangs, which are then ligated with nucleic acid ends containing overhangs that are complementary to predictable overhangs on the DNA fragments. In another example, enzymatic cleavage leaves predictable blunt ends, and the blunt-ended DNA fragments can subsequently be ligated to nucleic acids comprising blunt ends, such as adapters, oligonucleotides, or polynucleotides. In some embodiments, the fragmented DNA molecules are blunt-end modified (or "end repaired") so that DNA fragments are produced with blunt ends prior to ligation to adapters. The blunt-end modification step can be accomplished by incubation with a suitable enzyme, such as a DNA polymerase having both 3' to 5' exonuclease activity and 5' to 3' exonuclease activity, such as T4 polymerase. In some embodiments, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 may be added after end repair or more nucleotides, such as one or more adenines, one or more thymines, one or more guanines, or one or more cytosines, to create an overhang. For example, 1, 2, 3, 4, 5 or 6 nucleotides may be added after pairing of the ends. A DNA fragment having an overhang can be ligated to one or more nucleotides having a complementary overhang, such as an oligonucleotide, an adapter oligonucleotide or a polynucleotide, eg in a ligation reaction. For example, a template-independent polymerase can be used to add a single adenine to the 3' end of the end-repaired DNA fragment, which is then ligated to one or more adapters, each adapter having a thymine at the 3' end. In some embodiments, a nucleic acid, such as an oligonucleotide or polynucleotide, can be added to a blunt-ended double-stranded DNA molecule that has been modified by one or more nucleotide extensions at the 3' end, and subsequently subjected to 5' ' Phosphorylation. In some cases, extension of the 3' end can be accomplished with a polymerase such as Klenow polymerase or any suitable polymerase provided herein, or in the presence of one or more dNTPs in a suitable buffer containing magnesium In the case of terminal deoxynucleotidyl transferase. In some embodiments, target polynucleotides with blunt ends are ligated to one or more adapters containing blunt ends. Phosphorylation of the 5' end of the DNA fragment molecule can be performed, for example, with T4 polynucleotide kinase in a suitable buffer containing ATP and magnesium. Fragmented DNA molecules may optionally be treated to dephosphorylate the 5' or 3' ends, for example by using enzymes known in the art, such as phosphatases.

本文所用的与两个多核苷酸(例如接头寡核苷酸和靶多核苷酸)有关的术语“连接”,,意指两个分开的DNA片段的共价接合,以产生具有连续骨架的单个较大的多核苷酸。用于连接两个DNA片段的方法在本领域中是已知的,包括但不限于酶促的和非酶促的(例如化学的)方法。非酶促的连接反应的示例包括美国专利号5,780,613和5,476,930中所描述的非酶促连接技术,其内容以引用方式并入本文中。在一些实施方案中,接头寡核苷酸通过连接酶连接至靶多核苷酸,所述连接酶例如DNA连接酶或RNA连接酶。本领域已知多种连接酶,每种具有被表征的反应条件,这些连接酶包括但不限于依赖NAD+的连接酶,包括tRNA连接酶、Taq DNA连接酶、丝状栖热菌(Thermus filiformis)DNA连接酶、大肠杆菌DNA连接酶、Tth DNA连接酶、水生栖热菌(Thermus scotoductus)DNA连接酶(I型和II型)、耐热连接酶、Ampligase耐热DNA连接酶、VanC型连接酶、9°N DNA连接酶、Tsp DNA连接酶和通过生物勘探发现的新型连接酶;依赖ATP的连接酶包括T4RNA连接酶、T4DNA连接酶、T3DNA连接酶、T7DNA连接酶、Pfu DNA连接酶、DNA连接酶I、DNA连接酶III、DNA连接酶IV、以及通过生物勘探发现的新型连接酶;及其野生型、突变异构体和基因工程变异体。The term "ligation" as used herein in relation to two polynucleotides (e.g., an adapter oligonucleotide and a target polynucleotide), means the covalent joining of two separate DNA fragments to produce a single DNA with a continuous backbone. Larger polynucleotides. Methods for joining two DNA fragments are known in the art and include, but are not limited to, enzymatic and non-enzymatic (eg, chemical) methods. Examples of non-enzymatic ligation reactions include the non-enzymatic ligation techniques described in US Patent Nos. 5,780,613 and 5,476,930, the contents of which are incorporated herein by reference. In some embodiments, the adapter oligonucleotide is ligated to the target polynucleotide by a ligase, such as DNA ligase or RNA ligase. A variety of ligases are known in the art, each with characterized reaction conditions, and include, but are not limited to, NAD + -dependent ligases, including tRNA ligase, Taq DNA ligase, Thermus filiformis DNA Ligase, E. coli DNA Ligase, Tth DNA Ligase, Thermus scotoductus DNA Ligase (Type I and Type II), Thermostable Ligase, Ampligase Thermostable DNA Ligase, VanC Type Ligase , 9°N DNA ligase, Tsp DNA ligase and novel ligases discovered through bioprospecting; ATP-dependent ligases include T4RNA ligase, T4DNA ligase, T3DNA ligase, T7DNA ligase, Pfu DNA ligase, DNA Ligase I, DNA ligase III, DNA ligase IV, and novel ligases discovered through bioprospecting; and their wild-type, mutant isoforms, and genetically engineered variants.

连接可发生在具有可杂交序列的DNA片段之间,该可杂交序列例如互补的突出。连接还可发生在两个平末端之间。通常,在连接反应中利用5’磷酸基。5’磷酸基由靶多核苷酸、接头寡核苷酸或二者提供。根据需要,可向待连接的DNA片段添加或从中除去5’磷酸基。向5’磷酸基添加或从中除去5’磷酸基的方法在本领域中是已知的,包括但不限于酶促的和化学的方法。用于添加和/或除去5’磷酸基的酶包括激酶、磷酸酶和聚合酶。在一些实施方案中,连接反应中相连的两个末端(例如接头末端和靶多核苷酸末端)都提供5’磷酸基,以使得在连接两个末端过程中产生两个共价连接。在一些实施方案中,在连接反应中相连的两个末端中只有一个(例如只有接头末端和靶多核苷酸末端中的一个)提供5’磷酸盐,以使得在连接两个末端过程中只产生一个共价连接。Ligation can occur between DNA fragments having hybridizable sequences, such as complementary overhangs. Ligation can also occur between two blunt ends. Typically, a 5' phosphate group is utilized in the ligation reaction. The 5' phosphate is provided by the target polynucleotide, the linker oligonucleotide, or both. A 5' phosphate group can be added to or removed from the DNA fragments to be ligated as necessary. Methods of adding or removing 5' phosphate groups to or from 5' phosphate groups are known in the art and include, but are not limited to, enzymatic and chemical methods. Enzymes for adding and/or removing 5' phosphate groups include kinases, phosphatases and polymerases. In some embodiments, both ends ligated in a ligation reaction (e.g., an adapter end and a target polynucleotide end) provide a 5' phosphate group such that two covalent linkages are created during ligation of the two ends. In some embodiments, only one of the two ends joined in the ligation reaction (e.g., only one of the adapter end and the target polynucleotide end) contributes the 5' phosphate so that only one of the two ends is produced during ligation. a covalent link.

在一些实施方案中,在靶多核苷酸的一个末端或两个末端处只有一条链连接至接头寡核苷酸。在一些实施方案中,在靶多核苷酸的一个末端或两个末端处两条链都连接至接头寡核苷酸。在一些实施方案中,在连接之前除去3’磷酸盐。在一些实施方案中,将接头寡核苷酸被添加至靶多核苷酸的两个末端,其中在每个末端处一条链或两条链被连接至接头寡核苷酸。当在两个末端处两条链都连接至接头寡核苷酸,连接之后可进行裂解反应,该反应留下能够作为对应3’末端延伸模板的5’突出,3’末端可包括或可不包括一个或多个来源于接头寡核苷酸的核苷酸。在一些实施方案中,靶多核苷酸的一端被连接至第一接头寡核苷酸,以及另一端被连接至第二接头寡核苷酸。在一些实施方案中,靶多核苷酸的两个末端被连接至单个接头寡核苷酸的相反末端。在一些实施方案中,靶多核苷酸与其连接的接头寡核苷酸包含平末端。在一些实施方案中,每个样本可进行单独的连接反应,使用不同的第一接头寡核苷酸,对于每个样本,该第一接头寡核苷酸包含至少一个条码序列,以使得没有一个条码序列连接至超过一个样本的靶多核苷酸。具有相连接的接头寡核苷酸的DNA片段或靶多核苷酸被认为是被连接的接头“标记(tagged)”。In some embodiments, only one strand is ligated to the adapter oligonucleotide at one or both ends of the target polynucleotide. In some embodiments, both strands are ligated to an adapter oligonucleotide at one or both ends of the target polynucleotide. In some embodiments, the 3' phosphate is removed prior to ligation. In some embodiments, an adapter oligonucleotide is added to both ends of a target polynucleotide, with one or both strands ligated to the adapter oligonucleotide at each end. When both strands are ligated to adapter oligonucleotides at both ends, ligation can be followed by a cleavage reaction that leaves a 5' overhang that can serve as a template for the extension of the corresponding 3' end, which may or may not include One or more nucleotides derived from an adapter oligonucleotide. In some embodiments, one end of the target polynucleotide is ligated to a first adapter oligonucleotide, and the other end is ligated to a second adapter oligonucleotide. In some embodiments, both ends of a target polynucleotide are ligated to opposite ends of a single adapter oligonucleotide. In some embodiments, the linker oligonucleotide to which the target polynucleotide is ligated comprises blunt ends. In some embodiments, each sample can be subjected to a separate ligation reaction, using a different first adapter oligonucleotide, the first adapter oligonucleotide comprising at least one barcode sequence for each sample such that neither The barcode sequences are linked to target polynucleotides of more than one sample. A DNA fragment or target polynucleotide having an adapter oligonucleotide attached is said to be "tagged" with the adapter ligated.

在一些情况下,连接反应可在约0.1ng/μL、约0.2ng/μL、约0.3ng/μL、约0.4ng/μL、约0.5ng/μL、约0.6ng/μL、约0.7ng/μL、约0.8ng/μL、约0.9ng/μL、约1.0ng/μL、约1.2ng/μL、约1.4ng/μL、约1.6ng/μL、约1.8ng/μL、约2.0ng/μL、约2.5ng/μL、约3.0ng/μL、约3.5ng/μL、约4.0ng/μL、约4.5ng/μL、约5.0ng/μL、约6.0ng/μL、约7.0ng/μL、约8.0ng/μL、约9.0ng/μL、约10ng/μL、约15ng/μL、约20ng/μL、约30ng/μL、约40ng/μL、约50ng/μL、约60ng/μL、约70ng/μL、约80ng/μL、约90ng/μL、约100ng/μL、约150ng/μL、约200ng/μL、约300ng/μL、约400ng/μL、约500ng/μL、约600ng/μL、约800ng/μL或约1000ng/μL的DNA片段或靶多核苷酸浓度下进行。例如,连接可在约100ng/μL、约150ng/μL、约200ng/μL、约300ng/μL、约400ng/μL或约500ng/μL的DNA片段或靶多核苷酸浓度下进行。In some cases, the ligation reaction can be performed at about 0.1 ng/μL, about 0.2 ng/μL, about 0.3 ng/μL, about 0.4 ng/μL, about 0.5 ng/μL, about 0.6 ng/μL, about 0.7 ng/μL , about 0.8ng/μL, about 0.9ng/μL, about 1.0ng/μL, about 1.2ng/μL, about 1.4ng/μL, about 1.6ng/μL, about 1.8ng/μL, about 2.0ng/μL, about 2.5ng/μL, about 3.0ng/μL, about 3.5ng/μL, about 4.0ng/μL, about 4.5ng/μL, about 5.0ng/μL, about 6.0ng/μL, about 7.0ng/μL, about 8.0ng /μL, about 9.0ng/μL, about 10ng/μL, about 15ng/μL, about 20ng/μL, about 30ng/μL, about 40ng/μL, about 50ng/μL, about 60ng/μL, about 70ng/μL, about 80ng/μL, about 90ng/μL, about 100ng/μL, about 150ng/μL, about 200ng/μL, about 300ng/μL, about 400ng/μL, about 500ng/μL, about 600ng/μL, about 800ng/μL or about 1000ng/μL DNA fragment or target polynucleotide concentration. For example, ligation can be performed at a DNA fragment or target polynucleotide concentration of about 100 ng/μL, about 150 ng/μL, about 200 ng/μL, about 300 ng/μL, about 400 ng/μL, or about 500 ng/μL.

在一些情况下,连接反应可在约0.1至1000ng/μL、约1至1000ng/μL、约1至800ng/μL、约10至800ng/μL、约10至600ng/μL、约100至600ng/μL或约100至500ng/μL的DNA片段或靶多核苷酸浓度下进行。In some cases, the ligation reaction can be performed at about 0.1 to 1000 ng/μL, about 1 to 1000 ng/μL, about 1 to 800 ng/μL, about 10 to 800 ng/μL, about 10 to 600 ng/μL, about 100 to 600 ng/μL or about 100 to 500 ng/μL of DNA fragment or target polynucleotide concentration.

在一些情况下,连接反应可进行大于约5分钟、约10分钟、约20分钟、约30分钟、约40分钟、约50分钟、约60分钟、约90分钟、约2小时、约3小时、约4小时、约5小时、约6小时、约8小时、约10小时、约12小时、约18小时、约24小时、约36小时、约48小时或约96小时。在其它情况下,连接反应可进行少于约5分钟、约10分钟、约20分钟、约30分钟、约40分钟、约50分钟、约60分钟、约90分钟、约2小时、约3小时、约4小时、约5小时、约6小时、约8小时、约10小时、约12小时、约18小时、约24小时、约36小时、约48小时或约96小时。例如,连接反应可进行约30分钟至90分钟。在一些实施方案中,接头连接至靶多核苷酸,产生具有3’突出的连接产物多核苷酸,该突出包含源自接头的核苷酸序列。In some cases, the ligation reaction can be performed for greater than about 5 minutes, about 10 minutes, about 20 minutes, about 30 minutes, about 40 minutes, about 50 minutes, about 60 minutes, about 90 minutes, about 2 hours, about 3 hours, About 4 hours, about 5 hours, about 6 hours, about 8 hours, about 10 hours, about 12 hours, about 18 hours, about 24 hours, about 36 hours, about 48 hours, or about 96 hours. In other cases, the ligation reaction can be performed in less than about 5 minutes, about 10 minutes, about 20 minutes, about 30 minutes, about 40 minutes, about 50 minutes, about 60 minutes, about 90 minutes, about 2 hours, about 3 hours , about 4 hours, about 5 hours, about 6 hours, about 8 hours, about 10 hours, about 12 hours, about 18 hours, about 24 hours, about 36 hours, about 48 hours, or about 96 hours. For example, the ligation reaction can be performed for about 30 minutes to 90 minutes. In some embodiments, an adapter is ligated to a target polynucleotide resulting in a ligation product polynucleotide having a 3' overhang comprising a nucleotide sequence derived from the adapter.

在一些实施方案中,在将至少一个接头寡核苷酸连接至靶多核苷酸之后,使用一个或多个连接的接头寡核苷酸作为模板延伸一个或多个靶多核苷酸的3’末端。例如,包含两个杂交寡核苷酸的接头仅仅连接至靶多核苷酸的5’末端,这允许使用接头的连接链作为模板,延伸靶的未连接的3’末端,同时或者随后置换未连接的链。含有两个杂交寡核苷酸的接头的两条链可连接至靶多核苷酸,以使连接的产物具有5’突出,互补的3’末端可使用5’突出作为模板延伸。作为进一步的示例,发夹接头寡核苷酸可连接至靶多核苷酸的5’末端。在一些实施方案中,延伸的靶多核苷酸的3’末端包含一个或多个来自接头寡核苷酸的核苷酸。对于两端都连接接头的靶多核苷酸,具有5’突出的双链靶多核苷酸的两个3’末端都可进行延伸。该3’末端延伸,或“补平(fill-in)反应”,生成与模板杂交的互补的序列或接头寡核苷酸模板“互补物(complement)”,从而补平5’突出,产生双链序列区域。当双链靶多核苷酸的两个末端都具有5’突出时,该5’突出通过互补链3’末端延伸补平,产物是完全双链的。可通过本领域已知的任何合适的聚合酶进行延伸,例如DNA聚合酶,多种DNA聚合酶可通过商业途径获得。DNA聚合酶可包含DNA依赖的DNA聚合酶活性、RNA依赖的DNA聚合酶活性或DNA依赖的和RNA依赖的DNA聚合酶活性。DNA聚合酶可以是热稳定的或非热稳定的。DNA聚合酶的示例包括但不限于Taq聚合酶、Tth聚合酶、Tli聚合酶、Pfu聚合酶、Pfutubo聚合酶、Pyrobest聚合酶、Pwo聚合酶、KOD聚合酶、Bst聚合酶、Sac聚合酶、Sso聚合酶、Poc聚合酶、Pab聚合酶、Mth聚合酶、Pho聚合酶、ES4聚合酶、VENT聚合酶、DEEPVENT聚合酶、EX-Taq聚合酶、LA-Taq聚合酶、Expand聚合酶、Platinum Taq聚合酶、Hi-Fi聚合酶、Tbr聚合酶、Tfl聚合酶、Tru聚合酶、Tac聚合酶、Tne聚合酶、Tma聚合酶、Tih聚合酶、Tfi聚合酶、Klenow片段、及其变型、改性产品和衍生物。3’末端延伸可在汇集来自独立样本的靶多核苷酸之前或之后进行。In some embodiments, after ligation of at least one adapter oligonucleotide to the target polynucleotide, the 3' end of the one or more target polynucleotides is extended using the one or more ligated adapter oligonucleotides as a template . For example, an adapter comprising two hybridizing oligonucleotides is ligated only to the 5' end of the target polynucleotide, which allows the unligated 3' end of the target to be extended using the ligated strand of the adapter as a template, simultaneously or subsequently displacing the unligated chain. The two strands of the adapter containing the two hybridizing oligonucleotides can be ligated to the target polynucleotide so that the ligated product has a 5' overhang, and the complementary 3' end can be extended using the 5' overhang as a template. As a further example, a hairpin adapter oligonucleotide can be ligated to the 5' end of the target polynucleotide. In some embodiments, the 3' end of the extended target polynucleotide comprises one or more nucleotides from an adapter oligonucleotide. For target polynucleotides with adapters attached to both ends, both 3' ends of a double-stranded target polynucleotide with a 5' overhang can be extended. This 3' end extension, or "fill-in reaction", generates a complementary sequence or adapter oligonucleotide template "complement" that hybridizes to the template, thereby filling in the 5' overhang, creating a double chain sequence region. When both ends of the double stranded target polynucleotide have 5' overhangs, the 5' overhangs are filled in by extension of the 3' ends of the complementary strands and the product is fully double stranded. Extension can be performed by any suitable polymerase known in the art, such as DNA polymerase, a variety of which are commercially available. A DNA polymerase may comprise DNA-dependent DNA polymerase activity, RNA-dependent DNA polymerase activity, or DNA-dependent and RNA-dependent DNA polymerase activity. DNA polymerases can be thermostable or non-thermostable. Examples of DNA polymerases include, but are not limited to, Taq polymerase, Tth polymerase, Tli polymerase, Pfu polymerase, Pfutubo polymerase, Pyrobest polymerase, Pwo polymerase, KOD polymerase, Bst polymerase, Sac polymerase, Sso Polymerase, Poc polymerase, Pab polymerase, Mth polymerase, Pho polymerase, ES4 polymerase, VENT polymerase, DEEPVENT polymerase, EX-Taq polymerase, LA-Taq polymerase, Expand polymerase, Platinum Taq polymerase Enzyme, Hi-Fi polymerase, Tbr polymerase, Tfl polymerase, Tru polymerase, Tac polymerase, Tne polymerase, Tma polymerase, Tih polymerase, Tfi polymerase, Klenow fragment, and its variants, modified products and derivatives. 3' end extension can be performed before or after pooling of target polynucleotides from independent samples.

在某些实施方案中,本发明提供了用于富集靶核酸和分析靶核酸的方法。在一些情况下,用于富集的方法为基于溶液的格式。在一些情况下,可用标记物标记靶核酸。在其它情况下,靶核酸可与用标记物标记的一个或多个缔合分子交联。标记物的示例包括但不限于生物素、聚组氨酸标签和化学标签(例如点击化学法(Click Chemistry methods)中所用的炔烃和叠氮化物)。此外,标记的靶核酸可使用捕获剂被捕获并从而进行富集。捕获剂可以是链霉亲和素和/或抗生物素蛋白、抗体、化学部分(例如炔烃、叠氮化物)、以及本领域中已知的用于亲和纯化的任何生物的、化学的、物理的或酶促的试剂。In certain embodiments, the present invention provides methods for enriching and analyzing target nucleic acids. In some cases, the method used for enrichment is a solution-based format. In some cases, a target nucleic acid can be labeled with a label. In other cases, a target nucleic acid can be cross-linked with one or more associated molecules labeled with a label. Examples of labels include, but are not limited to, biotin, polyhistidine tags, and chemical tags (such as alkynes and azides used in Click Chemistry methods). In addition, labeled target nucleic acids can be captured using a capture agent and thus enriched. Capture agents can be streptavidin and/or avidin, antibodies, chemical moieties (e.g., alkynes, azides), and any biological, chemical, or other known in the art for affinity purification. , physical or enzymatic reagents.

在一些情况下,固定化的或非固定化的核酸探针可用于捕获靶核酸。例如通过在固相载体上或溶液中与探针杂交,从样本富集靶核酸。在一些实施例中,样本可以是基因组样本。在一些实施例中,探针可以是扩增子。扩增子可包含预定序列。此外,杂交的靶核酸可被洗去和/或洗脱探针。靶核酸可以是DNA、RNA、cDNA或mRNA分子。In some cases, immobilized or non-immobilized nucleic acid probes can be used to capture target nucleic acids. Target nucleic acids are enriched from a sample, eg, by hybridization with probes on a solid support or in solution. In some embodiments, a sample may be a genomic sample. In some embodiments, the probes may be amplicons. An amplicon may comprise a predetermined sequence. In addition, hybridized target nucleic acids can be washed away and/or the probes can be eluted. A target nucleic acid can be a DNA, RNA, cDNA or mRNA molecule.

在一些情况下,富集方法可包括将含有靶核酸的样本与探针接触、将靶核酸和固相载体结合。在一些情况下,可使用化学的、物理的或酶促的方法使样本片段化,以产生靶核酸。在一些情况下,探针可与靶核酸特异性地杂交。在一些情况下,靶核酸的平均大小可为约50至5000个、约50至2000个、约100至2000个、约100至1000个、约200至1000个、约200至800或约300至800个、约300至600个或约400至600个核苷酸残基。靶核苷酸可进一步与样本中的非结合核酸分离。可清洗和/或洗脱固相载体,以得到富集的靶核酸。在一些实施例中,富集步骤可重复约1、2、3、4、5、6、7、8、9或10次。例如,富集步骤可重复约1、2或3次。In some cases, the enrichment method can include contacting a sample containing the target nucleic acid with a probe, binding the target nucleic acid to a solid support. In some cases, a sample can be fragmented using chemical, physical, or enzymatic methods to produce target nucleic acids. In some cases, a probe can specifically hybridize to a target nucleic acid. In some cases, the average size of the target nucleic acid can be about 50 to 5000, about 50 to 2000, about 100 to 2000, about 100 to 1000, about 200 to 1000, about 200 to 800, or about 300 to 800, about 300 to 600, or about 400 to 600 nucleotide residues. Target nucleotides can be further separated from non-bound nucleic acids in the sample. The solid support can be washed and/or eluted to obtain enriched target nucleic acids. In some embodiments, the enrichment step may be repeated about 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 times. For example, the enrichment step can be repeated about 1, 2 or 3 times.

在一些情况下,富集方法可包括提供来源于探针的扩增子,其中用于扩增的所述探针附接至固相载体。固相载体可包括载体固定化的核酸探针,以便从样本捕获特异性靶核酸。来源于探针的扩增子可与靶核酸杂交。与探针扩增子杂交之后,可通过捕获(例如通过如生物素、抗体等的捕获剂)以及从捕获的探针冲洗和/或洗脱杂交的靶核酸,富集样本中的靶核酸(图4)。使用例如PCR法进一步扩增靶核酸序列,以产生富集PCR产物的扩增池。In some cases, the enrichment method can include providing amplicons derived from probes attached to a solid support for amplification. A solid support may include nucleic acid probes immobilized on the support to capture specific target nucleic acids from a sample. Amplicons derived from the probes can hybridize to the target nucleic acid. After hybridization to the probe amplicon, the target nucleic acid in the sample can be enriched by capturing (e.g., by a capture agent such as biotin, antibody, etc.) and washing and/or eluting the hybridized target nucleic acid from the captured probe ( Figure 4). The target nucleic acid sequence is further amplified using, for example, PCR methods to generate an amplified pool enriched in PCR products.

在一些情况下,固相载体可以是微阵列、载玻片、芯片、微孔、柱、管、颗粒或珠粒。在一些实施例中,固相载体可涂覆链霉亲和素和/或抗生物素蛋白。在其它实施例中,固相载体可涂覆抗体。此外,固相载体可包含玻璃、金属、陶瓷或聚合物材料。在一些实施方案中,固相载体可以是核酸微阵列(例如DNA微阵列)。在其它实施方案中,固相载体可以是顺磁珠。In some cases, the solid support can be a microarray, slide, chip, microwell, column, tube, particle or bead. In some embodiments, the solid support can be coated with streptavidin and/or avidin. In other embodiments, the solid support can be coated with antibodies. In addition, solid supports may comprise glass, metal, ceramic or polymeric materials. In some embodiments, the solid support can be a nucleic acid microarray (eg, a DNA microarray). In other embodiments, the solid support can be paramagnetic beads.

在一些情况下,富集方法可包括用第二限制性内切酶消化、自连接(例如自环化)、以及用原始限制性内切核酸酶重新消化。在具体实施例中,只有连接产物将会被线性化并且可用于接头连接和测序。在其它情况下,连接交接序列本身可用于基于杂交的富集,该富集使用与交接序列互补的诱饵探针(bait-probe)。In some cases, the enrichment method can include digestion with a second restriction endonuclease, self-ligation (eg, self-circularization), and re-digestion with the original restriction endonuclease. In specific embodiments, only ligation products will be linearized and available for adapter ligation and sequencing. In other cases, ligation of the junction sequence itself can be used for hybridization-based enrichment using bait-probes complementary to the junction sequence.

在特定实施方案中,本发明提供了用于扩增富集的DNA的方法。在一些情况下,富集的DNA为读对。通过本发明的方法可获得读对。In certain embodiments, the present invention provides methods for amplifying enriched DNA. In some cases, the enriched DNA is a read pair. Read pairs can be obtained by the method of the present invention.

在一些实施方案中,使用一个或多个扩增和/或复制步骤制备待测序的文库。可使用本领域中已知的任何扩增方法。可使用的扩增技术的示例包括但不限于定量聚合酶链式反应、定量荧光聚合酶链式反应(QF-PCR)、多重荧光聚合酶链式反应(MF-PCR)、实时聚合酶链式反应(RTPCR)、单细胞聚合酶链式反应、限制性片段长度多态性聚合酶链式反应(PCR-RFLP)、PCK-RFLPIRT-PCR-IRFLP、热启动聚合酶链式反应、巢式聚合酶链式反应、原位聚合酶链式反应、原位滚环扩增(RCA)、桥式聚合酶链式反应、连接介导聚合酶链式反应、Qb复制酶扩增、反向聚合酶链式反应、皮量滴定(picotiter)聚合酶链式反应和乳液聚合酶链式反应。其它合适的扩增方法包括连接酶链式反应(LCR)、转录扩增、自主序列复制、靶多核苷酸序列的选择性扩增、共有序列引物聚合酶链式反应(CP-PCR)、随机引物聚合酶链式反应(AP-PCR)、简并寡核苷酸引物聚合酶链式反应(DOP-PCR)和基于核酸的序列扩增(NABSA)。本文可用的其它扩增方法包括美国专利号5,242,794、5,494,810、4,988,617和6,582,938中所描述的方法。In some embodiments, one or more amplification and/or replication steps are used to prepare a library to be sequenced. Any amplification method known in the art can be used. Examples of amplification techniques that may be used include, but are not limited to, quantitative polymerase chain reaction, quantitative fluorescent polymerase chain reaction (QF-PCR), multiplex fluorescent polymerase chain reaction (MF-PCR), real-time polymerase chain reaction reaction (RTPCR), single cell PCR, restriction fragment length polymorphism polymerase chain reaction (PCR-RFLP), PCK-RFLPIRT-PCR-IRFLP, hot start PCR, nested polymerization Enzyme chain reaction, in situ PCR, in situ rolling circle amplification (RCA), bridge PCR, ligation-mediated PCR, Qb replicase amplification, reverse polymerase Chain reaction, picotiter polymerase chain reaction and emulsion polymerase chain reaction. Other suitable amplification methods include ligase chain reaction (LCR), transcriptional amplification, autonomous sequence replication, selective amplification of target polynucleotide sequences, consensus primer polymerase chain reaction (CP-PCR), random Primed-polymerase chain reaction (AP-PCR), degenerate oligonucleotide-primed polymerase chain reaction (DOP-PCR), and nucleic acid-based sequence amplification (NABSA). Other amplification methods useful herein include those described in US Patent Nos. 5,242,794, 5,494,810, 4,988,617, and 6,582,938.

在特定实施方案中,在DNA分子被分配进各个分区之后,使用PCR扩增DNA分子。在一些情况下,扩增接头内的一个或多个特异性启动序列用于PCR扩增。扩增接头可以在分配进各个分区之前或之后,连接至片段化的DNA分子。包含扩增接头的多核苷酸可用PCR指数地扩增,扩增接头在两端具有合适的启动序列。例如由于含有启动序列的扩增接头的连接效率有瑕疵,只具有一个合适启动序列的多核苷酸,只可以进行线性扩增。进一步,如果含有合适启动序列的接头没有被连接,可一起从扩增(例如PCR扩增)中排除多核苷酸。在一些实施方案中,PCR循环的数量在10-30之间变化,但可以低至9、8、7、6、5、4、3、2或更少或高至40、45、50、55、60或更多。因此,在PCR扩增之后,与可线性扩增的或非可扩增的片段相比,可以以指数方式扩增的负载具有合适启动序列的扩增接头的片段可以以高得多的(1000倍或更多)浓度存在。与全基因组扩增技术(例如用随机引物扩增或使用phi29聚合酶的多重置换扩增)相比,PCR的益处包括但不限于更一致的序列涵盖率——由于每个片段在每个循环最多只能拷贝一次且由于通过热循环程序控制扩增;嵌合分子的形成率实质上比例如MDA显著低(Lasken等,2007,BMC Biotechnology)——由于嵌合分子通过展示组装表中的非生物序列,对精确序列组装造成重大挑战,这可能导致错误组装率更高或高度含混的和片段化的组装;序列特异性的偏差减少,所述偏差可能由于MDA中常用的随机化引物的结合,这与使用具有特异性序列的特异性引物位点形成对照;最终扩增的DNA产物量的重现性更高,这通过选择PCR循环次数进行控制;以及与本领域中已知的普通全基因组扩增技术相比,用常用于PCR中的聚合酶进行复制,保真度更高。In a particular embodiment, after the DNA molecules are partitioned into partitions, PCR is used to amplify the DNA molecules. In some cases, one or more specific promoter sequences within the amplification adapter are used for PCR amplification. Amplification adapters can be ligated to the fragmented DNA molecules either before or after partitioning into individual partitions. Polynucleotides comprising amplification adapters having suitable promoter sequences at both ends can be exponentially amplified by PCR. For example, polynucleotides with only one suitable promoter sequence can only be linearly amplified due to defective ligation efficiency of the amplification adapter containing the promoter sequence. Further, polynucleotides may be excluded altogether from amplification (eg, PCR amplification) if adapters containing appropriate promoter sequences are not ligated. In some embodiments, the number of PCR cycles varies between 10-30, but can be as low as 9, 8, 7, 6, 5, 4, 3, 2 or less or as high as 40, 45, 50, 55 , 60 or more. Thus, following PCR amplification, fragments loaded with an amplified adapter with a suitable promoter sequence that can be amplified exponentially can be amplified at a much higher (1000 times or more) concentration exists. Benefits of PCR include, but are not limited to, more consistent sequence coverage compared to whole-genome amplification techniques such as amplification with random primers or multiple displacement amplification using phi29 polymerase—since each fragment Can only be copied at most once and due to controlled amplification by thermal cycling procedures; the formation rate of chimeric molecules is substantially lower than that of, for example, MDA (Lasken et al., 2007, BMC Biotechnology) - due to chimeric molecules displaying non- Biological sequences, which pose significant challenges to accurate sequence assembly, which can lead to higher rates of misassembly or highly ambiguous and fragmented assemblies; reduced sequence-specific biases that can be due to the incorporation of randomized primers commonly used in MDA , which is in contrast to the use of specific primer sites with specific sequences; the reproducibility of the amount of final amplified DNA product is more reproducible, which is controlled by the selection of the number of PCR cycles; Compared with genome amplification techniques, replication with polymerases commonly used in PCR provides higher fidelity.

在一些实施方案中,补平反应之后使用第一引物和第二引物扩增一个或多个靶多核苷酸,或者补平反应是作为所述扩增的一部分进行,其中第一引物包含可与一个或多个第一接头寡核苷酸的互补物的至少一部分杂交的序列,并且进一步其中第二引物包含与一个或多个第二接头寡核苷酸的互补物的至少一部分杂交的序列。第一引物和第二引物中的每一个都可以具有合适的长度,例如约为、约小于或约大于10个、15个、20个、25个、30个、35个、40个、45个、50个、55个、60个、65个、70个、75个、80个、90个、100个或更多个核苷酸,其任何部分或全部可与相应靶序列互补(例如约为、小于约或大于约5个、10个、15个、20个、25个、30个、35个、40个、45个、50个或更多个核苷酸)。例如约10至50个核苷酸可与相应靶序列互补。In some embodiments, one or more target polynucleotides are amplified after, or as part of, a fill-in reaction using a first primer and a second primer, wherein the first primer comprises a A sequence to which at least a portion of the complement of the one or more first adapter oligonucleotides hybridizes, and further wherein the second primer comprises a sequence to which at least a portion of the complement of the one or more second adapter oligonucleotides hybridizes. Each of the first primer and the second primer can have a suitable length, for example about, about less than or about more than 10, 15, 20, 25, 30, 35, 40, 45 , 50, 55, 60, 65, 70, 75, 80, 90, 100 or more nucleotides, any part or all of which may be complementary to the corresponding target sequence (e.g., about , less than about or greater than about 5, 10, 15, 20, 25, 30, 35, 40, 45, 50 or more nucleotides). For example about 10 to 50 nucleotides may be complementary to the corresponding target sequence.

“扩增”指任何增加靶序列拷贝数的方法。在一些情况下,复制反应可以只产生多核苷酸的单个互补拷贝/复制物。本领域已知用于靶多核苷酸引物引导扩增的方法,包括但不限于基于聚合酶链式反应(PCR)的方法。本领域已知PCR扩增靶序列的有利条件,可在方法中的多个步骤进行优化,并且取决于反应中元件的特性,例如靶类型、靶浓度、待扩增的序列长度、靶和/或一个或多个引物的序列、引物长度、引物浓度、所用的聚合酶、反应体积、一个或多个元件与一个或多个其它元件的比率、以及其它,所有这些中的部分或全部可被改变。通常,PCR涉及待扩增的靶(如果为双链)的变性、将一个或多个引物与靶杂交、以及通过DNA聚合酶的引物延伸等步骤,重复(或“循环”)这些步骤以扩增靶序列。该方法中的步骤可根据各种结果进行优化,例如为了增加产率、减少假性产物的形成和/或增加或减少引物退火的特异性。本领域已知优化方法,包括调整扩增反应中的元件类型或数量和/或方法中给定步骤的条件,诸如特定步骤的温度、特定步骤的持续时间和/或循环的次数。"Amplification"refers to any method of increasing the number of copies of a target sequence. In some cases, a replication reaction may produce only a single complementary copy/replica of the polynucleotide. Methods for primer-directed amplification of target polynucleotides are known in the art, including but not limited to polymerase chain reaction (PCR)-based methods. Favorable conditions for PCR amplification of target sequences are known in the art, can be optimized at various steps in the method, and depend on the characteristics of the elements in the reaction, such as target type, target concentration, length of sequence to be amplified, target and/or or the sequence of one or more primers, primer length, primer concentration, polymerase used, reaction volume, ratio of one or more elements to one or more other elements, and others, some or all of which may be determined by Change. Generally, PCR involves the steps of denaturation of the target to be amplified (if double-stranded), hybridization of one or more primers to the target, and primer extension by a DNA polymerase, which are repeated (or "cycled") to amplify Amplified target sequence. Steps in the method can be optimized for various outcomes, eg, to increase yield, reduce formation of spurious products, and/or increase or decrease specificity of primer annealing. Optimization methods are known in the art and include adjusting the type or number of elements in the amplification reaction and/or the conditions of a given step in the method, such as the temperature of a particular step, the duration of a particular step and/or the number of cycles.

在一些实施方案中,扩增反应可包含至少约5个、10个、15个、20个、25个、30个、35个、40个、50个、60个、70个、80个、90个、100个、150个、200个或更多个循环。在一些实施例中,扩增反应可包含至少约20、25、30、35或40个循环。在一些实施方案中,扩增反应可包含至多约5、10、15、20、25、30、35、40、50、60、70、80、90、100、150、200或更多个循环。循环可含有任何数量的步骤,例如1、2、3、4、5、6、7、8、9、10或更多个步骤。步骤可包含任何温度或温度梯度,适合于实现给定步骤的目的,包括但不限于3’末端延伸(例如接头补平)、引物退火、引物延伸和链变性。步骤可以持续任意时间,包括但不限于约为、约小于或约大于1秒、5秒、10秒、15秒、20秒、25秒、30秒、35秒、40秒、45秒、50秒、55秒、60秒、70秒、80秒、90秒、100秒、120秒、180秒、240秒、300秒、360秒、420秒、480秒、540秒、600秒、1200秒、1800秒或更多秒,包括无限期的,直至手动中断。包含不同步骤的任意次数的循环可以任意次序组合。在一些实施方案中,组合包含不同步骤的不同循环,以使组合中循环的总次数约为、约小于或约大于5次、10次、15次、20次、25次、30次、35次、40次、50次、60次、70次、80次、90次、100次、150次、200次或更多次循环。在一些实施方案中,在补平反应之后进行扩增。In some embodiments, the amplification reaction may comprise at least about 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90 , 100, 150, 200 or more cycles. In some embodiments, an amplification reaction may comprise at least about 20, 25, 30, 35, or 40 cycles. In some embodiments, the amplification reaction may comprise up to about 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 150, 200 or more cycles. A cycle may contain any number of steps, such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more steps. A step may comprise any temperature or temperature gradient suitable to achieve the purpose of a given step, including but not limited to 3' end extension (e.g., adapter fill-in), primer annealing, primer extension, and strand denaturation. A step can last for any amount of time, including but not limited to about, about less than, or about more than 1 second, 5 seconds, 10 seconds, 15 seconds, 20 seconds, 25 seconds, 30 seconds, 35 seconds, 40 seconds, 45 seconds, 50 seconds , 55 seconds, 60 seconds, 70 seconds, 80 seconds, 90 seconds, 100 seconds, 120 seconds, 180 seconds, 240 seconds, 300 seconds, 360 seconds, 420 seconds, 480 seconds, 540 seconds, 600 seconds, 1200 seconds, 1800 seconds seconds or more, including indefinitely, until manually interrupted. Any number of cycles containing different steps can be combined in any order. In some embodiments, the combination comprises different cycles of different steps such that the total number of cycles in the combination is about, about less than, or about greater than 5, 10, 15, 20, 25, 30, 35 , 40, 50, 60, 70, 80, 90, 100, 150, 200 or more cycles. In some embodiments, the amplification is performed after the fill-in reaction.

在一些实施方案中,扩增反应可以在至少约1ng、2ng、3ng、4ng、5ng、6ng、7ng、8ng、9ng、10ng、12ng、14ng、16ng、18ng、20ng、25ng、30ng、40ng、50ng、100ng、200ng、300ng、400ng、500ng、600ng、800ng、1000ng的靶DNA分子上进行。在其它实施方案中,扩增反应可以在少于约1ng、2ng、3ng、4ng、5ng、6ng、7ng、8ng、9ng、10ng、12ng、14ng、16ng、18ng、20ng、25ng、30ng、40ng、50ng、100ng、200ng、300ng、400ng、500ng、600ng、800ng、1000ng的靶DNA分子上进行。In some embodiments, the amplification reaction can be at least about 1 ng, 2 ng, 3 ng, 4 ng, 5 ng, 6 ng, 7 ng, 8 ng, 9 ng, 10 ng, 12 ng, 14 ng, 16 ng, 18 ng, 20 ng, 25 ng, 30 ng, 40 ng, 50 ng , 100ng, 200ng, 300ng, 400ng, 500ng, 600ng, 800ng, 1000ng of target DNA molecules. In other embodiments, the amplification reaction can be performed at less than about 1 ng, 2 ng, 3 ng, 4 ng, 5 ng, 6 ng, 7 ng, 8 ng, 9 ng, 10 ng, 12 ng, 14 ng, 16 ng, 18 ng, 20 ng, 25 ng, 30 ng, 40 ng, 50ng, 100ng, 200ng, 300ng, 400ng, 500ng, 600ng, 800ng, 1000ng of target DNA molecules.

可以在汇集来自独立样本的靶多核苷酸之前或之后,进行扩增。Amplification can be performed before or after pooling of target polynucleotides from independent samples.

本发明的方法涉及确定样本中出现的可扩增核酸的量。任何已知的方法可用于定量可扩增的核酸,一种示例性方法是聚合酶链式反应(PCR)、特异性定量聚合酶链式反应(qPCR)。qPCR是一种基于聚合酶链式反应的技术,用于扩增并同时定量靶核酸分子。qPCR允许检测和定量(作为拷贝的绝对数或当相对于DNA输入物或另外的归一化基因被归一化时的相对数量)DNA样品中的具体序列。流程按照聚合酶链式反应的一般原则,另外的特征是当扩增DNA在每个扩增循环之后在反应中积聚时,对其进行实时定量。QPCR例如在Kurnit等(美国专利号6,033,854)、Wang等(美国专利号5,567,583和5,348,853)、Ma等(The Journalof American Science,2(3),2006)、Heid等(Genome Research 986-994,1996)、Sambrook和Russell(Quantitative PCR,Cold Spring Harbor Protocols,2006)、以及Higuchi(美国专利号6,171,785和5,994,056)中有描述。这些内容全部以引用方式并入本文中。The methods of the invention involve determining the amount of amplifiable nucleic acid present in a sample. Any known method can be used to quantify the amplifiable nucleic acid, an exemplary method is polymerase chain reaction (PCR), specific quantitative polymerase chain reaction (qPCR). qPCR is a polymerase chain reaction-based technique for amplifying and simultaneously quantifying target nucleic acid molecules. qPCR allows detection and quantification (either as absolute number of copies or relative quantity when normalized to DNA input or another normalizing gene) of specific sequences in a DNA sample. The protocol follows the general principles of the polymerase chain reaction, with the additional feature of real-time quantification of amplified DNA as it accumulates in the reaction after each amplification cycle. QPCR is, for example, described in Kurnit et al. (US Patent No. 6,033,854), Wang et al. (US Patent Nos. 5,567,583 and 5,348,853), Ma et al. (The Journal of American Science, 2(3), 2006), Heid et al. (Genome Research 986-994, 1996) , Sambrook and Russell (Quantitative PCR, Cold Spring Harbor Protocols, 2006), and Higuchi (US Patent Nos. 6,171,785 and 5,994,056). These contents are incorporated herein by reference in their entirety.

其它定量方法包括使用插入双链DNA的荧光染料、以及当与互补DNA杂交时发荧光的修饰的DNA寡核苷酸探针。这些方法可被广泛使用,也可特异性地适用于实时PCR,其作为示例进一步详述。在第一种方法中,DNA结合染料与PCR中的全部双链(ds)DNA结合,引起染料的荧光反应。因此,PCR期间DNA产物的增加使荧光密度增加,并且在每个循环进行测量,从而使DNA浓度被定量。类似标准PCR进行反应,加入荧光(ds)DNA染料。在一个热循环仪中运行反应,每个循环之后,用检测器测量荧光水平;染料只有当结合至(ds)DNA(即PCR产物)时才发荧光。根据标准的稀释法,可确定PCR中的(ds)DNA浓度。正如其它实时PCR法,获得的数值没有与其关联的绝对单位。比较测量的DNA/RNA样本和标准稀释液,得到样本与标准相比的分数或比率,允许不同组织或实验条件之间的相对比较。为了确保靶基因定量和/或表达的准确度,可相对于稳定表达的基因进行归一化。未知基因的拷贝数可同样相对于已知拷贝数的基因归一化。Other quantitative methods include the use of fluorescent dyes that intercalate into double-stranded DNA, and modified DNA oligonucleotide probes that fluoresce when hybridized to complementary DNA. These methods are broadly applicable and also specifically applicable to real-time PCR, which are further detailed as examples. In the first method, a DNA-binding dye binds to all double-stranded (ds) DNA in PCR, causing a fluorescent reaction of the dye. Thus, an increase in DNA product during PCR increases the fluorescence intensity and is measured at each cycle, allowing the DNA concentration to be quantified. The reaction was performed similarly to standard PCR, with the addition of fluorescent (ds) DNA dye. The reaction is run in a thermal cycler, and after each cycle, the level of fluorescence is measured with a detector; the dye only fluoresces when bound to (ds)DNA (ie the PCR product). The concentration of (ds)DNA in PCR can be determined according to standard dilution methods. As with other real-time PCR methods, the values obtained have no absolute units associated with them. Comparing measured DNA/RNA samples and standard dilutions yields a sample-to-standard fraction or ratio, allowing relative comparisons between different tissues or experimental conditions. To ensure accurate quantification and/or expression of target genes, normalization to stably expressed genes can be performed. The copy number of the unknown gene can likewise be normalized to the gene of known copy number.

第二种方法使用序列特异性的基于RNA或DNA的探针,以定量仅含有探针序列的DNA;因此,使用报告探针极大地提高了特异性,并允许即使存在某些非特异性DNA扩增时定量。这允许多路技术,即通过使用具有不同颜色标记的特异性探针,分析同个反应中的几个基因,前提是全部基因以相似效率进行扩增。The second method uses sequence-specific RNA- or DNA-based probes to quantify DNA containing only the probe sequence; thus, the use of reporter probes greatly improves specificity and allows for the quantification of DNA even in the presence of some non-specific DNA amplification. Quantitative time increase. This allows multiplexing, that is, the analysis of several genes in the same reaction by using specific probes with different color labels, provided that all genes are amplified with similar efficiency.

该方法通常用基于DNA的探针进行,该探针一端带有荧光报告分子(例如6-羧基荧光素),探针的另一端具有荧光淬灭分子(例如6-羧基-四甲基罗丹明)。报告分子紧密靠近淬灭分子,这阻止了其荧光性的检测。聚合酶(例如Taq聚合酶)的5’至3’外切酶活性使探针分解,破坏了报告分子-淬灭分子的接近,从而使得发出能够被检测到的未淬灭的荧光。在每个PCR循环中报告探针靶向作用的产物的增加导致荧光性成比例增加,这是由于探针的分解和报告分子的释放。类似于标准PCR反应进行反应,加入报告探针。随着反应开始,在PCR的退火阶段期间,探针和引物都对DNA靶退火。新的DNA链的聚合从引物开始,一旦聚合酶到达探针,其5’-3’核酸外切酶降解探针,物理地将荧光报告分子与淬灭分子分开,导致荧光增加。在实时PCR热循环仪中检测和测量荧光,对应于产物指数增长的荧光几何级数增长用于确定每个反应中的阈值循环。This method is usually performed with a DNA-based probe with a fluorescent reporter molecule (such as 6-carboxyfluorescein) on one end and a fluorescence quencher molecule (such as 6-carboxy-tetramethylrhodamine) on the other end of the probe. ). The reporter molecule is in close proximity to the quencher molecule, which prevents its fluorescence from being detected. The 5' to 3' exonuclease activity of a polymerase (eg, Taq polymerase) cleaves the probe, breaking the reporter-quencher proximity, allowing unquenched fluorescence to be detected. The increase in the product of reporter probe targeting at each PCR cycle results in a proportional increase in fluorescence due to breakdown of the probe and release of the reporter molecule. The reaction is performed similarly to a standard PCR reaction, adding the reporter probe. As the reaction begins, both probes and primers anneal to the DNA target during the annealing phase of PCR. Polymerization of a new DNA strand begins with the primer, and once the polymerase reaches the probe, its 5'-3' exonuclease degrades the probe, physically separating the fluorescent reporter from the quencher, resulting in an increase in fluorescence. Fluorescence was detected and measured in a real-time PCR thermal cycler, and the geometric growth of fluorescence corresponding to the exponential growth of the product was used to determine the threshold cycle in each reaction.

通过在对数规模上绘制与循环次数相对的荧光(因此指数增长量将产生直线),确定在反应的指数期期间存在的DNA的相对浓度。确定用于检测背景上的荧光的阈值。来自样本的荧光达到阈值的循环次数称为阈值循环Ct。由于DNA的量在指数期期间每个循环加倍,可计算DNA的相对量,例如,Ct比另一样本早3个循环的样本具有多23=8倍的模板。然后通过将结果与标准曲线进行比较,确定核酸(例如RNA或DNA)的量,该标准曲线通过已知核酸量的系列稀释(例如未稀释的、1:4、1:16、1:64)的实时PCR产生。The relative concentration of DNA present during the exponential phase of the reaction was determined by plotting fluorescence versus cycle number on a logarithmic scale (so exponential growth would yield a straight line). Determine the threshold used to detect fluorescence over the background. The number of cycles at which the fluorescence from the sample reaches the threshold is referred to as the threshold cycle Ct . Since the amount of DNA doubles every cycle during the exponential phase, the relative amount of DNA can be calculated, eg, a sample with a Ct 3 cycles earlier than another sample has 2 3 =8 times more template. The amount of nucleic acid (e.g. RNA or DNA) is then determined by comparing the results to a standard curve by serial dilution (e.g. undiluted, 1:4, 1:16, 1:64) of known amounts of nucleic acid real-time PCR generation.

在某些实施方案中,qPCR反应涉及双荧光团方法,该方法利用荧光共振能量转移(FRET),例如LIGHTCYCLER杂交探针,其中两个寡核苷酸探针退火至扩增子(例如参见美国专利号6,174,670)。寡核苷酸被设计以头对尾(head-to-tail)的方向杂交,具有以与高效能量转移兼容的距离隔开的荧光团。其它构建为当结合核酸或掺入延伸产物中时发出信号的标记寡核苷酸的示例包括:SCORPIONS探针(例如Whitcombe等,Nature Biotechnology17:804-807,1999,和美国专利号6,326,145)、Sunrise(或AMPLIFLOUR)引物(例如Nazarenko等,Nuc.Acids Res.25:2516-2521,1997,和美国专利号6,117,635),以及LUX引物和MOLECULAR BEACONS探针(例如Tyagi等,Nature Biotechnology 14:303-308,1996和美国专利号5,989,823)。In certain embodiments, the qPCR reaction involves a dual fluorophore approach utilizing fluorescence resonance energy transfer (FRET), such as LIGHTCYCLER hybridization probes, in which two oligonucleotide probes anneal to the amplicon (see e.g. US Patent No. 6,174,670). Oligonucleotides are designed to hybridize in a head-to-tail orientation, with fluorophores separated by a distance compatible with efficient energy transfer. Examples of other labeled oligonucleotides constructed to signal when bound to nucleic acid or incorporated into extension products include: SCORPIONS probes (e.g., Whitcombe et al., Nature Biotechnology 17:804-807, 1999, and U.S. Patent No. 6,326,145), Sunrise (or AMPLIFLOUR) primers (such as Nazarenko et al., Nuc.Acids Res.25:2516-2521, 1997, and U.S. Patent No. 6,117,635), and LUX primers and MOLECULAR BEACONS probes (such as Tyagi et al., Nature Biotechnology 14:303-308 , 1996 and US Patent No. 5,989,823).

在其它实施方案中,qPCR反应使用荧光Taqman方法和能够实时测量荧光的仪器(例如ABI Prism 7700序列检测仪)。Taqman反应使用的杂交探针采用两种不同的荧光染料标记。一种染料是报告染料(6-羧基荧光素),另一种是淬灭染料(6-羧基-四甲基罗丹明)。当探针完整无缺时,发生荧光能量转移,报告染料荧光发射被淬灭染料吸收。在PCR循环的延伸期期间,用DNA聚合酶的5’-3’核溶活性裂解荧光杂交探针。在裂解探针时报告染料发射不再有效地转移至淬灭染料,导致报告染料荧光发射光谱增加。可使用任何核酸定量方法定量样品中核酸的量,包括实时方法或单点检测法。检测可用多种不同的方法完成(例如染色、用标记探针杂交;掺入生物素化的引物然后抗生物素-酶结合检测;将32P-标记的脱氧核苷三磷酸,例如dCTP或dATP掺入扩增片段),还可用本领域已知的用于核酸定量的其它任何合适的检测方法完成。定量可包括或可不包括扩增步骤。In other embodiments, the qPCR reaction uses a fluorescent Taqman method and an instrument capable of measuring fluorescence in real time (eg, an ABI Prism 7700 Sequence Detector). The hybridization probes used in the Taqman reaction are labeled with two different fluorescent dyes. One dye is a reporter dye (6-carboxyfluorescein) and the other is a quencher dye (6-carboxy-tetramethylrhodamine). When the probe is intact, fluorescence energy transfer occurs and the fluorescent emission from the reporter dye is absorbed by the quencher dye. During the extension phase of the PCR cycle, the fluorescent hybridization probe is cleaved by the 5'-3' nucleolytic activity of the DNA polymerase. Upon cleavage of the probe, the reporter dye emission is no longer effectively transferred to the quencher dye, resulting in an increase in the reporter dye fluorescence emission spectrum. The amount of nucleic acid in a sample can be quantified using any nucleic acid quantification method, including real-time methods or single-point detection methods. Detection can be accomplished in a number of different ways (e.g. staining, hybridization with labeled probes; incorporation of biotinylated primers followed by avidin-enzyme conjugate detection; incorporation of 32P-labeled deoxynucleoside triphosphates such as dCTP or dATP into amplified fragments), it can also be accomplished by any other suitable detection method known in the art for the quantification of nucleic acids. Quantification may or may not include an amplification step.

在一些实施方案中,本发明提供了用于鉴别和定量连接的DNA片段的标记物。在一些情况下,该连接的DNA片段可被标记以辅助下游应用,例如阵列杂交。例如,可使用随机引物法或切口平移,标记连接的DNA片段。In some embodiments, the present invention provides markers for identifying and quantifying ligated DNA fragments. In some cases, the ligated DNA fragments can be labeled to aid in downstream applications, such as array hybridization. For example, the ligated DNA fragments can be labeled using random priming methods or nick translation.

多种标记物(例如报告分子)可用于标记本文所述的核苷酸序列,包括但不限于在扩增步骤期间。合适的标记物包括放射性核素、酶、荧光剂、化学发光剂或显色剂,以及配体、辅因子、抑制剂、磁粒子等。这些标记物的示例列入在美国专利号美国专利号3,817,837;美国专利号3,850,752;美国专利号3,939,350;美国专利号3,996,345;美国专利号4,277,437;美国专利号4,275,149和美国专利号4,366,241中,其全部内容通过引用而并入。A variety of labels (eg, reporter molecules) can be used to label the nucleotide sequences described herein, including but not limited to during amplification steps. Suitable labels include radionuclides, enzymes, fluorescers, chemiluminescent or chromogenic agents, as well as ligands, cofactors, inhibitors, magnetic particles and the like. Examples of such markers are listed in U.S. Patent No. U.S. Patent No. 3,817,837; U.S. Patent No. 3,850,752; U.S. Patent No. 3,939,350; U.S. Patent No. 3,996,345; U.S. Patent No. 4,277,437; Incorporated by reference.

另外的标记包括但不限于β-半乳糖苷酶、转化酶、绿色荧光蛋白、荧光素酶、氯霉素、乙酰转移酶、β-葡萄糖醛酸酶、外-葡聚糖酶和葡萄糖淀粉酶。也可以使用荧光标记以及专门合成的具有特定化学性质的荧光试剂。可以用多种方法测量荧光。例如,有些荧光标记在激发或发射光谱中发生变化,有些当一个荧光报告分子失去荧光时发生共振能量转移,当再次获得荧光时,一些发生损失(淬灭)荧光或出现荧光,而一些报告旋转运动。Additional markers include, but are not limited to, β-galactosidase, invertase, green fluorescent protein, luciferase, chloramphenicol, acetyltransferase, β-glucuronidase, exo-glucanase, and glucoamylase . Fluorescent labels can also be used, as well as specially synthesized fluorescent reagents with specific chemical properties. Fluorescence can be measured in a number of ways. For example, some fluorescent labels undergo changes in excitation or emission spectra, some undergo resonance energy transfer when a fluorescent reporter loses fluorescence, some lose (quench) fluorescence or fluoresce when fluorescence is regained, and some reporter rotation sports.

此外,为了获得用于标记的足够材料,可合并多重扩增,而不是增加每个反应扩增循环的次数。可选择地,标记的核苷酸可被掺入扩增反应最后的多个循环,例如30个PCR循环(没有标记)+10个PCR循环(加上标记)。In addition, multiplex amplifications can be combined rather than increasing the number of amplification cycles per reaction in order to obtain sufficient material for labeling. Alternatively, labeled nucleotides can be incorporated in the final multiple cycles of the amplification reaction, for example 30 PCR cycles (without label) + 10 PCR cycles (with label).

在特定的实施方案中,本发明提供了能够附接到连接的DNA片段的探针。如本文所使用的,术语“探针”指能够与另一个目标分子(例如另一个寡核苷酸)杂交的分子(例如寡核苷酸、不管是天然存在于纯化的限制酶切消化中的还是合成产生的、重组产生的或者通过PCR扩增产生的)。当探针为寡核苷酸时,可以是单链的或双链的。探针在检测、鉴别和分离特定靶标(例如基因序列)时非常有用。在一些情况下,探针可与标记关联,以使其在任何检测系统中可被检测,包括但不限于酶(例如酶联免疫吸附测定、以及基于酶的组织化学分析)、荧光的、放射性的和发光的系统。In particular embodiments, the invention provides probes capable of attaching to ligated DNA fragments. As used herein, the term "probe" refers to a molecule (such as an oligonucleotide, whether naturally occurring in a purified restriction digest) capable of hybridizing to another target molecule (such as another oligonucleotide). or produced synthetically, recombinantly or by PCR amplification). When the probe is an oligonucleotide, it can be single-stranded or double-stranded. Probes are very useful in detecting, identifying and isolating specific targets such as gene sequences. In some cases, the probe can be associated with a label such that it is detectable in any detection system, including but not limited to enzymatic (e.g. ELISA, and enzyme-based histochemical assays), fluorescent, radioactive and luminous system.

对于阵列和微阵列,术语“探针”用于指任何可杂交的材料,该材料被固定至阵列,用于检测已经和所述探针杂交的核苷酸序列。在一些情况下,探针可为约10bp至500bp、约10bp至250bp、约20bp至250bp、约20bp至200bp、约25bp至200bp、约25bp至100bp、约30bp至100bp或约30bp至80bp。在一些情况下,探针长度可大于约10bp、约20bp、约30bp、约40bp、约50bp、约60bp、约70bp、约80bp、约90bp、约100bp、约150bp、约200bp、约250bp、约300bp、约400bp或约500bp。例如,探针长度可为约20至约50bp。用于探针设计的示例和原理可在WO95/11995、EP 717,113和WO97/29212中找到。With respect to arrays and microarrays, the term "probe" is used to refer to any hybridizable material that is immobilized to the array for the detection of nucleotide sequences to which the probes have hybridized. In some cases, the probe can be about 10 bp to 500 bp, about 10 bp to 250 bp, about 20 bp to 250 bp, about 20 bp to 200 bp, about 25 bp to 200 bp, about 25 bp to 100 bp, about 30 bp to 100 bp, or about 30 bp to 80 bp. In some cases, the probe length can be greater than about 10 bp, about 20 bp, about 30 bp, about 40 bp, about 50 bp, about 60 bp, about 70 bp, about 80 bp, about 90 bp, about 100 bp, about 150 bp, about 200 bp, about 250 bp, about 300bp, about 400bp or about 500bp. For example, the probe can be about 20 to about 50 bp in length. Examples and principles for probe design can be found in WO95/11995, EP 717,113 and WO97/29212.

在一些情况下,可设计一个或多个探针,以使其可靠近被限制性内切核酸酶消化的位点杂交。例如,探针可在限制性内切核酸酶识别位点的约10bp、约20bp、约30bp、约40bp、约50bp、约60bp、约70bp、约80bp、约90bp、约100bp、约150bp、约200bp、约250bp、约300bp、约400bp或约500bp之内。In some cases, one or more probes can be designed so that they hybridize near sites digested by restriction endonucleases. For example, the probe can be at about 10 bp, about 20 bp, about 30 bp, about 40 bp, about 50 bp, about 60 bp, about 70 bp, about 80 bp, about 90 bp, about 100 bp, about 150 bp, about Within 200bp, about 250bp, about 300bp, about 400bp, or about 500bp.

在其它情况下,可在限制性内切核酸酶消化位点每一侧的约10bp、约20bp、约30bp、约40bp、约50bp、约60bp、约70bp、约80bp、约90bp、约100bp、约150bp、约200bp、约250bp、约300bp、约400bp或约500bp之内,设计单个的、唯一的探针。设计探针,使其可在限制性内切核酸酶消化位点的每一侧杂交。例如,可使用在初级限制性内切酶识别位点每一侧的单个探针。In other cases, about 10 bp, about 20 bp, about 30 bp, about 40 bp, about 50 bp, about 60 bp, about 70 bp, about 80 bp, about 90 bp, about 100 bp, Within about 150 bp, about 200 bp, about 250 bp, about 300 bp, about 400 bp, or about 500 bp, a single, unique probe is designed. Design the probes so that they hybridize on either side of the restriction endonuclease digestion site. For example, a single probe on each side of the primary restriction enzyme recognition site can be used.

在进一步的情况下,可在限制性内切核酸酶识别位点的每一侧设计2个、3个、4个、5个、6个、7个、8个或更多个探针,识别位点接着可用于研究同一连接事件。例如可在限制性内切核酸酶识别位点的每一侧设计2个或3个探针。在一些实施例中,每个初级限制性内切酶识别位点可使用多个(2个、3个、4个、5个、6个、7个或8个或更多个)探针,这可有利于减少从单个探针得到假阴性结果的问题。In further cases, 2, 3, 4, 5, 6, 7, 8 or more probes can be designed on each side of the restriction endonuclease recognition site, recognizing The loci can then be used to study the same ligation event. For example, 2 or 3 probes can be designed on each side of the restriction endonuclease recognition site. In some embodiments, multiple (2, 3, 4, 5, 6, 7, or 8 or more) probes may be used per primary restriction enzyme recognition site, This can be advantageous in reducing the problem of getting false negative results from a single probe.

如本文所使用的,术语“探针组”指一套或一组探针,该探针可与基因组中用于初级限制性内切酶的一个或多个初级限制性内切酶识别位点杂交。As used herein, the term "probe set" refers to a set or group of probes that can interact with one or more primary restriction enzyme recognition sites in the genome for a primary restriction enzyme hybridize.

在一些情况下,一组探针可与邻近基因组DNA中限制性内切酶的一个或多个初级限制性内切酶识别位点的核酸序列依次互补。例如,探针组可与邻近基因组DNA中一个或多个初级限制性内切酶识别位点的核苷酸依次互补,这些核苷酸为约10bp至500bp、约10bp至250bp、约20bp至250bp、约20bp至200bp、约25bp至200bp、约25bp至100bp、约30bp至100bp或约30bp至80bp。探针组可与限制性内切核酸酶识别位点的一侧(例如任一侧)或两侧序列互补。相应地,探针可与邻近基因组DNA中的一个或多个初级限制性内切核酸酶识别位点的每一侧相邻的核酸序列互补。此外,探针组可与来自基因组中一个或多个初级限制性内切酶识别位点的核酸序列互补,该核酸序列距基因组中一个或多个初级限制性酶识别位点小于约10bp、约20bp、约30bp、约40bp、约50bp、约60bp、约70bp、约80bp、约90bp、约100bp、约150bp、约200bp、约250bp、约300bp、约400bp或约500bp。In some cases, a set of probes may be sequentially complementary to nucleic acid sequences adjacent to one or more primary restriction enzyme recognition sites of a restriction enzyme in genomic DNA. For example, the probe set can be sequentially complementary to nucleotides adjacent to one or more primary restriction endonuclease recognition sites in genomic DNA, these nucleotides are about 10bp to 500bp, about 10bp to 250bp, about 20bp to 250bp , about 20bp to 200bp, about 25bp to 200bp, about 25bp to 100bp, about 30bp to 100bp, or about 30bp to 80bp. The probe set may be complementary to sequences on one side (eg, on either side) or both sides of the restriction endonuclease recognition site. Accordingly, the probes may be complementary to nucleic acid sequences adjacent to each side of one or more primary restriction endonuclease recognition sites in the adjacent genomic DNA. In addition, the probe set may be complementary to a nucleic acid sequence from one or more primary restriction enzyme recognition sites in the genome that is less than about 10 bp, about 20bp, about 30bp, about 40bp, about 50bp, about 60bp, about 70bp, about 80bp, about 90bp, about 100bp, about 150bp, about 200bp, about 250bp, about 300bp, about 400bp, or about 500bp.

在一些情况下,两个或多个探针可设计为能够与邻近基因组DNA中一个或多个限制性内切核酸酶识别位点的序列杂交。探针可重叠或部分重叠。In some cases, two or more probes can be designed to hybridize to sequences adjacent to one or more restriction endonuclease recognition sites in the genomic DNA. Probes may overlap or partially overlap.

探针、探针阵列或探针组可被固定在载体上。载体(例如固相载体)可用多种材料制成——例如玻璃、二氧化硅、塑料、尼龙或硝酸纤维素。载体优选为坚硬的并具有平面表面。载体可具有约1至10,000,000个确定的解析位点。例如,载体可具有约10至10,000,000个、约10至5,000,000个、约100至5,000,000个、约100至4,000,000个、约1000至4,000,000个、约1000至3,000,000个、约10,000至3,000,000个、约10,000至2,000,000个、约100,000至2,000,000个或约100,000至1,000,000个确定的解析位点。确定的解析位点的密度可为每平方厘米内有至少约10个、约100个、约1000个、约10,000个、约100,000个或约1,000,000个确定的解析位点。在一些情况下,每个解析位点可被大于95%的单个类型的寡核苷酸占据。在其它情况下,每个解析位点可被探针的合并混合物或探针组占据。在进一步的情况下,一些解析位点被探针的合并混合物或探针组占据,而另一些解析位点被大于95%的单个类型的寡核苷酸占据。Probes, probe arrays or probe sets may be immobilized on a support. Supports such as solid supports can be made from a variety of materials - for example glass, silica, plastic, nylon or nitrocellulose. The carrier is preferably rigid and has a planar surface. A vector may have from about 1 to 10,000,000 defined resolved sites. For example, the carrier may have about 10 to 10,000,000, about 10 to 5,000,000, about 100 to 5,000,000, about 100 to 4,000,000, about 1,000 to 4,000,000, about 1,000 to 3,000,000, about 10,000 to 3,000,000, about 10,000 2,000,000, about 100,000 to 2,000,000, or about 100,000 to 1,000,000 identified resolved sites. The density of defined resolved sites can be at least about 10, about 100, about 1000, about 10,000, about 100,000, or about 1,000,000 defined resolved sites per square centimeter. In some cases, each resolved site can be occupied by greater than 95% of a single type of oligonucleotide. In other cases, each resolved site may be occupied by a pooled mixture or set of probes. In further instances, some resolved sites are occupied by pooled mixtures or probe sets of probes, while other resolved sites are occupied by greater than 95% of a single type of oligonucleotide.

在一些情况下,用于阵列上给定核苷酸序列的探针数量可相对于该阵列待杂交的DNA样本大大过量。例如,与输入样本中的DNA的量相比,阵列可具有约10倍、约100倍、约1000倍、约10,000倍、约100,000倍、约1,000,000倍、约10,000,000倍或约100,000,000倍的探针数。In some cases, the number of probes for a given nucleotide sequence on an array may be in large excess relative to the DNA sample to be hybridized to the array. For example, the array can have about 10 times, about 100 times, about 1000 times, about 10,000 times, about 100,000 times, about 1,000,000 times, about 10,000,000 times, or about 100,000,000 times more probes than the amount of DNA in the input sample number.

在一些情况下,阵列可具有约10个、约100个、约1000个、约10,000个、约100,000个、约1,000,000个、约10,000,000个、约100,000,000个或约1,000,000,000个探针。In some cases, the array can have about 10, about 100, about 1000, about 10,000, about 100,000, about 1,000,000, about 10,000,000, about 100,000,000, or about 1,000,000,000 probes.

探针或探针组的阵列可以逐步的方式在载体上合成,或可以预合成的形式连接。一种合成方法是VLSIPSTM(如美国专利号5,143,854和欧洲专利EP 476,014所述),该方法需要使用光来引导寡核苷酸探针在高密度的、小型化的阵列中合成。美国专利号5,571,639和美国专利号5,593,839中描述了用于设计掩码的算法,以减少合成循环的次数。如欧洲专利EP 624,059所述,还可通过机械约束的流道将单体输送至载体的槽,以组合的方式合成阵列。还可通过使用喷墨打印机将试剂滴到载体上来合成阵列(参见例如欧洲专利EP 728,520)。Arrays of probes or sets of probes can be synthesized on a support in a stepwise fashion, or can be ligated in presynthesized form. One synthesis method is VLSIPS (as described in US Patent No. 5,143,854 and European Patent EP 476,014), which requires the use of light to guide the synthesis of oligonucleotide probes in high-density, miniaturized arrays. Algorithms for designing masks to reduce the number of synthesis cycles are described in US Patent No. 5,571,639 and US Patent No. 5,593,839. Arrays can also be synthesized in a combinatorial manner, as described in European patent EP 624,059, by transporting the monomers to the grooves of the carrier through mechanically constrained flow channels. Arrays can also be synthesized by dropping reagents onto supports using an inkjet printer (see eg European Patent EP 728,520).

在一些实施方案中,本发明提供了用于将连接的DNA片段杂交至阵列上的方法。“基底”或“阵列”为特意创建的核酸的集合,其可被合成地或生物合成地制备,并以多种不同方式(例如可溶分子的文库;连接至树脂珠、硅芯片或其它固相载体的寡核苷酸的文库)对生物活性进行筛选。此外,术语“阵列”包括那些通过在基底上点布基本任何长度(例如长度为从1至约1000个核苷酸单体)的核酸而制得的核酸文库。In some embodiments, the invention provides methods for hybridizing ligated DNA fragments onto an array. A "substrate" or "array" is a collection of purposely created nucleic acids that can be prepared synthetically or biosynthetically and that can be prepared in a number of different ways (e.g., libraries of soluble molecules; attached to resin beads, silicon chips, or other solid Libraries of oligonucleotides with phase vectors) are screened for biological activity. Furthermore, the term "array" includes those nucleic acid libraries made by spotting nucleic acids of substantially any length (eg, from 1 to about 1000 nucleotide monomers in length) on a substrate.

许多教科书和文献中都广泛地描述了阵列技术及各种相关技术和应用。例如,这些包括Lemieux等,1998,Molecular Breeding 4,277-289;PCR Methods Manual(M.Innis、D.Gelfand、J.Sninsky编辑)中的Schena和Davis,Parallel Analysis with BiologicalChips.;DNA Microarrays:A Practical Approach(M.Schena编辑)(Oxford UniversityPress,Oxford,UK,1999)中的Schena和Davis,1999,Genes,Genomes and Chips.;TheChipping Forecast(Nature Genetics special issue;1999年1月增刊);Mark Schena(编辑),Microarray Biochip Technology,(Eaton Publishing公司);Cortes,2000,TheScientist 14[17]:25;Gwynn和Page,Microarray analysis:the next revolution inmolecular biology,Science,1999年8月6日;和Eakins和Chu,1999,Trends inBiotechnology,17,217-218。Array technology and various related techniques and applications are described extensively in many textbooks and literature. These include, for example, Lemieux et al., 1998, Molecular Breeding 4, 277-289; Schena and Davis, Parallel Analysis with Biological Chips. in PCR Methods Manual (eds. M. Innis, D. Gelfand, J. Sninsky); DNA Microarrays: A Schena and Davis, 1999, Genes, Genomes and Chips. in Practical Approach (edited by M. Schena) (Oxford University Press, Oxford, UK, 1999); The Chipping Forecast (Nature Genetics special issue; January 1999 supplement); Mark Schena (editor), Microarray Biochip Technology, (Eaton Publishing); Cortes, 2000, The Scientist 14[17]:25; Gwynn and Page, Microarray analysis: the next revolution inmolecular biology, Science, 6 August 1999; and Eakins and Chu, 1999, Trends in Biotechnology, 17, 217-218.

通常,任何文库都可以通过空间上隔开文库的成员被以有序的方式排列在阵列中。用于排列的合适的文库的示例包括核酸文库(包括DNA、cDNA、寡核苷酸等文库)、肽、多肽和蛋白质文库,除此之外,还有包含任何分子的文库,例如配体文库。In general, any library can be arranged in an ordered fashion in an array by spatially separating the members of the library. Examples of suitable libraries for arraying include nucleic acid libraries (including libraries of DNA, cDNA, oligonucleotides, etc.), peptide, polypeptide and protein libraries, in addition to libraries comprising any molecule, such as a ligand library .

文库可被固定或固定化至固相载体(例如固相基底)上,以限制成员的扩散和混合。在一些情况下,可制备DNA结合配体的文库。特别地,文库可固定化至基本上平面的固相,包括膜和无孔基底,例如塑料和玻璃。此外,文库可以以有助于标引(即参照或访问特定成员)的方式排列。在一些实施例中,文库的成员可以网格阵型中的点(spot)施加。普通分析系统可适用于该目的。例如,可将阵列以孔(well)中的多个成员,或者以每个孔中的单个成员固定化至微板的表面,。此外,固相基底可以是膜,例如硝酸纤维素或尼龙膜(例如用于印迹试验中的膜)。可选择的基底包括玻璃或硅基基底。因此,文库可通过本领域已知的任何合适的方法固定化,例如通过电荷相互作用或者通过化学偶联至孔(well)壁和底或膜的表面。可使用其它排列和固定的方式,例如移液、点滴、压电方式、喷墨和喷泡技术、静电应用等。在硅基基底的情况下,可利用光刻法将文库排列和固定在芯片。Libraries can be immobilized or immobilized to a solid support (eg, a solid substrate) to limit diffusion and mixing of members. In some cases, libraries of DNA-binding ligands can be prepared. In particular, libraries can be immobilized to substantially planar solid phases, including membranes and non-porous substrates such as plastic and glass. In addition, libraries can be arranged in a manner that facilitates indexing (ie, referencing or accessing specific members). In some embodiments, members of a library can be applied in spots in a grid formation. Common analysis systems are suitable for this purpose. For example, an array can be immobilized to the surface of a microplate as multiple members in a well, or as a single member in each well. Furthermore, the solid substrate may be a membrane, such as nitrocellulose or nylon membranes (such as are used in blotting assays). Alternative substrates include glass or silicon based substrates. Thus, the library may be immobilized by any suitable method known in the art, for example by charge interaction or by chemical coupling to the walls and bottom of the well or to the surface of the membrane. Other means of alignment and fixation can be used, such as pipetting, dripping, piezoelectric means, inkjet and bubble jet techniques, electrostatic applications, and the like. In the case of silicon-based substrates, libraries can be arrayed and immobilized on the chip using photolithography.

文库可通过被“点布(spotted)”到固相基底上进行排列;这可通过手工或通过利用机器人沉积成员完成。通常,阵列可描述为宏阵列或微阵列,不同之处在于点的大小。宏阵列可含有约300微米或更大的点大小,可通过现有凝胶和印迹扫描仪很容易地成像。微阵列中点的大小可为直径小于200微米且这些阵列常常含有上千个点。因此,微阵列可能需要专门的机器人和成像仪器,这需要定制。在综述Cortese,2000,The Scientist 14[11]:26中,大体描述了仪器。Libraries can be arrayed by being "spotted" onto a solid substrate; this can be done manually or by depositing members using robotics. In general, arrays can be described as macroarrays or microarrays, the difference being the size of the spots. Macroarrays can contain spot sizes of approximately 300 microns or greater and can be easily imaged with existing gel and blot scanners. The size of spots in a microarray can be less than 200 microns in diameter and these arrays often contain thousands of spots. Therefore, microarrays may require specialized robotics and imaging instruments, which require customization. In the review Cortese, 2000, The Scientist 14[11]:26, the instrument is generally described.

用于产生DNA分子的固定化文库的技术在本领域中已有描述。通常,大多数现有技术方法描述如何合成单链核酸分子文库,使用例如掩盖技术,以在固相基底上的离散位置构建各种序列排列。美国专利号5,837,832描述了一种基于超大规模集成技术,用于产生固定化至硅基底的DNA阵列的改良方法。特别地,美国专利号5,837,832描述了一种称为“铺瓦式(tiling)”的策略,以在基底上空间限定的位置合成特异性的探针组,可用于产生本发明的固定化的DNA文库。美国专利号5,837,832还提供了可能也被使用的更早期的技术的引用文献。在其它情况下,阵列还可使用光沉积化学构建。Techniques for generating immobilized libraries of DNA molecules are described in the art. In general, most prior art methods describe how to synthesize libraries of single-stranded nucleic acid molecules using, for example, masking techniques to construct various sequence arrangements at discrete locations on a solid substrate. US Patent No. 5,837,832 describes an improved method for producing DNA arrays immobilized to silicon substrates based on very large scale integration technology. In particular, U.S. Patent No. 5,837,832 describes a strategy called "tiling" to synthesize specific probe sets at spatially defined locations on a substrate, which can be used to generate the immobilized DNA of the present invention. library. US Patent No. 5,837,832 also provides citations to earlier techniques that may also have been used. In other cases, arrays can also be constructed using photodeposition chemistry.

还可以以在阵列中离散的、预定的位置放置每个不同的文库成员(例如唯一的肽序列))的方式,在表面上合成肽(或模拟肽)阵列。每个文库成员的身份通过其在阵列中的空间位置确定。确定预定分子(例如靶标或探针)和反应文库成员之间的结合相互作用的位置,从而基于空间位置鉴别反应文库成员的序列。这些方法在以下文献中进行描述:美国专利号5,143,854;WO90/15070和WO92/10092;Fodor等,(1991)Science,251:767;Dower和Fodor,(1991)Ann.Rep.Med.Chem.,26:271。Arrays of peptides (or peptidomimetics) can also be synthesized on the surface in such a way that each distinct library member (eg, a unique peptide sequence) is placed at a discrete, predetermined location in the array. The identity of each library member is determined by its spatial position in the array. The location of a binding interaction between a predetermined molecule (eg, a target or probe) and a reaction library member is determined, thereby identifying the sequence of the reaction library member based on the spatial location. These methods are described in the following documents: U.S. Patent No. 5,143,854; WO90/15070 and WO92/10092; Fodor et al., (1991) Science, 251:767; Dower and Fodor, (1991) Ann.Rep.Med.Chem., 26:271.

为了有助于检测,可使用标记(如以上所讨论的)——例如任何可容易检测的报告分子,例如荧光报告分子、生物发光报告分子、磷光性报告分子、放射性报告分子等的报告分子。本发明在其它部分探讨了这些报告分子、其检测、与靶标/探针的结合等。Shalon等,1996,Genome Res 6(7):639-45中也公开了探针和靶标的标记。To facilitate detection, a label (as discussed above) can be used—eg, any readily detectable reporter molecule, such as a fluorescent reporter, bioluminescent reporter, phosphorescent reporter, radioactive reporter, etc. reporter molecule. These reporter molecules, their detection, binding to targets/probes, etc. are discussed elsewhere in this disclosure. Labeling of probes and targets is also disclosed in Shalon et al., 1996, Genome Res 6(7):639-45.

下表1给出了一些商用微阵列格式的示例(也参见Marshall和Hodgson,1998,Nature Biotechnology,16(1),27-31)。Table 1 below gives examples of some commercial microarray formats (see also Marshall and Hodgson, 1998, Nature Biotechnology, 16(1), 27-31).

表1Table 1

表1-续Table 1 - continued

为了从基于阵列的分析生成数据,可检测表明存在或不存在探针与核苷酸序列之间杂交的信号。此外,还可利用直接或间接标记技术。例如直接标记将荧光染料直接掺入核苷酸序列,该核苷酸序列与探针相关阵列杂交(例如在标记核苷酸或PCR引物的存在下,通过酶促合成将染料掺入核苷酸序列)。例如通过使用具有类似化学结构和特性的荧光染料家族,直接标记法可产生强杂交信号,并且很容易实施。在包含核酸直接标记的情况下,可在多重荧光比较阵列分析中使用花青素(cyanine)或Alexa类似物。在其它实施方案中,直接标记法可用于在与微阵列探针杂交之前或之后将表位掺入核酸中。一种或多种染色法和试剂可用于标记杂交的复合物(例如结合至表位的荧光分子,从而借助染料分子与杂交种群的表位的结合,提供荧光信号。)To generate data from an array-based analysis, a signal indicative of the presence or absence of hybridization between a probe and a nucleotide sequence can be detected. In addition, direct or indirect labeling techniques can also be utilized. For example, direct labeling incorporates fluorescent dyes directly into nucleotide sequences that hybridize to probe-associated arrays (e.g., dye incorporation into nucleotides by enzymatic synthesis in the presence of labeled nucleotides or PCR primers sequence). For example, by using a family of fluorescent dyes with similar chemical structures and properties, direct labeling can generate strong hybridization signals and is easy to implement. Where direct labeling of nucleic acids is involved, cyanine or Alexa analogs can be used in multiplex fluorescence comparison array assays. In other embodiments, direct labeling methods can be used to incorporate epitopes into nucleic acids either before or after hybridization to microarray probes. One or more staining methods and reagents can be used to label the hybridized complex (eg, a fluorescent molecule that binds to an epitope, thereby providing a fluorescent signal via binding of the dye molecule to the epitope of the hybridized population.)

在各种实施方案中,本文所述的或本领域已知的合适的测序方法将用于获得来自样本中的核酸分子的序列信息。通过本领域中已知的经典的Sanger测序法,可完成测序。测序还可使用高通量系统完成,其中一些系统使得测序的核苷酸在其掺入生长中的链中之后或期间立即被检测到,即实时或基本实时检测序列。在一些情况下,高通量测序每小时生成至少1,000个、至少5,000个、至少10,000个、至少20,000个、至少30,000个、至少40,000个、至少50,000个、至少100,000个或至少500,000个序列读段;而测序读段可以为每个读段至少约50个、约60个、约70个、约80个、约90个、约100个、约120个、约150个、约180个、约210个、约240个、约270个、约300个、约350个、约400个、约450个、约500个、约600个、约700个、约800个、约900个或约1000个碱基。In various embodiments, suitable sequencing methods described herein or known in the art will be used to obtain sequence information from nucleic acid molecules in a sample. Sequencing can be accomplished by classical Sanger sequencing methods known in the art. Sequencing can also be accomplished using high-throughput systems, some of which allow sequenced nucleotides to be detected immediately after or during their incorporation into a growing strand, ie, sequence detection in real-time or substantially real-time. In some instances, the high throughput sequencing generates at least 1,000, at least 5,000, at least 10,000, at least 20,000, at least 30,000, at least 40,000, at least 50,000, at least 100,000, or at least 500,000 sequence reads per hour while the sequencing reads can be at least about 50, about 60, about 70, about 80, about 90, about 100, about 120, about 150, about 180, about 210 per read about 240 about 270 about 300 about 350 about 400 about 450 about 500 about 600 about 700 about 800 about 900 or about 1000 bases base.

在一些实施方案中,高通量测序涉及使用可通过Illumina的Genome AnalyzerIIX、MiSeq个人测序仪或HiSeq系统实现的技术,例如那些使用HiSeq 2500、HiSeq 1500、HiSeq 2000或HiSeq 1000的机器的技术。这些机器使用通过合成化学可逆的基于终止子的测序。这些机器可在八天内完成2000亿DNA读段或更多。较小的系统可用于3、2、1天或更短时间内的运行。In some embodiments, high-throughput sequencing involves the use of technologies achievable by Illumina's Genome Analyzer IIX, MiSeq Personal Sequencer, or HiSeq systems, such as those using HiSeq 2500, HiSeq 1500, HiSeq 2000, or HiSeq 1000 machines. These machines use terminator-based sequencing that is reversible by synthetic chemistry. These machines can complete 200 billion DNA reads or more in eight days. Smaller systems are available for runs of 3, 2, 1 day or less.

在一些实施方案中,高通量测序涉及使用ABI Solid System可实现的技术。该基因分析平台使得能够进行与珠粒相连的克隆扩增的DNA片段的大规模平行测序。测序方法基于与染料标记寡核苷酸的顺序连接。In some embodiments, high-throughput sequencing involves techniques achievable using the ABI Solid System. This genetic analysis platform enables massively parallel sequencing of clonally amplified DNA fragments attached to beads. The sequencing method is based on sequential ligation with dye-labeled oligonucleotides.

下一代测序可包括离子半导体测序(例如使用来自Life Technologies(IonTorrent)的技术)。离子半导体测序可利用当核苷酸被掺入DNA链时,可释放离子。为了进行离子半导体测序,可形成高密度阵列的微机械孔。每个孔可容纳单个DNA模板。孔的下面可以是离子灵敏层,在离子灵敏层下面可以是离子传感器。当核苷酸添加至DNA时,可释放H+,其可作为pH变化的衡量。H+离子可转化为电压,并由半导体传感器记录。阵列芯片可用核苷酸一个接一个按顺序排满。不需要扫描、光或照相机。在一些情况下,使用IONPROTONTM测序仪对核酸进行测序。在一些情况下,使用IONPGMTM测序仪。Ion Torrent个人基因组测序仪(PGM)。PGM可在两个小时内完成1千万个读段。Next-generation sequencing can include ion semiconductor sequencing (eg, using technology from Life Technologies (IonTorrent)). Ion semiconductor sequencing can take advantage of the release of ions when nucleotides are incorporated into DNA strands. For ion semiconductor sequencing, high density arrays of micromechanical wells can be formed. Each well can hold a single DNA template. Below the holes may be an ion-sensitive layer, and below the ion-sensitive layer may be an ion sensor. When nucleotides are added to DNA, H+ is released, which is a measure of the change in pH. The H+ ions are converted into an electrical voltage and recorded by a semiconductor sensor. The array chip can be filled sequentially with nucleotides one after the other. No scans, lights or cameras are required. In some cases, nucleic acids are sequenced using an IONPROTON sequencer. In some cases, an IONPGM sequencer was used. Ion Torrent Personal Genome Sequencer (PGM). PGM can complete 10 million reads in two hours.

在一些实施方案中,高通量测序涉及使用Helicos BioSciences Corporation(Cambridge,Massachusetts)提供的技术,例如合成单分子测序(SMSS)法。SMSS是独特的,因为其允许在至多24小时内测定整个人类基因组的序列。最后,SMSS部分地在以下文献中描述,美国公开申请号为20060024711;20060024678;20060012793;20060012784;和20050100932。In some embodiments, high-throughput sequencing involves the use of technologies provided by Helicos BioSciences Corporation (Cambridge, Massachusetts), such as the Single Molecule Sequencing by Synthesis (SMSS) method. SMSS is unique because it allows the entire human genome to be sequenced in up to 24 hours. Finally, SMSS is described in part in US Published Application Nos. 20060024711; 20060024678; 20060012793; 20060012784; and 20050100932.

在一些实施方案中,高通量测序涉及使用通过454Lifesciences公司(Branford,Connecticut)可获得的技术,例如PicoTiterPlate装置,其包括光学纤维面板,该光学纤维面板将测序反应生成的化学发光信号传输至由仪器中的CCD照相机记录。该光学纤维的使用允许在4.5小时内检测最少2千万个碱基对。In some embodiments, high-throughput sequencing involves the use of technologies available through 454 Lifesciences, Inc. (Branford, Connecticut), such as the PicoTiterPlate device, which includes a fiber optic panel that transmits the chemiluminescent signal generated by the sequencing reaction to a The CCD camera in the instrument records. The use of this fiber optic allows detection of a minimum of 20 million base pairs in 4.5 hours.

在检测光学纤维之前使用珠扩增的方法在以下文献中有所描述:Marguiles,M.等,“Genome sequencing in microfabricated high-density pricolitre reactors”,Nature,doi:10.1038/nature03959;和美国公开申请号20020012930;20030068629;20030100102;20030148344;20040248161;20050079510、20050124022;和20060078909。Methods using bead amplification prior to detection of optical fibers are described in Marguiles, M. et al., "Genome sequencing in microfabricated high-density pricolitre reactors", Nature, doi: 10.1038/nature03959; and U.S. Published Application No. 20020012930; 20030068629; 20030100102; 20030148344; 20040248161; 20050079510, 20050124022; and 20060078909.

在一些实施方案中,使用Clonal Single Molecule Array(Solexa公司)或利用可逆终止子化学的合成测序法,进行高通量测序。这些技术部分地在以下文献中进行描述:美国专利号6,969,488;6,897,023;6,833,246;6,787,308;和美国申请公开号20040106110;20030064398;20030022207;和Constans、A.,The Scientist 2003,17(13):36。In some embodiments, high-throughput sequencing is performed using a Clonal Single Molecule Array (Solexa Corporation) or sequencing-by-synthesis using reversible terminator chemistry. These techniques are described in part in: US Patent Nos. 6,969,488; 6,897,023; 6,833,246; 6,787,308; and US Application Publication Nos. 20040106110;

下一代测序技术可包括Pacific Biosciences的实时技术(SMRTTM)。在SMRT中,四种DNA碱基中的每一种都可与四种不同荧光染料中的一种连接。这些染料可以被磷酸连接。可用位于零模波导(ZMW)底部的模板单链DNA的单个分子,固定化单个DNA聚合酶。ZMW可以是限制结构,该限制结构能够在荧光核苷酸的背景下,观察通过DNA聚合酶使单个核苷酸的掺入,荧光核苷酸可迅速扩散进和扩散出ZMW(以微秒计)。其可耗费几毫秒以将核苷酸掺入生长中的链。在该期间,荧光标记可被激发并产生荧光信号,且荧光标签可被裂解除去。可从下面照亮ZMW。来自激发光线的衰减光可穿透每个ZMW的较低的20-30nm。可创建具有检测极限为20仄升(10"升)的显微镜。极小的检测体积可使背景噪音的减少改善1000倍。相应染料荧光的检测可指出掺入了哪些碱基。可重复该过程。Next generation sequencing technologies may include Pacific Biosciences' Real Time Technology (SMRT ). In SMRT, each of the four DNA bases can be attached to one of four different fluorescent dyes. These dyes can be phosphate-linked. A single DNA polymerase can be immobilized with a single molecule of template single-stranded DNA at the bottom of a zero-mode waveguide (ZMW). The ZMW can be a confinement structure that enables the incorporation of a single nucleotide by DNA polymerase to be observed against the background of a fluorescent nucleotide that rapidly diffuses into and out of the ZMW (measured in microseconds). ). It can take milliseconds to incorporate nucleotides into a growing chain. During this time, the fluorescent label can be excited and generate a fluorescent signal, and the fluorescent tag can be cleaved away. The ZMW can be illuminated from below. Attenuated light from the excitation light penetrates the lower 20-30 nm of each ZMW. A microscope can be created with a detection limit of 20 liters (10" liters). The extremely small detection volume enables a 1000-fold improvement in background noise reduction. Detection of the corresponding dye fluorescence indicates which bases are incorporated. The process can be repeated .

在一些情况下,下一代测序为纳米孔测序(参见例如Soni GV和Meller A.(2007)Clin Chem 53:1996-2001)。纳米孔可以是直径达到约一纳米级的小孔。将纳米孔浸没在导电液中,并施用电流穿过它,由于离子穿过纳米孔的传导,可产生微小电流。流经的电流量对纳米孔的大小灵敏。当DNA分子穿过纳米孔时,DNA分子上的每个核苷酸可以不同程度堵塞纳米孔。因此,当DNA分子穿过纳米孔时,穿过纳米孔的电流变化可表示DNA序列的读数。纳米孔测序技术可来自Oxford Nanopore Technologies;例如GridlON系统。单个纳米孔可插入至横跨微孔顶部的聚合物膜。每个微孔可具有用于单个传感的电极。微孔可被制造为阵列芯片,每个芯片具有100,000个或更多个微孔(例如,多于200,000个、300,000个、400,000个、500,000个、600,000个、700,000个、800,000个、900,000或1,000,000个)。仪器(或节点)可用于分析芯片。可实时分析数据。一次可操作一种或多种仪器。纳米孔可以是蛋白质纳米孔,例如蛋白质α-溶血素、七聚体蛋白质孔。纳米孔可以是制成固态的纳米孔,例如在合成膜(例如SiNx或SiO2)中形成的纳米尺寸的孔。纳米孔可以是混合孔(例如整合在固态膜中的蛋白质孔)。纳米孔可以是具有集成传感器的纳米孔(例如隧道电极探测器、电容探测器或石墨烯基纳米孔隙或边缘状态探测器(参见例如Garaj等,(2010)Nature第67卷,doi:10.1038/nature09379))。可功能化纳米孔,以分析特定类型的分子(例如DNA、RNA或蛋白质)。纳米孔测序可包含“链测序”,在该“链测序”中,完整的DNA聚合物可穿过蛋白质纳米孔并在DNA通过孔易位时实时测序。酶可使双链DNA的链分开,并使链穿过纳米孔。DNA在一端可具有发夹,系统可读段两条链。在一些情况下,纳米孔测序为“核酸外切酶测序”,其中可通过加工的核酸外切酶从DNA链中剪切单个核苷酸,且核苷酸可穿过蛋白质纳米孔。核苷酸可瞬时结合至孔内的分子(例如环糊精)。特征性的电流中断可用于鉴别碱基。In some cases, the next generation sequencing is nanopore sequencing (see eg, Soni GV and Meller A. (2007) Clin Chem 53:1996-2001). A nanopore may be a small hole with a diameter up to about one nanometer. Immersing the nanopore in a conductive liquid and applying an electric current through it creates a tiny current due to the conduction of ions through the nanopore. The amount of current flowing is sensitive to the size of the nanopore. When a DNA molecule passes through a nanopore, each nucleotide on the DNA molecule can clog the nanopore to varying degrees. Thus, as a DNA molecule travels through the nanopore, a change in current passing through the nanopore can represent a readout of the DNA sequence. Nanopore sequencing technology is available from Oxford Nanopore Technologies; eg the GridlON system. A single nanopore can be inserted into a polymer membrane that spans the top of the microwell. Each microwell can have an electrode for a single sensing. The microwells can be fabricated as array chips with 100,000 or more microwells per chip (e.g., more than 200,000, 300,000, 400,000, 500,000, 600,000, 700,000, 800,000, 900,000, or 1,000,000 indivual). Instruments (or nodes) can be used to analyze chips. Data can be analyzed in real time. One or more instruments can be operated at a time. The nanopore may be a protein nanopore, eg a protein alpha-hemolysin, a heptameric protein pore. The nanopores may be nanopores made in a solid state, such as nanometer-sized pores formed in synthetic membranes such as SiNx or SiO2. A nanopore may be a hybrid pore (eg, a protein pore integrated in a solid-state membrane). Nanopores can be nanopores with integrated sensors (e.g. tunnel electrode detectors, capacitive detectors or graphene-based nanopores or edge state detectors (see e.g. Garaj et al., (2010) Nature vol. 67, doi:10.1038/nature09379 )). Nanopores can be functionalized to analyze specific types of molecules (eg DNA, RNA or proteins). Nanopore sequencing can include "strand sequencing," in which intact DNA polymers can pass through a protein nanopore and be sequenced in real time as the DNA translocates through the pore. The enzyme separates the strands of double-stranded DNA and passes the strands through the nanopore. DNA can have a hairpin at one end, and the system can read both strands. In some instances, nanopore sequencing is "exonuclease sequencing," in which individual nucleotides are cleaved from a DNA strand by a processing exonuclease and the nucleotides are passed through the protein nanopore. Nucleotides can transiently bind to molecules (eg, cyclodextrins) within the pores. Characteristic current interruptions can be used to identify bases.

可使用来自GENIA的纳米测序技术。工程蛋白孔可被包埋进脂质双层膜中。“主动控制”技术可用于使纳米孔膜有效组装和控制DNA穿过通道的运动。在一些情况下,纳米孔测序技术来自NABsys。基因组DNA可片段化为平均长度为约100kb的链。100kb片段可制成单链的并随后与6碱基探针杂交。具有探针的基因组片段可被驱动穿过纳米孔,其可产生与时间相对的电流追踪。电流追踪可提供探针在每个基因组片段上的位置。基因组片段可排列起来,以创建用于基因组的探针图。该方法可以并行用于探针文库完成。可生成对于每个探针的基因组长度的探针图。用称为“移动窗口杂交测序(mwSBH)”的方法,可修复错误。在一些情况下,纳米孔测序技术来自IBM/Roche。电子束可用于在微芯片中制备纳米孔大小的开口。电场可用于将DNA拉过或穿过纳米孔。纳米孔中的DNA晶体管装置可包含交替的纳米金属层或介电层。DNA骨架中的离散电荷可被DNA纳米孔内部的电场捕获。启动或切断门电压可使DNA序列被读段。Nanosequencing technology from GENIA is available. Engineered protein pores can be embedded in lipid bilayer membranes. "Active control" techniques can be used to efficiently assemble nanoporous membranes and control the movement of DNA through the channels. In some cases, nanopore sequencing technology was obtained from NABsys. Genomic DNA can be fragmented into strands with an average length of about 100 kb. The 100 kb fragment can be made single stranded and subsequently hybridized with a 6 base probe. Genomic fragments with probes can be driven through the nanopore, which can generate a current trace versus time. Current tracing provides the position of the probes on each genomic fragment. Genomic fragments can be aligned to create probe maps for the genome. This method can be used in parallel for probe library completion. A probe map can be generated for the genome length of each probe. Errors can be corrected using a method called "moving window sequencing by hybridization (mwSBH)". In some cases, nanopore sequencing technology was from IBM/Roche. Electron beams can be used to create nanopore-sized openings in microchips. An electric field can be used to pull the DNA through or through the nanopore. DNA transistor devices in nanopores can contain alternating nanoscale metal or dielectric layers. Discrete charges in the DNA backbone can be trapped by the electric field inside the DNA nanopore. Turning on or off the gate voltage allows the DNA sequence to be read.

下一代测序可包含DNA纳米球测序(例如通过Complete Genomics完成;参见例如Drmanac等,(2010)Science 327:78-81)。DNA可被分离、片段化和选择大小。例如DNA可被片段化(例如通过超声)至平均长度为约500bp。接头(Adl)可用于连接片段的末端。接头可用于和测序反应的锚定物杂交。具有连接至每个末端的接头的DNA可被PCR扩增。接头序列可被修饰,以使互补单链末端相互结合形成环状DNA。DNA可被甲基化,以保护其不被后续步骤中所使用的IIS类型限制性内切酶切割。接头(例如右接头)可具有限制性识别位点,而该限制性识别位点可维持非甲基化。接头中的非甲基化的限制性识别位点可被限制性内切酶(例如Acul)识别,Acul可在距离右接头右边13bp切割DNA,以形成线性的双链DNA。第二组的右接头和左接头(Ad2)可连接到线性DNA的任意末端上,且全部结合有两个接头的DNA可被PCR扩增(例如通过PCR)。可修饰Ad2序列,以使其彼此结合并形成环状DNA。DNA可被甲基化,但左Adl接头上的限制性内切酶识别位点可维持非甲基化。可应用限制性内切酶(例如Acul),DNA可在距离Adl左侧13bp被切割,以形成线性DNA片段。第三组的右接头和左接头(Ad3)可连接到线性DNA的右翼和左翼上,并可PCR扩增得到的片段。可修饰接头,以使其彼此结合并形成环状DNA。可加入III类型限制性内切酶(例如EcoP15);EcoP15可距离Ad3左侧26bp切割DNA和距离Ad2右侧26bp切割DNA。该切割可除去DNA的大片段并再一次直线化DNA。第四组的右接头和左接头(Ad4)可连接到DNA,扩增DNA(例如通过PCR),并修饰,以使其彼此结合并形成完全环状的DNA模板。Next-generation sequencing may comprise DNA nanoball sequencing (eg, accomplished by Complete Genomics; see eg, Drmanac et al., (2010) Science 327:78-81). DNA can be isolated, fragmented and size selected. For example, DNA can be fragmented (eg, by sonication) to an average length of about 500 bp. Adapters (Adl) can be used to join the ends of the fragments. Adapters can be used to hybridize to anchors for sequencing reactions. DNA with adapters ligated to each end can be PCR amplified. The linker sequence can be modified so that the complementary single-stranded ends join each other to form circular DNA. DNA can be methylated to protect it from cleavage by IIS-type restriction enzymes used in subsequent steps. A linker (eg, a right linker) can have a restriction recognition site that can remain unmethylated. The unmethylated restriction recognition site in the linker can be recognized by a restriction enzyme such as Acul, which can cut the DNA 13 bp to the right of the right linker to form a linear double-stranded DNA. A second set of right and left adapters (Ad2) can be ligated to either end of the linear DNA, and the DNA combined with both adapters can be PCR amplified (eg, by PCR). The Ad2 sequences can be modified so that they bind to each other and form circular DNA. DNA can be methylated, but the restriction enzyme recognition site on the left Adl linker remains unmethylated. A restriction endonuclease (such as Acul) can be used, and the DNA can be cut 13 bp to the left of Adl to form a linear DNA fragment. The third set of right and left adapters (Ad3) can be ligated to the right and left flanks of linear DNA, and the resulting fragments can be amplified by PCR. Linkers can be modified so that they bind to each other and form circular DNA. A type III restriction enzyme (eg EcoP15) can be added; EcoP15 can cut DNA 26 bp to the left of Ad3 and 26 bp to the right of Ad2. This cleavage removes large fragments of DNA and linearizes the DNA again. A fourth set of right and left adapters (Ad4) can be ligated to DNA, amplified (eg, by PCR), and modified so that they bind to each other and form a fully circular DNA template.

可使用滚环式复制(例如使用Phi 29DNA聚合酶)扩增DNA的小片段。四个接头序列可含有可杂交的回文序列,单链可在其自身上折叠以形成DNA纳米球(DNB),该DNA纳米球的平均直径为大约200-300纳米。DNA纳米球可附接(例如通过吸附)至微阵列(测序流动槽)。流动槽可以是涂覆有二氧化硅、钛和六甲基二硅氮烷(HMDS)的硅片及光阻材料。通过将荧光探针连接至DNA,可通过非链式测序进行测序。被检测位置的荧光颜色可通过高分辨率照相机观察。可确定接头序列之间的核苷酸序列的身份。Small fragments of DNA can be amplified using rolling circle replication (eg, using Phi 29 DNA polymerase). The four linker sequences can contain hybridizable palindromic sequences, and the single strand can fold upon itself to form DNA nanoballs (DNBs) with an average diameter of about 200-300 nanometers. DNA nanospheres can be attached (eg by adsorption) to the microarray (sequencing flow cell). Flow cells can be silicon wafers and photoresists coated with silicon dioxide, titanium, and hexamethyldisilazane (HMDS). Sequencing can be performed by off-chain sequencing by attaching fluorescent probes to DNA. The fluorescent color of the detected position can be observed by a high-resolution camera. The identity of the nucleotide sequences between the linker sequences can be determined.

在一些实施方案中,可使用AnyDot.芯片(Genovoxx,Germany)进行高通量测序。特别地,AnyDot.芯片使核苷酸荧光信号检测增强10至50倍。AnyDot.芯片和使用芯片的方法部分地在以下文献中进行描述:国际公布申请号WO 02088382、WO 03020968、WO 03031947、WO 2005044836、PCT/EP 05/05657、PCT/EP 05/05655;和德国专利申请号DE 101 49 786、DE 102 14 395、DE 103 56 837、DE 10 2004 009 704、DE 10 2004 025 696、DE 10 2004025 746、DE 10 2004 025 694、DE 10 2004 025 695、DE 10 2004025 744、DE 10 2004025 745和DE 10 2005 012 301。In some embodiments, high-throughput sequencing can be performed using AnyDot. Chips (Genovoxx, Germany). In particular, AnyDot. Chips enhance the detection of nucleotide fluorescence signals by 10 to 50 times. AnyDot. Chips and methods of using chips are described in part in International Published Application Nos. WO 02088382, WO 03020968, WO 03031947, WO 2005044836, PCT/EP 05/05657, PCT/EP 05/05655; and German Patents Application numbers DE 101 49 786, DE 102 14 395, DE 103 56 837, DE 10 2004 009 704, DE 10 2004 025 696, DE 10 2004025 746, DE 10 2004 025 694, DE 10 2004 50 4025 025 , DE 10 2004025 745 and DE 10 2005 012 301.

其它高通量测序系统包括那些在以下文献中公开的系统:Venter,J.,等,Science,2001年2月16日;Adams,M.等,Science,2000年3月24日;和M.J.Levene等,Science299:682-686,2003年1月;和美国公开申请号20030044781h和2006/0078937。总体上,这样的系统涉及通过在核酸分子上测定的聚合反应以暂时添加碱基,来测序具有多个碱基的靶分子,即实时追踪核酸聚合酶在待测序的模板核酸分子上的活性。然后可通过识别每一步中以碱基添加的顺序被核酸聚合酶的催化活性并入至靶核酸正在生长的互补链的碱基,推断出序列。靶核酸分子复合物上的聚合酶位于适合于沿靶核酸分子移动和在活性位点延伸寡核苷酸引物的位置。多个标记类的核苷酸类似物位于接近活性位点,每个可区分类型的核苷酸类似物与靶核酸序列中的不同核苷酸互补。通过使用聚合酶,在活性位点向核酸链加入核苷酸类似物延长生长中的核酸链,而加入的核苷酸类似物与靶核酸的核苷酸在活性位点互补。作为聚合步骤的结果,加入至寡核苷酸引物的核苷酸类似物被识别。重复提供标记的核苷酸类似物、聚合生长中的核酸链、以及识别加入的核苷酸类似物的步骤,以便进一步延伸核酸链并确定靶核酸的序列。Other high-throughput sequencing systems include those disclosed in Venter, J., et al., Science, Feb. 16, 2001; Adams, M. et al., Science, Mar. 24, 2000; and M.J. Levene et al., Science 299:682-686, January 2003; and US Published Application Nos. 20030044781h and 2006/0078937. In general, such systems involve sequencing target molecules with multiple bases by temporally adding bases through assayed polymerization reactions on nucleic acid molecules, ie, tracking the activity of nucleic acid polymerases on template nucleic acid molecules to be sequenced in real time. The sequence can then be deduced by identifying, at each step, the bases that are catalytically incorporated by the nucleic acid polymerase into the growing complementary strand of the target nucleic acid in the order of base addition. The polymerase on the target nucleic acid molecule complex is located at a position suitable for movement along the target nucleic acid molecule and extension of the oligonucleotide primer at the active site. Multiple labeled classes of nucleotide analogs are located proximate to the active site, each distinguishable class of nucleotide analogs being complementary to a different nucleotide in the target nucleic acid sequence. The growing nucleic acid chain is extended by adding a nucleotide analog to the nucleic acid chain at the active site, which is complementary to the nucleotides of the target nucleic acid at the active site, by using a polymerase. As a result of the polymerization step, nucleotide analogs added to the oligonucleotide primers are identified. The steps of providing labeled nucleotide analogs, polymerizing the growing nucleic acid strand, and identifying the added nucleotide analog are repeated to further extend the nucleic acid strand and determine the sequence of the target nucleic acid.

在特定的实施方案中,本发明进一步提供包含本发明一种或多种组分的试剂盒。该试剂盒可用于任何对本领域技术人员显而易见的应用,包括以上所述的应用。试剂盒可包含例如多个缔合分子、固定剂、限制性内切酶、连接酶和/或其组合。在一些情况下,缔合分子可以是蛋白质,包括例如组蛋白。在一些情况下,固定剂可以是甲醛或任何其它DNA交联剂。In particular embodiments, the invention further provides kits comprising one or more components of the invention. The kit can be used for any application obvious to those skilled in the art, including those described above. A kit can comprise, for example, a plurality of association molecules, fixatives, restriction enzymes, ligases, and/or combinations thereof. In some cases, an association molecule can be a protein, including, for example, histones. In some cases, the fixative can be formaldehyde or any other DNA cross-linking agent.

在一些情况下,试剂盒可进一步包含多个珠粒。珠粒可以是顺磁性的和/或涂覆有捕获剂。例如珠粒可涂覆有链霉亲和素和/或抗体。In some cases, the kit can further comprise a plurality of beads. Beads can be paramagnetic and/or coated with capture agents. For example beads can be coated with streptavidin and/or antibodies.

在一些情况下,试剂盒可包含接头寡核苷酸和/或测序引物。此外,试剂盒可包含能够使用接头寡核苷酸和/或测序引物扩增读对的装置。In some cases, kits can include adapter oligonucleotides and/or sequencing primers. In addition, the kit can comprise a device capable of amplifying read pairs using adapter oligonucleotides and/or sequencing primers.

在一些情况下,试剂盒还可包含其它试剂,包括但不限于裂解缓冲液、连接试剂(例如dNTP、聚合酶、多核苷酸激酶和/或连接酶缓冲液等)和PCR试剂(例如dNTP、聚合酶和/或PCR缓冲液等)。In some cases, the kit may also contain other reagents including, but not limited to, lysis buffers, ligation reagents (such as dNTPs, polymerases, polynucleotide kinases, and/or ligase buffers, etc.), and PCR reagents (such as dNTPs, polymerase and/or PCR buffer, etc.).

试剂盒还可包含用于使用试剂盒组分和/或用于生成读对的说明。The kit may also comprise instructions for using the kit components and/or for generating read pairs.

图8中所示的计算机系统500可理解为逻辑装置,其能够从媒介511和/或网络端口505读段指令,媒介511和/或网络端口505可选择性地连接至具有固定媒介的服务器509。例如图8中所示的系统可包括中央处理器501、磁盘驱动器503、可选的输入设备例如键盘515和/或鼠标516、以及可选的显示器507。通过图示的通信媒介,可实现将数据通信至本地或远程服务器。通信媒介可包括任何方式的传送和/或接收数据。例如,通信媒介可以是网络连接、无线连接或因特网连接。该连接能够通过万维网(world wide web)提供通信。可预期的是,如图8中所示,与本发明相关的数据可经由这些网络或连接传输,用于被一方接收和/或评论。The computer system 500 shown in FIG. 8 can be understood as a logical device capable of reading instructions from a medium 511 and/or a network port 505, which can optionally be connected to a server 509 with a fixed medium . A system such as that shown in FIG. 8 may include a central processing unit 501 , a disk drive 503 , optional input devices such as a keyboard 515 and/or mouse 516 , and an optional display 507 . Data can be communicated to a local or remote server through the illustrated communication media. Communication media may include any means of transmitting and/or receiving data. For example, a communication medium can be a network connection, a wireless connection or an Internet connection. This connection can provide communication over the world wide web. It is contemplated that, as shown in FIG. 8, data related to the present invention may be transmitted via these networks or connections for receipt and/or comment by a party.

图9为表示计算机系统100的第一示例架构的框图,其可结合本发明的示例实施方案使用。如图9所示,示例的计算机系统可包括用于处理指令的处理器102。处理器的非限制性示例包括:Intel XeonTM处理器、AMD OpteronTM处理器、Samsung 32-位RISC ARM1176JZ(F)-S v1.0TM处理器、ARM Cortex-A8Samsung S5PC100TM处理器、ARM Cortex-A8Apple A4TM处理器、Marvell PXA 930TM处理器或功能上相当的处理器。多线程执行可用于并行处理。在一些实施方案中,还可使用多处理器和多核处理器,不管是在单个计算机系统中、在集群中,还是分布于遍及网络的系统中,该系统包含多个计算机、手机和/或个人数据助理设备。FIG. 9 is a block diagram representing a first example architecture of a computer system 100 that may be used in conjunction with example embodiments of the present invention. As shown in FIG. 9, the example computer system may include a processor 102 for processing instructions. Non-limiting examples of processors include: Intel Xeon™ processor, AMD Opteron™ processor, Samsung 32-bit RISC ARM1176JZ(F)-S v1.0™ processor, ARM Cortex-A8 Samsung S5PC100™ processor, ARM Cortex-A8 Apple A4™ processor processor, Marvell PXA 930TM processor, or functionally equivalent processor. Multithreaded execution is available for parallel processing. In some embodiments, multi-processors and multi-core processors may also be used, whether in a single computer system, in a cluster, or distributed across a network of systems comprising multiple computers, cell phones, and/or personal Data assistant device.

如图9中所示,高速缓冲存储器104可连接至处理器102或合并在处理器102中,以便为处理器102最近或经常使用的指令或数据提供高速存储器。处理器102通过处理器总线108连接至北桥106。北桥106通过存储器总线112连接至随机存取存储器(RAM)110,并通过处理器102管理对RAM 110的访问。北桥106还通过芯片组总线116连接至南桥114。南桥114依次连接至外围总线118。外围总线可以是例如PCI、PCI-X、PCI Express或其它外围总线。北桥和南桥通常被称为处理器芯片组,并管理处理器、RAM和外围总线118上的外围部件之间传输数据。在一些可替代的架构中,北桥的功能可被整合入处理器中,而不是使用单独的北桥芯片。As shown in FIG. 9 , a cache memory 104 may be connected to or incorporated in the processor 102 to provide high-speed memory for instructions or data most recently or frequently used by the processor 102 . Processor 102 is connected to Northbridge 106 via processor bus 108 . Northbridge 106 is connected to random access memory (RAM) 110 through memory bus 112 and manages access to RAM 110 through processor 102 . Northbridge 106 is also connected to Southbridge 114 through chipset bus 116 . South bridge 114 is in turn connected to peripheral bus 118 . A peripheral bus may be, for example, PCI, PCI-X, PCI Express, or other peripheral bus. The Northbridge and Southbridge are commonly referred to as the processor chipset and manage the transfer of data between the processor, RAM and peripheral components on the peripheral bus 118 . In some alternative architectures, the Northbridge functionality can be integrated into the processor instead of using a separate Northbridge chip.

在一些实施方案中,系统100可包括连接至外围总线118的加速器卡122。加速器可包括现场可编程门阵列(FPGAs)或其它用于加速某一过程的硬件。例如,可使用加速器用于适配数据重构,或用于评估扩展的集处理中所用的代数表达式。In some embodiments, the system 100 may include an accelerator card 122 connected to the peripheral bus 118 . Accelerators may include field programmable gate arrays (FPGAs) or other hardware used to accelerate a process. For example, accelerators can be used for adaptive data reconstruction, or for evaluating algebraic expressions used in extended set processing.

软件和数据储存在外部存储器124中,并可被存入RAM110和/或缓冲存储器104,以便处理器使用。系统100包括用于管理系统资源的操作系统;操作系统的非限制性示例包括:Linux,WindowsTM,MACOSTM,BlackBerry OSTM,iOSTM,和其它功能上相当的操作系统,以及在该操作系统上运行的应用软件,用于管理存储数据并根据本发明的实施方案进行优化。Software and data are stored in external memory 124 and may be stored in RAM 110 and/or cache memory 104 for use by the processor. System 100 includes an operating system for managing system resources; non-limiting examples of operating systems include: Linux, Windows , MACOS , BlackBerry OS , iOS , and other functionally equivalent operating systems, as well as Application software running on the , used to manage stored data and optimize according to embodiments of the present invention.

在该实施例中,系统100还包括网络接口卡(NICs)120和121,以连接至外围总线,向外部存储器提供网络接口,例如网络附加存储(NAS)以及其它可用于分布式并行处理的计算机系统。In this embodiment, system 100 also includes network interface cards (NICs) 120 and 121 to connect to the peripheral bus to provide a network interface to external storage, such as network attached storage (NAS) and other computers that can be used for distributed parallel processing system.

图10图解示出了网络200,其具有多个计算机系统202a和202b、多个手机和个人数据助理202c、以及网络附加存储(NAS)204a和204b。在示例的实施方案中,系统202a、202b和202c可管理数据存储并优化对网络附加存储(NAS)204a和204b中存储数据的数据访问。数据模型可用于数据,并使用经过计算机系统202c和202b和手机和个人数据助理系统202c的分布式并行处理评估该数据模型。计算机系统202c和202b和手机和个人数据助理系统202c还可为存储于网络附加存储(NAS)204a和204b中的数据的适配数据重构提供并行处理。图10仅示出了示例,而结合本发明的各种实施方案,可使用多种其它的计算机架构和系统。例如,刀片式服务器可用于提供并行处理。处理器刀片可通过底板连接,以提供并行处理。存储器还可连接至底板或作为网络附加存储(NAS)通过单独的网络接口。Figure 10 diagrammatically shows a network 200 having multiple computer systems 202a and 202b, multiple cell phones and personal data assistants 202c, and network attached storage (NAS) 204a and 204b. In an example embodiment, systems 202a, 202b, and 202c can manage data storage and optimize data access to data stored in network attached storage (NAS) 204a and 204b. A data model can be applied to the data and evaluated using distributed parallel processing through computer systems 202c and 202b and cell phone and personal data assistant system 202c. Computer systems 202c and 202b and cell phone and personal data assistant system 202c may also provide parallel processing for adaptive data reconstruction of data stored in network attached storage (NAS) 204a and 204b. Figure 10 shows an example only, and various other computer architectures and systems may be used in conjunction with various embodiments of the invention. For example, blade servers can be used to provide parallel processing. Processor blades can be connected through the backplane to provide parallel processing. Storage can also be attached to the backplane or as network-attached storage (NAS) via a separate network interface.

在一些示例的实施方案中,处理器可维持独立的存储空间,并通过网络接口、底板或其它通过其它处理器用于并行处理的连接器传输数据。在其它实施方案中,部分或全部处理器可使用共享虚拟地址存储空间。In some example embodiments, a processor may maintain a separate memory space and transmit data through a network interface, backplane, or other connector for parallel processing by other processors. In other implementations, some or all of the processors may use a shared virtual address storage space.

图11是根据示例实施方案使用共享虚拟地址存储空间的多处理器计算机系统300的方框图。该系统包括多个可访问共享存储器子系统304的处理器302a-f。系统将多个可编程硬件存储算法处理器(MAP)306a-f合并到存储器子系统304中。每个MAP306a-f可包含存储器308a-f和一个或多个现场可编程门阵列(FPGA)310a-f。MAP提供了可配置的功能单元,并且特定算法或部分算法可提供至FPGA 310a-f,用于与各个处理器密切协作进行处理。例如,MAP可用于评估关于数据模型的代数式表达,并用于执行示例实施方案中的适配数据重构。在该示例中,用于这些目的的全部处理器可全局性访问每个MAP。在一种配置中,每个MAP可使用直接存储器访问(DMA)访问相关的存储器308a-f,使其独立于相应的微处理器302a-f、与其异步地执行任务。在该配置中,MAP可直接将结果反馈给用于算法的流水线和并行执行的另外的MAP。11 is a block diagram of a multiprocessor computer system 300 using a shared virtual address storage space according to an example embodiment. The system includes a plurality of processors 302a - f that have access to a shared memory subsystem 304 . The system incorporates multiple programmable hardware memory algorithm processors (MAPs) 306a - f into the memory subsystem 304 . Each MAP 306a-f may include memory 308a-f and one or more field programmable gate arrays (FPGAs) 310a-f. MAPs provide configurable functional units, and specific algorithms or portions of algorithms can be provided to FPGAs 310a-f for processing in close cooperation with the respective processors. For example, MAP can be used to evaluate algebraic representations about data models and to perform adaptive data reconstruction in example embodiments. In this example, all processors used for these purposes have global access to each MAP. In one configuration, each MAP may access an associated memory 308a-f using direct memory access (DMA), allowing it to perform tasks independently of, and asynchronously with, a corresponding microprocessor 302a-f. In this configuration, MAPs can directly feed results back to further MAPs for pipeline and parallel execution of algorithms.

以上的计算机架构和系统仅为示例,多种其它的计算机、手机和个人数据助理架构和系统可结合示例实施方案使用,包括使用以下的任意组合的系统:通用处理器、协同处理器、FPGA及其它可编程逻辑设备、片上系统(SOC)、专用集成电路(ASIC)、以及其它处理和逻辑元件。在一些实施方案中,可在软件或硬件中运行全部或部分计算机系统。可结合示例实施方案使用多种数据存储媒介,包括随机存取存储器、硬盘驱动器、闪速存储器、磁带驱动器、磁盘阵列、网络附加存储(NAS)以及其它本地或分布式数据存储设备和系统。The above computer architectures and systems are examples only, and a variety of other computer, cell phone, and personal data assistant architectures and systems can be used in conjunction with the example embodiments, including systems using any combination of: general-purpose processors, co-processors, FPGAs, and Other programmable logic devices, systems on chip (SOCs), application specific integrated circuits (ASICs), and other processing and logic elements. In some embodiments, all or part of a computer system may run in software or hardware. A variety of data storage media may be used in conjunction with example embodiments, including random access memory, hard drives, flash memory, tape drives, disk arrays, network attached storage (NAS), and other local or distributed data storage devices and systems.

在示例的实施方案中,可使用在任何上述或其它计算机架构和系统上执行的软件模块,运行计算机系统。在其它实施方案中,系统功能可部分地或完全地在固件、可编程逻辑设备、片上系统(SOC)、专用集成电路(ASIC)或其它处理或逻辑元件中运行,所述可编程逻辑设备例如图11中所示的现场可编程门阵列(FPGA)。例如通过使用硬件加速器卡,可对组处理器和优化器进行硬件加速,所述硬件加速器卡例如图9中所示的加速器卡122。In an example embodiment, the computer system may operate using software modules executing on any of the above or other computer architectures and systems. In other embodiments, system functionality may run partially or completely in firmware, programmable logic devices, system-on-chips (SOCs), application-specific integrated circuits (ASICs), or other processing or logic elements, such as The Field Programmable Gate Array (FPGA) shown in Figure 11. The group processor and optimizer can be hardware accelerated, for example by using a hardware accelerator card, such as accelerator card 122 shown in FIG. 9 .

以下实施例旨在说明但不限制本发明。这些实施例是可能使用的实施例中具有代表性的,本领域技术人员已知的其它程序可选择地使用。The following examples are intended to illustrate but not limit the invention. These examples are representative of those that may be used, and other procedures known to those skilled in the art may alternatively be used.

实施例Example

实施例1.在体外生成染色质的方法Example 1. Method for generating chromatin in vitro

两种重构染色质的途径值得特别留意:一种途径是使用不依赖ATP的组蛋白的随机沉积到DNA上,而另一种途径使用周期性核小体的依赖ATP的组装。本发明允许结合本文公开的一种或多种方法使用任何一种途径。生成染色质的两种途径的示例可参见Lusser等(“Strategies for the reconstitution of chromatin”,Nature Methods(2004),1(1):19-26),其全部内容以引用方式并入本文中,包括其中所引用的参考文献。Two pathways for remodeling chromatin deserve special attention: one uses the random deposition of ATP-independent histones onto DNA, and the other uses the ATP-dependent assembly of periodic nucleosomes. The invention allows for the use of either approach in conjunction with one or more of the methods disclosed herein. Examples of two pathways for generating chromatin can be found in Lusser et al. ("Strategies for the reconstitution of chromatin", Nature Methods (2004), 1(1):19-26), the entire contents of which are incorporated herein by reference, Include references cited therein.

实施例2.使用基于HI-C的技术进行基因组组装Example 2. Genome Assembly Using HI-C Based Technology

来自人类受试者的基因组被片段化为大小500kb的伪重叠群。使用基于Hi-C的方法,通过探测活细胞内染色体的物理布局,生成多个读对。多种基于Hi-C的方法可用于生成读对,包括以下所示的方法:Lieberman-Aiden等(“Comprehensive mapping of longrange interactions reveals folding principles of the human genome”,Science(2009),326(5950):289-293),其全部内容以引用方式并入本文中,包括其中所引用的参考文献。读对被定位至全部伪重叠群,并且那些定位至两个独立伪重叠群的读对用于基于定位数据构建邻接矩阵。通过采用读段到伪重叠群边缘的距离的函数,对约50%、约60%、约70%、约80%、约90%、约95%或约99%的读对加权,以在数学上体现经验上已知的短接触比长接触高的概率。然后,对于每个伪重叠群,分析邻接矩阵,以通过发现单个最佳邻接伪重叠群确定经过伪重叠群的路径,这通过具有最高的加权之和而确定。通过实施这些方法,发现大于97%的全部伪重叠群识别出其正确邻接伪重叠群。可进行另外的试验,以测试较短重叠群和替代性加权和路径发现方法的影响。Genomes from human subjects were fragmented into pseudocontigs of size 500 kb. Generate multiple read pairs by probing the physical layout of chromosomes in living cells using a Hi-C-based approach. A variety of Hi-C based methods can be used to generate read pairs, including the method shown below: Lieberman-Aiden et al. ("Comprehensive mapping of longrange interactions reveals folding principles of the human genome", Science (2009), 326(5950) :289-293), the entire contents of which are incorporated herein by reference, including the references cited therein. Read pairs were mapped to all pseudo-contigs, and those that mapped to two independent pseudo-contigs were used to build an adjacency matrix based on the mapping data. About 50%, about 60%, about 70%, about 80%, about 90%, about 95%, or about 99% of the read pairs are weighted by taking a function of the distance of the reads to the pseudo-contig edge to mathematically The above embodies the empirically known higher probability of short contacts than long contacts. Then, for each pseudo-contig, the adjacency matrix is analyzed to determine a path through the pseudo-contig by finding a single best-neighboring pseudo-contig, which is determined by having the highest weighted sum. By implementing these methods, it was found that greater than 97% of all pseudocontigs identified their correct neighbor pseudocontigs. Additional experiments can be performed to test the effect of shorter contigs and alternative weighting and path finding methods.

可选择地,使用Hi-C数据的基因组组装可包括计算方法,该计算方法利用Hi-C数据组中基因组接近的信号,用于从头基因组组装的超长架构(scaffolding)。可结合本文所公开的方法使用的这些计算方法的示例包括由Burton等提出的连接邻近染色质法(NatureBiotechnology 31:1119-1125(2013));由Kaplan等提出的DNA三角法(NatureBiotechnology31:1143-47(2013)),其全部内容及其任何引用文献通过引用被并入本文。此外,应当理解为这些计算方法可组合使用,包括和本文所示的其它基因组组装方法。Alternatively, genome assembly using Hi-C data may include computational methods that exploit signals of genomic proximity in Hi-C data sets for ultralong scaffolding for de novo genome assembly. Examples of such computational methods that can be used in conjunction with the methods disclosed herein include the linking adjacent chromatin method proposed by Burton et al. (Nature Biotechnology 31:1119-1125 (2013)); the DNA triangulation method proposed by Kaplan et al. (Nature Biotechnology 31:1143- 47 (2013)), the entire contents of which and any references cited therein are incorporated herein by reference. Furthermore, it should be understood that these computational methods can be used in combination, including with other genome assembly methods shown herein.

例如,可结合本文所公开的方法,使用基于Burton等的连接相邻染色质的方法,其包括以下步骤:(a)使重叠群聚集为染色体组,(b)对一个或多个染色体组内的重叠群进行排序,以及(c)为各重叠群分配相对方向。对于步骤(a),使用层次聚类,将重叠群放到组中。创建图表,每个节点最初表示一个重叠群,并且节点之间的每条边具有与连接两个重叠群的Hi-C读对的数量相等的权重。使用具有平均连接度量的层次凝聚聚类,将重叠群混合起来,直至组的数量减少至不同染色体的期望数量(仅计算具有多于一个重叠群的组)。不聚类重复重叠群(该重叠群与其它重叠群的平均连接密度,按限制性片段位点的数量进行归一化后,比平均连接密度大两倍)和具有太少限制性片段位点的重叠群。然而,聚类之后,这些重叠群中的每一个被分配至一个组--如果其与那个组的平均连接密度比其与其它组的平均连接密度大四倍。对于步骤(b),类似聚类步骤创建图表,但节点之间的边的权重等于重叠群之间的Hi-C连接数量的倒数,用每个重叠群中限制性片段位点的数量归一化。从该图中排除短的重叠群。计算该图标的最小生成树。发现该树中的最长路径“主干”。然后修改该生成树,以通过向其加入邻近树干的重叠群,延长主干,以使总边权重保持试探性地低。在每个组发现延长主干之后,如下所述,延长主干被转化为完全有序的。从生成树中移除主干,留下一组含有全部非主干内的重叠群的“分支”。这些分支被重新插入主干,首先是最长的分支,选择插入位点,以使排序中相邻的连接数最大。短片段未被插入;因此,被聚类的多个小重叠群未参与最终的组装。对于步骤(c),每个重叠群在其次序内的方向通过考虑每个重叠群上Hi-C连接对齐的确切位置而确定。假定Hi-C连接将基因组距离为x的两个读段相连的可能性对于x≥~100Kb大致为1/x。创建加权有向无环图(WDAG),表示按预定次序定向重叠群的所有可能方式。WDAG中的每条边对应一对相邻的重叠群,沿其四个可能的组合方向中的一个,且边权值被设为观察两个重叠群之间的Hi-C连接距离集合的对数似然,假定它们紧邻给定方向相邻。对于每个重叠群,如下计算其方向的质量评分。发现该重叠群在其当前方向上与其邻近重叠群之间的所观察的Hi-C连接集合的对数似然性。然后反转重叠群,再次计算对数似然性。由于计算方向的方式,保证第一对数似然性较高。对数似然性之间的差异被视为质量评分。For example, the Burton et al.-based method for linking adjacent chromatin can be used in conjunction with the methods disclosed herein, which includes the steps of: (a) aggregating contigs into chromosome sets, (b) mapping contigs within one or more chromosome sets , and (c) assign a relative orientation to each contig. For step (a), contigs were placed into groups using hierarchical clustering. A graph is created where each node initially represents a contig and each edge between nodes has a weight equal to the number of Hi-C read pairs connecting the two contigs. Using hierarchical agglomerative clustering with an average connectivity metric, contigs were mixed until the number of groups was reduced to the desired number of distinct chromosomes (only groups with more than one contig were counted). Non-clustering repeat contigs (the average junction density of this contig with other contigs, normalized by the number of restriction fragment sites, is two times greater than the average junction density) and having too few restriction fragment sites contigs. After clustering, however, each of these contigs was assigned to a group if its average connection density with that group was four times greater than its average connection density with other groups. For step (b), create a graph similar to the clustering step, but edges between nodes are weighted equal to the inverse of the number of Hi-C connections between contigs, normalized by the number of restriction fragment sites in each contig change. Short contigs were excluded from the plot. Compute the minimum spanning tree for this icon. The longest path "backbone" in this tree is found. This spanning tree is then modified to lengthen the trunk by adding to it contigs adjacent to the trunk so that the total edge weights remain tentatively low. After each group discovers the extended backbone, the extended backbone is converted to fully ordered as described below. The backbone is removed from the spanning tree, leaving a set of "branches" containing all contigs not within the backbone. These branches are reinserted into the trunk, longest branches first, with insertion sites chosen so as to maximize the number of contiguous connections in the ordering. Short fragments were not inserted; therefore, multiple small contigs that were clustered did not participate in the final assembly. For step (c), the orientation of each contig within its sequence was determined by considering the exact position of the Hi-C junction alignment on each contig. The probability of assuming a Hi-C join to join two reads at a genomic distance x is roughly 1/x for x > ~100Kb. A weighted directed acyclic graph (WDAG) is created representing all possible ways of orienting contigs in a predetermined order. Each edge in WDAG corresponds to a pair of adjacent contigs along one of its four possible combined directions, and the edge weight is set to the pair observing the set of Hi-C connection distances between the two contigs likelihood, assuming they are immediately adjacent in a given direction. For each contig, the quality score for its orientation was calculated as follows. The log-likelihood of the observed set of Hi-C connections between this contig in its current orientation and its neighboring contigs is found. The contigs were then reversed and the log-likelihood calculated again. Due to the way the direction is calculated, the first log-likelihood is guaranteed to be high. The difference between the log-likelihoods is taken as a quality score.

本文所公开的方法中还可使用类似Kaplan等的可替代的DNA三角法,以从重叠群和读对组装基因组。DNA三角法基于高通量体内全基因组染色质交互数据的使用,以推断基因组位置。对于DNA三角法,首先通过将基因组分到100-kb箱中定量CTR类型,每个箱表示大的虚拟重叠群,并计算每个放置的重叠群与每个染色体的平均交互频率。为了评估经过长距离的定位,忽略重叠群与每一侧上相接的1mb的交互数据。平均交互频率大大分开了染色体内的相互作用和染色体之间的相互作用,并高度预测了重叠群属于哪个染色体。然后,采用简单的多级模型,朴素贝叶斯分类器,基于其与每个染色体的平均交互频率,预测每个重叠群的染色体。基因组的组装部分用于拟合描述Hi-C交互频率和基因组距离之间的关系的概率性单参数指数衰减模型(领域驱动设计(DDD)模型)。在每轮中,从染色体除去重叠群,连同每一侧上1Mb的侧翼区。然后基于交互情况和衰减模型估计每个重叠群的最可能的位置。预测误差定量为预测位置和实际位置之间的距离的绝对值。Alternative DNA triangulation methods like Kaplan et al. can also be used in the methods disclosed herein to assemble genomes from contigs and read pairs. DNA triangulation is based on the use of high-throughput in vivo genome-wide chromatin interaction data to infer genomic positions. For DNA triangulation, CTR types were first quantified by grouping genes into 100-kb bins, each bin representing a large virtual contig, and calculating the average interaction frequency of each placed contig with each chromosome. To assess mapping over long distances, the interaction data of the contig and the 1 mb adjoining on each side were ignored. The average interaction frequency largely separates interactions within chromosomes from interactions between chromosomes and is highly predictive of which chromosome a contig belongs to. Then, a simple multilevel model, the Naive Bayes classifier, was employed to predict the chromosomes of each contig based on their average interaction frequency with each chromosome. The assembled portion of the genome was used to fit a probabilistic one-parameter exponential decay model (domain-driven design (DDD) model) describing the relationship between Hi-C interaction frequency and genomic distance. In each round, contigs were removed from the chromosome, along with 1 Mb of flanking regions on each side. The most probable position of each contig was then estimated based on the interaction case and the decay model. The forecast error is quantified as the absolute value of the distance between the predicted position and the actual position.

通过将DNA三角法结合长插入片段文库,可进一步提高每个重叠群的可预测性。通过了解染色体分配和每个重叠群的大致位置,可极大地降低长插入支架的计算复杂性,因为每个重叠群仅需要与其邻近的重叠群配对;从而解决不明确的重叠群连接,并减少位于染色体较远区域或不同染色体上的重叠群不正确连接的组装错误。The predictability of each contig can be further improved by combining DNA triangulation with long-insert libraries. By knowing the chromosome assignments and the approximate location of each contig, the computational complexity of long-insert scaffolds can be greatly reduced, as each contig only needs to be paired with its neighboring contigs; thus resolving ambiguous contig connections and reducing Assembly errors in which contigs located in distant regions of the chromosome or on different chromosomes are incorrectly linked.

实施例3.用于单体型定相的方法Example 3. Method for haplotype phasing

因为通过本文所公开的方法生成的读对通常来源于染色体内接触,任何含有杂合位点的读对还将携带与其定相相关的信息。使用这些信息,可快速地并准确地在短的、中等的甚至长的(百万碱基)距离进行可靠的定相。设计用于定相来自1000个基因组三人组(母方/父方/后代基因组的集合)之一的数据的试验,可靠地推断定相。此外,还可以结合本文公开的单体型定相方法使用单体型重建,该单体型重建使用类似于Selvaraj等(NatureBiotechnology 31:1111-1118(2013))的邻位连接。Because read pairs generated by the methods disclosed herein typically originate from intrachromosomal contacts, any read pair containing a heterozygous site will also carry information related to its phasing. Using this information, reliable phasing at short, medium and even long (megabase) distances can be performed quickly and accurately. Experiments designed for phasing data from one of the 1000 genome triplets (collection of maternal/paternal/offspring genomes) reliably infer phasing. Furthermore, haplotype reconstruction using proximity linkage similar to Selvaraj et al. (Nature Biotechnology 31:1111-1118 (2013)) can also be used in conjunction with the haplotype phasing method disclosed herein.

例如,还可在本文所公开的方法中使用基于邻位连接的方法的单体型重建,用于定相基因组。使用基于邻位连接的方法的单体型重建结合邻位连接和DNA测序与用于单体型组装的概率算法。首先,使用染色体捕获技术,例如Hi-C技术,进行邻位连接测序。这些方法可捕获来自在三维空间内在一起成环的两个远离的基因组位置的DNA片段。在对获得的DNA文库进行鸟枪DNA测序之后,配对末端测序读段具有范围从几百碱基对到几千万碱基对的“插入大小(insert size)”。因此,在Hi-C试验中生成的短DNA片段可产生小的单体型块,长片段最终可将这些小块连接在一起。具有足够的测序范围,该方法可能连接非连续块中的变异体并将每个这种块组成成单个单体型。然后将该数据与单体型的概率算法结合。概率算法利用图表,该图表中节点对应杂合变体和边对应可连接变异体的重叠序列片段。该图表可含有由测序错误或反式相互作用造成的假边缘。然后用最大割算法(max-cutalgorithm)预测简化的方案,其最大程度地与由输入测序读段集合提供的单体型信息一致。因为邻位连接生成的图表比常规基因组测序或配对测序更大,修改运算时间和迭代次数,以便单体型可以以合理的速度和高准确度预测。然后可使用得到的数据使用Beagle软件和来自基因组计划的测序数据引导局部定相,以生成具有高分辨率和准确度的跨越染色体的单体型。For example, haplotype reconstruction based on proximity joining methods can also be used in the methods disclosed herein for phased genomes. Haplotype reconstruction using a proximity-joining-based approach combining proximity-joining and DNA sequencing with a probabilistic algorithm for haplotype assembly. First, use chromosome capture technology, such as Hi-C technology, to perform proximity junction sequencing. These methods capture DNA fragments from two distant genomic locations that loop together in three dimensions. After shotgun DNA sequencing of the resulting DNA library, the paired-end sequencing reads have an "insert size" ranging from a few hundred base pairs to tens of millions of base pairs. Thus, short DNA fragments generated in Hi-C assays generate small haplotype blocks, and long fragments eventually join these small blocks together. With sufficient sequencing range, the method could potentially connect variants in non-contiguous blocks and compose each such block into a single haplotype. This data is then combined with a probabilistic algorithm for haplotypes. The probabilistic algorithm utilizes a graph in which nodes correspond to heterozygous variants and edges correspond to overlapping sequence segments of connectable variants. The plot may contain spurious edges caused by sequencing errors or trans interactions. A max-cut algorithm is then used to predict the reduced solution that is most consistent with the haplotype information provided by the set of input sequencing reads. Because the graphs generated by proximity ligation are larger than conventional genome sequencing or paired sequencing, the computation time and number of iterations were modified so that haplotypes could be predicted with reasonable speed and high accuracy. The resulting data can then be used to guide local phasing using Beagle software and sequencing data from genome projects to generate chromosome-spanning haplotypes with high resolution and accuracy.

实施例4.用于宏基因组组装的方法Example 4. Methods for Metagenome Assembly

从环境采集微生物并用固定剂固定,该固定剂例如甲醛,以在微生物细胞内形成交联。通过使用高通量测序,从微生物生成多个重叠群。通过使用基于Hi-C的技术生成多个读对。定位至不同重叠群的读对指出哪些重叠群来自相同的种。Microorganisms are collected from the environment and fixed with a fixative, such as formaldehyde, to form crosslinks within the microbial cells. By using high-throughput sequencing, multiple contigs were generated from microorganisms. Generate multiple read pairs by using Hi-C based techniques. Read pairs that map to different contigs indicate which contigs are from the same species.

实施例5.制备极长程读对的方法Example 5. Method of making very long-range read pairs

使用市售试剂盒,将DNA提取至高达150kbp的片段大小。使用来自Activ Motif的商用试剂盒,在体外将DNA组装到重构染色质结构中。染色质被生物素化、用甲醛固定、并被固定化在链霉亲和素珠粒上。用限制性内切酶消化DNA片段并孵育过夜。得到的粘性末端用α-硫代-dGTP和生物素化的dCTP补平,以生成平末端。平末端用T4连接酶连接。重构染色质用蛋白酶消化,以重新得到连接的DNA。从珠粒提取DNA,并进行核酸外切酶消化,以从未连接的末端除去生物素。剪切回收的DNA,并用dNTP补平末端。通过用链霉亲和素珠粒下拉,纯化该生物素化的片段。在一些情况下,连接接头,并且对片段PCR扩增用于高通量测序。DNA was extracted up to a fragment size of 150 kbp using a commercially available kit. DNA was assembled into reconstituted chromatin structures in vitro using commercial kits from Activ Motif. Chromatin was biotinylated, fixed with formaldehyde, and immobilized on streptavidin beads. Digest DNA fragments with restriction enzymes and incubate overnight. The resulting cohesive ends were filled in with α-thio-dGTP and biotinylated dCTP to generate blunt ends. The blunt ends were ligated with T4 ligase. Reconstituted chromatin is digested with proteases to recover ligated DNA. DNA was extracted from beads and subjected to exonuclease digestion to remove biotin from unligated ends. The recovered DNA was sheared and the ends filled in with dNTPs. The biotinylated fragment was purified by pull down with streptavidin beads. In some cases, adapters were ligated and the fragments were PCR amplified for high-throughput sequencing.

实施例6.用于产生高质量人类基因组组装的方法Example 6. Method for generating high-quality human genome assemblies

在已知本发明可生成跨越相当大基因距离的读对的情况下,可测试该用于基因组组装的信息的利用。本发明可大大改善可能的从头组装与染色体长度的支架的连接。可评估组装能够如何完整地产生以及使用本发明需要多少数据。为了评估本方法产生对组装有用的数据的功效,可构建并测序标准的Illumina鸟枪文库和XLRP文库。在一种情况下,使用来自标准鸟枪文库的每一个的1个Illumina HiSeq泳道的数据和XLRP文库中的数据。检测每种方法生成的数据,并将其与各种现有组装软件比较。可选地,还记载了特别适合本发明产生的独特数据的新软件。可选地,使用充分表征的人类样本作为比较本方法产生的组装的参照,以评估其准确度和完整度。使用前面分析中得到的知识,产生组装软件,以增加XLRP和鸟枪数据的高效和有效利用。使用本文所述的方法,生成的基因组组装质量与2002年12月鼠基因组草图相当,或更好。Knowing that the present invention can generate read pairs spanning considerable genetic distances, the utilization of this information for genome assembly can be tested. The present invention can greatly improve possible de novo assembly and attachment of chromosome-length scaffolds. It can be assessed how completely an assembly can be produced and how much data is required to use the present invention. To assess the efficacy of this method to generate data useful for assembly, standard Illumina shotgun and XLRP libraries were constructed and sequenced. In one case, data from 1 Illumina HiSeq lane each of the standard shotgun library and data from the XLRP library were used. The data generated by each method was examined and compared to various existing assembly software. Optionally, new software specifically adapted to the unique data generated by the present invention is also documented. Optionally, use well-characterized human samples as a reference to compare assemblies produced by the method to assess its accuracy and completeness. Using the knowledge gained from the previous analysis, assembly software was generated to increase the efficient and effective use of XLRP and shotgun data. Using the methods described herein, the quality of the resulting genome assembly was comparable to, or better than, the December 2002 draft mouse genome.

可用于该分析的一种样本为NA12878。使用多种已公开的用于最大化DNA片段长度的技术,提取来自样本细胞的DNA。分别构建标准Illumina TruSeq鸟枪文库和XLRP文库。每个文库得到2x150bp序列的单个HiSeq泳道,每个文库可产生大约1亿5千万读对。使用全基因组组装算法将鸟枪数据组装为重叠群。这些算法的示例包括:如Chapman等(PLOS ONE6(8):e2350(2011))中所述的Meraculous或者如Simpson等(Genome research22(3):549–56(2012))中所述的标准遗传算法。将XLRP文库读段与初始组装产生的重叠群进行比对。比对用于进一步连接重叠群。一旦确定XLRP用于连接重叠群的有效性,Meraculous组装可扩展至同时将鸟枪和XLRP文库整合到单个组装过程中。Meraculous为组装软件奠定了牢固的基础,可选地,产生一体化的组装软件,以满足本发明的具体需要。本发明组装的人类基因组与任何已知序列进行比较,以评估基因组组装中的质量。One sample that can be used for this analysis is NA12878. DNA from sample cells is extracted using a variety of published techniques for maximizing DNA fragment length. Standard Illumina TruSeq shotgun libraries and XLRP libraries were constructed separately. Each library yields a single HiSeq lane of 2x150bp sequences, yielding approximately 150 million read pairs per library. Shotgun data were assembled into contigs using a whole genome assembly algorithm. Examples of these algorithms include: Meraculous as described in Chapman et al. (PLOS ONE6(8):e2350(2011)) or standard genetics as described in Simpson et al. (Genome research22(3):549–56 (2012)) algorithm. Align the XLRP library reads to the contigs generated by the initial assembly. Alignment was used to further link contigs. Once the effectiveness of XLRP for joining contigs is established, Meraculous assembly can be extended to simultaneously integrate shotgun and XLRP libraries into a single assembly process. Meraculous lays a solid foundation for assembling software, optionally producing integrated assembling software to meet the specific needs of the invention. The assembled human genome of the present invention is compared to any known sequence to assess the quality in the genome assembly.

实施例7.用于以高准确度从小数据组定相人类样本杂合SNP的方法Example 7. Method for phasing heterozygous SNPs in human samples with high accuracy from small data sets

在一个试验中,定相受试人类样本数据组中大约44%的杂合变异体。捕获全部或几乎全部距离限制内切位点一个读长内的定相变异体。通过使用计算机模拟分析,可通过使用更长的读长和使用用于消化的一种或多种组合限制性内切酶,捕获更多用于定相的变异体。使用具有不同限制酶切位点的限制性内切酶的组合,增加了参与每个读对的两个限制酶切位点中的一个范围内的基因组(以及从而杂合位点)的比例。计算机模拟分析显示,使用两种限制性内切酶的各种组合,本发明的方法可定相超过95%的已知杂合位置。附加的酶和更大的读长进一步增加了被观察和定相的杂合位点部分,直至完全的覆盖范围和定相。In one experiment, approximately 44% of the heterozygous variants in the test human sample data set were phased. Capture all or nearly all phased variants within one read length of the restriction site. By using in silico analysis, more variants can be captured for phasing by using longer read lengths and using a combination of restriction enzyme(s) for digestion. Using a combination of restriction enzymes with different restriction sites increases the proportion of the genome (and thus heterozygous sites) within one of the two restriction sites participating in each read pair. In silico analysis showed that using various combinations of the two restriction enzymes, the method of the present invention could phase over 95% of the known heterozygous positions. Additional enzymes and larger read lengths further increase the fraction of heterozygous sites that are observed and phased, up to complete coverage and phasing.

计算可用各种由两种限制性内切酶构成的组合实现的杂合位点覆盖范围。就读段接近中的杂合位点而言,用流程测试了前三种组合。对于这些组合中的每一种,产生并测序XLRP文库。得到的读段与人类参照基因组进行比对,并与样本的已知单体型比较,以确定实验步骤的准确度。只使用1条泳道的Illumina HiSeq数据,定相多达90%或更多的人类样本的杂合SNP,其准确度为99%或更高。此外,通过使读长增加至300bp,进一步捕获变异体。可观察到的限制内切位点周围的读段区域有效地翻倍。应用附加的限制性内切酶组合,增加覆盖范围和准确度。The hybrid site coverage achievable with various combinations of the two restriction enzymes was calculated. The first three combinations were tested with the pipeline for heterozygous sites in read proximity. For each of these combinations, an XLRP library was generated and sequenced. The resulting reads are aligned to the human reference genome and compared to the known haplotypes of the samples to determine the accuracy of the experimental procedure. Phase up to 90% or more of heterozygous SNPs in human samples with 99% or better accuracy using only 1 lane of Illumina HiSeq data. Additionally, variants were further captured by increasing the read length to 300bp. The read region around the observable restriction site is effectively doubled. Apply additional restriction enzyme combinations to increase coverage and accuracy.

实施例8.高分子量DNA的提取和影响:Example 8. Extraction and influence of high molecular weight DNA:

用市售试剂盒提取高达150kbp的DNA。图7表明可从达到被提取DNA的最大片段长度的捕获读对,生成XLRP文库。相应地,可期望本文所公开的方法能够从甚至更长串的DNA生成读对。有许多完善的用于高分子量DNA回收的方法,且这些方法可结合本文所公开的方法或实验步骤使用。使用提取方法以产生大片段长度的DNA,从这些片段中创建XLRP文库,并且可评估产生的读对。例如,可通过以下方法提取大分子量DNA:(1)温和的细胞溶解,根据Teague等(Proc.Nat.Acad.Sci.USA107(24):10848–53(2010))或Zhou等(PLOSGenetics,5(11):e1000711(2009));和(2)琼脂糖凝胶塞,根据Wing等(The PlantJournal:for Cell and Molecular Biology,4(5):893–8(1993)),其全部内容通过引用被并入本文,包括其中引用的任何参考文献,或者通过使用来自Boreal Genomics的AuroraSystem。这些方法能够生成超过下一代测序通常所需的长DNA片段;然而,本领域已知的其它合适的方法可替代,以实现类似的结果。Aurora System提供了意想不到的结果,并且可从组织或其它制备样本中分离和浓缩DNA至长度高达或超过百万碱基。使用这些方法中的每一种制备DNA提取物,从单个GM12878细胞培养物开始,以控制样本水平可能的差异。根据Herschleb等(Nature Protocols 2(3):677–84(2007)),通过脉冲电场凝胶电泳评估片段的大小分布。使用前述方法,可提取极其长串的DNA并用其构建XLRP文库。然后对XLRP文库测序并比对。通过比较读对之间的基因组距离和从凝胶中观察到的片段大小,分析得到的读段数据。DNA up to 150kbp was extracted using commercially available kits. Figure 7 demonstrates that an XLRP library can be generated from captured read pairs up to the maximum fragment length of the extracted DNA. Accordingly, the methods disclosed herein are expected to be able to generate read pairs from even longer stretches of DNA. There are many well-established methods for the recovery of high molecular weight DNA, and these methods can be used in conjunction with the methods or protocols disclosed herein. Extraction methods are used to generate large fragment lengths of DNA from which XLRP libraries are created and the resulting read pairs can be evaluated. For example, large molecular weight DNA can be extracted by the following methods: (1) mild cell lysis, according to Teague et al (Proc. (11): e1000711(2009)); and (2) agarose gel plug, according to Wing et al. References are incorporated herein, including any references cited therein, or by using the Aurora System from Boreal Genomics. These methods are capable of generating longer DNA fragments than are typically required for next-generation sequencing; however, other suitable methods known in the art may be substituted to achieve similar results. The Aurora System provides unexpected results and can isolate and concentrate DNA from tissue or other prepared samples to lengths up to or exceeding a million bases. DNA extracts were prepared using each of these methods, starting from a single GM12878 cell culture to control for possible differences in sample levels. The size distribution of the fragments was assessed by pulsed field gel electrophoresis according to Herschleb et al. (Nature Protocols 2(3):677-84 (2007)). Using the methods described above, extremely long strings of DNA can be extracted and used to construct XLRP libraries. The XLRP library was then sequenced and aligned. The resulting read data were analyzed by comparing the genomic distance between the read pairs and the fragment sizes observed from the gel.

实施例9.减少来自非期望基因组区域的读对Example 9. Reduction of read pairs from undesired genomic regions

通过体外转录产生与非期望基因组区域互补的RNA,并在交联之前将其加入重组的染色质。当补充RNA结合至一个或多个非期望基因组区域时,RNA结合降低了在这些区域处的交联效率。从而减少来自这些区域的DNA在交联复合物中的丰度。重组的染色质被生物素化和固定化、并如上所述进行使用。在一些情况下,RNA被设计为针对基因组中的重复区域。RNA complementary to undesired genomic regions is generated by in vitro transcription and incorporated into reconstituted chromatin prior to crosslinking. When the complementary RNA binds to one or more undesired genomic regions, RNA binding reduces the efficiency of crosslinking at these regions. Thereby reducing the abundance of DNA from these regions in the cross-linked complex. Recombined chromatin was biotinylated and immobilized, and used as described above. In some cases, RNAs are designed to target repetitive regions in the genome.

实施例10.增加来自期望染色质区域的读对Example 10. Increasing read pairs from desired chromatin regions

来自期望染色质区域的DNA以双链形式产生,用于基因组装或单体型分析。相应地减少来自非期望区域的DNA的代表。通过以多个千碱基间距铺盖在这些区域上的引物,生成来自期望染色质区域的双链DNA。在该方法的其它实施方案中,改变铺盖间距,以使不同大小的期望区域具有期望的复制效率。经过期望区域的引物结合位点与引物接触,可选地通过融化DNA。使用铺设的引物合成DNA的新链。例如通过用单链DNA特异性的核酸内切酶把这些区域作为目标,减少或消除非期望区域。可选择地扩增其余的期望区域。制备好的样本用本文其它地方所述的测序文库制备方法进行处理。在一些实施方案中,从每个这些期望染色质区域生成跨越高达每个期望染色质区域长度的读对。DNA from desired chromatin regions is produced in double-stranded form for gene assembly or haplotype analysis. Representation of DNA from undesired regions is correspondingly reduced. Double-stranded DNA from desired chromatin regions is generated by laying primers across these regions at multiple kilobase spacing. In other embodiments of the method, the blanket spacing is varied to achieve desired replication efficiencies for desired regions of different sizes. Primer binding sites are contacted via the desired region, optionally by melting the DNA. A new strand of DNA is synthesized using the laid primers. Undesired regions are reduced or eliminated, for example, by targeting these regions with single-stranded DNA-specific endonucleases. The remaining desired regions are optionally amplified. Prepared samples were processed using sequencing library preparation methods described elsewhere herein. In some embodiments, read pairs spanning up to the length of each desired chromatin region are generated from each of these desired chromatin regions.

虽然本文已经示出并描述了本发明的优选实施方案,但对于本领域的技术人员来说,显然这些实施例只是用于举例说明。对于本领域技术人员来说,在不背离本发明的范围的情况下可以作出多种变形、改变和替换。应当理解此处所描述的本发明实施方案的各种可替代方案可用于实施本发明。以下权利要求旨在定义本发明的范围以及由这些权利要求涵盖的这些权利要求及其等价物的范围内的结构和方法。以下权利要求旨在定义本发明的范围以及由这些权利要求涵盖的这些权利要求及其等同方案的范围内的结构和方法。While preferred embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that these Examples are by way of illustration only. Various modifications, changes and substitutions will occur to those skilled in the art without departing from the scope of the present invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is intended that the following claims define the scope of the invention and that structures and methods within the scope of such claims and their equivalents be covered by such claims. It is intended that the following claims define the scope of the invention and that structures and methods within the scope of such claims and their equivalents be covered by these claims.

Claims (22)

1. a kind of method for genome assembling, including:
Obtain multiple contigs;
Reconstruct chromatin is obtained, the reconstruct chromatin includes the naked DNA compound with the nucleoprotein detached;
It is reconstructed in the data that chromatinic physical layout generates from by detecting, generates multiple readings pair;
It is read the multiple to positioning to the multiple contig, so as to generate read location data;
The contig is arranged using the read location data;And
Determine the path by the contig, the order and/or the side towards genome that the path represents the contig To.
2. according to the method described in claim 1, wherein the multiple contig is by using the generation of air gun PCR sequencing PCR, the bird Rifle PCR sequencing PCR includes:
Make subject's DNA break of long section into the uncertain random fragment of size;
The segment is sequenced with high-flux sequence method, to generate multiple sequencing reads;With
The sequencing read is assembled, to form multiple contigs.
3. the method according to claim 1 or claim 2 wherein by using the technology included the following steps, passes through Detection reconstructs chromatinic physical layout, generates the multiple reading pair:
Reconstruct chromatin is made to be crosslinked with fixative, to form DNA- protein cross objects;
With the crosslinked DNA- protein of one or more restriction enzyme cleavages, to generate multiple DNA- containing cohesive end Protein complex;
The cohesive end described in the nucleotide filling-in containing one or more markers, generating flat end then makes blunt end cloning Together;
The multiple DNA protein complexes is made to be fractured into segment;
By using one or more markers, the segment containing contact is pulled down;With
The segment containing contact is sequenced with high-flux sequence method, to generate multiple readings pair.
4. according to the method described in claim 1, wherein the multiple read to being to reconstruct chromatinic physical layout by detecting Generation, the reconstruct chromatin is to be answered by will be obtained from the naked DNA of one or more subject's samples with the histone detached Close what is formed.
5. according to the method described in claim 1, wherein for the multiple reading pair, by using the read to the overlapping The function of the distance at group edge, the reading pair of weighting at least 80% embody short contact probability more higher than long contact.
6. according to the method described in claim 1, wherein the method provides the genome assembling of human experimenter, wherein from people The DNA of class subject generates multiple contigs, and wherein reconstructs chromatin by using as made from subject's naked DNA, raw Into multiple readings pair.
7. according to the method described in claim 1, wherein the method further includes:
Identify it is the multiple read to one or more of heterozygous sites;With
Reading pair of the identification containing pairs of heterozygous sites, wherein from the identification of the pairs of heterozygous sites, can determine for equipotential Gene variant determines phase data.
8. according to the method described in claim 1, wherein from single DNA molecules obtain it is the multiple read pair, wherein at least 1% It reads to crossing over at least distance of 30kB on single DNA molecules, and wherein described reading in 14 days to generating.
9. according to the method described in claim 8, wherein at least 10% reading is to crossing on the single DNA molecules extremely The distance of few 50kB.
10. according to the method described in claim 8, wherein at least 1% reading is to crossing on the single DNA molecules extremely The distance of few 100kB.
11. according to the method described in claim 1, wherein described reading in 7 days to generating.
12. according to the method described in claim 1, wherein the multiple reading pair is obtained from single DNA molecules in vitro;Wherein extremely Few 1% reading is to crossing over at least distance of 30kB on the single DNA molecules.
13. according to the method for claim 12, wherein at least 10% reading is to crossing on the single DNA molecules At least distance of 30kB.
14. according to the method for claim 13, wherein at least 1% reading is to crossing on the single DNA molecules At least distance of 50kB.
15. according to the method described in claim 7, wherein the multiple reading pair is obtained from single DNA molecules;Wherein at least 1% The reading to across at least distance of 30kB on the single DNA molecules, and wherein be more than 70% accuracy into Row determines phase data for Alielic variants.
16. according to the method for claim 15, wherein at least 10% reading is to crossing on the single DNA molecules At least distance of 50kB.
17. according to the method for claim 15, wherein at least 1% reading is to crossing on the single DNA molecules At least distance of 100kB.
18. according to claim 15 to 17 any one of them method, wherein being carried out with the accuracy for being more than 90% for equipotential Gene variant determines phase data.
19. according to the method described in claim 7, wherein the multiple reading pair is obtained from single DNA molecules in vitro;Wherein extremely Few 1% reading to across at least distance of 30kB on the single DNA molecules, and wherein be more than 70% it is accurate Degree carries out determining phase data for Alielic variants.
20. according to the method for claim 19, wherein at least 10% reading is to crossing on the single DNA molecules At least distance of 30kB.
21. according to the method for claim 19, wherein at least 1% reading is to crossing on the single DNA molecules At least distance of 50kB.
22. according to claim 19 to 21 any one of them method, wherein being carried out with the accuracy for being more than 90% for equipotential Gene variant determines phase data.
CN201480020008.2A 2013-02-01 2014-01-31 The method of phase is determined for genome assembling and haplotype Active CN105121661B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810469575.6A CN108624668B (en) 2013-02-01 2014-01-31 Methods for genome assembly and haplotype phasing

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US201361759941P 2013-02-01 2013-02-01
US61/759,941 2013-02-01
US201361892355P 2013-10-17 2013-10-17
US61/892,355 2013-10-17
PCT/US2014/014184 WO2014121091A1 (en) 2013-02-01 2014-01-31 Methods for genome assembly and haplotype phasing

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201810469575.6A Division CN108624668B (en) 2013-02-01 2014-01-31 Methods for genome assembly and haplotype phasing

Publications (2)

Publication Number Publication Date
CN105121661A CN105121661A (en) 2015-12-02
CN105121661B true CN105121661B (en) 2018-06-08

Family

ID=51262991

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201480020008.2A Active CN105121661B (en) 2013-02-01 2014-01-31 The method of phase is determined for genome assembling and haplotype
CN201810469575.6A Active CN108624668B (en) 2013-02-01 2014-01-31 Methods for genome assembly and haplotype phasing

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201810469575.6A Active CN108624668B (en) 2013-02-01 2014-01-31 Methods for genome assembly and haplotype phasing

Country Status (8)

Country Link
US (3) US10089437B2 (en)
EP (2) EP2951319B1 (en)
JP (3) JP6466855B2 (en)
CN (2) CN105121661B (en)
AU (2) AU2014212152B2 (en)
CA (2) CA3209385A1 (en)
GB (2) GB2519255B (en)
WO (1) WO2014121091A1 (en)

Families Citing this family (50)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9792405B2 (en) 2013-01-17 2017-10-17 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
EP2994749B1 (en) 2013-01-17 2025-03-05 Illumina, Inc. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US10068054B2 (en) 2013-01-17 2018-09-04 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US9679104B2 (en) 2013-01-17 2017-06-13 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US10847251B2 (en) 2013-01-17 2020-11-24 Illumina, Inc. Genomic infrastructure for on-site or cloud-based DNA and RNA processing and analysis
US10691775B2 (en) 2013-01-17 2020-06-23 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
EP2951319B1 (en) * 2013-02-01 2021-03-10 The Regents of the University of California Methods for genome assembly and haplotype phasing
EP3080605B1 (en) * 2013-12-11 2019-02-20 The Regents of the University of California Method for labeling dna fragments to reconstruct physical linkage and phase
AU2015296029B2 (en) * 2014-08-01 2022-01-27 Dovetail Genomics, Llc Tagging nucleic acids for sequence assembly
NZ734854A (en) 2015-02-17 2022-11-25 Dovetail Genomics Llc Nucleic acid sequence assembly
US9940266B2 (en) 2015-03-23 2018-04-10 Edico Genome Corporation Method and system for genomic visualization
GB2554572B (en) * 2015-03-26 2021-06-23 Dovetail Genomics Llc Physical linkage preservation in DNA storage
WO2016164313A1 (en) * 2015-04-06 2016-10-13 The Regents Of The University Of California Methods and compositions for long-range haplotype phasing
AU2016341198B2 (en) * 2015-10-19 2023-03-09 Dovetail Genomics, Llc Methods for genome assembly, haplotype phasing, and target independent nucleic acid detection
US10068183B1 (en) 2017-02-23 2018-09-04 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on a quantum processing platform
US20170270245A1 (en) 2016-01-11 2017-09-21 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing
EP3402883A4 (en) * 2016-01-12 2019-09-18 Seqwell, Inc. Compositions and methods for sequencing nucleic acids
WO2018037289A2 (en) * 2016-02-10 2018-03-01 Energin.R Technologies 2009 Ltd. Systems and methods for computational demultiplexing of genomic barcoded sequences
SG11201807117WA (en) * 2016-02-23 2018-09-27 Dovetail Genomics Llc Generation of phased read-sets for genome assembly and haplotype phasing
CN105839196B (en) * 2016-05-11 2018-04-17 北京百迈客生物科技有限公司 A kind of Hi C high-flux sequence banking process of eukaryotic dna
CN109477101B (en) 2016-05-13 2022-11-18 多弗泰尔基因组学有限责任公司 Recovery of long-range linkage information from preserved samples
CN106055925B (en) * 2016-05-24 2018-09-18 中国水产科学研究院 The method and apparatus for assembling genome sequence based on transcript profile both-end sequencing data
MX2024002491A (en) * 2016-06-07 2024-03-15 Illumina Inc Bioinformatics systems, apparatus, and methods for performing secondary and/or tertiary processing.
WO2018086045A1 (en) * 2016-11-10 2018-05-17 深圳华大基因研究院 Method for performing quantitative analysis on subgroup in specific group
AU2017363139B2 (en) 2016-11-16 2023-09-21 Catalog Technologies, Inc. Nucleic acid-based data storage
US10650312B2 (en) 2016-11-16 2020-05-12 Catalog Technologies, Inc. Nucleic acid-based data storage
CN106754868A (en) * 2016-11-29 2017-05-31 武汉菲沙基因信息有限公司 A kind of method of the DNA fragmentation interacted in capture Matrix attachment region
CN118441026A (en) 2016-12-19 2024-08-06 生物辐射实验室股份有限公司 Drop-on labeled DNA with maintained adjacency
EP3612646A1 (en) 2017-04-18 2020-02-26 Dovetail Genomics, LLC Nucleic acid characteristics as guides for sequence assembly
US10176296B2 (en) * 2017-05-17 2019-01-08 International Business Machines Corporation Algebraic phasing of polyploids
KR102035285B1 (en) * 2017-05-30 2019-10-22 단국대학교 산학협력단 Contig Profile Update Method and Contig Formation Method for DNA shotgun sequencing or RNA transcriptome assembly
CN107704725B (en) * 2017-08-11 2020-12-01 浙江工业大学 A method for assembling discontinuous multi-domain protein structures
JP7297774B2 (en) * 2017-11-09 2023-06-26 ダブテイル ゲノミクス エルエルシー Analysis of structural variation
AU2019214956B2 (en) 2018-01-31 2025-06-26 Dovetail Genomics, Llc Sample prep for DNA linkage recovery
JP7364604B2 (en) 2018-03-16 2023-10-18 カタログ テクノロジーズ, インコーポレイテッド Chemical methods for nucleic acid-based data storage
CN108985009B (en) * 2018-08-29 2022-06-07 北京希望组生物科技有限公司 Method for obtaining gene haplotype sequence and application thereof
CN109055491A (en) * 2018-09-18 2018-12-21 武汉菲沙基因信息有限公司 A kind of Hi-C high-flux sequence banking process suitable for plant
GB2596982B8 (en) * 2019-04-28 2025-06-25 Univ California Methods for library preparation to enrich informative DNA fragments using enzymatic digestion
US11610651B2 (en) 2019-05-09 2023-03-21 Catalog Technologies, Inc. Data structures and operations for searching, computing, and indexing in DNA-based data storage
EP3990920B1 (en) 2019-06-27 2025-02-12 Dovetail Genomics, LLC Methods and compositions for proximity ligation
EP4041920A1 (en) 2019-10-11 2022-08-17 Catalog Technologies, Inc. Nucleic acid security and authentication
CN111192627B (en) * 2019-12-15 2022-09-06 南京理工大学 Ribonucleic acid contact map prediction method based on base embedding and direct correlation analysis
AU2021271639A1 (en) 2020-05-11 2022-12-08 Catalog Technologies, Inc. Programs and functions in DNA-based data storage
CN111564182B (en) * 2020-05-12 2024-02-09 西藏自治区农牧科学院水产科学研究所 High-weight recovery of fish of the genus of Glehnian chromosome-level assembly of (2)
CN111627492B (en) * 2020-05-25 2023-04-28 中国人民解放军军事科学院军事医学研究院 Cancer genome Hi-C data simulation method, device and electronic equipment
GB202008269D0 (en) * 2020-06-02 2020-07-15 Oxford Biodynamics Ltd Detecting a chromosome marker
CN113215141A (en) * 2021-02-23 2021-08-06 华南农业大学 Bacterial HI-C genome and plasmid conformation capture method
CN117580963A (en) * 2021-05-05 2024-02-20 斯坦福大学托管董事会 Methods and systems for analyzing nucleic acid molecules
CN115810395B (en) * 2022-12-05 2023-09-26 武汉贝纳科技有限公司 T2T assembly method based on high-throughput sequencing animal and plant genome
CN116606910B (en) * 2023-07-21 2023-10-13 中国农业科学院农业基因组研究所 Metagenomic GutHi-C library building method suitable for microbial population and application

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103443338A (en) * 2011-02-02 2013-12-11 华盛顿大学商业化中心 Massively parallel continguity mapping

Family Cites Families (112)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NL154598B (en) 1970-11-10 1977-09-15 Organon Nv PROCEDURE FOR DETERMINING AND DETERMINING LOW MOLECULAR COMPOUNDS AND PROTEINS THAT CAN SPECIFICALLY BIND THESE COMPOUNDS AND TEST PACKAGING.
US3817837A (en) 1971-05-14 1974-06-18 Syva Corp Enzyme amplification assay
US3939350A (en) 1974-04-29 1976-02-17 Board Of Trustees Of The Leland Stanford Junior University Fluorescent immunoassay employing total reflection for activation
US3996345A (en) 1974-08-12 1976-12-07 Syva Company Fluorescence quenching with immunological pairs in immunoassays
US4277437A (en) 1978-04-05 1981-07-07 Syva Company Kit for carrying out chemically induced fluorescence immunoassay
US4275149A (en) 1978-11-24 1981-06-23 Syva Company Macromolecular environment control in specific receptor assays
US4366241A (en) 1980-08-07 1982-12-28 Syva Company Concentrating zone method in heterogeneous immunoassays
US5242794A (en) 1984-12-13 1993-09-07 Applied Biosystems, Inc. Detection of specific sequences in nucleic acids
US4988617A (en) 1988-03-25 1991-01-29 California Institute Of Technology Method of detecting a nucleotide change in nucleic acids
US5234809A (en) 1989-03-23 1993-08-10 Akzo N.V. Process for isolating nucleic acid
US5143854A (en) 1989-06-07 1992-09-01 Affymax Technologies N.V. Large scale photolithographic solid phase synthesis of polypeptides and receptor binding screening thereof
US5494810A (en) 1990-05-03 1996-02-27 Cornell Research Foundation, Inc. Thermostable ligase-mediated DNA amplifications system for the detection of genetic disease
ATE199054T1 (en) 1990-12-06 2001-02-15 Affymetrix Inc A Delaware Corp COMPOUNDS AND THEIR USE IN A BINARY SYNTHESIS STRATEGY
US5994056A (en) 1991-05-02 1999-11-30 Roche Molecular Systems, Inc. Homogeneous methods for nucleic acid amplification and detection
EP0916396B1 (en) 1991-11-22 2005-04-13 Affymetrix, Inc. (a Delaware Corporation) Combinatorial strategies for polymer synthesis
US5567583A (en) 1991-12-16 1996-10-22 Biotronics Corporation Methods for reducing non-specific priming in DNA detection
US6033854A (en) 1991-12-16 2000-03-07 Biotronics Corporation Quantitative PCR using blocking oligonucleotides
US5348853A (en) 1991-12-16 1994-09-20 Biotronics Corporation Method for reducing non-specific priming in DNA amplification
WO1994024143A1 (en) 1993-04-12 1994-10-27 Northwestern University Method of forming oligonucleotides
US5837832A (en) 1993-06-25 1998-11-17 Affymetrix, Inc. Arrays of nucleic acid probes on biological chips
DE69433180T2 (en) 1993-10-26 2004-06-24 Affymetrix, Inc., Santa Clara FIELDS OF NUCLEIC ACID PROBE ON ORGANIC CHIPS
US6110709A (en) 1994-03-18 2000-08-29 The General Hospital Corporation Cleaved amplified modified polymorphic sequence detection methods
US5571639A (en) 1994-05-24 1996-11-05 Affymax Technologies N.V. Computer-aided engineering system for design of sequence arrays and lithographic masks
US5705628A (en) 1994-09-20 1998-01-06 Whitehead Institute For Biomedical Research DNA purification and isolation using magnetic particles
US5795716A (en) 1994-10-21 1998-08-18 Chee; Mark S. Computer-aided visualization and analysis system for sequence evaluation
US5599695A (en) 1995-02-27 1997-02-04 Affymetrix, Inc. Printing molecular library arrays using deprotection agents solely in the vapor phase
US5780613A (en) 1995-08-01 1998-07-14 Northwestern University Covalent lock for self-assembled oligonucleotide constructs
EP0937159A4 (en) 1996-02-08 2004-10-20 Affymetrix Inc Chip-based speciation and phenotypic characterization of microorganisms
US5786146A (en) 1996-06-03 1998-07-28 The Johns Hopkins University School Of Medicine Method of detection of methylated nucleic acid using agents which modify unmethylated cytosine and distinguishing modified methylated and non-methylated nucleic acids
EP1179600B1 (en) 1996-06-04 2005-05-11 University Of Utah Research Foundation Monitoring hybridization during PCR
US6117635A (en) 1996-07-16 2000-09-12 Intergen Company Nucleic acid amplification oligonucleotides with molecular energy transfer labels and methods based thereon
US6449562B1 (en) 1996-10-10 2002-09-10 Luminex Corporation Multiplexed analysis of clinical specimens apparatus and method
WO1998041651A1 (en) 1997-03-18 1998-09-24 Hsc Research & Development Limited Partnership Method for preparing chromatin
US6969488B2 (en) 1998-05-22 2005-11-29 Solexa, Inc. System and apparatus for sequential processing of analytes
JP2003502005A (en) 1997-10-28 2003-01-21 ザ リージェンツ オブ ザ ユニバーシティ オブ カリフォルニア DNA base mismatch detection using flow cytometry
US5989823A (en) 1998-09-18 1999-11-23 Nexstar Pharmaceuticals, Inc. Homogeneous detection of a target through nucleic acid ligand-ligand beacon interaction
GB9812768D0 (en) 1998-06-13 1998-08-12 Zeneca Ltd Methods
US20040106110A1 (en) 1998-07-30 2004-06-03 Solexa, Ltd. Preparation of polynucleotide arrays
US20030022207A1 (en) 1998-10-16 2003-01-30 Solexa, Ltd. Arrayed polynucleotides and their use in genome analysis
US6787308B2 (en) 1998-07-30 2004-09-07 Solexa Ltd. Arrayed biomolecules and their use in sequencing
JP4808315B2 (en) * 1998-12-02 2011-11-02 ブリストル−マイヤーズ スクウィブ カンパニー DNA / protein fusion and its use
US8367322B2 (en) 1999-01-06 2013-02-05 Cornell Research Foundation, Inc. Accelerating identification of single nucleotide polymorphisms and alignment of clones in genomic sequencing
US7056661B2 (en) 1999-05-19 2006-06-06 Cornell Research Foundation, Inc. Method for sequencing nucleic acid molecules
US6225109B1 (en) 1999-05-27 2001-05-01 Orchid Biosciences, Inc. Genetic analysis device
US7244559B2 (en) 1999-09-16 2007-07-17 454 Life Sciences Corporation Method of sequencing a nucleic acid
US7211390B2 (en) 1999-09-16 2007-05-01 454 Life Sciences Corporation Method of sequencing a nucleic acid
AU7537200A (en) 1999-09-29 2001-04-30 Solexa Ltd. Polynucleotide sequencing
US6582938B1 (en) 2001-05-11 2003-06-24 Affymetrix, Inc. Amplification of nucleic acids
GB0002389D0 (en) 2000-02-02 2000-03-22 Solexa Ltd Molecular arrays
US6448717B1 (en) 2000-07-17 2002-09-10 Micron Technology, Inc. Method and apparatuses for providing uniform electron beams from field emission displays
AU2001293163A1 (en) 2000-09-27 2002-04-08 Lynx Therapeutics, Inc. Method for determining relative abundance of nucleic acid sequences
US7001724B1 (en) 2000-11-28 2006-02-21 Applera Corporation Compositions, methods, and kits for isolating nucleic acids using surfactants and proteases
DE10120797B4 (en) 2001-04-27 2005-12-22 Genovoxx Gmbh Method for analyzing nucleic acid chains
GB0114853D0 (en) 2001-06-18 2001-08-08 Medical Res Council Happier Mapping
DE10239504A1 (en) 2001-08-29 2003-04-24 Genovoxx Gmbh Parallel sequencing of nucleic acid fragments, useful e.g. for detecting mutations, comprises sequential single-base extension of immobilized fragment-primer complex
WO2003031947A2 (en) 2001-10-04 2003-04-17 Genovoxx Gmbh Device for sequencing nucleic acid molecules
DE10149786B4 (en) 2001-10-09 2013-04-25 Dmitry Cherkasov Surface for studies of populations of single molecules
US20050124022A1 (en) 2001-10-30 2005-06-09 Maithreyan Srinivasan Novel sulfurylase-luciferase fusion proteins and thermostable sulfurylase
US6902921B2 (en) 2001-10-30 2005-06-07 454 Corporation Sulfurylase-luciferase fusion proteins and thermostable sulfurylase
CA2481312A1 (en) 2002-03-08 2003-09-18 The Babraham Institute Tagging and recovery of elements associated with target molecules
DE10214395A1 (en) 2002-03-30 2003-10-23 Dmitri Tcherkassov Parallel sequencing of nucleic acid segments, useful for detecting single-nucleotide polymorphisms, by single-base extensions with labeled nucleotide
US7563600B2 (en) 2002-09-12 2009-07-21 Combimatrix Corporation Microarray synthesis and assembly of gene-length polynucleotides
US7414117B2 (en) 2002-12-26 2008-08-19 Ngk Insulators, Ltd. Nucleotide derivative and DNA microarray
CA2513889A1 (en) 2003-01-29 2004-08-19 454 Corporation Double ended sequencing
CA2528901A1 (en) 2003-07-02 2005-01-20 Dsm Ip Assets B.V. Improved test system for the determination of the presence of an antibiotic in a fluid
GB0316075D0 (en) 2003-07-09 2003-08-13 Molecular Sensing Plc Protease detection assay
DE10356837A1 (en) 2003-12-05 2005-06-30 Dmitry Cherkasov New conjugates useful for modifying nucleic acid chains comprise nucleotide or nucleoside molecules coupled to a label through water-soluble polymer linkers
WO2005044836A2 (en) 2003-11-05 2005-05-19 Genovoxx Gmbh Macromolecular nucleotide compounds and methods for using the same
US7169560B2 (en) 2003-11-12 2007-01-30 Helicos Biosciences Corporation Short cycle methods for sequencing polynucleotides
DE102004009704A1 (en) 2004-02-27 2005-09-15 Dmitry Cherkasov New conjugates useful for labeling nucleic acids comprise a label coupled to nucleotide or nucleoside molecules through polymer linkers
DE102004025696A1 (en) 2004-05-26 2006-02-23 Dmitry Cherkasov Ultra-high parallel analysis process to analyse nucleic acid chains in which a sample solid is bound and substrate material
DE102004025695A1 (en) 2004-05-26 2006-02-23 Dmitry Cherkasov Optical fluorescent parallel process to analyse nucleic acid chains in which a sample solid is bound with a primer-matrix complex
DE102004025744A1 (en) 2004-05-26 2005-12-29 Dmitry Cherkasov Surface of a solid support, useful for multiple parallel analysis of nucleic acids by optical methods, having low non-specific binding of labeled components
DE102004025694A1 (en) 2004-05-26 2006-02-23 Dmitry Cherkasov Optical fluorescent ultra-high parallel process to analyse nucleic acid chains in which a sample solid is bound with a primer-matrix complex
DE102004025746A1 (en) 2004-05-26 2005-12-15 Dmitry Cherkasov Parallel sequencing of nucleic acids by optical methods, by cyclic primer-matrix extension, using a solid phase with reduced non-specific binding of labeled components
DE102004025745A1 (en) 2004-05-26 2005-12-15 Cherkasov, Dmitry Surface of solid phase, useful for parallel, optical analysis of many nucleic acids, has reduced non-specific binding of labeled components
US7361468B2 (en) 2004-07-02 2008-04-22 Affymetrix, Inc. Methods for genotyping polymorphisms in humans
US20060024711A1 (en) 2004-07-02 2006-02-02 Helicos Biosciences Corporation Methods for nucleic acid amplification and sequence determination
US7276720B2 (en) 2004-07-19 2007-10-02 Helicos Biosciences Corporation Apparatus and methods for analyzing samples
US20060012793A1 (en) 2004-07-19 2006-01-19 Helicos Biosciences Corporation Apparatus and methods for analyzing samples
US20060024678A1 (en) 2004-07-28 2006-02-02 Helicos Biosciences Corporation Use of single-stranded nucleic acid binding proteins in sequencing
US7425415B2 (en) 2005-04-06 2008-09-16 City Of Hope Method for detecting methylated CpG islands
JP2006301289A (en) 2005-04-20 2006-11-02 Tokyo Ohka Kogyo Co Ltd Negative resist composition and resist pattern forming method
US20090233291A1 (en) 2005-06-06 2009-09-17 454 Life Sciences Corporation Paired end sequencing
CA2611743C (en) 2005-06-15 2019-12-31 Callida Genomics, Inc. Nucleic acid analysis by forming and tracking aliquoted fragments of a target polynucleotide
DK1899488T3 (en) 2005-07-04 2015-12-21 Univ Erasmus Medical Ct CHROMOSOM CONFORMATIONS "CAPTURE-ON-CHIP" (4C) -ASSAY
US20070172839A1 (en) 2006-01-24 2007-07-26 Smith Douglas R Asymmetrical adapters and methods of use thereof
US8071296B2 (en) 2006-03-13 2011-12-06 Agency For Science, Technology And Research Nucleic acid interaction analysis
WO2007136874A2 (en) 2006-05-18 2007-11-29 President And Fellows Of Harvard College Genomic library construction
US9273309B2 (en) 2006-08-24 2016-03-01 University Of Massachusetts Mapping of genomic interactions
US8278112B2 (en) 2006-12-21 2012-10-02 The Regents Of The University Of California Site-specific installation of methyl-lysine analogues into recombinant histones
ES2634266T3 (en) 2007-01-11 2017-09-27 Erasmus University Medical Center Capture of circular chromosomal conformation (4C)
WO2008097887A2 (en) 2007-02-02 2008-08-14 Emory University Methods of direct genomic selection using high density oligonucleotide microarrays
WO2009052214A2 (en) 2007-10-15 2009-04-23 Complete Genomics, Inc. Sequence analysis using decorated nucleic acids
EP2053132A1 (en) 2007-10-23 2009-04-29 Roche Diagnostics GmbH Enrichment and sequence analysis of geomic regions
US8592150B2 (en) 2007-12-05 2013-11-26 Complete Genomics, Inc. Methods and compositions for long fragment read sequencing
US20090298064A1 (en) 2008-05-29 2009-12-03 Serafim Batzoglou Genomic Sequencing
US8076070B2 (en) * 2008-08-06 2011-12-13 University Of Southern California Genome-wide chromosome conformation capture
WO2010036323A1 (en) * 2008-09-25 2010-04-01 University Of Massachusetts Medical School Method of identifing interactions between genomic loci
US9524369B2 (en) * 2009-06-15 2016-12-20 Complete Genomics, Inc. Processing and analysis of complex nucleic acid sequence data
WO2011106546A1 (en) * 2010-02-25 2011-09-01 Teva Pharmaceutical Industries Ltd. A process for the preparation of rosuvastatin intermediate
US8841075B1 (en) * 2010-04-13 2014-09-23 Cleveland State University Homologous pairing capture assay and related methods and applications
US20110287947A1 (en) 2010-05-18 2011-11-24 University Of Southern California Tethered Conformation Capture
RU2603082C2 (en) 2010-07-09 2016-11-20 Сергентис Б.В. Methods of sequencing of three-dimensional structure of the analyzed genome region
WO2012047726A1 (en) 2010-09-29 2012-04-12 The Broad Institute, Inc. Methods for chromatin immuno-precipitations
US20120197533A1 (en) 2010-10-11 2012-08-02 Complete Genomics, Inc. Identifying rearrangements in a sequenced genome
WO2012150317A1 (en) 2011-05-05 2012-11-08 Institut National De La Sante Et De La Recherche Medicale (Inserm) Linear dna amplification
CA2833165A1 (en) * 2011-04-14 2012-10-18 Complete Genomics, Inc. Processing and analysis of complex nucleic acid sequence data
US8663919B2 (en) 2011-05-18 2014-03-04 Life Technologies Corporation Chromosome conformation analysis
JP6168722B2 (en) 2012-01-31 2017-07-26 ブラザー工業株式会社 Image forming apparatus
US9411930B2 (en) * 2013-02-01 2016-08-09 The Regents Of The University Of California Methods for genome assembly and haplotype phasing
EP2951319B1 (en) 2013-02-01 2021-03-10 The Regents of the University of California Methods for genome assembly and haplotype phasing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103443338A (en) * 2011-02-02 2013-12-11 华盛顿大学商业化中心 Massively parallel continguity mapping

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells;Brock A. Peters,et al;《NATURE》;20120712;第487卷(第190期);190-195页 *

Also Published As

Publication number Publication date
GB201501001D0 (en) 2015-03-04
CN105121661A (en) 2015-12-02
JP2016506733A (en) 2016-03-07
HK1218433A1 (en) 2017-02-17
JP7028807B2 (en) 2022-03-02
GB2547875B (en) 2017-12-13
US20190080050A1 (en) 2019-03-14
AU2014212152A1 (en) 2015-08-06
AU2020202992B2 (en) 2023-02-23
GB201520448D0 (en) 2016-01-06
AU2020202992A1 (en) 2020-05-28
EP2951319A1 (en) 2015-12-09
EP2951319B1 (en) 2021-03-10
US10089437B2 (en) 2018-10-02
US20220172799A1 (en) 2022-06-02
GB2519255B (en) 2016-01-06
CN108624668A (en) 2018-10-09
JP2019088295A (en) 2019-06-13
CA2899020C (en) 2023-10-03
US20150363550A1 (en) 2015-12-17
CA2899020A1 (en) 2014-08-07
GB2519255A (en) 2015-04-15
JP6466855B2 (en) 2019-02-06
CN108624668B (en) 2022-12-02
JP2022065109A (en) 2022-04-26
GB2547875A (en) 2017-09-06
CA3209385A1 (en) 2014-08-07
EP2951319A4 (en) 2016-12-21
US11081209B2 (en) 2021-08-03
AU2014212152B2 (en) 2020-02-06
WO2014121091A1 (en) 2014-08-07
EP3885446A1 (en) 2021-09-29

Similar Documents

Publication Publication Date Title
US20220172799A1 (en) Methods for genome assembly and haplotype phasing
US12180535B2 (en) Tagging nucleic acids for sequence assembly
CN108368542B (en) Methods for genome assembly, haplotype phasing, and target-independent nucleic acid detection
AU2021232750B2 (en) Methods for labeling DNA fragments to reconstruct physical linkage and phase
HK1218433B (en) Methods for genome assembly and haplotype phasing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant