[go: up one dir, main page]

CN110556165B - Method for rapidly identifying transgene or gene editing material and insertion site thereof - Google Patents

Method for rapidly identifying transgene or gene editing material and insertion site thereof Download PDF

Info

Publication number
CN110556165B
CN110556165B CN201910863735.XA CN201910863735A CN110556165B CN 110556165 B CN110556165 B CN 110556165B CN 201910863735 A CN201910863735 A CN 201910863735A CN 110556165 B CN110556165 B CN 110556165B
Authority
CN
China
Prior art keywords
sequence
expression vector
genome
reading
dna
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910863735.XA
Other languages
Chinese (zh)
Other versions
CN110556165A (en
Inventor
舒庆尧
吴三玲
谭瑗瑗
高其康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201910863735.XA priority Critical patent/CN110556165B/en
Publication of CN110556165A publication Critical patent/CN110556165A/en
Priority to US17/594,728 priority patent/US20220205034A1/en
Priority to PCT/CN2020/110191 priority patent/WO2021047363A1/en
Priority to EP20863661.3A priority patent/EP3919629A4/en
Application granted granted Critical
Publication of CN110556165B publication Critical patent/CN110556165B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/6895Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for plants, fungi or algae

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Organic Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明公开了一种利用全基因组重测序数据快速鉴定转基因或基因编辑材料及其插入位点的方法,该方法包括:提取基因组DNA;得到全基因组的双末端测序数据;判断待测植株中插入的含T‑DNA序列的表达载体序列是否已知;判断待测植株是否存在转基因事件或基因编辑事件,以及是否有骨架序列转入事件的发生;确定T‑DNA序列的插入位点;本发明方法结合生物信息学分析手段,在表达载体已知或未知的情况下,鉴定是否存在转基因或基因编辑事件发生,在表达载体已知的情况下,不仅能快速准确给出目标序列插入到基因组的精确定位,方向,拷贝数,侧翼序列信息,还能鉴定是否有骨架序列即目标序列以外的序列插入到基因组上,并给出同样定位。

Figure 201910863735

The invention discloses a method for rapidly identifying transgenic or gene editing materials and their insertion sites by using whole genome resequencing data. The method includes: extracting genomic DNA; obtaining paired-end sequencing data of the whole genome; Whether the sequence of the expression vector containing the T-DNA sequence is known; determine whether there is a transgenic event or gene editing event in the plant to be tested, and whether there is a backbone sequence transfer event; Determine the insertion site of the T-DNA sequence; the present invention The method combines bioinformatics analysis methods to identify whether there is a transgene or gene editing event when the expression vector is known or unknown. When the expression vector is known, not only can the target sequence be inserted into the genome quickly and accurately. Accurate positioning, orientation, copy number, and flanking sequence information can also identify whether there is a backbone sequence, that is, a sequence other than the target sequence, inserted into the genome, and give the same positioning.

Figure 201910863735

Description

Method for rapidly identifying transgene or gene editing material and insertion site thereof
Technical Field
The invention relates to the technical field of bioinformatics, in particular to a method for rapidly identifying a transgenic or gene editing material and an insertion site thereof by using whole genome re-sequencing data.
Background
A method for quickly identifying transgenic material or gene editing material and its insertion site is a technique for directionally modifying the genetic characteristics of living beings by using transgenic technology or gene editing technology to "edit" or modify the target gene in animal or plant body according to the gene engineering principle and the will of people.
The traditional detection method for exogenous gene segments of transgenosis or gene editing mainly comprises the following steps: 1. in situ hybridization: verifying the integration, integrity and approximate gene copy number of the transgene; 2. real-time fluorescent quantitative PCR: accurately calculating the copy number of the transgenes; 3. chromosome walking technology: confirm the precise insertion site of the transgene; 4. fluorescence in situ hybridization: the chromosomal location of the transgene integration was demonstrated.
However, these methods all have the disadvantages of long experimental period, single identification content, low flux and the like. With the development of sequencing technology and the optimization of price, the popularization of transgenic or gene editing technology, and a method for rapidly identifying the insertion sites of transgenic or gene editing events and whether non-target fragments are inserted and the insertion sites based on a whole genome re-sequencing method become more and more important.
The invention patent document with the publication number of CN103270175B discloses a method and a system for detecting the insertion site of a transgenic exogenous fragment, wherein the method comprises the following steps: and determining the exogenous single-side short fragment and the genome single-side short fragment through comparison, and determining the insertion site of the exogenous fragment in the genome sequence according to the intersection of the exogenous single-side short fragment and the genome single-side short fragment.
The invention patent document with publication number CN105631242B discloses a method for rapidly identifying transgenic events by using whole genome sequencing data, which combines bioinformatics analysis means to identify transgenic events by mapping sequencing reads to genome and expression vector sequences respectively, and identifies the position, direction, copy number and flanking sequence information of an exogenous fragment inserted into the genome and the homozygous and heterozygous states of a transgenic sample by a three-step positioning method.
However, none of the above patent documents considers that the insertion of an exogenous sequence causes recombination or disorder of sequences around the genomic insertion site, thereby reducing the sensitivity of the method and increasing the probability of false negatives and missed detections; also, the case where false positives arise when the vector sequence is partially homologous to the genomic sequence is not considered. In addition, the method uses the mapping of the reading sequence and the wild genome sequence of the plant to be detected, and has the disadvantages of complex process, time consumption and labor consumption.
Disclosure of Invention
The invention provides a method for rapidly identifying a transgene or gene editing material and an insertion site thereof by using whole genome re-sequencing data, which is combined with a bioinformatics analysis means, identifies whether a transgene or gene editing event occurs under the condition that an expression vector is known or unknown, can rapidly and accurately give out accurate positioning, direction, copy number and flanking sequence information of a target sequence inserted into a genome under the condition that the expression vector is known, and can also identify whether a skeleton sequence, namely a sequence except the target sequence is inserted into the genome and give out the same positioning.
The specific technical scheme is as follows:
a method for rapidly identifying a transgenic or gene-editing material and its insertion site using whole genome re-sequencing data, comprising:
(1) extracting the genome DNA of the plant sample to be detected after being processed by a transgenic technology or a gene editing technology;
(2) performing whole genome re-sequencing on the genome DNA to obtain double-end sequencing data of a whole genome;
(3) judging whether an expression vector sequence containing a T-DNA sequence inserted into a plant to be detected is known or not, and carrying out the following operations according to a judgment result:
if the sequence of the expression vector inserted into the plant to be detected is known, mapping the expression vector containing the T-DNA sequence as a template with the double-end sequencing data to obtain a mapping database;
if the expression vector sequence inserted into the plant to be detected is unknown, mapping the expression vector sequence with the double-end sequencing data by using a generic vector library as a template to obtain a mapping database;
(4) respectively counting the number of reads of a matched expression vector sequence or the number of reads of a matched generic vector library, the number of reads of a matched skeleton sequence, the base coverage rate and the average sequencing depth of a T-DNA sequence, the average sequencing depth of a plant sample genome to be detected, the sequence length of an expression vector and the length of the reads, and judging whether a transgenic event or a gene editing event exists in the plant to be detected, whether a skeleton sequence transfer event occurs and the copy number of an inserted sequence according to the following formula;
criteria for determining the presence or absence of a transgenic or gene editing event are:
(4-1) known expression vector sequences: VRN is more than or equal to Gdepth/2+ VectorLen/ReadLen, and Tcov is more than or equal to 0.9;
wherein VRN represents the number of reads matching the expression vector sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected; VectorLen denotes the sequence length of the expression vector; ReadLen represents the read length; tcov represents the base coverage of the T-DNA sequence, Tdepth represents the average sequencing depth of the T-DNA, and Bdepth represents the average sequencing depth of the framework sequence;
(4-2) unknown sequence of expression vector: FRN is not less than (Gdepth/2) multiplied by 10;
wherein FRN represents the number of reads that match the generic library; gdepth represents the average sequencing depth of the genome of the plant to be detected;
the standard for judging whether the skeleton sequence transfer event exists is as follows: BRN is more than or equal to Gdepth/3;
wherein BRN represents the number of reads matching the backbone sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected;
(4-3) determination of copy number: the copy number of the inserted T-DNA is Tdepth/Gdepth; the number of inserted skeleton sequences is Bdepth/Gdepth;
tdepth represents the average sequencing depth of the T-DNA, Gdepth represents the average sequencing depth of the genome of a plant to be detected, and Bdepth represents the average sequencing depth of a skeleton sequence;
(5) taking an expression vector containing a T-DNA sequence as a reference sequence, and extracting a double-ended read pair meeting the conditions according to the data of a mapping database, wherein the double-ended read pair must comprise a single-ended read I completely matched with the expression vector sequence and a single-ended read II not matched with or not completely matched with the expression vector sequence; then, carrying out local homology comparison analysis on the single-ended reading sequence II, an expression vector sequence containing a T-DNA sequence and a wild type genome sequence of a plant to be detected respectively;
determining the insertion site of the T-DNA sequence according to the following different conditions;
(5-1) if one end of the single-ended reading sequence II can be matched with the sequence of the expression vector, the other end of the single-ended reading sequence II can be matched with the sequence of the wild genome, and at least three single-ended reading sequences II with the same matching starting position of the genome or the expression vector exist, judging that the single-ended reading sequences II have candidate insertion sites, and the genome matching position closest to the sequence position of the expression vector is the insertion site.
(5-2) if the single-ended reading sequence II can not be matched with the sequence of the expression vector but can be matched with the sequence of the wild genome, and at least three single-ended reading sequences II with the same matching initial position of the genome exist, judging that the single-ended reading sequences II have insertion sites, and the insertion sites are the initial positions matched with the genome.
(5-3) if the single-ended reading sequence II can not be matched with the sequence of the expression vector and can not be matched with the sequence of the wild genome, assembling and splicing the corresponding double-ended reading sequence pair of the single-ended reading sequence II into a fragment according to the sequence overlapping characteristic, connecting the fragment with a T-DNA sequence or an inserted expression vector fragment by using N, and repeating the step (3) and the step (5) by using the fragment as a reference sequence until the single-ended reading sequence II can be matched with the wild genome sequence of the plant to be detected and can be judged by the step (5-1) or the step (5-2).
As shown in FIG. 1, an expression vector sequence includes both the T-DNA sequence of interest and sequences other than the T-DNA of interest, which we define as a backbone sequence. Whether the T-DNA sequence or the framework sequence is inserted into the wild type genome of the plant to be tested, there are no more than three cases: the first is precise insertion, i.e., the genomic sequence is tightly linked to the inserted sequence, no other sequences in between, and for one single-ended read I that matches the expression vector sequence, there are two possibilities for the other single-ended read II that does not match or does not match completely to the expression vector sequence, one is a read sequence that spans the "breakpoint" between the inserted sequence and the genomic sequence (that can be located by 5-1), and one is a read sequence that comes entirely from the genomic sequence (that can be located by 5-2); the second is insertion with concomitant insertion or perturbation of the mini-fragment, where the mini-fragment is defined as less than or equal to 90bp (where the length is adjustable) depending on the length of the read and the sensitivity of the method, where single-ended read II also exists with two possibilities, one is a read sequence spanning the "breakpoint" between the inserted sequence and the genomic sequence (that can be located by 5-1), and one is a read sequence entirely from the genomic sequence (that can be located by 5-2); the third is insertion and accompanying recombination or perturbation of large fragments, defined here as greater than 100bp (here adjustable in length) in terms of read length, where single-ended read II also presents two possibilities, one for read sequences spanning the "breakpoint" between the inserted sequence and the rearranged (perturbed) sequence, and one completely from the rearranged (perturbed) sequence, both of which can be used for 5-3 alignment.
Further, in the step (2), performing whole genome resequencing on the genome DNA, and performing quality control on original offline data to obtain quality-controlled double-end sequencing data of the whole genome;
the quality control method comprises the following specific steps:
(2-a) removing the read sequence with the linker;
(2-b) reads with a proportion of N removed greater than 20%;
(2-c) removing the double-ended read corresponding to the single-ended read when the 3' end of the single-ended read contains more low-quality bases than one third of the length proportion of the single-ended read; the low-quality base is a base with the mass less than or equal to 20.
Further, in step (3), the generic database is a database of complete sequence of vectors collected from the public database (NCBI, https:// www.ncbi.nlm.nih.gov /) as complete as possible.
Further, in the step (5-1), comparing an interval sequence matched with the expression vector sequence in the single-end reading sequence II with a wild type genome sequence of a plant sample to be detected;
if not, determining that the candidate insertion site on the single-ended reading sequence II is a real insertion site; if the sequence is homologous, the candidate insertion site on the single-ended reading sequence II is determined to be a false positive insertion site.
Further, in the step (5-2), if the single-ended reading sequence II is the left-end sequence of the double-ended reading sequence pair, the insertion site on the single-ended reading sequence II is determined to be the maximum site; and if the single-ended reading sequence II is the right-end sequence of the double-ended reading sequence pair, judging that the insertion site on the single-ended reading sequence II is the site minimum value.
Further, in the steps (5-1) to (5-3), the matching criteria are: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.
Further, the method of the present invention further comprises step (6): designing primer sequences before and after the insertion site obtained in the step (5), and verifying by a PCR technology.
Compared with the prior art, the invention has the following beneficial effects:
(1) the method combines with bioinformatics analysis means, identifies whether the transgene or the gene editing event occurs under the condition that an expression vector is known or unknown, can quickly and accurately give the accurate positioning, direction, copy number and flanking sequence information of the target sequence inserted into the genome under the condition that the expression vector is known, and can also identify whether a skeleton sequence, namely a sequence except the target sequence is inserted into the genome and give the same positioning.
(2) The method not only considers that the peripheral sequence of the genome insertion site is recombined or disordered due to the insertion of the exogenous sequence, but also reduces the sensitivity of the method and improves the false negative and missed detection probability; it is also considered that when the vector sequence has partial homology with the genomic sequence, false positives arise.
(3) Compared with the traditional experimental method, the method has the advantages of short time consumption, economy, repeatability and the like, can determine the influence of the transgenic or gene editing event on the plant genome in a short time, is favorable for the safety risk evaluation of the transgenic plant, and promotes the application of the transgenic or gene editing technology in agriculture.
Drawings
FIG. 1 is a schematic diagram of the method for rapid identification of transgenic or gene-editing material and its insertion site using whole genome re-sequencing data in accordance with the present invention;
wherein an expression vector sequence includes both the target T-DNA sequence and sequences other than the target T-DNA, we define as framework sequences. Whether the T-DNA sequence or the framework sequence is inserted into the wild type genome of the plant to be tested, there are no more than three cases: the first is precise insertion, i.e., the genomic sequence is tightly linked to the inserted sequence, no other sequences in between, and for one single-ended read I that matches the expression vector sequence, there are two possibilities for the other single-ended read II that does not match or does not match completely to the expression vector sequence, one is a read sequence that spans the "breakpoint" between the inserted sequence and the genomic sequence (that can be located by 5-1), and one is a read sequence that comes entirely from the genomic sequence (that can be located by 5-2); the second is insertion with concomitant insertion or perturbation of the mini-fragment, where the mini-fragment is defined as less than or equal to 90bp (where the length is adjustable) depending on the length of the read and the sensitivity of the method, where single-ended read II also exists with two possibilities, one is a read sequence spanning the "breakpoint" between the inserted sequence and the genomic sequence (that can be located by 5-1), and one is a read sequence entirely from the genomic sequence (that can be located by 5-2); the third is insertion and accompanying recombination or perturbation of large fragments, defined here as greater than 100bp (here adjustable in length) in terms of read length, where single-ended read II also presents two possibilities, one for read sequences spanning the "breakpoint" between the inserted sequence and the rearranged (perturbed) sequence, and one completely from the rearranged (perturbed) sequence, both of which can be used for 5-3 alignment.
FIG. 2 is a schematic flow chart of the method for rapidly identifying a transgene or gene editing material and its insertion site by using whole genome re-sequencing data according to the present invention.
FIG. 3 is a schematic diagram showing the determination of the presence or absence of a transgene or gene editing event and the occurrence and copy number of a backbone sequence insertion event in example 1.
The outermost layer of the graph is the length scale of the expression vector, the second layer from outside to inside is the structural annotation of the expression vector sequence, the third layer from outside to inside is the sequencing reading coverage area, and the fourth layer is the number of times of sequencing to each base of the expression vector sequence, namely the sequencing depth. From the figure it can be seen that not only the T-DNA sequence but also the backbone sequence is inserted into the genome.
FIG. 4 is a diagram showing the details of the example 1 in which the expression vector is inserted into the genome of a rice plant to be tested.
FIG. 5 is a schematic diagram showing the determination of the presence or absence of a transgene or gene editing event and the occurrence and copy number of a backbone sequence insertion event in example 2.
FIG. 6 is a diagram showing the details of the example 2 in which the expression vector is inserted into the genome of a soybean plant to be tested.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. These examples are intended to illustrate the invention only and are not intended to limit the scope of the invention. The experimental procedures not specified in the examples were carried out under conventional conditions or under the conditions described in molecular cloning protocols (J. SammBruke and D.W. Lassel, Huang Peyer, published by scientific Press, third edition 8.2002).
The terms referred to in the present invention are defined as follows:
reading the sequence: each sequence assembled from the "ATCG" array was obtained by a sequencer.
Double-ended read pair: randomly breaking a genome into DNA fragments with different lengths by an ultrasonic instrument, recovering and purifying the DNA fragments with fixed sizes by PCR, fixing the DNA fragments on a substrate by adding joints at two ends of the DNA fragments, and sequencing from two ends to the middle of the fragments to obtain sequences at two ends of the same DNA fragment.
Single-ended reading: with respect to paired-end read pairs, it is meant any read in a pair of paired-end read pairs.
Average sequencing depth: the ratio of the total amount of bases obtained by sequencing the sequence interval to the length of the sequence interval is determined as the ratio of the total amount of bases obtained by sequencing the sequence interval (excluding the region homologous to the genome) to the length of the sequence interval (excluding the region homologous to the genome) for the average sequencing depth of the T-DNA and the backbone sequence.
Base coverage of T-DNA sequence: the ratio of the length of the T-DNA interval to the length of the T-DNA sequence is determined by sequencing.
And (3) complete matching: for reads of 150bp length, the sequence can be perfectly matched to the reference sequence, allowing mismatches of 2-3 bases, gaps within 10 bp.
Not perfectly matched: for reads of 150bp length, there is no more than 90bp of sequence that can match the reference sequence.
Example 1
A method for rapidly identifying a gene editing material and an insertion site thereof by using whole genome re-sequencing data comprises the following specific steps:
(1) extracting the genomic DNA of the experimental material and the plant to be detected: the rice gene editing sample R07 takes pHUN4c12S as a vector, the total length is 12314bp, the gene editing target gene is MTL (LOC _ Os03g27610), and the gene encodes phospholipase.
Firstly, designing a target sequence (20nt) in the third exon of the MTL gene according to the sequence of the MTL gene, and connecting the target sequence into a vector pHUN4c12S to obtain a pHUN4c12S-MTL plasmid; then, the plasmid is transferred into an agrobacterium-infected state; and finally, transferring pHUN4c12S-MTL into the japonica rice variety Sn rice No. 1 by an agrobacterium-mediated transgenic method. Screening, differentiating and rooting to obtain T0And (3) continuously planting the transgenic regenerated seedlings to obtain T1 plants R07, transplanting the plants to be tested to a field for planting for about 30 days, taking leaves, extracting the genome DNA of the plants to be tested by using a conventional DNA extraction kit, and carrying out whole-gene recombination sequencing. The rice reference genomic sequence is RGAP v7(htt p:// rice. plant biology. msu. edu/annotation _ pseudo. shtml).
(2) Performing whole genome resequencing on the genome DNA of the plant to be detected, and performing quality control on original off-line data to obtain double-end sequencing data of the whole genome;
the whole genome double-end sequencing is carried out on the gene editing rice to be tested transferred into the vector pHUN4c12S by a high-throughput sequencer (Illumina sequencer, Beijing Baimaike Biotechnology Ltd.), the insertion size is 300bp, the reading length is 150bp, the original off-line data is 48G, and the average sequencing depth is 57 layers.
Performing quality control on original offline data, wherein the software of the quality control is Trimmomatic, the version is 0.36, and the quality control standard is as follows:
(2-a) removing the read with the linker;
(2-b) removing reads with a proportion of N in reads greater than 20%;
(2-c) removing the paired-end reads when the 3' end of the single-end read (i.e., one of the paired-end reads) contains more than one-third of the low-quality (base-quality ═ 20) bases in the ratio of the length of the read; setting parameters as tailing: 20, MINLENEN: and 100, obtaining quality data after quality control.
(3) Judging whether an expression vector sequence containing a T-DNA sequence inserted into a plant to be detected is known or not, and carrying out the following operations according to a judgment result:
if the sequence of the expression vector inserted into the plant to be detected is known, the expression vector is used as a template, and the double-end sequencing data is mapped with the expression vector to obtain a mapping database;
if the sequence of the expression vector inserted into the plant to be detected is unknown, mapping the double-end sequencing data with the universal vector library as a template to obtain a mapping database;
the judgment result is as follows: the sequence of the expression vector inserted into the plant to be detected is known, so that the expression vector pHUN4c12S-MTL containing a T-DNA sequence is used as a template, and double-end sequencing data obtained after whole genome re-sequencing quality control is subjected to mapping analysis with the double-end sequencing data to obtain a mapping database.
The specific mapping analysis steps are as follows: the software used is bowtie2, the version is 2.2.1, the default parameters (the definite mapping standard) are adopted, the obtained comparison result information file is in the sam format, the sam format file is converted into a binary bam format file (namely a mapping database), and the conversion tool is samtools. In order to reduce the interference of the sequencing repeated sequence (two sequences which are completely identical in the same position and the same direction of the matched expression vector) on the subsequent positioning result, a step of removing the repeated sequence alignment result is introduced in the format conversion process.
The tool used to REMOVE duplicate sequences is the subroutine module MarkDuplicates. jar in the picard software, with the parameter REMOVE _ DUPLICATES set to true and the other parameters default.
(4) Respectively counting the number of reads (VRN) of a matched expression vector sequence, the number of reads (BRN) of a matched skeleton sequence, the base coverage rate (TCov) and the average sequencing depth (Tdepth) of a T-DNA sequence, the average sequencing depth (Gdepth) of a sample genome, the sequence length (VectorLen) of an expression vector and the reading length (ReadLen) in the mapping database, and judging whether a transgenic event or a gene editing event exists in a plant to be detected and whether a skeleton sequence is transferred into the event according to the following formula;
criteria for determining the presence or absence of a transgenic or gene editing event are:
VRN is more than or equal to Gdepth/2+ VectorLen/ReadLen, and Tcov is more than or equal to 0.9;
wherein VRN represents the number of reads matching the expression vector sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected; VectorLen denotes the sequence length of the expression vector; ReadLen represents the read length; tcov represents the base coverage of the T-DNA sequence;
the standard for judging whether the skeleton sequence transfer event exists is as follows: BRN is more than or equal to Gdepth/3;
wherein BRN represents the number of reads matching the backbone sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected;
the detection result is as follows: the number of reads (VRN) of the matched expression vector sequence is 2725, the number of reads (BRN) of the matched skeleton sequence is 1053, the base coverage rate (TCov) of the T-DNA sequence is 0.997, the average sequencing depth (Tdepth) of the T-DNA is 23 layers, the average sequencing depth of the skeleton sequence is 27 layers, the average sequencing depth (Gdepth) of the sample genome is 57 layers, the length (VectorLen) of the expression vector sequence is 12314bp, and the length (ReadLen) of the reads is 150 bp.
(4-1)VRN≥111=(57/2+12314/150);Tcov≥0.9;
(4-2)BRN≥19;
(4-3) determination of copy number: the copy number of the inserted T-DNA is 23/57 and 1/2; the number of inserted skeleton sequences is 27/57 ≈ 1/2;
and judging whether the sample has a transgenic or gene editing event or has a skeleton sequence transfer event and single copy according to the judgment standard.
(5) Taking an expression vector containing a T-DNA sequence as a reference sequence, extracting qualified doublets by samtools software according to data of a mapping database (bam format file)An end-read pair, which must comprise a single-ended read I that is perfectly matched to the expression vector sequence and another single-ended read II that is not matched or not perfectly matched to the expression vector sequence; and then carrying out local homology comparison analysis on the single-ended reading sequence II, the expression vector sequence containing the T-DNA sequence and the wild type genome sequence of the plant to be detected respectively. The software used for local homology alignment analysis was blast, version of blast was 2.2.28+, and the outfmt parameter was set to 6 to ensure that the output was m8 format separated by tab bonds, for subsequent processing, num _ descriptions or max _ target _ seqs parameter was set to 1, evalue parameter was set to 1e-5
Determining the insertion site of the T-DNA sequence according to the following different conditions;
(5-1) if one end of the single-ended reading sequence II can be matched with an expression vector sequence, the other end of the single-ended reading sequence II can be matched with a wild type genome sequence, and at least three single-ended reading sequences II with the same matching initial position of the genome or the expression vector exist, judging that the single-ended reading sequence II has a candidate insertion site, extracting a reading sequence II interval section matched with the expression vector sequence, comparing the reading sequence II interval section with the wild type genome of a transgenic sample to be detected, and if no homologous sequence exists, setting the genome matching site closest to the expression vector sequence position as a real insertion site; if a homologous sequence exists, the candidate insertion site on the single-ended reading II is determined to be a false positive insertion site. The matching criteria were: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.
(5-2) if the single-ended reading sequence II can not be matched with the sequence of the expression vector but can be matched with the sequence of the wild genome, and at least three single-ended reading sequences II with the same matching initial position of the genome exist, judging that the single-ended reading sequences II have insertion sites, and the insertion sites are the initial positions matched with the genome; the matching criteria were: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.
If the single-ended reading sequence II is the left-end sequence of the double-ended reading sequence pair, the insertion site on the single-ended reading sequence II is judged to be the maximum site; and if the single-ended reading sequence II is the right-end sequence of the double-ended reading sequence pair, judging that the insertion site on the single-ended reading sequence II is the site minimum value.
(5-3) if the single-ended reading sequence II can not be matched with the sequence of the expression vector and can not be matched with the sequence of the wild genome, assembling and splicing the corresponding double-ended reading sequence pair of the single-ended reading sequence II into a fragment according to the sequence overlapping characteristic, connecting the fragment with a T-DNA sequence or an inserted expression vector fragment by using N, and repeating the step (3) and the step (5) by using the fragment as a reference sequence until the single-ended reading sequence II can be matched with the wild genome sequence of the plant to be detected and can be judged by the step (5-1) or the step (5-2).
According to the method of step (5), the judgment results are shown in Table 1, and the genome schematic diagrams of the inserted T-DNA sequences are shown in FIGS. 3 and 4.
As can be seen from FIGS. 3 and 4, this example is the case in step (5-1), and the insertion site is located by single-ended reading II matching with the expression vector sequence and the wild-type genomic sequence.
TABLE 1 final output results of positioning of rice plants to be tested
Figure GDA0003311088100000091
Example 2
(1) Extracting the genomic DNA of the experimental material and the plant to be detected: the soybean transgenic sample L22 takes pSOY19 as a vector, the total length is 9557bp, and the target gene g10-epsps contained in the vector has the function of a glyphosate tolerance gene.
According to the sequence of the g10-epsps gene, the target sequence is connected into a vector pSOY19 to obtain a pSOY19-g10-epsps expression plasmid vector; then, the plasmid is transferred into an agrobacterium-infected state; finally, pSOY19-g10-epsps was transferred into recipient Huachun No. 3 by Agrobacterium-mediated transgenic method. After screening, differentiation and rooting, transgenic soybean L22 is obtained.
After the strain is normally cultured to 100 days of seedling age, taking leaves, extracting the genome DNA of the plant to be detected by using a conventional DNA extraction kit, and performing whole genome re-sequencing. The soybean reference genome is a cultivar Williams 82 sequence Gmax-275 _ v2.fa, and the download address is https:// phytozome.jgi.doe.gov/pz/portal.html.
(2) Performing whole genome resequencing on the genome DNA of the plant to be detected, and performing quality control on original off-line data to obtain double-end sequencing data of the whole genome;
the transgenic soybean sample to be tested, which is transferred into the vector pSOY19, is subjected to whole genome double-end sequencing by a high-throughput sequencer (Illumina XTen, Nuo and induced science and technology Limited), the insertion size is 300bp, the reading length is 150bp, the obtained original off-machine data is 52G, and the average sequencing depth is 23 layers.
Performing quality control on original offline data, wherein the software of the quality control is Trimmomatic, the version is 0.36, and the quality control standard is as follows:
(2-a) removing the read with the linker;
(2-b) removing reads with a proportion of N in reads greater than 20%;
(2-c) removing the paired-end reads when the 3' end of the single-end read (i.e., one of the paired-end reads) contains more than one-third of the low-quality (base-quality ═ 20) bases in the ratio of the length of the read; setting parameters as tailing: 20, MINLENEN: and 100, obtaining quality data after quality control.
(3) Judging whether an expression vector sequence containing a T-DNA sequence inserted into a plant to be detected is known or not, and carrying out the following operations according to a judgment result:
if the sequence of the expression vector inserted into the plant to be detected is known, the expression vector is used as a template, and the double-end sequencing data is mapped with the expression vector to obtain a mapping database;
if the sequence of the expression vector inserted into the plant to be detected is unknown, the generic vector library is used as a template, and the generic vector library and the double-end sequencing data are mapped to obtain a mapping database;
the judgment result is as follows: the sequence of an expression vector inserted into a plant to be detected is known, so that the expression vector pSOY19-g10-epsps containing a T-DNA sequence is used as a template, and double-end sequencing data obtained after whole genome re-sequencing quality control is subjected to mapping analysis to obtain a mapping database.
The specific mapping analysis steps are as follows: the software used is bowtie2, the version is 2.2.1, the default parameters (the definite mapping standard) are adopted, the obtained comparison result information file is in the sam format, the sam format file is converted into a binary bam format file (namely a mapping database), and the conversion tool is samtools. In order to reduce the interference of the sequencing repeated sequence (two sequences which are completely identical in the same position and the same direction of the matched expression vector) on the subsequent positioning result, a step of removing the repeated sequence alignment result is introduced in the format conversion process.
The tool used to REMOVE duplicate sequences is the subroutine module MarkDuplicates. jar in the picard software, with the parameter REMOVE _ DUPLICATES set to true and the other parameters default.
(4) Respectively counting the number of reads (VRN) of a matched expression vector sequence, the number of reads (BRN) of a matched skeleton sequence, the base coverage rate (TCov) and the average sequencing depth (Tdepth) of a T-DNA sequence, the average sequencing depth (Gdepth) of a sample genome, the sequence length (VectorLen) of an expression vector and the reading length (ReadLen) in the mapping database, and judging whether a transgenic event or a gene editing event exists in a plant to be detected and whether a skeleton sequence is transferred into the event according to the following formula;
criteria for determining the presence or absence of a transgenic or gene editing event are:
VRN is more than or equal to Gdepth/2+ VectorLen/ReadLen, and Tcov is more than or equal to 0.9;
wherein VRN represents the number of reads matching the expression vector sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected; VectorLen denotes the sequence length of the expression vector; ReadLen represents the read length; tcov represents the base coverage of the T-DNA sequence;
the detection result is as follows: the number of reads (VRN) of the matched expression vector sequence is 293, the number of reads (BRN) of the matched skeleton sequence is 0, the base coverage rate (TCov) of the T-DNA sequence is 0.986, the average sequencing depth (Tdepth) of the T-DNA is 14 layers, the average sequencing depth (Gdepth) of the sample genome is 23 layers, the sequence length (VectorLen) of the expression vector is 9557bp, and the reading length (ReadLen) is 150 bp.
(4-1)VRN≥75=(23/2+9557/150);Tcov≥0.9;
(4-2)BRN<8;
(4-3) determination of copy number: the copy number of the inserted T-DNA is 14/23 and 1/2, and the T-DNA is single copy;
and according to the judgment standard, judging that the sample has a transgenic or gene editing event, and has single copy but no skeleton sequence transfer event.
(5) Taking an expression vector containing a T-DNA sequence as a reference sequence, extracting a double-ended reading pair meeting the conditions through samtools software according to data of a mapping database (bam format file), wherein the double-ended reading pair must comprise a single-ended reading I completely matched with the expression vector sequence and a single-ended reading II which is not matched with or is not completely matched with the expression vector sequence; and then carrying out local homology comparison analysis on the single-ended reading sequence II, the expression vector sequence containing the T-DNA sequence and the wild type genome sequence of the plant to be detected respectively. The software used for local homology alignment analysis was blast, version of blast was 2.2.28+, and the outfmt parameter was set to 6 to ensure that the output was m8 format separated by tab bonds, for subsequent processing, num _ descriptions or max _ target _ seqs parameter was set to 1, evalue parameter was set to 1e-5
Determining the insertion site of the T-DNA sequence according to the following different conditions;
(5-1) if one end of the single-ended reading sequence II can be matched with an expression vector sequence, the other end of the single-ended reading sequence II can be matched with a wild type genome sequence, and at least three single-ended reading sequences II with the same matching initial position of the genome or the expression vector exist, judging that the single-ended reading sequence II has a candidate insertion site, extracting a reading sequence II interval section matched with the expression vector sequence, comparing the reading sequence II interval section with the wild type genome of a transgenic sample to be detected, and if no homologous sequence exists, setting the genome matching site closest to the expression vector sequence position as a real insertion site; if a homologous sequence exists, the candidate insertion site on the single-ended reading II is determined to be a false positive insertion site. The matching criteria were: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.
(5-2) if the single-ended reading sequence II can not be matched with the sequence of the expression vector but can be matched with the sequence of the wild genome, and at least three single-ended reading sequences II with the same matching initial position of the genome exist, judging that the single-ended reading sequences II have insertion sites, and the insertion sites are the initial positions matched with the genome; the matching criteria were: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.
If the single-ended reading sequence II is the left-end sequence of the double-ended reading sequence pair, the insertion site on the single-ended reading sequence II is judged to be the maximum site; and if the single-ended reading sequence II is the right-end sequence of the double-ended reading sequence pair, judging that the insertion site on the single-ended reading sequence II is the site minimum value.
(5-3) if the single-ended reading sequence II can not be matched with the sequence of the expression vector and can not be matched with the sequence of the wild genome, assembling and splicing the corresponding double-ended reading sequence pair of the single-ended reading sequence II into a fragment according to the sequence overlapping characteristic, connecting the fragment with a T-DNA sequence or an inserted expression vector fragment by using N, and repeating the step (3) and the step (5) by using the fragment as a reference sequence until the single-ended reading sequence II can be matched with the wild genome sequence of the plant to be detected and can be judged by the step (5-1) or the step (5-2).
According to the method of step (5), the judgment results are shown in Table 2, and the genome schematic diagrams of the inserted T-DNA sequences are shown in FIGS. 5 and 6.
As can be seen from FIGS. 5 and 6, this embodiment is the case of steps (5-1) and (5-3), and the insertion site is located by steps (5-1) and (5-3).
TABLE 2 final output results of positioning soybean plants to be tested
Figure GDA0003311088100000121
Example 3
(1) Extracting the genomic DNA of the experimental material and the plant to be detected: taking a rice gene editing sample R14 as an example, pHUN4c12S is a vector, the total length is 12314bp, a gene editing target gene is OsLCT1(LOC _ Os06g38120), and the gene encodes a low-affinity ion transporter.
Firstly, designing a target sequence in the third exon of the gene according to the sequence of the OsLCT1 gene (CCCGGCAGCGCACCGATGTTGCT), and connecting the target sequence into a vector pHUN4c12S to obtain a pHUN4c12S-LCT1 plasmid; then, the plasmid is transferred into an agrobacterium-infected state; and finally, the pHUN4c12S-LCT1 is transferred into the japonica rice variety sweet japonica No. 1 by an agrobacterium-mediated transgenic method. And screening, differentiating and rooting to obtain T0 transgenic regenerated seedling. Continuously planting to obtain a T1 plant R14, planting the plant to be detected in a field for about 30 days, taking leaves, extracting the genome DNA of the plant to be detected by using a conventional kit, detecting that the plant does not contain a hygromycin resistance gene by using a hygromycin resistance gene marker, and then performing whole genome re-sequencing. The rice reference genomic sequence was RGAP v7(http:// rice. plant. biology. msu. edu/annotation _ pseudo. shtml).
(2) Performing whole genome resequencing on the genome DNA of the plant to be detected, and performing quality control on original off-line data to obtain double-end sequencing data of the whole genome;
the whole genome double-end sequencing is carried out on the gene editing rice to be tested by a high-throughput sequencer (Illumina sequencer, beijing Baimaike biotechnology limited), the insertion size is 300bp, the reading length is 150bp, the original off-line data 32G is obtained, and the average sequencing depth is 25 layers.
Performing quality control on original offline data, wherein the software of the quality control is Trimmomatic, the version is 0.36, and the quality control standard is as follows:
(2-a) removing the read with the linker;
(2-b) removing reads with a proportion of N in reads greater than 20%;
(2-c) removing the paired-end reads when the 3' end of the single-end read (i.e., one of the paired-end reads) contains more than one-third of the low-quality (base-quality ═ 20) bases in the ratio of the length of the read; setting parameters as tailing: 20, MINLENEN: and 100, obtaining quality data after quality control.
(3) Judging whether an expression vector sequence containing a T-DNA sequence inserted into a plant to be detected is known or not, and carrying out the following operations according to a judgment result:
if the sequence of the expression vector inserted into the plant to be detected is known, the expression vector is used as a template, and the double-end sequencing data is mapped with the expression vector to obtain a mapping database;
if the sequence of the expression vector inserted into the plant to be detected is unknown, mapping the double-end sequencing data with the universal vector library as a template to obtain a mapping database;
here, the expression vector and the T-DNA sequence were analyzed without knowing them, and then verified by analysis using the known expression vector and T-DNA sequence. The judgment result is as follows: the sequence of the expression vector inserted into the plant to be detected is unknown, so that the pan-vector library is used as a template, and the double-end sequencing data obtained after the whole genome re-sequencing quality control is subjected to mapping analysis with the double-end sequencing data to obtain a mapping database.
The specific mapping analysis steps are as follows: the software used is bowtie2, the version is 2.2.1, the default parameters (the definite mapping standard) are adopted, the obtained comparison result information file is in the sam format, the sam format file is converted into a binary bam format file (namely a mapping database), and the conversion tool is samtools. In order to reduce the interference of the sequencing repeated sequence (two sequences which are completely identical in the same position and the same direction of the matched expression vector) on the subsequent positioning result, a step of removing the repeated sequence alignment result is introduced in the format conversion process.
The tool used to REMOVE duplicate sequences is the subroutine module MarkDuplicates. jar in the picard software, with the parameter REMOVE _ DUPLICATES set to true and the other parameters default.
(4) Respectively counting the number of read sequences (FRN) matched with the generic vector library in the mapping database, the average sequencing depth (Gdepth) of a sample genome and the reading length (ReadLen), and judging whether a transgenic event or a gene editing event exists in a plant to be detected according to the following formula;
the criteria for determining the presence or absence of a transgenic event or gene editing event are (generic vector library): FRN is more than or equal to Gdepth/2 x 10;
the detection result is as follows: the number of reads (FRN) aligned to the generic library was 710, the average sequencing depth (Gdepth) of the sample genome was 25 layers, and the read length (ReadLen) was 150 bp.
(4-a)FRN≥125=(25/2)×10;
And judging whether the sample has a transgene or a gene editing event according to the judgment standard.
By adopting the method combining the embodiment 1 and the embodiment 3, 78 rice negative control materials, 16 rice gene editing materials, 5 transgenic rice materials and 9 transgenic soybean materials are tested in total, wherein when 16 materials are inserted, the situation that small disturbance exists near the genome insertion site is 5-2, the situation that 5 transgenic materials have large disturbance or rearrangement exists near the genome insertion site is 5-3, and the skeleton sequence insertion phenomenon exists in 15 materials. The total of 45 insertion events, except 3 soybean samples, which were not verified by experimental PCR, were consistent with our analysis results, with an accuracy of 93.7%. The method is 100% accurate in determining whether a transgene or gene editing event is present.

Claims (6)

1.一种快速鉴定转基因或基因编辑材料及其插入位点的方法,其特征在于,包括:1. a method for rapidly identifying transgenic or gene editing materials and insertion sites thereof, comprising: (1)提取经转基因技术或基因编辑技术处理后的待测植株样本的基因组DNA;(1) Extracting the genomic DNA of the plant sample to be tested after being processed by transgenic technology or gene editing technology; (2)对所述基因组DNA进行全基因组重测序,得到全基因组的双末端测序数据;(2) performing whole-genome re-sequencing on the genomic DNA to obtain paired-end sequencing data of the whole genome; (3)判断待测植株中插入的含T-DNA序列的表达载体序列是否已知,根据判断结果进行如下操作:(3) Judging whether the expression vector sequence containing the T-DNA sequence inserted in the plant to be tested is known, and performing the following operations according to the judgment result: 若待测植株中插入的表达载体序列是已知的,则以含T-DNA序列的表达载体作为模板,与所述双末端测序数据进行映射,得到映射数据库;If the sequence of the expression vector inserted in the plant to be tested is known, the expression vector containing the T-DNA sequence is used as a template to map with the paired-end sequencing data to obtain a mapping database; 若待测植株中插入的表达载体序列是未知的,则以泛载体库作为模板,与所述双末端测序数据进行映射,得到映射数据库;If the sequence of the expression vector inserted in the plant to be tested is unknown, the pan-vector library is used as a template to map with the paired-end sequencing data to obtain a mapping database; (4)分别统计所述映射数据库中匹配表达载体序列的读序数目或匹配泛载体库的读序数目,匹配骨架序列的读序数目,T-DNA序列的碱基覆盖率和平均测序深度,待测植株样本基因组的平均测序深度,表达载体的序列长度,以及读序长度,并根据如下公式判断待测植株是否存在转基因事件或基因编辑事件,是否有骨架序列转入事件的发生,以及插入序列拷贝数;(4) respectively counting the number of reads matching the expression vector sequence or the number of reads matching the pan-vector library in the mapping database, the number of reads matching the backbone sequence, the base coverage of the T-DNA sequence and the average sequencing depth, The average sequencing depth of the sample genome of the plant to be tested, the sequence length of the expression vector, and the reading length, and the following formulas are used to determine whether there is a transgenic event or gene editing event in the plant to be tested, whether there is a backbone sequence transfer event, and insertion. sequence copy number; 判定是否存在转基因事件或基因编辑事件的标准为:The criteria for determining whether there is a transgenic event or a gene editing event are: (4-1)表达载体序列已知:VRN≥Gdepth/2+VectorLen/ReadLen,且Tcov≥0.9;(4-1) The sequence of the expression vector is known: VRN≥Gdepth/2+VectorLen/ReadLen, and Tcov≥0.9; 其中,VRN表示匹配表达载体序列的读序数目;Gdepth表示待测植株基因组的平均测序深度;VectorLen表示表达载体的序列长度;ReadLen表示读序长度;Tcov表示T-DNA序列的碱基覆盖率,Tdepth表示T-DNA的平均测序深度,Bdepth表示骨架序列的平均测序深度;Among them, VRN represents the number of reads matching the expression vector sequence; Gdepth represents the average sequencing depth of the plant genome to be tested; VectorLen represents the sequence length of the expression vector; ReadLen represents the read length; Tcov represents the base coverage of the T-DNA sequence, Tdepth represents the average sequencing depth of T-DNA, and Bdepth represents the average sequencing depth of backbone sequences; (4-2)表达载体序列未知:FRN≥(Gdepth/2)×10;(4-2) The sequence of the expression vector is unknown: FRN≥(Gdepth/2)×10; 其中,FRN表示匹配泛载体库的读序数目;Gdepth表示待测植株基因组的平均测序深度;Among them, FRN represents the number of reads matching the pan-vector library; Gdepth represents the average sequencing depth of the plant genome to be tested; 判定是否存在骨架序列转入事件的标准为:BRN≥Gdepth/3;The criterion for judging whether there is a backbone sequence transfer event is: BRN≥Gdepth/3; 其中,BRN表示匹配骨架序列的读序数目;Gdepth表示待测植株基因组的平均测序深度;Among them, BRN represents the number of reads matching the backbone sequence; Gdepth represents the average sequencing depth of the genome of the plant to be tested; (4-3)拷贝数的确定:插入的T-DNA拷贝数=Tdepth/Gdepth;插入的骨架序列数=Bdepth/Gdepth;(4-3) Determination of copy number: inserted T-DNA copy number=Tdepth/Gdepth; inserted backbone sequence number=Bdepth/Gdepth; Tdepth表示T-DNA的平均测序深度,Gdepth表示待测植株基因组的平均测序深度,Bdepth表示骨架序列的平均测序深度;Tdepth represents the average sequencing depth of T-DNA, Gdepth represents the average sequencing depth of the plant genome to be tested, and Bdepth represents the average sequencing depth of the backbone sequence; (5)以含有T-DNA序列的表达载体为参考序列,根据映射数据库的数据,提取符合条件的双端读序对,该双端读序对中必须包含一条完全匹配到表达载体序列的单端读序I,和另一条未匹配或未完全匹配到表达载体序列的单端读序II;再将单端读序II分别与含有T-DNA序列的表达载体序列和待测植株的野生型基因组序列进行本地同源性比对分析;(5) Taking the expression vector containing the T-DNA sequence as the reference sequence, according to the data of the mapping database, extract the qualified double-ended reading pair, and the double-ended reading pair must contain a single sequence that completely matches the expression vector sequence. End reading I, and another single-end reading II that does not match or does not completely match the expression vector sequence; then single-end reading II is respectively associated with the expression vector sequence containing the T-DNA sequence and the wild type of the plant to be tested. Genome sequence for local homology alignment analysis; 根据下列不同情况确定T-DNA序列的插入位点;The insertion site of the T-DNA sequence is determined according to the following different situations; (5-1)若单端读序II的一端能与表达载体序列匹配,另一端能与野生型基因组序列匹配,并且存在至少三条具有野生型基因组或表达载体同一匹配起始位置的单端读序II,则判定该单端读序II上具有候选的插入位点,且与表达载体序列位置最近的基因组匹配位置为插入位点;(5-1) If one end of the single-end read sequence II can match the expression vector sequence, the other end can match the wild-type genome sequence, and there are at least three single-end reads with the same matching start position of the wild-type genome or expression vector Sequence II, it is determined that there is a candidate insertion site on the single-end reading sequence II, and the genome matching position closest to the sequence position of the expression vector is the insertion site; (5-2)若单端读序II不能与表达载体序列匹配,但是能与野生型基因组序列匹配,并且存在至少三条具有野生型基因组同一匹配起始位置的单端读序II,则判定该单端读序II上具有插入位点,且插入位点为与基因组匹配的起始位置;(5-2) If the single-end reading sequence II cannot match the expression vector sequence, but can match the wild-type genome sequence, and there are at least three single-end reading sequences II with the same matching starting position of the wild-type genome, it is determined that the There is an insertion site on the single-end reading sequence II, and the insertion site is the starting position matched with the genome; (5-3)若单端读序II既不能与表达载体序列匹配,又不能与野生型基因组序列匹配,则需要先根据序列重叠特点,将该单端读序II所对应的双端读序对组装拼接成一条片段,用N将该片段与T-DNA序列或插入的表达载体片段连接,并作为参考序列重复步骤(3)和步骤(5),直到单端读序II能匹配待测植株的野生型基因组序列,并可通过步骤(5-1)或(5-2)进行判断为止;(5-3) If the single-end reading II can neither match the expression vector sequence nor the wild-type genome sequence, then it is necessary to firstly analyze the double-end reading corresponding to the single-end reading II according to the sequence overlapping characteristics. The assembly is spliced into a fragment, the fragment is connected with the T-DNA sequence or the inserted expression vector fragment with N, and steps (3) and (5) are repeated as a reference sequence, until the single-end reading sequence II can match the test. The wild-type genome sequence of the plant can be judged by step (5-1) or (5-2); 所述N表示未知的ATCG碱基或者排列组合序列;The N represents an unknown ATCG base or permutation sequence; 所述读序是指通过测序仪得到的每一条由“ATCG”排列组合的序列;所述双端读序对是指通过超声仪将基因组随机打断成不同长度的DNA片段,通过PCR回收并纯化固定大小的DNA片段,对这些DNA片段两端通过加接头将其固定在基板上,然后从两端往片段中间测序,得到的来自同一个DNA片段的两端序列;所述单端读序是指相对于双端读序对而言,指一对双端读序对中的任意一条读序;所述完全匹配是指对于150bp长度的读序,序列能够与参考序列完全匹配,允许2-3个碱基的错配,10bp以内的空位;所述未完全匹配是指对于150bp长度的读序,有不大于90bp的序列能够匹配参考序列。The read sequence refers to each sequence that is arranged and combined by "ATCG" obtained by a sequencer; the paired-end read pair refers to the random fragmentation of the genome into DNA fragments of different lengths by a sonicator, and recovery and recovery by PCR. Purify DNA fragments of a fixed size, fix the two ends of these DNA fragments on the substrate by adding adapters, and then sequence from the two ends to the middle of the fragments, and the obtained sequences from the two ends of the same DNA fragment; the single-end reading sequence Refers to any reading sequence in a pair of double-ended reading pairs relative to a pair of double-ended reading sequences; the perfect match means that for a reading sequence with a length of 150 bp, the sequence can be completely matched with the reference sequence, allowing 2 -A mismatch of 3 bases and a gap within 10 bp; the incomplete match refers to a sequence of no more than 90 bp that can match the reference sequence for a read sequence of 150 bp in length. 2.如权利要求1所述的快速鉴定转基因或基因编辑材料及其插入位点的方法,其特征在于,步骤(2)中,对所述基因组DNA进行全基因组重测序,并对原始下机数据进行质控,得到质控后的全基因组的双末端测序数据;2. The method for rapidly identifying transgenic or gene editing materials and insertion sites thereof as claimed in claim 1, wherein in step (2), whole genome re-sequencing is performed on the genomic DNA, and the original off-machine is performed. The data is subjected to quality control to obtain the paired-end sequencing data of the whole genome after quality control; 所述质控的具体方法如下:The specific method of the quality control is as follows: (2-a)去除带接头的读序序列;(2-a) removing the reading sequence with the linker; (2-b)去除N的比例大于20%的读序;(2-b) Remove reads with a ratio of N greater than 20%; (2-c)当单端读序3’端含有的低质量碱基数超过该条读序长度比例的三分之一时,去除所述单端读序所对应的双端读序;所述低质量碱基为质量≤20的碱基。(2-c) when the number of low-quality bases contained at the 3' end of the single-end reading exceeds one-third of the length ratio of the reading, remove the paired-end reading corresponding to the single-end reading; The low-quality bases are bases with a quality of ≤20. 3.如权利要求1所述的数据快速鉴定转基因或基因编辑材料及其插入位点的方法,其特征在于,步骤(5-1)中,将单端读序II中与表达载体序列匹配的区间序列与待测植株样本的野生型基因组序列进行比对;3. The method for rapid identification of transgene or gene editing material and insertion site thereof according to claim 1, wherein in step (5-1), the single-end reading II is matched with the expression vector sequence. The interval sequence is compared with the wild-type genome sequence of the plant sample to be tested; 若不是同源序列,则判定该单端读序II上所具有的候选插入位点为真实存在的插入位点;若是同源序列,则判定该单端读序II上所具有的候选插入位点为假阳性插入位点。If it is not a homologous sequence, it is determined that the candidate insertion site on the single-ended reading sequence II is a real insertion site; if it is a homologous sequence, it is determined that the single-ended reading sequence II has a candidate insertion site. Dots are false positive insertion sites. 4.如权利要求1所述的数据快速鉴定转基因或基因编辑材料及其插入位点的方法,其特征在于,步骤(5-2)中,若单端读序II为双端读序对的左端序列,则判定的单端读序II上所具有的插入位点为位点最大值;若单端读序II为双端读序对的右端序列,则判定的单端读序II上所具有的插入位点为位点最小值。4. The method for rapid identification of transgene or gene editing material and insertion site thereof according to claim 1, wherein in step (5-2), if single-end reading II is a pair of double-end readings If the single-end reading sequence II is the right-end sequence of the double-end reading sequence, then the single-end reading sequence II is determined to have an insertion site on the left-end sequence. The insertion site with is the site minimum. 5.如权利要求1所述的数据快速鉴定转基因或基因编辑材料及其插入位点的方法,其特征在于,步骤(5-1)~(5-3)中,所述匹配的标准为:碱基相似度≥95%,错配碱基≤5,空位≤5。5. The method for rapidly identifying transgenic or gene editing materials and their insertion sites with data according to claim 1, wherein in steps (5-1) to (5-3), the matching criteria are: Base similarity ≥95%, mismatched bases ≤5, and gaps ≤5. 6.如权利要求1所述的快速鉴定转基因或基因编辑材料及其插入位点的方法,其特征在于,还包括步骤(6):根据步骤(5)得到的插入位点设计前后引物序列,通过PCR技术进行验证。6. the method for rapid identification of transgene or gene editing material and insertion site thereof as claimed in claim 1, is characterized in that, also comprises step (6): the primer sequence before and after the insertion site design obtained according to step (5), Validated by PCR technique.
CN201910863735.XA 2019-09-12 2019-09-12 Method for rapidly identifying transgene or gene editing material and insertion site thereof Active CN110556165B (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201910863735.XA CN110556165B (en) 2019-09-12 2019-09-12 Method for rapidly identifying transgene or gene editing material and insertion site thereof
US17/594,728 US20220205034A1 (en) 2019-09-12 2020-08-20 Method for quickly identifying clean transgenic or gene-edited plants and insertion sites by using whole genome re-sequencing data
PCT/CN2020/110191 WO2021047363A1 (en) 2019-09-12 2020-08-20 Method for using whole genome re-sequencing data to quickly identify transgenic or gene editing material and insertion sites thereof
EP20863661.3A EP3919629A4 (en) 2019-09-12 2020-08-20 METHOD FOR USING WHOLE GENOME RE-SEQUENCING DATA TO RAPIDLY IDENTIFY TRANSGENIC OR GENOME EDITING MATERIAL AND ITS INSERTION SITES

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910863735.XA CN110556165B (en) 2019-09-12 2019-09-12 Method for rapidly identifying transgene or gene editing material and insertion site thereof

Publications (2)

Publication Number Publication Date
CN110556165A CN110556165A (en) 2019-12-10
CN110556165B true CN110556165B (en) 2022-03-18

Family

ID=68740136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910863735.XA Active CN110556165B (en) 2019-09-12 2019-09-12 Method for rapidly identifying transgene or gene editing material and insertion site thereof

Country Status (4)

Country Link
US (1) US20220205034A1 (en)
EP (1) EP3919629A4 (en)
CN (1) CN110556165B (en)
WO (1) WO2021047363A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110556165B (en) * 2019-09-12 2022-03-18 浙江大学 Method for rapidly identifying transgene or gene editing material and insertion site thereof
CN113957130B (en) * 2021-09-27 2023-12-22 江汉大学 Method for identifying transgenic event based on high-throughput sequencing and probe enrichment
CN115620810B (en) * 2022-12-19 2023-03-28 北京诺禾致源科技股份有限公司 Method and device for detecting exogenous insertion information based on third-generation gene sequencing data
CN116469468B (en) * 2023-06-12 2023-09-19 北京齐禾生科生物科技有限公司 Editing gene carrier residue detection method and system based on Bayes model
CN117106875B (en) * 2023-10-23 2024-02-06 中国科学院昆明植物研究所 Method for estimating plant genome size and/or repeatability based on low-depth sequencing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103270175A (en) * 2011-01-20 2013-08-28 深圳华大基因科技有限公司 Method and system for detecting the insertion sites of transgenic foreign fragments
CN103725773A (en) * 2012-10-10 2014-04-16 杭州普望生物技术有限公司 Technology for identifying HBV (hepatitis B virus) gene integration sites and recurrently targeted genes in host genome
CN105631242A (en) * 2015-12-25 2016-06-01 中国农业大学 Method for identifying transgenic events through whole genome sequencing data
CN107630079A (en) * 2016-07-19 2018-01-26 中国农业科学院作物科学研究所 The method for determining the sequence of exogenous dna fragment, insertion position and marginal sequence in genetically modified organism
CN108034706A (en) * 2018-01-16 2018-05-15 浙江大学 The method that transgenic line insertion point is quickly determined using weight sequencing technologies

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011143231A2 (en) * 2010-05-10 2011-11-17 The Broad Institute High throughput paired-end sequencing of large-insert clone libraries
WO2013144663A2 (en) * 2012-03-27 2013-10-03 Rudjer Boskovic Institute Method of determination of neutral dna sequences in the genome, system for targeting sequences obtained thereby and methods for use thereof
MX364309B (en) * 2013-04-17 2019-04-22 Pioneer Hi Bred Int Methods for characterizing dna sequence composition in a genome.
CN108073791B (en) * 2017-12-12 2019-02-05 元码基因科技(苏州)有限公司 Method based on two generation sequencing datas detection target gene structure variation
CN110556165B (en) * 2019-09-12 2022-03-18 浙江大学 Method for rapidly identifying transgene or gene editing material and insertion site thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103270175A (en) * 2011-01-20 2013-08-28 深圳华大基因科技有限公司 Method and system for detecting the insertion sites of transgenic foreign fragments
CN103725773A (en) * 2012-10-10 2014-04-16 杭州普望生物技术有限公司 Technology for identifying HBV (hepatitis B virus) gene integration sites and recurrently targeted genes in host genome
CN105631242A (en) * 2015-12-25 2016-06-01 中国农业大学 Method for identifying transgenic events through whole genome sequencing data
CN107630079A (en) * 2016-07-19 2018-01-26 中国农业科学院作物科学研究所 The method for determining the sequence of exogenous dna fragment, insertion position and marginal sequence in genetically modified organism
CN108034706A (en) * 2018-01-16 2018-05-15 浙江大学 The method that transgenic line insertion point is quickly determined using weight sequencing technologies

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《水稻紫色种皮基因Pb的精细定位与候选基因分析》;王彩霞,舒庆尧;《科学通报》;20071130;全文 *
利用重测序技术获取转基因植物T-DNA插入位点;徐纪明,胡晗,毛文轩,毛传澡;《遗传》;20180726;全文 *

Also Published As

Publication number Publication date
EP3919629A1 (en) 2021-12-08
WO2021047363A1 (en) 2021-03-18
US20220205034A1 (en) 2022-06-30
CN110556165A (en) 2019-12-10
EP3919629A4 (en) 2022-06-15

Similar Documents

Publication Publication Date Title
CN110556165B (en) Method for rapidly identifying transgene or gene editing material and insertion site thereof
Sun et al. Linked-read sequencing of gametes allows efficient genome-wide analysis of meiotic recombination
Delseny et al. High throughput DNA sequencing: The new sequencing revolution
Bentolila et al. Comprehensive high-resolution analysis of the role of an Arabidopsis gene family in RNA editing
Fukai et al. Establishment of a Lotus japonicus gene tagging population using the exon‐targeting endogenous retrotransposon LORE1
CN108486266B (en) Molecular marker of corn chloroplast genome and application of molecular marker in variety identification
Ma et al. RNA-seq-mediated transcriptome analysis of a fiberless mutant cotton and its possible origin based on SNP markers
CN104313146A (en) Method for developing genome simple sequence repeats (SSR) molecular marker
CN114555821B (en) Detection of sequences uniquely associated with a target region of DNA
CN107460254A (en) A kind of method based on pig LINE1 transposons insertion polymorphism research and development New molecular marker
CN107862177B (en) Construction method of single nucleotide polymorphism molecular marker set for distinguishing carp populations
CN101613742A (en) A multiplex high-throughput genetic marker system and genetic analysis method for Chinese mitten crab
JPWO2021067484A5 (en)
CN108796053A (en) A kind of identification method of arabidopsis gene mutant
JPWO2019022018A1 (en) Polymorphism detection method
Best et al. Mapping maize mutants using bulked‐segregant analysis and next‐generation sequencing
Zarka et al. T-DNA characterization of genetically modified 3-R-gene late blight-resistant potato events with a novel procedure utilizing the Samplix Xdrop® enrichment technology
Mellan et al. Forest genomics and biotechnology
WO2022168195A1 (en) Genetic information analysis system and genetic information analysis method
CN112802554B (en) An animal mitochondrial genome assembly method based on second-generation data
KR101515861B1 (en) Polymerase chain reaction method for identification of introduced exotic gene number in brassicaceae
Cullis Plant Genomics
Hayward et al. The genome sequence of the holly blue, Celastrina argiolus (Linnaeus, 1758)
CN108642197B (en) SNP (Single nucleotide polymorphism) marker related to millet code number character as well as detection primer and application thereof
CN119351574A (en) A Sichuan local chicken 11K SNP liquid phase chip and its application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant