CN110556165B

CN110556165B - Method for rapidly identifying transgene or gene editing material and insertion site thereof

Info

Publication number: CN110556165B
Application number: CN201910863735.XA
Authority: CN
Inventors: 舒庆尧; 吴三玲; 谭瑗瑗; 高其康
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2022-03-18
Anticipated expiration: 2039-09-12
Also published as: EP3919629A1; WO2021047363A1; US20220205034A1; CN110556165A; EP3919629A4

Abstract

The invention discloses a method for rapidly identifying transgenic or gene editing materials and their insertion sites by using whole genome resequencing data. The method includes: extracting genomic DNA; obtaining paired-end sequencing data of the whole genome; Whether the sequence of the expression vector containing the T-DNA sequence is known; determine whether there is a transgenic event or gene editing event in the plant to be tested, and whether there is a backbone sequence transfer event; Determine the insertion site of the T-DNA sequence; the present invention The method combines bioinformatics analysis methods to identify whether there is a transgene or gene editing event when the expression vector is known or unknown. When the expression vector is known, not only can the target sequence be inserted into the genome quickly and accurately. Accurate positioning, orientation, copy number, and flanking sequence information can also identify whether there is a backbone sequence, that is, a sequence other than the target sequence, inserted into the genome, and give the same positioning.

Description

Method for rapidly identifying transgene or gene editing material and insertion site thereof

Technical Field

The invention relates to the technical field of bioinformatics, in particular to a method for rapidly identifying a transgenic or gene editing material and an insertion site thereof by using whole genome re-sequencing data.

Background

A method for quickly identifying transgenic material or gene editing material and its insertion site is a technique for directionally modifying the genetic characteristics of living beings by using transgenic technology or gene editing technology to "edit" or modify the target gene in animal or plant body according to the gene engineering principle and the will of people.

The traditional detection method for exogenous gene segments of transgenosis or gene editing mainly comprises the following steps: 1. in situ hybridization: verifying the integration, integrity and approximate gene copy number of the transgene; 2. real-time fluorescent quantitative PCR: accurately calculating the copy number of the transgenes; 3. chromosome walking technology: confirm the precise insertion site of the transgene; 4. fluorescence in situ hybridization: the chromosomal location of the transgene integration was demonstrated.

However, these methods all have the disadvantages of long experimental period, single identification content, low flux and the like. With the development of sequencing technology and the optimization of price, the popularization of transgenic or gene editing technology, and a method for rapidly identifying the insertion sites of transgenic or gene editing events and whether non-target fragments are inserted and the insertion sites based on a whole genome re-sequencing method become more and more important.

The invention patent document with the publication number of CN103270175B discloses a method and a system for detecting the insertion site of a transgenic exogenous fragment, wherein the method comprises the following steps: and determining the exogenous single-side short fragment and the genome single-side short fragment through comparison, and determining the insertion site of the exogenous fragment in the genome sequence according to the intersection of the exogenous single-side short fragment and the genome single-side short fragment.

The invention patent document with publication number CN105631242B discloses a method for rapidly identifying transgenic events by using whole genome sequencing data, which combines bioinformatics analysis means to identify transgenic events by mapping sequencing reads to genome and expression vector sequences respectively, and identifies the position, direction, copy number and flanking sequence information of an exogenous fragment inserted into the genome and the homozygous and heterozygous states of a transgenic sample by a three-step positioning method.

However, none of the above patent documents considers that the insertion of an exogenous sequence causes recombination or disorder of sequences around the genomic insertion site, thereby reducing the sensitivity of the method and increasing the probability of false negatives and missed detections; also, the case where false positives arise when the vector sequence is partially homologous to the genomic sequence is not considered. In addition, the method uses the mapping of the reading sequence and the wild genome sequence of the plant to be detected, and has the disadvantages of complex process, time consumption and labor consumption.

Disclosure of Invention

The invention provides a method for rapidly identifying a transgene or gene editing material and an insertion site thereof by using whole genome re-sequencing data, which is combined with a bioinformatics analysis means, identifies whether a transgene or gene editing event occurs under the condition that an expression vector is known or unknown, can rapidly and accurately give out accurate positioning, direction, copy number and flanking sequence information of a target sequence inserted into a genome under the condition that the expression vector is known, and can also identify whether a skeleton sequence, namely a sequence except the target sequence is inserted into the genome and give out the same positioning.

The specific technical scheme is as follows:

a method for rapidly identifying a transgenic or gene-editing material and its insertion site using whole genome re-sequencing data, comprising:

(1) extracting the genome DNA of the plant sample to be detected after being processed by a transgenic technology or a gene editing technology;

(2) performing whole genome re-sequencing on the genome DNA to obtain double-end sequencing data of a whole genome;

(3) judging whether an expression vector sequence containing a T-DNA sequence inserted into a plant to be detected is known or not, and carrying out the following operations according to a judgment result:

if the sequence of the expression vector inserted into the plant to be detected is known, mapping the expression vector containing the T-DNA sequence as a template with the double-end sequencing data to obtain a mapping database;

if the expression vector sequence inserted into the plant to be detected is unknown, mapping the expression vector sequence with the double-end sequencing data by using a generic vector library as a template to obtain a mapping database;

(4) respectively counting the number of reads of a matched expression vector sequence or the number of reads of a matched generic vector library, the number of reads of a matched skeleton sequence, the base coverage rate and the average sequencing depth of a T-DNA sequence, the average sequencing depth of a plant sample genome to be detected, the sequence length of an expression vector and the length of the reads, and judging whether a transgenic event or a gene editing event exists in the plant to be detected, whether a skeleton sequence transfer event occurs and the copy number of an inserted sequence according to the following formula;

criteria for determining the presence or absence of a transgenic or gene editing event are:

(4-1) known expression vector sequences: VRN is more than or equal to Gdepth/2+ VectorLen/ReadLen, and Tcov is more than or equal to 0.9;

wherein VRN represents the number of reads matching the expression vector sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected; VectorLen denotes the sequence length of the expression vector; ReadLen represents the read length; tcov represents the base coverage of the T-DNA sequence, Tdepth represents the average sequencing depth of the T-DNA, and Bdepth represents the average sequencing depth of the framework sequence;

(4-2) unknown sequence of expression vector: FRN is not less than (Gdepth/2) multiplied by 10;

wherein FRN represents the number of reads that match the generic library; gdepth represents the average sequencing depth of the genome of the plant to be detected;

the standard for judging whether the skeleton sequence transfer event exists is as follows: BRN is more than or equal to Gdepth/3;

wherein BRN represents the number of reads matching the backbone sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected;

(4-3) determination of copy number: the copy number of the inserted T-DNA is Tdepth/Gdepth; the number of inserted skeleton sequences is Bdepth/Gdepth;

tdepth represents the average sequencing depth of the T-DNA, Gdepth represents the average sequencing depth of the genome of a plant to be detected, and Bdepth represents the average sequencing depth of a skeleton sequence;

(5) taking an expression vector containing a T-DNA sequence as a reference sequence, and extracting a double-ended read pair meeting the conditions according to the data of a mapping database, wherein the double-ended read pair must comprise a single-ended read I completely matched with the expression vector sequence and a single-ended read II not matched with or not completely matched with the expression vector sequence; then, carrying out local homology comparison analysis on the single-ended reading sequence II, an expression vector sequence containing a T-DNA sequence and a wild type genome sequence of a plant to be detected respectively;

determining the insertion site of the T-DNA sequence according to the following different conditions;

(5-1) if one end of the single-ended reading sequence II can be matched with the sequence of the expression vector, the other end of the single-ended reading sequence II can be matched with the sequence of the wild genome, and at least three single-ended reading sequences II with the same matching starting position of the genome or the expression vector exist, judging that the single-ended reading sequences II have candidate insertion sites, and the genome matching position closest to the sequence position of the expression vector is the insertion site.

(5-2) if the single-ended reading sequence II can not be matched with the sequence of the expression vector but can be matched with the sequence of the wild genome, and at least three single-ended reading sequences II with the same matching initial position of the genome exist, judging that the single-ended reading sequences II have insertion sites, and the insertion sites are the initial positions matched with the genome.

(5-3) if the single-ended reading sequence II can not be matched with the sequence of the expression vector and can not be matched with the sequence of the wild genome, assembling and splicing the corresponding double-ended reading sequence pair of the single-ended reading sequence II into a fragment according to the sequence overlapping characteristic, connecting the fragment with a T-DNA sequence or an inserted expression vector fragment by using N, and repeating the step (3) and the step (5) by using the fragment as a reference sequence until the single-ended reading sequence II can be matched with the wild genome sequence of the plant to be detected and can be judged by the step (5-1) or the step (5-2).

As shown in FIG. 1, an expression vector sequence includes both the T-DNA sequence of interest and sequences other than the T-DNA of interest, which we define as a backbone sequence. Whether the T-DNA sequence or the framework sequence is inserted into the wild type genome of the plant to be tested, there are no more than three cases: the first is precise insertion, i.e., the genomic sequence is tightly linked to the inserted sequence, no other sequences in between, and for one single-ended read I that matches the expression vector sequence, there are two possibilities for the other single-ended read II that does not match or does not match completely to the expression vector sequence, one is a read sequence that spans the "breakpoint" between the inserted sequence and the genomic sequence (that can be located by 5-1), and one is a read sequence that comes entirely from the genomic sequence (that can be located by 5-2); the second is insertion with concomitant insertion or perturbation of the mini-fragment, where the mini-fragment is defined as less than or equal to 90bp (where the length is adjustable) depending on the length of the read and the sensitivity of the method, where single-ended read II also exists with two possibilities, one is a read sequence spanning the "breakpoint" between the inserted sequence and the genomic sequence (that can be located by 5-1), and one is a read sequence entirely from the genomic sequence (that can be located by 5-2); the third is insertion and accompanying recombination or perturbation of large fragments, defined here as greater than 100bp (here adjustable in length) in terms of read length, where single-ended read II also presents two possibilities, one for read sequences spanning the "breakpoint" between the inserted sequence and the rearranged (perturbed) sequence, and one completely from the rearranged (perturbed) sequence, both of which can be used for 5-3 alignment.

Further, in the step (2), performing whole genome resequencing on the genome DNA, and performing quality control on original offline data to obtain quality-controlled double-end sequencing data of the whole genome;

the quality control method comprises the following specific steps:

(2-a) removing the read sequence with the linker;

(2-b) reads with a proportion of N removed greater than 20%;

(2-c) removing the double-ended read corresponding to the single-ended read when the 3' end of the single-ended read contains more low-quality bases than one third of the length proportion of the single-ended read; the low-quality base is a base with the mass less than or equal to 20.

Further, in step (3), the generic database is a database of complete sequence of vectors collected from the public database (NCBI, https:// www.ncbi.nlm.nih.gov /) as complete as possible.

Further, in the step (5-1), comparing an interval sequence matched with the expression vector sequence in the single-end reading sequence II with a wild type genome sequence of a plant sample to be detected;

if not, determining that the candidate insertion site on the single-ended reading sequence II is a real insertion site; if the sequence is homologous, the candidate insertion site on the single-ended reading sequence II is determined to be a false positive insertion site.

Further, in the step (5-2), if the single-ended reading sequence II is the left-end sequence of the double-ended reading sequence pair, the insertion site on the single-ended reading sequence II is determined to be the maximum site; and if the single-ended reading sequence II is the right-end sequence of the double-ended reading sequence pair, judging that the insertion site on the single-ended reading sequence II is the site minimum value.

Further, in the steps (5-1) to (5-3), the matching criteria are: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.

Further, the method of the present invention further comprises step (6): designing primer sequences before and after the insertion site obtained in the step (5), and verifying by a PCR technology.

Compared with the prior art, the invention has the following beneficial effects:

(1) the method combines with bioinformatics analysis means, identifies whether the transgene or the gene editing event occurs under the condition that an expression vector is known or unknown, can quickly and accurately give the accurate positioning, direction, copy number and flanking sequence information of the target sequence inserted into the genome under the condition that the expression vector is known, and can also identify whether a skeleton sequence, namely a sequence except the target sequence is inserted into the genome and give the same positioning.

(2) The method not only considers that the peripheral sequence of the genome insertion site is recombined or disordered due to the insertion of the exogenous sequence, but also reduces the sensitivity of the method and improves the false negative and missed detection probability; it is also considered that when the vector sequence has partial homology with the genomic sequence, false positives arise.

(3) Compared with the traditional experimental method, the method has the advantages of short time consumption, economy, repeatability and the like, can determine the influence of the transgenic or gene editing event on the plant genome in a short time, is favorable for the safety risk evaluation of the transgenic plant, and promotes the application of the transgenic or gene editing technology in agriculture.

Drawings

FIG. 1 is a schematic diagram of the method for rapid identification of transgenic or gene-editing material and its insertion site using whole genome re-sequencing data in accordance with the present invention;

wherein an expression vector sequence includes both the target T-DNA sequence and sequences other than the target T-DNA, we define as framework sequences. Whether the T-DNA sequence or the framework sequence is inserted into the wild type genome of the plant to be tested, there are no more than three cases: the first is precise insertion, i.e., the genomic sequence is tightly linked to the inserted sequence, no other sequences in between, and for one single-ended read I that matches the expression vector sequence, there are two possibilities for the other single-ended read II that does not match or does not match completely to the expression vector sequence, one is a read sequence that spans the "breakpoint" between the inserted sequence and the genomic sequence (that can be located by 5-1), and one is a read sequence that comes entirely from the genomic sequence (that can be located by 5-2); the second is insertion with concomitant insertion or perturbation of the mini-fragment, where the mini-fragment is defined as less than or equal to 90bp (where the length is adjustable) depending on the length of the read and the sensitivity of the method, where single-ended read II also exists with two possibilities, one is a read sequence spanning the "breakpoint" between the inserted sequence and the genomic sequence (that can be located by 5-1), and one is a read sequence entirely from the genomic sequence (that can be located by 5-2); the third is insertion and accompanying recombination or perturbation of large fragments, defined here as greater than 100bp (here adjustable in length) in terms of read length, where single-ended read II also presents two possibilities, one for read sequences spanning the "breakpoint" between the inserted sequence and the rearranged (perturbed) sequence, and one completely from the rearranged (perturbed) sequence, both of which can be used for 5-3 alignment.

FIG. 2 is a schematic flow chart of the method for rapidly identifying a transgene or gene editing material and its insertion site by using whole genome re-sequencing data according to the present invention.

FIG. 3 is a schematic diagram showing the determination of the presence or absence of a transgene or gene editing event and the occurrence and copy number of a backbone sequence insertion event in example 1.

The outermost layer of the graph is the length scale of the expression vector, the second layer from outside to inside is the structural annotation of the expression vector sequence, the third layer from outside to inside is the sequencing reading coverage area, and the fourth layer is the number of times of sequencing to each base of the expression vector sequence, namely the sequencing depth. From the figure it can be seen that not only the T-DNA sequence but also the backbone sequence is inserted into the genome.

FIG. 4 is a diagram showing the details of the example 1 in which the expression vector is inserted into the genome of a rice plant to be tested.

FIG. 5 is a schematic diagram showing the determination of the presence or absence of a transgene or gene editing event and the occurrence and copy number of a backbone sequence insertion event in example 2.

FIG. 6 is a diagram showing the details of the example 2 in which the expression vector is inserted into the genome of a soybean plant to be tested.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. These examples are intended to illustrate the invention only and are not intended to limit the scope of the invention. The experimental procedures not specified in the examples were carried out under conventional conditions or under the conditions described in molecular cloning protocols (J. SammBruke and D.W. Lassel, Huang Peyer, published by scientific Press, third edition 8.2002).

The terms referred to in the present invention are defined as follows:

reading the sequence: each sequence assembled from the "ATCG" array was obtained by a sequencer.

Double-ended read pair: randomly breaking a genome into DNA fragments with different lengths by an ultrasonic instrument, recovering and purifying the DNA fragments with fixed sizes by PCR, fixing the DNA fragments on a substrate by adding joints at two ends of the DNA fragments, and sequencing from two ends to the middle of the fragments to obtain sequences at two ends of the same DNA fragment.

Single-ended reading: with respect to paired-end read pairs, it is meant any read in a pair of paired-end read pairs.

Average sequencing depth: the ratio of the total amount of bases obtained by sequencing the sequence interval to the length of the sequence interval is determined as the ratio of the total amount of bases obtained by sequencing the sequence interval (excluding the region homologous to the genome) to the length of the sequence interval (excluding the region homologous to the genome) for the average sequencing depth of the T-DNA and the backbone sequence.

Base coverage of T-DNA sequence: the ratio of the length of the T-DNA interval to the length of the T-DNA sequence is determined by sequencing.

And (3) complete matching: for reads of 150bp length, the sequence can be perfectly matched to the reference sequence, allowing mismatches of 2-3 bases, gaps within 10 bp.

Not perfectly matched: for reads of 150bp length, there is no more than 90bp of sequence that can match the reference sequence.

Example 1

A method for rapidly identifying a gene editing material and an insertion site thereof by using whole genome re-sequencing data comprises the following specific steps:

(1) extracting the genomic DNA of the experimental material and the plant to be detected: the rice gene editing sample R07 takes pHUN4c12S as a vector, the total length is 12314bp, the gene editing target gene is MTL (LOC _ Os03g27610), and the gene encodes phospholipase.

Firstly, designing a target sequence (20nt) in the third exon of the MTL gene according to the sequence of the MTL gene, and connecting the target sequence into a vector pHUN4c12S to obtain a pHUN4c12S-MTL plasmid; then, the plasmid is transferred into an agrobacterium-infected state; and finally, transferring pHUN4c12S-MTL into the japonica rice variety Sn rice No. 1 by an agrobacterium-mediated transgenic method. Screening, differentiating and rooting to obtain T₀And (3) continuously planting the transgenic regenerated seedlings to obtain T1 plants R07, transplanting the plants to be tested to a field for planting for about 30 days, taking leaves, extracting the genome DNA of the plants to be tested by using a conventional DNA extraction kit, and carrying out whole-gene recombination sequencing. The rice reference genomic sequence is RGAP v7(htt p:// rice. plant biology. msu. edu/annotation _ pseudo. shtml).

(2) Performing whole genome resequencing on the genome DNA of the plant to be detected, and performing quality control on original off-line data to obtain double-end sequencing data of the whole genome;

the whole genome double-end sequencing is carried out on the gene editing rice to be tested transferred into the vector pHUN4c12S by a high-throughput sequencer (Illumina sequencer, Beijing Baimaike Biotechnology Ltd.), the insertion size is 300bp, the reading length is 150bp, the original off-line data is 48G, and the average sequencing depth is 57 layers.

Performing quality control on original offline data, wherein the software of the quality control is Trimmomatic, the version is 0.36, and the quality control standard is as follows:

(2-a) removing the read with the linker;

(2-b) removing reads with a proportion of N in reads greater than 20%;

(2-c) removing the paired-end reads when the 3' end of the single-end read (i.e., one of the paired-end reads) contains more than one-third of the low-quality (base-quality ═ 20) bases in the ratio of the length of the read; setting parameters as tailing: 20, MINLENEN: and 100, obtaining quality data after quality control.

if the sequence of the expression vector inserted into the plant to be detected is known, the expression vector is used as a template, and the double-end sequencing data is mapped with the expression vector to obtain a mapping database;

if the sequence of the expression vector inserted into the plant to be detected is unknown, mapping the double-end sequencing data with the universal vector library as a template to obtain a mapping database;

the judgment result is as follows: the sequence of the expression vector inserted into the plant to be detected is known, so that the expression vector pHUN4c12S-MTL containing a T-DNA sequence is used as a template, and double-end sequencing data obtained after whole genome re-sequencing quality control is subjected to mapping analysis with the double-end sequencing data to obtain a mapping database.

The specific mapping analysis steps are as follows: the software used is bowtie2, the version is 2.2.1, the default parameters (the definite mapping standard) are adopted, the obtained comparison result information file is in the sam format, the sam format file is converted into a binary bam format file (namely a mapping database), and the conversion tool is samtools. In order to reduce the interference of the sequencing repeated sequence (two sequences which are completely identical in the same position and the same direction of the matched expression vector) on the subsequent positioning result, a step of removing the repeated sequence alignment result is introduced in the format conversion process.

The tool used to REMOVE duplicate sequences is the subroutine module MarkDuplicates. jar in the picard software, with the parameter REMOVE _ DUPLICATES set to true and the other parameters default.

(4) Respectively counting the number of reads (VRN) of a matched expression vector sequence, the number of reads (BRN) of a matched skeleton sequence, the base coverage rate (TCov) and the average sequencing depth (Tdepth) of a T-DNA sequence, the average sequencing depth (Gdepth) of a sample genome, the sequence length (VectorLen) of an expression vector and the reading length (ReadLen) in the mapping database, and judging whether a transgenic event or a gene editing event exists in a plant to be detected and whether a skeleton sequence is transferred into the event according to the following formula;

VRN is more than or equal to Gdepth/2+ VectorLen/ReadLen, and Tcov is more than or equal to 0.9;

wherein VRN represents the number of reads matching the expression vector sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected; VectorLen denotes the sequence length of the expression vector; ReadLen represents the read length; tcov represents the base coverage of the T-DNA sequence;

the detection result is as follows: the number of reads (VRN) of the matched expression vector sequence is 2725, the number of reads (BRN) of the matched skeleton sequence is 1053, the base coverage rate (TCov) of the T-DNA sequence is 0.997, the average sequencing depth (Tdepth) of the T-DNA is 23 layers, the average sequencing depth of the skeleton sequence is 27 layers, the average sequencing depth (Gdepth) of the sample genome is 57 layers, the length (VectorLen) of the expression vector sequence is 12314bp, and the length (ReadLen) of the reads is 150 bp.

(4-1)VRN≥111＝(57/2+12314/150)；Tcov≥0.9；

(4-2)BRN≥19；

(4-3) determination of copy number: the copy number of the inserted T-DNA is 23/57 and 1/2; the number of inserted skeleton sequences is 27/57 ≈ 1/2;

and judging whether the sample has a transgenic or gene editing event or has a skeleton sequence transfer event and single copy according to the judgment standard.

(5) Taking an expression vector containing a T-DNA sequence as a reference sequence, extracting qualified doublets by samtools software according to data of a mapping database (bam format file)An end-read pair, which must comprise a single-ended read I that is perfectly matched to the expression vector sequence and another single-ended read II that is not matched or not perfectly matched to the expression vector sequence; and then carrying out local homology comparison analysis on the single-ended reading sequence II, the expression vector sequence containing the T-DNA sequence and the wild type genome sequence of the plant to be detected respectively. The software used for local homology alignment analysis was blast, version of blast was 2.2.28+, and the outfmt parameter was set to 6 to ensure that the output was m8 format separated by tab bonds, for subsequent processing, num _ descriptions or max _ target _ seqs parameter was set to 1, evalue parameter was set to 1e^-5。

(5-1) if one end of the single-ended reading sequence II can be matched with an expression vector sequence, the other end of the single-ended reading sequence II can be matched with a wild type genome sequence, and at least three single-ended reading sequences II with the same matching initial position of the genome or the expression vector exist, judging that the single-ended reading sequence II has a candidate insertion site, extracting a reading sequence II interval section matched with the expression vector sequence, comparing the reading sequence II interval section with the wild type genome of a transgenic sample to be detected, and if no homologous sequence exists, setting the genome matching site closest to the expression vector sequence position as a real insertion site; if a homologous sequence exists, the candidate insertion site on the single-ended reading II is determined to be a false positive insertion site. The matching criteria were: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.

(5-2) if the single-ended reading sequence II can not be matched with the sequence of the expression vector but can be matched with the sequence of the wild genome, and at least three single-ended reading sequences II with the same matching initial position of the genome exist, judging that the single-ended reading sequences II have insertion sites, and the insertion sites are the initial positions matched with the genome; the matching criteria were: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.

If the single-ended reading sequence II is the left-end sequence of the double-ended reading sequence pair, the insertion site on the single-ended reading sequence II is judged to be the maximum site; and if the single-ended reading sequence II is the right-end sequence of the double-ended reading sequence pair, judging that the insertion site on the single-ended reading sequence II is the site minimum value.

According to the method of step (5), the judgment results are shown in Table 1, and the genome schematic diagrams of the inserted T-DNA sequences are shown in FIGS. 3 and 4.

As can be seen from FIGS. 3 and 4, this example is the case in step (5-1), and the insertion site is located by single-ended reading II matching with the expression vector sequence and the wild-type genomic sequence.

TABLE 1 final output results of positioning of rice plants to be tested

Example 2

(1) Extracting the genomic DNA of the experimental material and the plant to be detected: the soybean transgenic sample L22 takes pSOY19 as a vector, the total length is 9557bp, and the target gene g10-epsps contained in the vector has the function of a glyphosate tolerance gene.

According to the sequence of the g10-epsps gene, the target sequence is connected into a vector pSOY19 to obtain a pSOY19-g10-epsps expression plasmid vector; then, the plasmid is transferred into an agrobacterium-infected state; finally, pSOY19-g10-epsps was transferred into recipient Huachun No. 3 by Agrobacterium-mediated transgenic method. After screening, differentiation and rooting, transgenic soybean L22 is obtained.

After the strain is normally cultured to 100 days of seedling age, taking leaves, extracting the genome DNA of the plant to be detected by using a conventional DNA extraction kit, and performing whole genome re-sequencing. The soybean reference genome is a cultivar Williams 82 sequence Gmax-275 _ v2.fa, and the download address is https:// phytozome.jgi.doe.gov/pz/portal.html.

the transgenic soybean sample to be tested, which is transferred into the vector pSOY19, is subjected to whole genome double-end sequencing by a high-throughput sequencer (Illumina XTen, Nuo and induced science and technology Limited), the insertion size is 300bp, the reading length is 150bp, the obtained original off-machine data is 52G, and the average sequencing depth is 23 layers.

(2-a) removing the read with the linker;

(2-b) removing reads with a proportion of N in reads greater than 20%;

if the sequence of the expression vector inserted into the plant to be detected is unknown, the generic vector library is used as a template, and the generic vector library and the double-end sequencing data are mapped to obtain a mapping database;

the judgment result is as follows: the sequence of an expression vector inserted into a plant to be detected is known, so that the expression vector pSOY19-g10-epsps containing a T-DNA sequence is used as a template, and double-end sequencing data obtained after whole genome re-sequencing quality control is subjected to mapping analysis to obtain a mapping database.

the detection result is as follows: the number of reads (VRN) of the matched expression vector sequence is 293, the number of reads (BRN) of the matched skeleton sequence is 0, the base coverage rate (TCov) of the T-DNA sequence is 0.986, the average sequencing depth (Tdepth) of the T-DNA is 14 layers, the average sequencing depth (Gdepth) of the sample genome is 23 layers, the sequence length (VectorLen) of the expression vector is 9557bp, and the reading length (ReadLen) is 150 bp.

(4-1)VRN≥75＝(23/2+9557/150)；Tcov≥0.9；

(4-2)BRN＜8；

(4-3) determination of copy number: the copy number of the inserted T-DNA is 14/23 and 1/2, and the T-DNA is single copy;

and according to the judgment standard, judging that the sample has a transgenic or gene editing event, and has single copy but no skeleton sequence transfer event.

(5) Taking an expression vector containing a T-DNA sequence as a reference sequence, extracting a double-ended reading pair meeting the conditions through samtools software according to data of a mapping database (bam format file), wherein the double-ended reading pair must comprise a single-ended reading I completely matched with the expression vector sequence and a single-ended reading II which is not matched with or is not completely matched with the expression vector sequence; and then carrying out local homology comparison analysis on the single-ended reading sequence II, the expression vector sequence containing the T-DNA sequence and the wild type genome sequence of the plant to be detected respectively. The software used for local homology alignment analysis was blast, version of blast was 2.2.28+, and the outfmt parameter was set to 6 to ensure that the output was m8 format separated by tab bonds, for subsequent processing, num _ descriptions or max _ target _ seqs parameter was set to 1, evalue parameter was set to 1e^-5。

According to the method of step (5), the judgment results are shown in Table 2, and the genome schematic diagrams of the inserted T-DNA sequences are shown in FIGS. 5 and 6.

As can be seen from FIGS. 5 and 6, this embodiment is the case of steps (5-1) and (5-3), and the insertion site is located by steps (5-1) and (5-3).

TABLE 2 final output results of positioning soybean plants to be tested

Example 3

(1) Extracting the genomic DNA of the experimental material and the plant to be detected: taking a rice gene editing sample R14 as an example, pHUN4c12S is a vector, the total length is 12314bp, a gene editing target gene is OsLCT1(LOC _ Os06g38120), and the gene encodes a low-affinity ion transporter.

Firstly, designing a target sequence in the third exon of the gene according to the sequence of the OsLCT1 gene (CCCGGCAGCGCACCGATGTTGCT), and connecting the target sequence into a vector pHUN4c12S to obtain a pHUN4c12S-LCT1 plasmid; then, the plasmid is transferred into an agrobacterium-infected state; and finally, the pHUN4c12S-LCT1 is transferred into the japonica rice variety sweet japonica No. 1 by an agrobacterium-mediated transgenic method. And screening, differentiating and rooting to obtain T0 transgenic regenerated seedling. Continuously planting to obtain a T1 plant R14, planting the plant to be detected in a field for about 30 days, taking leaves, extracting the genome DNA of the plant to be detected by using a conventional kit, detecting that the plant does not contain a hygromycin resistance gene by using a hygromycin resistance gene marker, and then performing whole genome re-sequencing. The rice reference genomic sequence was RGAP v7(http:// rice. plant. biology. msu. edu/annotation _ pseudo. shtml).

the whole genome double-end sequencing is carried out on the gene editing rice to be tested by a high-throughput sequencer (Illumina sequencer, beijing Baimaike biotechnology limited), the insertion size is 300bp, the reading length is 150bp, the original off-line data 32G is obtained, and the average sequencing depth is 25 layers.

(2-a) removing the read with the linker;

(2-b) removing reads with a proportion of N in reads greater than 20%;

here, the expression vector and the T-DNA sequence were analyzed without knowing them, and then verified by analysis using the known expression vector and T-DNA sequence. The judgment result is as follows: the sequence of the expression vector inserted into the plant to be detected is unknown, so that the pan-vector library is used as a template, and the double-end sequencing data obtained after the whole genome re-sequencing quality control is subjected to mapping analysis with the double-end sequencing data to obtain a mapping database.

(4) Respectively counting the number of read sequences (FRN) matched with the generic vector library in the mapping database, the average sequencing depth (Gdepth) of a sample genome and the reading length (ReadLen), and judging whether a transgenic event or a gene editing event exists in a plant to be detected according to the following formula;

the criteria for determining the presence or absence of a transgenic event or gene editing event are (generic vector library): FRN is more than or equal to Gdepth/2 x 10;

the detection result is as follows: the number of reads (FRN) aligned to the generic library was 710, the average sequencing depth (Gdepth) of the sample genome was 25 layers, and the read length (ReadLen) was 150 bp.

(4-a)FRN≥125＝(25/2)×10；

And judging whether the sample has a transgene or a gene editing event according to the judgment standard.

By adopting the method combining the embodiment 1 and the embodiment 3, 78 rice negative control materials, 16 rice gene editing materials, 5 transgenic rice materials and 9 transgenic soybean materials are tested in total, wherein when 16 materials are inserted, the situation that small disturbance exists near the genome insertion site is 5-2, the situation that 5 transgenic materials have large disturbance or rearrangement exists near the genome insertion site is 5-3, and the skeleton sequence insertion phenomenon exists in 15 materials. The total of 45 insertion events, except 3 soybean samples, which were not verified by experimental PCR, were consistent with our analysis results, with an accuracy of 93.7%. The method is 100% accurate in determining whether a transgene or gene editing event is present.

Claims

1. a method for rapidly identifying transgenic or gene editing materials and insertion sites thereof, comprising:

(1) Extracting the genomic DNA of the plant sample to be tested after being processed by transgenic technology or gene editing technology;

(2) performing whole-genome re-sequencing on the genomic DNA to obtain paired-end sequencing data of the whole genome;

(3) Judging whether the expression vector sequence containing the T-DNA sequence inserted in the plant to be tested is known, and performing the following operations according to the judgment result:

If the sequence of the expression vector inserted in the plant to be tested is known, the expression vector containing the T-DNA sequence is used as a template to map with the paired-end sequencing data to obtain a mapping database;

If the sequence of the expression vector inserted in the plant to be tested is unknown, the pan-vector library is used as a template to map with the paired-end sequencing data to obtain a mapping database;

(4) respectively counting the number of reads matching the expression vector sequence or the number of reads matching the pan-vector library in the mapping database, the number of reads matching the backbone sequence, the base coverage of the T-DNA sequence and the average sequencing depth, The average sequencing depth of the sample genome of the plant to be tested, the sequence length of the expression vector, and the reading length, and the following formulas are used to determine whether there is a transgenic event or gene editing event in the plant to be tested, whether there is a backbone sequence transfer event, and insertion. sequence copy number;

The criteria for determining whether there is a transgenic event or a gene editing event are:

(4-1) The sequence of the expression vector is known: VRN≥Gdepth/2+VectorLen/ReadLen, and Tcov≥0.9;

Among them, VRN represents the number of reads matching the expression vector sequence; Gdepth represents the average sequencing depth of the plant genome to be tested; VectorLen represents the sequence length of the expression vector; ReadLen represents the read length; Tcov represents the base coverage of the T-DNA sequence, Tdepth represents the average sequencing depth of T-DNA, and Bdepth represents the average sequencing depth of backbone sequences;

(4-2) The sequence of the expression vector is unknown: FRN≥(Gdepth/2)×10;

Among them, FRN represents the number of reads matching the pan-vector library; Gdepth represents the average sequencing depth of the plant genome to be tested;

The criterion for judging whether there is a backbone sequence transfer event is: BRN≥Gdepth/3;

Among them, BRN represents the number of reads matching the backbone sequence; Gdepth represents the average sequencing depth of the genome of the plant to be tested;

(4-3) Determination of copy number: inserted T-DNA copy number=Tdepth/Gdepth; inserted backbone sequence number=Bdepth/Gdepth;

Tdepth represents the average sequencing depth of T-DNA, Gdepth represents the average sequencing depth of the plant genome to be tested, and Bdepth represents the average sequencing depth of the backbone sequence;

(5) Taking the expression vector containing the T-DNA sequence as the reference sequence, according to the data of the mapping database, extract the qualified double-ended reading pair, and the double-ended reading pair must contain a single sequence that completely matches the expression vector sequence. End reading I, and another single-end reading II that does not match or does not completely match the expression vector sequence; then single-end reading II is respectively associated with the expression vector sequence containing the T-DNA sequence and the wild type of the plant to be tested. Genome sequence for local homology alignment analysis;

The insertion site of the T-DNA sequence is determined according to the following different situations;

(5-1) If one end of the single-end read sequence II can match the expression vector sequence, the other end can match the wild-type genome sequence, and there are at least three single-end reads with the same matching start position of the wild-type genome or expression vector Sequence II, it is determined that there is a candidate insertion site on the single-end reading sequence II, and the genome matching position closest to the sequence position of the expression vector is the insertion site;

(5-2) If the single-end reading sequence II cannot match the expression vector sequence, but can match the wild-type genome sequence, and there are at least three single-end reading sequences II with the same matching starting position of the wild-type genome, it is determined that the There is an insertion site on the single-end reading sequence II, and the insertion site is the starting position matched with the genome;

(5-3) If the single-end reading II can neither match the expression vector sequence nor the wild-type genome sequence, then it is necessary to firstly analyze the double-end reading corresponding to the single-end reading II according to the sequence overlapping characteristics. The assembly is spliced into a fragment, the fragment is connected with the T-DNA sequence or the inserted expression vector fragment with N, and steps (3) and (5) are repeated as a reference sequence, until the single-end reading sequence II can match the test. The wild-type genome sequence of the plant can be judged by step (5-1) or (5-2);

The N represents an unknown ATCG base or permutation sequence;

The read sequence refers to each sequence that is arranged and combined by "ATCG" obtained by a sequencer; the paired-end read pair refers to the random fragmentation of the genome into DNA fragments of different lengths by a sonicator, and recovery and recovery by PCR. Purify DNA fragments of a fixed size, fix the two ends of these DNA fragments on the substrate by adding adapters, and then sequence from the two ends to the middle of the fragments, and the obtained sequences from the two ends of the same DNA fragment; the single-end reading sequence Refers to any reading sequence in a pair of double-ended reading pairs relative to a pair of double-ended reading sequences; the perfect match means that for a reading sequence with a length of 150 bp, the sequence can be completely matched with the reference sequence, allowing 2 -A mismatch of 3 bases and a gap within 10 bp; the incomplete match refers to a sequence of no more than 90 bp that can match the reference sequence for a read sequence of 150 bp in length.

2. The method for rapidly identifying transgenic or gene editing materials and insertion sites thereof as claimed in claim 1, wherein in step (2), whole genome re-sequencing is performed on the genomic DNA, and the original off-machine is performed. The data is subjected to quality control to obtain the paired-end sequencing data of the whole genome after quality control;

The specific method of the quality control is as follows:

(2-a) removing the reading sequence with the linker;

(2-b) Remove reads with a ratio of N greater than 20%;

(2-c) when the number of low-quality bases contained at the 3' end of the single-end reading exceeds one-third of the length ratio of the reading, remove the paired-end reading corresponding to the single-end reading; The low-quality bases are bases with a quality of ≤20.

3. The method for rapid identification of transgene or gene editing material and insertion site thereof according to claim 1, wherein in step (5-1), the single-end reading II is matched with the expression vector sequence. The interval sequence is compared with the wild-type genome sequence of the plant sample to be tested;

If it is not a homologous sequence, it is determined that the candidate insertion site on the single-ended reading sequence II is a real insertion site; if it is a homologous sequence, it is determined that the single-ended reading sequence II has a candidate insertion site. Dots are false positive insertion sites.

4. The method for rapid identification of transgene or gene editing material and insertion site thereof according to claim 1, wherein in step (5-2), if single-end reading II is a pair of double-end readings If the single-end reading sequence II is the right-end sequence of the double-end reading sequence, then the single-end reading sequence II is determined to have an insertion site on the left-end sequence. The insertion site with is the site minimum.

5. The method for rapidly identifying transgenic or gene editing materials and their insertion sites with data according to claim 1, wherein in steps (5-1) to (5-3), the matching criteria are: Base similarity ≥95%, mismatched bases ≤5, and gaps ≤5.

6. the method for rapid identification of transgene or gene editing material and insertion site thereof as claimed in claim 1, is characterized in that, also comprises step (6): the primer sequence before and after the insertion site design obtained according to step (5), Validated by PCR technique.