Disclosure of Invention
The invention provides a method for rapidly identifying a transgene or gene editing material and an insertion site thereof by using whole genome re-sequencing data, which is combined with a bioinformatics analysis means, identifies whether a transgene or gene editing event occurs under the condition that an expression vector is known or unknown, can rapidly and accurately give out accurate positioning, direction, copy number and flanking sequence information of a target sequence inserted into a genome under the condition that the expression vector is known, and can also identify whether a skeleton sequence, namely a sequence except the target sequence is inserted into the genome and give out the same positioning.
The specific technical scheme is as follows:
a method for rapidly identifying a transgenic or gene-editing material and its insertion site using whole genome re-sequencing data, comprising:
(1) extracting the genome DNA of the plant sample to be detected after being processed by a transgenic technology or a gene editing technology;
(2) performing whole genome re-sequencing on the genome DNA to obtain double-end sequencing data of a whole genome;
(3) judging whether an expression vector sequence containing a T-DNA sequence inserted into a plant to be detected is known or not, and carrying out the following operations according to a judgment result:
if the sequence of the expression vector inserted into the plant to be detected is known, mapping the expression vector containing the T-DNA sequence as a template with the double-end sequencing data to obtain a mapping database;
if the expression vector sequence inserted into the plant to be detected is unknown, mapping the expression vector sequence with the double-end sequencing data by using a generic vector library as a template to obtain a mapping database;
(4) respectively counting the number of reads of a matched expression vector sequence or the number of reads of a matched generic vector library, the number of reads of a matched skeleton sequence, the base coverage rate and the average sequencing depth of a T-DNA sequence, the average sequencing depth of a plant sample genome to be detected, the sequence length of an expression vector and the length of the reads, and judging whether a transgenic event or a gene editing event exists in the plant to be detected, whether a skeleton sequence transfer event occurs and the copy number of an inserted sequence according to the following formula;
criteria for determining the presence or absence of a transgenic or gene editing event are:
(4-1) known expression vector sequences: VRN is more than or equal to Gdepth/2+ VectorLen/ReadLen, and Tcov is more than or equal to 0.9;
wherein VRN represents the number of reads matching the expression vector sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected; VectorLen denotes the sequence length of the expression vector; ReadLen represents the read length; tcov represents the base coverage of the T-DNA sequence, Tdepth represents the average sequencing depth of the T-DNA, and Bdepth represents the average sequencing depth of the framework sequence;
(4-2) unknown sequence of expression vector: FRN is not less than (Gdepth/2) multiplied by 10;
wherein FRN represents the number of reads that match the generic library; gdepth represents the average sequencing depth of the genome of the plant to be detected;
the standard for judging whether the skeleton sequence transfer event exists is as follows: BRN is more than or equal to Gdepth/3;
wherein BRN represents the number of reads matching the backbone sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected;
(4-3) determination of copy number: the copy number of the inserted T-DNA is Tdepth/Gdepth; the number of inserted skeleton sequences is Bdepth/Gdepth;
tdepth represents the average sequencing depth of the T-DNA, Gdepth represents the average sequencing depth of the genome of a plant to be detected, and Bdepth represents the average sequencing depth of a skeleton sequence;
(5) taking an expression vector containing a T-DNA sequence as a reference sequence, and extracting a double-ended read pair meeting the conditions according to the data of a mapping database, wherein the double-ended read pair must comprise a single-ended read I completely matched with the expression vector sequence and a single-ended read II not matched with or not completely matched with the expression vector sequence; then, carrying out local homology comparison analysis on the single-ended reading sequence II, an expression vector sequence containing a T-DNA sequence and a wild type genome sequence of a plant to be detected respectively;
determining the insertion site of the T-DNA sequence according to the following different conditions;
(5-1) if one end of the single-ended reading sequence II can be matched with the sequence of the expression vector, the other end of the single-ended reading sequence II can be matched with the sequence of the wild genome, and at least three single-ended reading sequences II with the same matching starting position of the genome or the expression vector exist, judging that the single-ended reading sequences II have candidate insertion sites, and the genome matching position closest to the sequence position of the expression vector is the insertion site.
(5-2) if the single-ended reading sequence II can not be matched with the sequence of the expression vector but can be matched with the sequence of the wild genome, and at least three single-ended reading sequences II with the same matching initial position of the genome exist, judging that the single-ended reading sequences II have insertion sites, and the insertion sites are the initial positions matched with the genome.
(5-3) if the single-ended reading sequence II can not be matched with the sequence of the expression vector and can not be matched with the sequence of the wild genome, assembling and splicing the corresponding double-ended reading sequence pair of the single-ended reading sequence II into a fragment according to the sequence overlapping characteristic, connecting the fragment with a T-DNA sequence or an inserted expression vector fragment by using N, and repeating the step (3) and the step (5) by using the fragment as a reference sequence until the single-ended reading sequence II can be matched with the wild genome sequence of the plant to be detected and can be judged by the step (5-1) or the step (5-2).
As shown in FIG. 1, an expression vector sequence includes both the T-DNA sequence of interest and sequences other than the T-DNA of interest, which we define as a backbone sequence. Whether the T-DNA sequence or the framework sequence is inserted into the wild type genome of the plant to be tested, there are no more than three cases: the first is precise insertion, i.e., the genomic sequence is tightly linked to the inserted sequence, no other sequences in between, and for one single-ended read I that matches the expression vector sequence, there are two possibilities for the other single-ended read II that does not match or does not match completely to the expression vector sequence, one is a read sequence that spans the "breakpoint" between the inserted sequence and the genomic sequence (that can be located by 5-1), and one is a read sequence that comes entirely from the genomic sequence (that can be located by 5-2); the second is insertion with concomitant insertion or perturbation of the mini-fragment, where the mini-fragment is defined as less than or equal to 90bp (where the length is adjustable) depending on the length of the read and the sensitivity of the method, where single-ended read II also exists with two possibilities, one is a read sequence spanning the "breakpoint" between the inserted sequence and the genomic sequence (that can be located by 5-1), and one is a read sequence entirely from the genomic sequence (that can be located by 5-2); the third is insertion and accompanying recombination or perturbation of large fragments, defined here as greater than 100bp (here adjustable in length) in terms of read length, where single-ended read II also presents two possibilities, one for read sequences spanning the "breakpoint" between the inserted sequence and the rearranged (perturbed) sequence, and one completely from the rearranged (perturbed) sequence, both of which can be used for 5-3 alignment.
Further, in the step (2), performing whole genome resequencing on the genome DNA, and performing quality control on original offline data to obtain quality-controlled double-end sequencing data of the whole genome;
the quality control method comprises the following specific steps:
(2-a) removing the read sequence with the linker;
(2-b) reads with a proportion of N removed greater than 20%;
(2-c) removing the double-ended read corresponding to the single-ended read when the 3' end of the single-ended read contains more low-quality bases than one third of the length proportion of the single-ended read; the low-quality base is a base with the mass less than or equal to 20.
Further, in step (3), the generic database is a database of complete sequence of vectors collected from the public database (NCBI, https:// www.ncbi.nlm.nih.gov /) as complete as possible.
Further, in the step (5-1), comparing an interval sequence matched with the expression vector sequence in the single-end reading sequence II with a wild type genome sequence of a plant sample to be detected;
if not, determining that the candidate insertion site on the single-ended reading sequence II is a real insertion site; if the sequence is homologous, the candidate insertion site on the single-ended reading sequence II is determined to be a false positive insertion site.
Further, in the step (5-2), if the single-ended reading sequence II is the left-end sequence of the double-ended reading sequence pair, the insertion site on the single-ended reading sequence II is determined to be the maximum site; and if the single-ended reading sequence II is the right-end sequence of the double-ended reading sequence pair, judging that the insertion site on the single-ended reading sequence II is the site minimum value.
Further, in the steps (5-1) to (5-3), the matching criteria are: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.
Further, the method of the present invention further comprises step (6): designing primer sequences before and after the insertion site obtained in the step (5), and verifying by a PCR technology.
Compared with the prior art, the invention has the following beneficial effects:
(1) the method combines with bioinformatics analysis means, identifies whether the transgene or the gene editing event occurs under the condition that an expression vector is known or unknown, can quickly and accurately give the accurate positioning, direction, copy number and flanking sequence information of the target sequence inserted into the genome under the condition that the expression vector is known, and can also identify whether a skeleton sequence, namely a sequence except the target sequence is inserted into the genome and give the same positioning.
(2) The method not only considers that the peripheral sequence of the genome insertion site is recombined or disordered due to the insertion of the exogenous sequence, but also reduces the sensitivity of the method and improves the false negative and missed detection probability; it is also considered that when the vector sequence has partial homology with the genomic sequence, false positives arise.
(3) Compared with the traditional experimental method, the method has the advantages of short time consumption, economy, repeatability and the like, can determine the influence of the transgenic or gene editing event on the plant genome in a short time, is favorable for the safety risk evaluation of the transgenic plant, and promotes the application of the transgenic or gene editing technology in agriculture.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. These examples are intended to illustrate the invention only and are not intended to limit the scope of the invention. The experimental procedures not specified in the examples were carried out under conventional conditions or under the conditions described in molecular cloning protocols (J. SammBruke and D.W. Lassel, Huang Peyer, published by scientific Press, third edition 8.2002).
The terms referred to in the present invention are defined as follows:
reading the sequence: each sequence assembled from the "ATCG" array was obtained by a sequencer.
Double-ended read pair: randomly breaking a genome into DNA fragments with different lengths by an ultrasonic instrument, recovering and purifying the DNA fragments with fixed sizes by PCR, fixing the DNA fragments on a substrate by adding joints at two ends of the DNA fragments, and sequencing from two ends to the middle of the fragments to obtain sequences at two ends of the same DNA fragment.
Single-ended reading: with respect to paired-end read pairs, it is meant any read in a pair of paired-end read pairs.
Average sequencing depth: the ratio of the total amount of bases obtained by sequencing the sequence interval to the length of the sequence interval is determined as the ratio of the total amount of bases obtained by sequencing the sequence interval (excluding the region homologous to the genome) to the length of the sequence interval (excluding the region homologous to the genome) for the average sequencing depth of the T-DNA and the backbone sequence.
Base coverage of T-DNA sequence: the ratio of the length of the T-DNA interval to the length of the T-DNA sequence is determined by sequencing.
And (3) complete matching: for reads of 150bp length, the sequence can be perfectly matched to the reference sequence, allowing mismatches of 2-3 bases, gaps within 10 bp.
Not perfectly matched: for reads of 150bp length, there is no more than 90bp of sequence that can match the reference sequence.
Example 1
A method for rapidly identifying a gene editing material and an insertion site thereof by using whole genome re-sequencing data comprises the following specific steps:
(1) extracting the genomic DNA of the experimental material and the plant to be detected: the rice gene editing sample R07 takes pHUN4c12S as a vector, the total length is 12314bp, the gene editing target gene is MTL (LOC _ Os03g27610), and the gene encodes phospholipase.
Firstly, designing a target sequence (20nt) in the third exon of the MTL gene according to the sequence of the MTL gene, and connecting the target sequence into a vector pHUN4c12S to obtain a pHUN4c12S-MTL plasmid; then, the plasmid is transferred into an agrobacterium-infected state; and finally, transferring pHUN4c12S-MTL into the japonica rice variety Sn rice No. 1 by an agrobacterium-mediated transgenic method. Screening, differentiating and rooting to obtain T0And (3) continuously planting the transgenic regenerated seedlings to obtain T1 plants R07, transplanting the plants to be tested to a field for planting for about 30 days, taking leaves, extracting the genome DNA of the plants to be tested by using a conventional DNA extraction kit, and carrying out whole-gene recombination sequencing. The rice reference genomic sequence is RGAP v7(htt p:// rice. plant biology. msu. edu/annotation _ pseudo. shtml).
(2) Performing whole genome resequencing on the genome DNA of the plant to be detected, and performing quality control on original off-line data to obtain double-end sequencing data of the whole genome;
the whole genome double-end sequencing is carried out on the gene editing rice to be tested transferred into the vector pHUN4c12S by a high-throughput sequencer (Illumina sequencer, Beijing Baimaike Biotechnology Ltd.), the insertion size is 300bp, the reading length is 150bp, the original off-line data is 48G, and the average sequencing depth is 57 layers.
Performing quality control on original offline data, wherein the software of the quality control is Trimmomatic, the version is 0.36, and the quality control standard is as follows:
(2-a) removing the read with the linker;
(2-b) removing reads with a proportion of N in reads greater than 20%;
(2-c) removing the paired-end reads when the 3' end of the single-end read (i.e., one of the paired-end reads) contains more than one-third of the low-quality (base-quality ═ 20) bases in the ratio of the length of the read; setting parameters as tailing: 20, MINLENEN: and 100, obtaining quality data after quality control.
(3) Judging whether an expression vector sequence containing a T-DNA sequence inserted into a plant to be detected is known or not, and carrying out the following operations according to a judgment result:
if the sequence of the expression vector inserted into the plant to be detected is known, the expression vector is used as a template, and the double-end sequencing data is mapped with the expression vector to obtain a mapping database;
if the sequence of the expression vector inserted into the plant to be detected is unknown, mapping the double-end sequencing data with the universal vector library as a template to obtain a mapping database;
the judgment result is as follows: the sequence of the expression vector inserted into the plant to be detected is known, so that the expression vector pHUN4c12S-MTL containing a T-DNA sequence is used as a template, and double-end sequencing data obtained after whole genome re-sequencing quality control is subjected to mapping analysis with the double-end sequencing data to obtain a mapping database.
The specific mapping analysis steps are as follows: the software used is bowtie2, the version is 2.2.1, the default parameters (the definite mapping standard) are adopted, the obtained comparison result information file is in the sam format, the sam format file is converted into a binary bam format file (namely a mapping database), and the conversion tool is samtools. In order to reduce the interference of the sequencing repeated sequence (two sequences which are completely identical in the same position and the same direction of the matched expression vector) on the subsequent positioning result, a step of removing the repeated sequence alignment result is introduced in the format conversion process.
The tool used to REMOVE duplicate sequences is the subroutine module MarkDuplicates. jar in the picard software, with the parameter REMOVE _ DUPLICATES set to true and the other parameters default.
(4) Respectively counting the number of reads (VRN) of a matched expression vector sequence, the number of reads (BRN) of a matched skeleton sequence, the base coverage rate (TCov) and the average sequencing depth (Tdepth) of a T-DNA sequence, the average sequencing depth (Gdepth) of a sample genome, the sequence length (VectorLen) of an expression vector and the reading length (ReadLen) in the mapping database, and judging whether a transgenic event or a gene editing event exists in a plant to be detected and whether a skeleton sequence is transferred into the event according to the following formula;
criteria for determining the presence or absence of a transgenic or gene editing event are:
VRN is more than or equal to Gdepth/2+ VectorLen/ReadLen, and Tcov is more than or equal to 0.9;
wherein VRN represents the number of reads matching the expression vector sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected; VectorLen denotes the sequence length of the expression vector; ReadLen represents the read length; tcov represents the base coverage of the T-DNA sequence;
the standard for judging whether the skeleton sequence transfer event exists is as follows: BRN is more than or equal to Gdepth/3;
wherein BRN represents the number of reads matching the backbone sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected;
the detection result is as follows: the number of reads (VRN) of the matched expression vector sequence is 2725, the number of reads (BRN) of the matched skeleton sequence is 1053, the base coverage rate (TCov) of the T-DNA sequence is 0.997, the average sequencing depth (Tdepth) of the T-DNA is 23 layers, the average sequencing depth of the skeleton sequence is 27 layers, the average sequencing depth (Gdepth) of the sample genome is 57 layers, the length (VectorLen) of the expression vector sequence is 12314bp, and the length (ReadLen) of the reads is 150 bp.
(4-1)VRN≥111=(57/2+12314/150);Tcov≥0.9;
(4-2)BRN≥19;
(4-3) determination of copy number: the copy number of the inserted T-DNA is 23/57 and 1/2; the number of inserted skeleton sequences is 27/57 ≈ 1/2;
and judging whether the sample has a transgenic or gene editing event or has a skeleton sequence transfer event and single copy according to the judgment standard.
(5) Taking an expression vector containing a T-DNA sequence as a reference sequence, extracting qualified doublets by samtools software according to data of a mapping database (bam format file)An end-read pair, which must comprise a single-ended read I that is perfectly matched to the expression vector sequence and another single-ended read II that is not matched or not perfectly matched to the expression vector sequence; and then carrying out local homology comparison analysis on the single-ended reading sequence II, the expression vector sequence containing the T-DNA sequence and the wild type genome sequence of the plant to be detected respectively. The software used for local homology alignment analysis was blast, version of blast was 2.2.28+, and the outfmt parameter was set to 6 to ensure that the output was m8 format separated by tab bonds, for subsequent processing, num _ descriptions or max _ target _ seqs parameter was set to 1, evalue parameter was set to 1e-5。
Determining the insertion site of the T-DNA sequence according to the following different conditions;
(5-1) if one end of the single-ended reading sequence II can be matched with an expression vector sequence, the other end of the single-ended reading sequence II can be matched with a wild type genome sequence, and at least three single-ended reading sequences II with the same matching initial position of the genome or the expression vector exist, judging that the single-ended reading sequence II has a candidate insertion site, extracting a reading sequence II interval section matched with the expression vector sequence, comparing the reading sequence II interval section with the wild type genome of a transgenic sample to be detected, and if no homologous sequence exists, setting the genome matching site closest to the expression vector sequence position as a real insertion site; if a homologous sequence exists, the candidate insertion site on the single-ended reading II is determined to be a false positive insertion site. The matching criteria were: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.
(5-2) if the single-ended reading sequence II can not be matched with the sequence of the expression vector but can be matched with the sequence of the wild genome, and at least three single-ended reading sequences II with the same matching initial position of the genome exist, judging that the single-ended reading sequences II have insertion sites, and the insertion sites are the initial positions matched with the genome; the matching criteria were: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.
If the single-ended reading sequence II is the left-end sequence of the double-ended reading sequence pair, the insertion site on the single-ended reading sequence II is judged to be the maximum site; and if the single-ended reading sequence II is the right-end sequence of the double-ended reading sequence pair, judging that the insertion site on the single-ended reading sequence II is the site minimum value.
(5-3) if the single-ended reading sequence II can not be matched with the sequence of the expression vector and can not be matched with the sequence of the wild genome, assembling and splicing the corresponding double-ended reading sequence pair of the single-ended reading sequence II into a fragment according to the sequence overlapping characteristic, connecting the fragment with a T-DNA sequence or an inserted expression vector fragment by using N, and repeating the step (3) and the step (5) by using the fragment as a reference sequence until the single-ended reading sequence II can be matched with the wild genome sequence of the plant to be detected and can be judged by the step (5-1) or the step (5-2).
According to the method of step (5), the judgment results are shown in Table 1, and the genome schematic diagrams of the inserted T-DNA sequences are shown in FIGS. 3 and 4.
As can be seen from FIGS. 3 and 4, this example is the case in step (5-1), and the insertion site is located by single-ended reading II matching with the expression vector sequence and the wild-type genomic sequence.
TABLE 1 final output results of positioning of rice plants to be tested
Example 2
(1) Extracting the genomic DNA of the experimental material and the plant to be detected: the soybean transgenic sample L22 takes pSOY19 as a vector, the total length is 9557bp, and the target gene g10-epsps contained in the vector has the function of a glyphosate tolerance gene.
According to the sequence of the g10-epsps gene, the target sequence is connected into a vector pSOY19 to obtain a pSOY19-g10-epsps expression plasmid vector; then, the plasmid is transferred into an agrobacterium-infected state; finally, pSOY19-g10-epsps was transferred into recipient Huachun No. 3 by Agrobacterium-mediated transgenic method. After screening, differentiation and rooting, transgenic soybean L22 is obtained.
After the strain is normally cultured to 100 days of seedling age, taking leaves, extracting the genome DNA of the plant to be detected by using a conventional DNA extraction kit, and performing whole genome re-sequencing. The soybean reference genome is a cultivar Williams 82 sequence Gmax-275 _ v2.fa, and the download address is https:// phytozome.jgi.doe.gov/pz/portal.html.
(2) Performing whole genome resequencing on the genome DNA of the plant to be detected, and performing quality control on original off-line data to obtain double-end sequencing data of the whole genome;
the transgenic soybean sample to be tested, which is transferred into the vector pSOY19, is subjected to whole genome double-end sequencing by a high-throughput sequencer (Illumina XTen, Nuo and induced science and technology Limited), the insertion size is 300bp, the reading length is 150bp, the obtained original off-machine data is 52G, and the average sequencing depth is 23 layers.
Performing quality control on original offline data, wherein the software of the quality control is Trimmomatic, the version is 0.36, and the quality control standard is as follows:
(2-a) removing the read with the linker;
(2-b) removing reads with a proportion of N in reads greater than 20%;
(2-c) removing the paired-end reads when the 3' end of the single-end read (i.e., one of the paired-end reads) contains more than one-third of the low-quality (base-quality ═ 20) bases in the ratio of the length of the read; setting parameters as tailing: 20, MINLENEN: and 100, obtaining quality data after quality control.
(3) Judging whether an expression vector sequence containing a T-DNA sequence inserted into a plant to be detected is known or not, and carrying out the following operations according to a judgment result:
if the sequence of the expression vector inserted into the plant to be detected is known, the expression vector is used as a template, and the double-end sequencing data is mapped with the expression vector to obtain a mapping database;
if the sequence of the expression vector inserted into the plant to be detected is unknown, the generic vector library is used as a template, and the generic vector library and the double-end sequencing data are mapped to obtain a mapping database;
the judgment result is as follows: the sequence of an expression vector inserted into a plant to be detected is known, so that the expression vector pSOY19-g10-epsps containing a T-DNA sequence is used as a template, and double-end sequencing data obtained after whole genome re-sequencing quality control is subjected to mapping analysis to obtain a mapping database.
The specific mapping analysis steps are as follows: the software used is bowtie2, the version is 2.2.1, the default parameters (the definite mapping standard) are adopted, the obtained comparison result information file is in the sam format, the sam format file is converted into a binary bam format file (namely a mapping database), and the conversion tool is samtools. In order to reduce the interference of the sequencing repeated sequence (two sequences which are completely identical in the same position and the same direction of the matched expression vector) on the subsequent positioning result, a step of removing the repeated sequence alignment result is introduced in the format conversion process.
The tool used to REMOVE duplicate sequences is the subroutine module MarkDuplicates. jar in the picard software, with the parameter REMOVE _ DUPLICATES set to true and the other parameters default.
(4) Respectively counting the number of reads (VRN) of a matched expression vector sequence, the number of reads (BRN) of a matched skeleton sequence, the base coverage rate (TCov) and the average sequencing depth (Tdepth) of a T-DNA sequence, the average sequencing depth (Gdepth) of a sample genome, the sequence length (VectorLen) of an expression vector and the reading length (ReadLen) in the mapping database, and judging whether a transgenic event or a gene editing event exists in a plant to be detected and whether a skeleton sequence is transferred into the event according to the following formula;
criteria for determining the presence or absence of a transgenic or gene editing event are:
VRN is more than or equal to Gdepth/2+ VectorLen/ReadLen, and Tcov is more than or equal to 0.9;
wherein VRN represents the number of reads matching the expression vector sequence; gdepth represents the average sequencing depth of the genome of the plant to be detected; VectorLen denotes the sequence length of the expression vector; ReadLen represents the read length; tcov represents the base coverage of the T-DNA sequence;
the detection result is as follows: the number of reads (VRN) of the matched expression vector sequence is 293, the number of reads (BRN) of the matched skeleton sequence is 0, the base coverage rate (TCov) of the T-DNA sequence is 0.986, the average sequencing depth (Tdepth) of the T-DNA is 14 layers, the average sequencing depth (Gdepth) of the sample genome is 23 layers, the sequence length (VectorLen) of the expression vector is 9557bp, and the reading length (ReadLen) is 150 bp.
(4-1)VRN≥75=(23/2+9557/150);Tcov≥0.9;
(4-2)BRN<8;
(4-3) determination of copy number: the copy number of the inserted T-DNA is 14/23 and 1/2, and the T-DNA is single copy;
and according to the judgment standard, judging that the sample has a transgenic or gene editing event, and has single copy but no skeleton sequence transfer event.
(5) Taking an expression vector containing a T-DNA sequence as a reference sequence, extracting a double-ended reading pair meeting the conditions through samtools software according to data of a mapping database (bam format file), wherein the double-ended reading pair must comprise a single-ended reading I completely matched with the expression vector sequence and a single-ended reading II which is not matched with or is not completely matched with the expression vector sequence; and then carrying out local homology comparison analysis on the single-ended reading sequence II, the expression vector sequence containing the T-DNA sequence and the wild type genome sequence of the plant to be detected respectively. The software used for local homology alignment analysis was blast, version of blast was 2.2.28+, and the outfmt parameter was set to 6 to ensure that the output was m8 format separated by tab bonds, for subsequent processing, num _ descriptions or max _ target _ seqs parameter was set to 1, evalue parameter was set to 1e-5。
Determining the insertion site of the T-DNA sequence according to the following different conditions;
(5-1) if one end of the single-ended reading sequence II can be matched with an expression vector sequence, the other end of the single-ended reading sequence II can be matched with a wild type genome sequence, and at least three single-ended reading sequences II with the same matching initial position of the genome or the expression vector exist, judging that the single-ended reading sequence II has a candidate insertion site, extracting a reading sequence II interval section matched with the expression vector sequence, comparing the reading sequence II interval section with the wild type genome of a transgenic sample to be detected, and if no homologous sequence exists, setting the genome matching site closest to the expression vector sequence position as a real insertion site; if a homologous sequence exists, the candidate insertion site on the single-ended reading II is determined to be a false positive insertion site. The matching criteria were: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.
(5-2) if the single-ended reading sequence II can not be matched with the sequence of the expression vector but can be matched with the sequence of the wild genome, and at least three single-ended reading sequences II with the same matching initial position of the genome exist, judging that the single-ended reading sequences II have insertion sites, and the insertion sites are the initial positions matched with the genome; the matching criteria were: the similarity of the basic groups is more than or equal to 95 percent, the mismatched basic groups are less than or equal to 5 percent, and the vacancy is less than or equal to 5 percent.
If the single-ended reading sequence II is the left-end sequence of the double-ended reading sequence pair, the insertion site on the single-ended reading sequence II is judged to be the maximum site; and if the single-ended reading sequence II is the right-end sequence of the double-ended reading sequence pair, judging that the insertion site on the single-ended reading sequence II is the site minimum value.
(5-3) if the single-ended reading sequence II can not be matched with the sequence of the expression vector and can not be matched with the sequence of the wild genome, assembling and splicing the corresponding double-ended reading sequence pair of the single-ended reading sequence II into a fragment according to the sequence overlapping characteristic, connecting the fragment with a T-DNA sequence or an inserted expression vector fragment by using N, and repeating the step (3) and the step (5) by using the fragment as a reference sequence until the single-ended reading sequence II can be matched with the wild genome sequence of the plant to be detected and can be judged by the step (5-1) or the step (5-2).
According to the method of step (5), the judgment results are shown in Table 2, and the genome schematic diagrams of the inserted T-DNA sequences are shown in FIGS. 5 and 6.
As can be seen from FIGS. 5 and 6, this embodiment is the case of steps (5-1) and (5-3), and the insertion site is located by steps (5-1) and (5-3).
TABLE 2 final output results of positioning soybean plants to be tested
Example 3
(1) Extracting the genomic DNA of the experimental material and the plant to be detected: taking a rice gene editing sample R14 as an example, pHUN4c12S is a vector, the total length is 12314bp, a gene editing target gene is OsLCT1(LOC _ Os06g38120), and the gene encodes a low-affinity ion transporter.
Firstly, designing a target sequence in the third exon of the gene according to the sequence of the OsLCT1 gene (CCCGGCAGCGCACCGATGTTGCT), and connecting the target sequence into a vector pHUN4c12S to obtain a pHUN4c12S-LCT1 plasmid; then, the plasmid is transferred into an agrobacterium-infected state; and finally, the pHUN4c12S-LCT1 is transferred into the japonica rice variety sweet japonica No. 1 by an agrobacterium-mediated transgenic method. And screening, differentiating and rooting to obtain T0 transgenic regenerated seedling. Continuously planting to obtain a T1 plant R14, planting the plant to be detected in a field for about 30 days, taking leaves, extracting the genome DNA of the plant to be detected by using a conventional kit, detecting that the plant does not contain a hygromycin resistance gene by using a hygromycin resistance gene marker, and then performing whole genome re-sequencing. The rice reference genomic sequence was RGAP v7(http:// rice. plant. biology. msu. edu/annotation _ pseudo. shtml).
(2) Performing whole genome resequencing on the genome DNA of the plant to be detected, and performing quality control on original off-line data to obtain double-end sequencing data of the whole genome;
the whole genome double-end sequencing is carried out on the gene editing rice to be tested by a high-throughput sequencer (Illumina sequencer, beijing Baimaike biotechnology limited), the insertion size is 300bp, the reading length is 150bp, the original off-line data 32G is obtained, and the average sequencing depth is 25 layers.
Performing quality control on original offline data, wherein the software of the quality control is Trimmomatic, the version is 0.36, and the quality control standard is as follows:
(2-a) removing the read with the linker;
(2-b) removing reads with a proportion of N in reads greater than 20%;
(2-c) removing the paired-end reads when the 3' end of the single-end read (i.e., one of the paired-end reads) contains more than one-third of the low-quality (base-quality ═ 20) bases in the ratio of the length of the read; setting parameters as tailing: 20, MINLENEN: and 100, obtaining quality data after quality control.
(3) Judging whether an expression vector sequence containing a T-DNA sequence inserted into a plant to be detected is known or not, and carrying out the following operations according to a judgment result:
if the sequence of the expression vector inserted into the plant to be detected is known, the expression vector is used as a template, and the double-end sequencing data is mapped with the expression vector to obtain a mapping database;
if the sequence of the expression vector inserted into the plant to be detected is unknown, mapping the double-end sequencing data with the universal vector library as a template to obtain a mapping database;
here, the expression vector and the T-DNA sequence were analyzed without knowing them, and then verified by analysis using the known expression vector and T-DNA sequence. The judgment result is as follows: the sequence of the expression vector inserted into the plant to be detected is unknown, so that the pan-vector library is used as a template, and the double-end sequencing data obtained after the whole genome re-sequencing quality control is subjected to mapping analysis with the double-end sequencing data to obtain a mapping database.
The specific mapping analysis steps are as follows: the software used is bowtie2, the version is 2.2.1, the default parameters (the definite mapping standard) are adopted, the obtained comparison result information file is in the sam format, the sam format file is converted into a binary bam format file (namely a mapping database), and the conversion tool is samtools. In order to reduce the interference of the sequencing repeated sequence (two sequences which are completely identical in the same position and the same direction of the matched expression vector) on the subsequent positioning result, a step of removing the repeated sequence alignment result is introduced in the format conversion process.
The tool used to REMOVE duplicate sequences is the subroutine module MarkDuplicates. jar in the picard software, with the parameter REMOVE _ DUPLICATES set to true and the other parameters default.
(4) Respectively counting the number of read sequences (FRN) matched with the generic vector library in the mapping database, the average sequencing depth (Gdepth) of a sample genome and the reading length (ReadLen), and judging whether a transgenic event or a gene editing event exists in a plant to be detected according to the following formula;
the criteria for determining the presence or absence of a transgenic event or gene editing event are (generic vector library): FRN is more than or equal to Gdepth/2 x 10;
the detection result is as follows: the number of reads (FRN) aligned to the generic library was 710, the average sequencing depth (Gdepth) of the sample genome was 25 layers, and the read length (ReadLen) was 150 bp.
(4-a)FRN≥125=(25/2)×10;
And judging whether the sample has a transgene or a gene editing event according to the judgment standard.
By adopting the method combining the embodiment 1 and the embodiment 3, 78 rice negative control materials, 16 rice gene editing materials, 5 transgenic rice materials and 9 transgenic soybean materials are tested in total, wherein when 16 materials are inserted, the situation that small disturbance exists near the genome insertion site is 5-2, the situation that 5 transgenic materials have large disturbance or rearrangement exists near the genome insertion site is 5-3, and the skeleton sequence insertion phenomenon exists in 15 materials. The total of 45 insertion events, except 3 soybean samples, which were not verified by experimental PCR, were consistent with our analysis results, with an accuracy of 93.7%. The method is 100% accurate in determining whether a transgene or gene editing event is present.