CN113436679B - Method and system for determining mutation rate of nucleic acid sample to be tested - Google Patents
Method and system for determining mutation rate of nucleic acid sample to be tested Download PDFInfo
- Publication number
- CN113436679B CN113436679B CN202010207884.3A CN202010207884A CN113436679B CN 113436679 B CN113436679 B CN 113436679B CN 202010207884 A CN202010207884 A CN 202010207884A CN 113436679 B CN113436679 B CN 113436679B
- Authority
- CN
- China
- Prior art keywords
- sequencing
- variation
- nucleic acid
- acid sample
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6844—Nucleic acid amplification reactions
- C12Q1/6858—Allele-specific amplification
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biotechnology (AREA)
- Analytical Chemistry (AREA)
- Organic Chemistry (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Wood Science & Technology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Medical Informatics (AREA)
- Genetics & Genomics (AREA)
- Theoretical Computer Science (AREA)
- Zoology (AREA)
- Evolutionary Biology (AREA)
- Molecular Biology (AREA)
- Microbiology (AREA)
- Immunology (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides a method for determining the mutation rate of a nucleic acid sample to be detected. The method comprises the following steps: sequencing a nucleic acid sample to be detected so as to obtain a sequencing result; comparing the sequencing result with a reference genome sequence of the nucleic acid sample to be tested so as to obtain a comparison result; determining and correcting structural variation, single nucleotide and/or small fragment variation, respectively, based on matching sequencing reads and the average length of the library building fragments; comparing unmatched sequencing reads to genome reference sequences of other species, and determining the sequencing read ratio of the source of each other species to the source of the unmatched species; splicing the unmatched sequencing reads, comparing the splicing result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation, and determining possible source species based on the splicing result; summarizing the structural variation, the single nucleotide and/or small fragment variation and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be detected.
Description
Technical Field
The present invention relates to the field of biological information, and in particular, to a method and system for determining the variability of a nucleic acid sample to be tested, a computer-readable storage medium, and an electronic device.
Background
Viruses often have stable structures, simple genomes, broad-spectrum infectivity and efficient packaging capability, and become widely used engineering DNA transport expression vectors. Whereas, using the autoimmune nature of the virus, researchers have used inactivated, attenuated or engineered viruses as effective vaccines. Furthermore, by utilizing the biological properties of viruses that lyse host cells during the amplification process, researchers have engineered viruses into oncolytic viruses that have replication packaging capabilities and specifically achieve tumor killing. With the development of virus related researches, various viruses such as adenovirus, lentivirus, herpes simplex virus-1 and the like are the engineering targets at present, and various virus products are applied to clinical treatment.
Although viruses have the above advantages as engineering vectors, the pathogenic ability and susceptibility to mutation of viruses also increase safety risks. Detection of exogenous contamination and self-variation is an important aspect of quality control in retrofit and production processes. In the case of adenoviruses, the FDA requires that the level of replicative adenovirus (RCA) in non-replicative adenoviruses be less than 1RCA/3e10VP. At present, exogenous and variant fragments in a virus sample are detected mainly by a low-throughput method for carrying out PCR detection and first-generation sequencing on a specific region, corresponding primer design is required for the fragments to be detected according to possible variant types, all exogenous and variant fragments in the sample are difficult to completely cover, the specificity and the length limitation of PCR reaction are limited, and high-homology fragments and long fragments are difficult to detect. The deep sequencing technology can be used for constructing a library by randomly fragmenting a sample to be detected, can be used for detecting all fragments in the sample with high flux, covers rich neighborhood information around the fragment to be detected, and can be used for effectively detecting exogenous and variant fragments of a virus sample by combining a related analysis technology.
In summary, the detection of exogenous and variant fragments in virus samples is an important content of quality control, but the conventional detection method still has the problems of low flux, incomplete detection and difficult detection of highly homologous fragments and long fragments, and the inventor establishes a detection and analysis flow from a sample to be detected to an analysis report based on a high flux deep sequencing technology and a related analysis technology, so as to effectively and comprehensively detect exogenous pollution and self-variation in the analysis sample.
Disclosure of Invention
The present application has been made based on the findings and knowledge of the inventors regarding the following facts and problems:
In the quality control of engineering virus transformation and production, in order to detect exogenous pollution and self variation, the inventor establishes a detection and analysis flow from a sample to be detected to an analysis report based on high-throughput deep sequencing and related analysis technology, and effectively and comprehensively detects and analyzes the exogenous pollution and self variation in the sample.
In a first aspect of the invention, the invention provides a method for determining the variability of a nucleic acid sample to be tested. According to an embodiment of the invention, the method comprises: (1) Sequencing a nucleic acid sample to be tested so as to obtain a sequencing result, wherein the lowest effective depth of the sequencing is 10-100, the data size of the sequencing result is determined based on the length of a reference genome, the lowest effective sequencing depth and a preset lowest variation rate of detectable variation, and the sequencing result consists of a plurality of sequencing reads; (2) Comparing the sequencing result with a reference genome sequence of the nucleic acid sample to be tested so as to obtain a comparison result, wherein the comparison result comprises a matched sequencing read and an unmatched sequencing read, and determining the average length of the sequenced library-building fragment based on the matched sequencing read; (3) Determining and correcting structural variation, single nucleotide and/or small fragment variation, respectively, based on the matched sequencing reads and the average length of the pooling fragments; (4) Splicing the unmatched sequencing reads, and comparing the splicing result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation; (5) Summarizing the structural variation, the single nucleotide and/or small fragment variation and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be detected. According to an embodiment of the invention, the nucleic acid sample to be tested comprises a viral genome. The method provided by the embodiment of the invention can effectively and comprehensively detect and analyze the conditions of exogenous pollution and self variation in the nucleic acid sample to be detected.
According to an embodiment of the present invention, the method may further comprise at least one of the following additional technical features:
According to an embodiment of the present invention, before performing step (2), performing quality assessment and screening on the sequencing result in advance, and re-determining a detectable mutation lowest mutation rate based on the screening result, and if the detectable mutation lowest mutation rate is lower than a predetermined threshold, increasing the amount of the nucleic acid sample in step (1).
According to an embodiment of the present invention, in step (3), the structural variation is determined using Pindel, and half of the predetermined lowest variation rate of the detectable variation is employed as the variation rate screening threshold.
According to an embodiment of the invention, after determining the structural variation, the single nucleotide variation and/or the small fragment variation, the sequencing reads involved in the variation are subjected to secondary comparison, and the detection variations of the same type are combined and false positive detection results due to low quality bases, comparison result errors and the like are corrected.
According to an embodiment of the present invention, in step (3), the secondary alignment is performed using different software than the alignment in step (2).
According to the embodiment of the invention, in the step (3), common mutation types are excluded according to the public data and the historical detection data.
According to an embodiment of the invention, in step (3), the detection of the single nucleotide and/or small fragment variations is performed using Mutect.sup.2.
According to an embodiment of the invention, in step (4), a possible source species is determined based on the splice result.
According to embodiments of the invention, the unmatched sequencing reads are further aligned with genomic reference sequences of other species, and the sequencing read proportions of each other species source and unknown source are determined.
According to an embodiment of the invention, the genome of the other species comprises a human genome and/or a mycoplasma genome.
According to an embodiment of the invention, further comprising performing PCR validation on the structural variation and/or exogenous variation.
In a second aspect of the present invention, the present invention proposes a computer-readable storage medium having a computer program stored thereon. According to an embodiment of the present invention, the program when executed by a processor implements the method for determining the mutation rate of a nucleic acid sample to be tested as described above.
In a third aspect of the invention, the invention provides an electronic device. According to an embodiment of the invention, the electronic device comprises a memory, a processor; wherein the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the method of determining the mutation rate of a nucleic acid sample to be tested as described above.
In a fourth aspect of the invention, the invention provides a system for determining the variability of a nucleic acid sample to be tested. According to an embodiment of the invention, the system comprises: the sequencing device is used for sequencing the nucleic acid sample to be tested so as to obtain a sequencing result, the lowest effective depth of the sequencing is 10-100, the data size of the sequencing result is determined based on the length of a reference genome, the lowest effective sequencing depth and a preset lowest variation rate of detectable variation, and the sequencing result consists of a plurality of sequencing reads; the comparison device is connected with the sequencing device and is used for comparing the sequencing result with a reference genome sequence of the nucleic acid sample to be tested so as to obtain a comparison result, wherein the comparison result comprises a matched sequencing read and an unmatched sequencing read, and the average length of the sequenced library-building fragment is determined based on the matched sequencing read; the matched sequencing read analysis device is connected with the comparison device and is used for respectively determining and correcting structural variation, single nucleotide and/or small fragment variation based on the average lengths of the matched sequencing read and the library building fragment; the unmatched sequencing read analysis device is connected with the comparison device, splices unmatched sequencing reads, and compares the splicing result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation; and the output device is connected with the matched sequencing read analysis device and the unmatched sequencing read analysis device and is used for summarizing the structural variation, the single nucleotide and/or small fragment variation and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be detected. According to an embodiment of the invention, the nucleic acid sample to be tested is a viral genome. The system according to the embodiment of the invention is suitable for executing the method for determining the mutation rate of the nucleic acid sample to be detected, and effectively and comprehensively detects the conditions of external pollution and self mutation in the analysis sample.
According to an embodiment of the present invention, the above system may further include at least one of the following technical features:
According to an embodiment of the invention, the system further comprises:
and the lowest mutation rate determining device is connected with the sequencing device and the comparison device, and is used for carrying out quality evaluation and screening on the sequencing result in advance, redefining the lowest mutation rate of the detectable mutation based on the screening result, increasing the quantity of the nucleic acid sample in the sequencing device if the lowest mutation rate of the detectable mutation is lower than a preset threshold value, and inputting the sequencing result into the comparison device if the lowest mutation rate of the detectable mutation is not lower than the preset threshold value.
According to an embodiment of the invention, the system further comprises an unmatched sequencing read source analysis device connected to the alignment device for aligning the unmatched sequencing reads with other species genome reference sequences and determining the sequencing read proportions of each other species source and unknown source, and inputting the results to the output device.
According to an embodiment of the invention, the system further comprises a splice result source analysis device connected to the unmatched sequencing read analysis device for determining possible source species based on the splice result, the result being input to the output device.
According to an embodiment of the present invention, the system further comprises a PCR device, which is connected to the matched sequencing read analysis device, the unmatched sequencing read source analysis device and the unmatched sequencing read analysis device, and is configured to perform PCR verification on the structural variation and/or the exogenous variation, and input the result to the output device.
Drawings
FIG. 1 is a schematic diagram of a virus variation detection analysis flow;
FIGS. 2A-2F are simulation tests of the accuracy of different tools to detect different proportions of different length deletion variants;
FIGS. 3A-3F are simulation tests of the accuracy of different tools to detect different proportions of different length roll-over variations;
FIGS. 4A-4F are simulation tests of the accuracy of different tools to detect insertion variations of different lengths in different proportions;
FIGS. 5A-5F are simulation tests of the accuracy of different tools to detect copy number variations of different lengths in different proportions;
FIG. 6 is an exogenous insertion/substitution variation detection flow;
FIGS. 7A-7D are simulation tests of accuracy for detecting exogenous substitution variations of different lengths in different proportions based on stitching;
FIGS. 8A-8C are experimental tests of the accuracy of detecting deletion and inversion variations of different lengths in different proportions;
FIG. 9 is an adenovirus sample experimental test and PCR validation;
FIG. 10 is a Pindel variant high resolution correction flow;
FIG. 11 is a schematic diagram showing the structure of a system for determining the mutation rate of a nucleic acid sample to be tested according to an embodiment of the present invention;
FIG. 12 is a schematic diagram showing a system for determining the mutation rate of a nucleic acid sample to be tested according to another embodiment of the present invention;
FIG. 13 is a schematic diagram showing a system for determining the mutation rate of a nucleic acid sample to be tested according to another embodiment of the present invention;
FIG. 14 is a schematic diagram showing a system for determining the mutation rate of a nucleic acid sample to be tested according to another embodiment of the present invention;
FIG. 15 is a schematic diagram showing a system for determining the mutation rate of a nucleic acid sample according to another embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
The invention provides a system for determining the mutation rate of a nucleic acid sample to be detected. Referring to fig. 11, the system includes: a sequencing device 100, wherein the sequencing device 100 is used for sequencing a nucleic acid sample to be tested so as to obtain a sequencing result, the lowest effective depth of the sequencing is 10-100, the data size of the sequencing result is determined based on the length of a reference genome, the lowest effective sequencing depth and a preset lowest mutation rate of detectable mutation, and the sequencing result is composed of a plurality of sequencing reads; the comparison device 200 is connected with the sequencing device 100, and is used for comparing the sequencing result with a reference genome sequence of the nucleic acid sample to be tested so as to obtain a comparison result, wherein the comparison result comprises a matched sequencing read and an unmatched sequencing read, and the average length of the sequenced library-building fragment is determined based on the matched sequencing read; a matched sequencing read analysis device 300, wherein the matched sequencing read analysis device 300 is connected with the comparison device 200 and is used for respectively determining and correcting structural variation, single nucleotide and/or small fragment variation based on the average lengths of the matched sequencing read and the library building fragment; a non-matched sequencing read analysis device 400, wherein the non-matched sequencing read analysis device 400 is connected with the comparison device 200, splices the non-matched sequencing reads, and compares the splicing result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation; and the output device 500 is connected with the matched sequencing read analysis device 300 and the unmatched sequencing read analysis device 400, and is used for summarizing the structural variation, the single nucleotide and/or small fragment variation and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be tested. According to an embodiment of the invention, the nucleic acid sample to be tested is a viral genome. The system according to the embodiment of the invention is suitable for executing the method for determining the mutation rate of the nucleic acid sample to be detected, and effectively and comprehensively detects the conditions of external pollution and self mutation in the analysis sample.
According to an embodiment of the present invention, the above system may further include at least one of the following technical features:
According to an embodiment of the invention, referring to fig. 12, the system further comprises: and a lowest mutation rate determining unit 600, wherein the lowest mutation rate determining unit 600 is connected to the sequencing unit 100 and the comparing unit 200, and is configured to perform quality evaluation and screening on the sequencing result in advance, and to redetermine the lowest mutation rate of the detectable mutation based on the screening result, and if the lowest mutation rate of the detectable mutation is lower than a predetermined threshold, increase the amount of the nucleic acid sample in the sequencing unit, and if the lowest mutation rate of the detectable mutation is not lower than the predetermined threshold, input the sequencing result to the comparing unit.
According to an embodiment of the present invention, referring to fig. 13, the system further comprises an unmatched sequencing read source analysis device 700, wherein the unmatched sequencing read source analysis device 700 is connected to the alignment device 200, and is used for aligning the unmatched sequencing read with genome reference sequences of other species, determining sequencing read proportions of sources of the other species and unknown sources, and inputting the results to the output device 500.
According to an embodiment of the invention, referring to fig. 14, the system further comprises a splice result source analysis device 800, said splice result source analysis device 800 being connected to the unmatched sequencing read analysis device 400 for determining possible source species based on said splice result, the result being input to said output device 500.
According to an embodiment of the present invention, referring to fig. 15, the system further comprises a PCR device 900, wherein the PCR device 900 is connected to the matched sequencing read analysis device 300, the unmatched sequencing read source analysis device 700 and the unmatched sequencing read analysis device 400, and is used for performing PCR verification on the structural variation and/or the exogenous variation, and the result is input to the output device 500.
In the quality control of engineering virus transformation and production, in order to detect exogenous pollution and self variation, the inventor establishes a detection and analysis flow from a sample to be detected to an analysis report based on high-throughput deep sequencing and related analysis technology, and effectively and comprehensively detects and analyzes the exogenous pollution and self variation in the sample.
The specific flow is as follows:
1) Obtaining a reference genome sequence of the detected virus through literature investigation, first generation sequencing and other modes; high quality viral genomic DNA is extracted after virus purification.
2) And (3) performing library-building sequencing on the virus extracted genome by a deep sequencing technology to obtain high-throughput sequencing data with sufficient depth. The minimum effective sequencing depth is an empirical value of 10-100, the total data volume required can be estimated by referring to the length of the genome sequence, the minimum effective sequencing depth and the preset minimum detectable variation rate,
3) Sequencing quality is evaluated by using sequencing data quality evaluation software (such as Fastqc), and base quality distribution and joint pollution conditions are mainly judged; preprocessing by using sequencing data preprocessing software (such as Cutadapt), and selecting corresponding linker types, base quality thresholds and sequence length thresholds by combining the quality evaluation results; carrying out quality evaluation again after data preprocessing, and confirming the preprocessing effect; and re-estimating the lowest mutation rate of the detectable mutation according to the total data amount after pretreatment, and adding a test sample if the lowest mutation rate of the detectable mutation is not up to the preset value.
4) Comparing the preprocessed data with a reference genome by using sequence comparison software (such as Bwa), obtaining a comparison result, and reserving a sub-optimal comparison result; performing de-duplication on the comparison result by using comparison result processing software (such as Samblaster) and extracting unmatched sequencing reads; the comparison results are ranked using comparison result processing software (e.g., sambamba) and the average length of the library-building segments is estimated.
5) Structural variation analysis is performed by using structural variation detection software (such as Pindel) based on the comparison result and the average length of the library building fragments, a proper detection variation length range is selected according to the length of the reference genome sequence, and half of the lowest variation rate of the preset detectable variation is selected as a variation rate screening threshold.
6) Correcting the mutation detection result and the detected mutation rate by using high-resolution mutation detection correction software, combining the same type of detected mutation based on the re-comparison result of the detected mutation related data, eliminating false positive detection results generated due to low-quality bases, comparison result errors and the like, and eliminating common mutation types according to the public data and the historical detection data.
7) The single nucleotide and small fragment variation analysis is performed based on the comparison result by using single nucleotide and small fragment variation detection software (such as Mutect < 2 >), common variation types are eliminated according to the published virus polymorphism data and the historical detection data, and the comparison is performed with the structural variation detection result so as to reduce false negative as a target, supplement single nucleotide and small fragment variation which is not detected in the structural variation detection, and correct the estimated variation rate of the single nucleotide and small fragment variation which is also detected in the structural variation detection.
8) Comparing unmatched sequencing reads with possible pollution genomes (such as human genome and mycoplasma genome) respectively, and counting the proportion of each pollution source and unknown source in the sample; and re-estimating the total data amount of the detected virus sources and the lowest detectable mutation rate according to the proportion of the pollution source sequences, and adding a detection sample if the preset lowest detectable mutation rate is not reached.
9) Splice software (e.g., spades) is used to splice unmatched sequencing reads and kmer parameters are adjusted to obtain the best splice rate and splice length. And detecting exogenous substitution variation according to an exogenous fragment splicing result by using substitution variation detection software, comparing the spliced fragment with a virus reference genome, screening spliced fragments of which both half read lengths at two ends of the fragment can be matched with the reference genome, analyzing possible exogenous insertion variation according to the matching position, and estimating variation rate according to the depth of the reference genome data and the depth of the spliced fragment data.
10 Searching for splice fragments using sequence similarity searching software (e.g., blast), and analyzing the possible source species and genetic information of the splice fragments.
11 For the detected structural variation and exogenous fragment variation, designing corresponding PCR experiments for verification, and recovering the PCR fragments for first-generation sequencing verification.
12 And (3) integrating the analysis results to generate a final analysis report.
The flow chart of the invention is shown in figure 1.
Embodiments of the present invention will be described in further detail below, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
Example 1 simulation test of Virus variant detection analysis procedure
Experiment one
Simulation testing of the accuracy of various tool detection of deletion, inversion, insertion, copy number variation of various lengths
Each simulation randomly generates 40000bp sequence as a reference genome sequence, and a tool Art simulation library building process is used for generating 40000-to-Illumina PE150 sequencing data as non-mutated sequencing data; randomly generating deletion mutation with lengths of 1, 10, 100, 200 and 1000bp in a reference genome sequence respectively, and generating 40000-bit Illumina PE150 sequencing data serving as sequencing data after mutation by using a tool Art simulation library building process; the non-mutated and mutated sequencing data samples were mixed at a mutation rate of 0.1, 0.2, 0.5, respectively, and the resulting mutation was detected using tools Mutect, freeBayes, pindel, delly, gridss, lumpy, respectively. The simulation was repeated 200 times to evaluate the accuracy of each tool to detect deletion variations of different lengths.
Each simulation randomly generates 40000bp sequence as a reference genome sequence, and a tool Art simulation library building process is used for generating 40000-to-Illumina PE150 sequencing data as non-mutated sequencing data; randomly generating turnover mutation with lengths of 10 bp, 100 bp, 200 bp and 1000bp in a reference genome sequence respectively, and generating 40000-bit Illumina PE150 sequencing data serving as sequencing data after mutation by using a tool Art simulation library building process; the non-mutated and mutated sequencing data samples were mixed at a mutation rate of 0.1, 0.2, 0.5, respectively, and the resulting mutation was detected using tools Mutect, freeBayes, pindel, delly, gridss, lumpy, respectively. The simulation was repeated 200 times to evaluate the accuracy of each tool to detect different length rollover variations.
Each simulation randomly generates 40000bp sequence as a reference genome sequence, and a tool Art simulation library building process is used for generating 40000-to-Illumina PE150 sequencing data as non-mutated sequencing data; randomly generating insertion mutation with lengths of 1, 10, 100 and 200bp in a reference genome sequence respectively, and generating 40000-bit Illumina PE150 sequencing data serving as sequencing data after mutation by using a tool Art simulation library building process; the non-mutated and mutated sequencing data samples were mixed at a mutation rate of 0.1, 0.2, 0.5, respectively, and the resulting mutation was detected using tools Mutect, freeBayes, pindel, delly, gridss, lumpy, respectively. The simulation was repeated 200 times to evaluate the accuracy of each tool to detect insertion variations of different lengths.
Each simulation randomly generates 40000bp sequence as a reference genome sequence, and a tool Art simulation library building process is used for generating 40000-to-Illumina PE150 sequencing data as non-mutated sequencing data; randomly generating 2X and 3X copy number variations with lengths of 25, 50, 100, 200 and 1000bp in a reference genome sequence respectively, and generating 40000-bit Illumina PE150 sequencing data as sequencing data after variation by using a tool Art simulation library building process; the non-mutated and mutated sequencing data samples were mixed at a mutation rate of 0.1, 0.2, 0.5, respectively, and the resulting mutation was detected using tools Mutect, pindel, delly, gridss, respectively. The simulation was repeated 200 times to evaluate the accuracy of each tool to detect copy number variations of different lengths.
The detection results are shown in figures 2A-2F, 3A-3F, 4A-4F and 5A-5F. Mutect2 and FreeBayes are means for detecting single nucleotide variation and small fragment insertion deletion variation, deletion, insertion and turnover variation with a maximum length of 10bp can be detected in the test, mutect2 can detect copy number 2X variation with a maximum length of 50bp, and the variation rate estimated by the means is close to the actual variation rate. Delly, lumpy and Gridss are tools for detecting structural variation, deletion and turnover variation with the minimum length of 100bp can be detected in the test, and Gridss can also detect insertion variation with the partial length of 100 and 200bp due to the function of partial splicing of Gridss. Delly can detect 2X copy number variations of a minimum length of 200 bp. Gridss can detect a 3 Xcopy number variation of 25bp in minimum length. Neither Delly, lumpy, nor Gridss can evaluate the variability. Pindel can detect variations of various lengths, and the estimated variation rate of the tool is close to the actual variation rate. With 30X coverage corresponding to the simulation data variation rate of 0.1 as a detection limit, the performance of each tool is consistent compared with the simulation data with the variation rate of 0.2 or 0.5. In conclusion, pindel can comprehensively detect various types of variation in various lengths, and can be used as a main tool for detecting virus variation; mutect2 and FreeBayes can be used as supplements for single nucleotide variation and small fragment insertion deletion variation detection; longer length exogenous insertion variants cannot be detected with tools based on known profile alignment, and need to be detected by tools based on stitching.
Experiment two
Simulation test of accuracy of detection of exogenous fragment variations of various lengths by each tool
Each simulation randomly generates 40000bp sequence as a reference genome sequence, and a tool Art simulation library building process is used for generating 40000-to-Illumina PE150 sequencing data as non-mutated sequencing data; randomly generating substitution variants with deletion lengths of 0, 200, 500, 1000 and 10000bp respectively and insertion lengths of 200, 500, 1000 and 10000bp respectively in a reference genome sequence, and generating 40000-Illumina PE150 sequencing data as sequencing data after the variants by using a tool Art simulation library building process; mixing the unmutated and mutated sequencing data samples with mutation rates of 0.1, 0.2 and 0.5 respectively, and splicing unmatched sequencing reads by using a tool Spades to detect the generated mutation. Exogenous insertion/substitution variation detection flow is shown in fig. 6. The simulation was repeated 200 times to evaluate Spades the accuracy of the splicing tool to detect substitution variations of different length inserts.
The detection results are shown in FIGS. 7A to 7D. For the replacement variation of the inserted and deleted fragments with different lengths, the Spades tool can splice the exogenous fragments and accurately compare the exogenous fragments to the occurrence position of the replacement variation, and the variation rate estimated by the tool is close to the actual variation rate. And taking 30X coverage corresponding to the simulation data mutation rate of 0.1 as a detection limit, and comparing with simulation data with the mutation rate of 0.2 or 0.5, the detection accuracy shows consistency. This shows that for longer exogenous fragment insertion/substitution variation, the splice-based Spades tool can accurately detect exogenous fragment and variation occurrence positions.
Example 2 adenovirus experimental test of viral variation detection analysis procedure
Experiment one
Experimental test virus variation detection analysis procedure for detecting accuracy of deletion and inversion variation of various lengths
Constructing adenovirus packaging plasmid with the length of about 40000bp as an unmutated vector; deletion and inversion variants of different lengths were introduced at different positions on the non-variant vector as shown in table 1; mixing the non-mutated vector and mutated vector with mutation rates of 0.001, 0.01 and 0.1 respectively, and performing Illumina PE150 deep sequencing with sequencing depth 1G; and detecting the variation in each sample through a virus variation detection analysis flow, comparing the variation with a variation vector and an actual variation rate, and evaluating the accuracy of the virus variation detection analysis flow.
The detection results are shown in FIGS. 8A to 8C. For the deletion and turnover mutation of different lengths of different detected positions, the virus mutation detection and analysis flow can accurately detect the deletion and turnover mutation, and the detection accuracy is consistent compared with the mutation rate of 0.01 or 0.1 by taking the coverage of about 30X corresponding to the actual mutation rate of 0.001 as the detection limit.
Table 1:
| Type(s) | Length (bp) | Initial position (bp) | |
| Variation A | Deletion of | 15 | 1177 |
| Variation B | Deletion of | 347 | 1837 |
| Variation C | Deletion of | 1042 | 2907 |
| Variation D | Overturning | 12 | 6084 |
| Variation E | Overturning | 320 | 5047 |
| Variation F | Overturning | 2079 | 4504 |
Experiment two
Experimental test virus variation detection analysis procedure for detecting virus sample SYN
Extracting a SYN genome of a virus sample and carrying out deep sequencing of Illumina PE150 at a sequencing depth of 1G; detecting the variation in each sample through a virus variation detection analysis flow, and verifying the detection result through PCR and a first generation sequencing reaction.
By analyzing the sequencing result, the exogenous fragment 1 and the exogenous fragment 2 are detected. The corresponding PCR primers were designed separately, and the gel was run as shown in FIG. 9, consistent with the length and position of the detected exogenous fragment. And (4) recovering the PCR fragments for carrying out first-generation sequencing, wherein the sequences are consistent with the detected exogenous fragments.
Example 3Pindel variant high resolution correction flow test
Experiment one
Test Pindel detection of variant high resolution correction procedure the detection results of virus sample SYN2 were corrected
Pindel the variant high resolution correction flow is shown in fig. 10. Screening Pindel results with variation depth greater than 10, and totally 254 results; after high resolution correction, 34 microsatellite unstable variants are detected altogether, 8 other variants are detected, wherein 2 variants are combined variants; compared with the historical detection results, the microsatellite instability variants are all existing variants, 5 of the other variants are the existing variants, and 3 are the newly detected variants. The results are shown in Table 2. The accuracy of Pindel detection results is effectively improved through the visual correction flow
Table 2:
example 4 lentivirus Experimental test of Virus variant detection analysis procedure
Experiment one
Experimental test virus variation detection analysis procedure for detecting lentivirus sample
Constructing a lentiviral vector with a certain length of fragments as an unmutated vector; introducing deletion and inversion variants of different lengths at different positions on the non-variant carrier; mixing the non-mutated vector and mutated vector with a mutation rate of 0.01 respectively, and performing Illumina PE150 deep sequencing with a sequencing depth of 1G; and detecting the variation in each sample through a virus variation detection analysis flow, comparing the variation with a variation vector and an actual variation rate, and evaluating the accuracy of the virus variation detection analysis flow.
The virus mutation detection and analysis flow can accurately detect deletion and turnover mutation of different lengths at different positions, and the detection accuracy is consistent compared with the mutation rate of 0.01.
Example 5 adeno-associated Virus Experimental test of Virus variant detection analysis procedure
Experiment one
Experimental test virus variation detection analysis procedure for detecting adeno-associated virus sample
Constructing an adeno-associated virus vector with a certain length of fragments as an unmutated vector; introducing deletion and inversion variants of different lengths at different positions on the non-variant carrier; mixing the non-mutated vector and mutated vector with a mutation rate of 0.01 respectively, and performing Illumina PE150 deep sequencing with a sequencing depth of 1G; and detecting the variation in each sample through a virus variation detection analysis flow, comparing the variation with a variation vector and an actual variation rate, and evaluating the accuracy of the virus variation detection analysis flow.
The virus mutation detection and analysis flow can accurately detect deletion and turnover mutation of different lengths at different positions, and the detection accuracy is consistent compared with the mutation rate of 0.01.
Example 6 Experimental test of herpes simplex Virus of Virus variant detection analysis procedure
Experiment one
Experimental test virus variation detection analysis flow for detecting herpes simplex virus sample
Constructing a herpes simplex virus vector with a certain length of fragments as an unmutated vector; introducing deletion and inversion variants of different lengths at different positions on the non-variant carrier; mixing the non-mutated vector and mutated vector with a mutation rate of 0.01 respectively, and performing Illumina PE150 deep sequencing with a sequencing depth of 1G; and detecting the variation in each sample through a virus variation detection analysis flow, comparing the variation with a variation vector and an actual variation rate, and evaluating the accuracy of the virus variation detection analysis flow.
The virus mutation detection and analysis flow can accurately detect deletion and turnover mutation of different lengths at different positions, and the detection accuracy is consistent compared with the mutation rate of 0.01.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.
Claims (20)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010207884.3A CN113436679B (en) | 2020-03-23 | 2020-03-23 | Method and system for determining mutation rate of nucleic acid sample to be tested |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010207884.3A CN113436679B (en) | 2020-03-23 | 2020-03-23 | Method and system for determining mutation rate of nucleic acid sample to be tested |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN113436679A CN113436679A (en) | 2021-09-24 |
| CN113436679B true CN113436679B (en) | 2024-05-10 |
Family
ID=77753267
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010207884.3A Active CN113436679B (en) | 2020-03-23 | 2020-03-23 | Method and system for determining mutation rate of nucleic acid sample to be tested |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN113436679B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115631792A (en) * | 2022-11-18 | 2023-01-20 | 湖南师范大学 | Sequencing-based hybrid fish gene recombination analysis method and device |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102251059A (en) * | 2011-07-12 | 2011-11-23 | 武汉百泰基因工程有限公司 | Hepatitis B virus lamivudine resistant RNA quantitative detection kit, detection method, primers and probes |
| CN104781421A (en) * | 2012-09-04 | 2015-07-15 | 夸登特健康公司 | Systems and methods for detecting rare mutations and copy number variations |
| CN105483118A (en) * | 2015-12-21 | 2016-04-13 | 浙江大学 | Gene editing technique taking Argonaute nuclease as core |
| CN105861710A (en) * | 2016-05-20 | 2016-08-17 | 北京科迅生物技术有限公司 | Sequencing joint and preparation method and application thereof in ultra-low frequency mutation detection |
| CN106909806A (en) * | 2015-12-22 | 2017-06-30 | 广州华大基因医学检验所有限公司 | The method and apparatus of fixed point detection variation |
| CN107209814A (en) * | 2015-01-13 | 2017-09-26 | 10X基因组学有限公司 | For making structure variation and the visual system and method for phase information |
| WO2020005159A1 (en) * | 2018-06-25 | 2020-01-02 | Lucence Life Sciences Pte. Ltd. | Method for detection and quantification of genetic alterations |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP1568765A1 (en) * | 2004-02-23 | 2005-08-31 | GSF-Forschungszentrum für Umwelt und Gesundheit GmbH | Method for genetic diversification in gene conversion active cells |
| JP7113838B2 (en) * | 2016-11-16 | 2022-08-05 | イルミナ インコーポレイテッド | Enabling method and system for array variant calling |
-
2020
- 2020-03-23 CN CN202010207884.3A patent/CN113436679B/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102251059A (en) * | 2011-07-12 | 2011-11-23 | 武汉百泰基因工程有限公司 | Hepatitis B virus lamivudine resistant RNA quantitative detection kit, detection method, primers and probes |
| CN104781421A (en) * | 2012-09-04 | 2015-07-15 | 夸登特健康公司 | Systems and methods for detecting rare mutations and copy number variations |
| CN107209814A (en) * | 2015-01-13 | 2017-09-26 | 10X基因组学有限公司 | For making structure variation and the visual system and method for phase information |
| CN105483118A (en) * | 2015-12-21 | 2016-04-13 | 浙江大学 | Gene editing technique taking Argonaute nuclease as core |
| CN106909806A (en) * | 2015-12-22 | 2017-06-30 | 广州华大基因医学检验所有限公司 | The method and apparatus of fixed point detection variation |
| CN105861710A (en) * | 2016-05-20 | 2016-08-17 | 北京科迅生物技术有限公司 | Sequencing joint and preparation method and application thereof in ultra-low frequency mutation detection |
| WO2020005159A1 (en) * | 2018-06-25 | 2020-01-02 | Lucence Life Sciences Pte. Ltd. | Method for detection and quantification of genetic alterations |
Non-Patent Citations (3)
| Title |
|---|
| Oncogenic Mutations of p110a Isoform of PI 3-Kinase Upregulate Its Protein Kinase Activity;Christina M. Buchanan 等;《PLOS ONE》;第8卷(第8期);全文 * |
| Oncolytic adenovirus programmed by synthetic gene circuit for cancer immunotherapy;Huiya Huang 等;《Nature Communications》;全文 * |
| 桑树多倍体鉴定和育种研究现状;焦锋 等;《西北农业学报》;第24卷(第12期);全文 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113436679A (en) | 2021-09-24 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Antipov et al. | Plasmid detection and assembly in genomic and metagenomic data sets | |
| Ewing et al. | Base-calling of automated sequencer traces using phred. II. Error probabilities | |
| Li et al. | Mapping short DNA sequencing reads and calling variants using mapping quality scores | |
| CN106909806B (en) | Method and device for spot detection of variants | |
| McCarthy et al. | LTR_STRUC: a novel search and identification program for LTR retrotransposons | |
| McElroy et al. | Accurate single nucleotide variant detection in viral populations by combining probabilistic clustering with a statistical test of strand bias | |
| CN104937598B (en) | The accurate and quick positioning of the sequencing reading value of targeting | |
| IL276891B1 (en) | Ultra-sensitive detection of circulating tumor dna through genome-wide integration | |
| CN104794371B (en) | The method and apparatus for detecting retrotransponsons insertion polymorphism | |
| CN104504304A (en) | Method and device for identifying clustered regularly interspaces short palindromic repeats (CRISPR) | |
| EP3475863B1 (en) | Methods for processing next-generation sequencing genomic data | |
| CN110846411A (en) | A method for differentiating gene mutation types based on next-generation sequencing of individual tumor samples | |
| CN113436679B (en) | Method and system for determining mutation rate of nucleic acid sample to be tested | |
| Mumm et al. | Multiplexed long-read plasmid validation and analysis using OnRamp | |
| CN104293941A (en) | Method for constructing sequencing library and application of sequencing library | |
| Simonyan et al. | HIVE-heptagon: a sensible variant-calling algorithm with post-alignment quality controls | |
| Chen et al. | DNA damage is a major cause of sequencing errors, directly confounding variant identification | |
| CN110942806A (en) | Blood type genotyping method and device and storage medium | |
| CN116075596A (en) | Methods for Identifying Nucleic Acid Barcodes | |
| US20200354798A1 (en) | Methods for determining tumor microsatellite instability | |
| Brown et al. | High-throughput analysis of DNA break-induced chromosome rearrangements by amplicon sequencing | |
| Tadmor et al. | MCRL: using a reference library to compress a metagenome into a non-redundant list of sequences, considering viruses as a case study | |
| US20230332205A1 (en) | Linked dual barcode insertion constructs | |
| Röckl et al. | Identification of viral variants from functional genomics data | |
| US20220284986A1 (en) | Systems and methods for identifying exon junctions from single reads |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |