CN113436679B

CN113436679B - Method and system for determining mutation rate of nucleic acid sample to be tested

Info

Publication number: CN113436679B
Application number: CN202010207884.3A
Authority: CN
Inventors: 谢震; 黄慧雅; 廖微曦; 曹玉冰; 郭亚琨
Original assignee: Beijing Syngentech Co ltd; Tsinghua University
Current assignee: Beijing Syngentech Co ltd; Tsinghua University
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2024-05-10
Anticipated expiration: 2040-03-23
Also published as: CN113436679A

Abstract

The invention provides a method for determining the mutation rate of a nucleic acid sample to be detected. The method comprises the following steps: sequencing a nucleic acid sample to be detected so as to obtain a sequencing result; comparing the sequencing result with a reference genome sequence of the nucleic acid sample to be tested so as to obtain a comparison result; determining and correcting structural variation, single nucleotide and/or small fragment variation, respectively, based on matching sequencing reads and the average length of the library building fragments; comparing unmatched sequencing reads to genome reference sequences of other species, and determining the sequencing read ratio of the source of each other species to the source of the unmatched species; splicing the unmatched sequencing reads, comparing the splicing result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation, and determining possible source species based on the splicing result; summarizing the structural variation, the single nucleotide and/or small fragment variation and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be detected.

Description

Method and system for determining mutation rate of nucleic acid sample to be tested

Technical Field

The present invention relates to the field of biological information, and in particular, to a method and system for determining the variability of a nucleic acid sample to be tested, a computer-readable storage medium, and an electronic device.

Background

Viruses often have stable structures, simple genomes, broad-spectrum infectivity and efficient packaging capability, and become widely used engineering DNA transport expression vectors. Whereas, using the autoimmune nature of the virus, researchers have used inactivated, attenuated or engineered viruses as effective vaccines. Furthermore, by utilizing the biological properties of viruses that lyse host cells during the amplification process, researchers have engineered viruses into oncolytic viruses that have replication packaging capabilities and specifically achieve tumor killing. With the development of virus related researches, various viruses such as adenovirus, lentivirus, herpes simplex virus-1 and the like are the engineering targets at present, and various virus products are applied to clinical treatment.

Although viruses have the above advantages as engineering vectors, the pathogenic ability and susceptibility to mutation of viruses also increase safety risks. Detection of exogenous contamination and self-variation is an important aspect of quality control in retrofit and production processes. In the case of adenoviruses, the FDA requires that the level of replicative adenovirus (RCA) in non-replicative adenoviruses be less than 1RCA/3e10VP. At present, exogenous and variant fragments in a virus sample are detected mainly by a low-throughput method for carrying out PCR detection and first-generation sequencing on a specific region, corresponding primer design is required for the fragments to be detected according to possible variant types, all exogenous and variant fragments in the sample are difficult to completely cover, the specificity and the length limitation of PCR reaction are limited, and high-homology fragments and long fragments are difficult to detect. The deep sequencing technology can be used for constructing a library by randomly fragmenting a sample to be detected, can be used for detecting all fragments in the sample with high flux, covers rich neighborhood information around the fragment to be detected, and can be used for effectively detecting exogenous and variant fragments of a virus sample by combining a related analysis technology.

In summary, the detection of exogenous and variant fragments in virus samples is an important content of quality control, but the conventional detection method still has the problems of low flux, incomplete detection and difficult detection of highly homologous fragments and long fragments, and the inventor establishes a detection and analysis flow from a sample to be detected to an analysis report based on a high flux deep sequencing technology and a related analysis technology, so as to effectively and comprehensively detect exogenous pollution and self-variation in the analysis sample.

Disclosure of Invention

The present application has been made based on the findings and knowledge of the inventors regarding the following facts and problems:

In the quality control of engineering virus transformation and production, in order to detect exogenous pollution and self variation, the inventor establishes a detection and analysis flow from a sample to be detected to an analysis report based on high-throughput deep sequencing and related analysis technology, and effectively and comprehensively detects and analyzes the exogenous pollution and self variation in the sample.

In a first aspect of the invention, the invention provides a method for determining the variability of a nucleic acid sample to be tested. According to an embodiment of the invention, the method comprises: (1) Sequencing a nucleic acid sample to be tested so as to obtain a sequencing result, wherein the lowest effective depth of the sequencing is 10-100, the data size of the sequencing result is determined based on the length of a reference genome, the lowest effective sequencing depth and a preset lowest variation rate of detectable variation, and the sequencing result consists of a plurality of sequencing reads; (2) Comparing the sequencing result with a reference genome sequence of the nucleic acid sample to be tested so as to obtain a comparison result, wherein the comparison result comprises a matched sequencing read and an unmatched sequencing read, and determining the average length of the sequenced library-building fragment based on the matched sequencing read; (3) Determining and correcting structural variation, single nucleotide and/or small fragment variation, respectively, based on the matched sequencing reads and the average length of the pooling fragments; (4) Splicing the unmatched sequencing reads, and comparing the splicing result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation; (5) Summarizing the structural variation, the single nucleotide and/or small fragment variation and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be detected. According to an embodiment of the invention, the nucleic acid sample to be tested comprises a viral genome. The method provided by the embodiment of the invention can effectively and comprehensively detect and analyze the conditions of exogenous pollution and self variation in the nucleic acid sample to be detected.

According to an embodiment of the present invention, the method may further comprise at least one of the following additional technical features:

According to an embodiment of the present invention, before performing step (2), performing quality assessment and screening on the sequencing result in advance, and re-determining a detectable mutation lowest mutation rate based on the screening result, and if the detectable mutation lowest mutation rate is lower than a predetermined threshold, increasing the amount of the nucleic acid sample in step (1).

According to an embodiment of the present invention, in step (3), the structural variation is determined using Pindel, and half of the predetermined lowest variation rate of the detectable variation is employed as the variation rate screening threshold.

According to an embodiment of the invention, after determining the structural variation, the single nucleotide variation and/or the small fragment variation, the sequencing reads involved in the variation are subjected to secondary comparison, and the detection variations of the same type are combined and false positive detection results due to low quality bases, comparison result errors and the like are corrected.

According to an embodiment of the present invention, in step (3), the secondary alignment is performed using different software than the alignment in step (2).

According to the embodiment of the invention, in the step (3), common mutation types are excluded according to the public data and the historical detection data.

According to an embodiment of the invention, in step (3), the detection of the single nucleotide and/or small fragment variations is performed using Mutect.sup.2.

According to an embodiment of the invention, in step (4), a possible source species is determined based on the splice result.

According to embodiments of the invention, the unmatched sequencing reads are further aligned with genomic reference sequences of other species, and the sequencing read proportions of each other species source and unknown source are determined.

According to an embodiment of the invention, the genome of the other species comprises a human genome and/or a mycoplasma genome.

According to an embodiment of the invention, further comprising performing PCR validation on the structural variation and/or exogenous variation.

In a second aspect of the present invention, the present invention proposes a computer-readable storage medium having a computer program stored thereon. According to an embodiment of the present invention, the program when executed by a processor implements the method for determining the mutation rate of a nucleic acid sample to be tested as described above.

In a third aspect of the invention, the invention provides an electronic device. According to an embodiment of the invention, the electronic device comprises a memory, a processor; wherein the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the method of determining the mutation rate of a nucleic acid sample to be tested as described above.

In a fourth aspect of the invention, the invention provides a system for determining the variability of a nucleic acid sample to be tested. According to an embodiment of the invention, the system comprises: the sequencing device is used for sequencing the nucleic acid sample to be tested so as to obtain a sequencing result, the lowest effective depth of the sequencing is 10-100, the data size of the sequencing result is determined based on the length of a reference genome, the lowest effective sequencing depth and a preset lowest variation rate of detectable variation, and the sequencing result consists of a plurality of sequencing reads; the comparison device is connected with the sequencing device and is used for comparing the sequencing result with a reference genome sequence of the nucleic acid sample to be tested so as to obtain a comparison result, wherein the comparison result comprises a matched sequencing read and an unmatched sequencing read, and the average length of the sequenced library-building fragment is determined based on the matched sequencing read; the matched sequencing read analysis device is connected with the comparison device and is used for respectively determining and correcting structural variation, single nucleotide and/or small fragment variation based on the average lengths of the matched sequencing read and the library building fragment; the unmatched sequencing read analysis device is connected with the comparison device, splices unmatched sequencing reads, and compares the splicing result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation; and the output device is connected with the matched sequencing read analysis device and the unmatched sequencing read analysis device and is used for summarizing the structural variation, the single nucleotide and/or small fragment variation and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be detected. According to an embodiment of the invention, the nucleic acid sample to be tested is a viral genome. The system according to the embodiment of the invention is suitable for executing the method for determining the mutation rate of the nucleic acid sample to be detected, and effectively and comprehensively detects the conditions of external pollution and self mutation in the analysis sample.

According to an embodiment of the present invention, the above system may further include at least one of the following technical features:

According to an embodiment of the invention, the system further comprises:

and the lowest mutation rate determining device is connected with the sequencing device and the comparison device, and is used for carrying out quality evaluation and screening on the sequencing result in advance, redefining the lowest mutation rate of the detectable mutation based on the screening result, increasing the quantity of the nucleic acid sample in the sequencing device if the lowest mutation rate of the detectable mutation is lower than a preset threshold value, and inputting the sequencing result into the comparison device if the lowest mutation rate of the detectable mutation is not lower than the preset threshold value.

According to an embodiment of the invention, the system further comprises an unmatched sequencing read source analysis device connected to the alignment device for aligning the unmatched sequencing reads with other species genome reference sequences and determining the sequencing read proportions of each other species source and unknown source, and inputting the results to the output device.

According to an embodiment of the invention, the system further comprises a splice result source analysis device connected to the unmatched sequencing read analysis device for determining possible source species based on the splice result, the result being input to the output device.

According to an embodiment of the present invention, the system further comprises a PCR device, which is connected to the matched sequencing read analysis device, the unmatched sequencing read source analysis device and the unmatched sequencing read analysis device, and is configured to perform PCR verification on the structural variation and/or the exogenous variation, and input the result to the output device.

Drawings

FIG. 1 is a schematic diagram of a virus variation detection analysis flow;

FIGS. 2A-2F are simulation tests of the accuracy of different tools to detect different proportions of different length deletion variants;

FIGS. 3A-3F are simulation tests of the accuracy of different tools to detect different proportions of different length roll-over variations;

FIGS. 4A-4F are simulation tests of the accuracy of different tools to detect insertion variations of different lengths in different proportions;

FIGS. 5A-5F are simulation tests of the accuracy of different tools to detect copy number variations of different lengths in different proportions;

FIG. 6 is an exogenous insertion/substitution variation detection flow;

FIGS. 7A-7D are simulation tests of accuracy for detecting exogenous substitution variations of different lengths in different proportions based on stitching;

FIGS. 8A-8C are experimental tests of the accuracy of detecting deletion and inversion variations of different lengths in different proportions;

FIG. 9 is an adenovirus sample experimental test and PCR validation;

FIG. 10 is a Pindel variant high resolution correction flow;

FIG. 11 is a schematic diagram showing the structure of a system for determining the mutation rate of a nucleic acid sample to be tested according to an embodiment of the present invention;

FIG. 12 is a schematic diagram showing a system for determining the mutation rate of a nucleic acid sample to be tested according to another embodiment of the present invention;

FIG. 13 is a schematic diagram showing a system for determining the mutation rate of a nucleic acid sample to be tested according to another embodiment of the present invention;

FIG. 14 is a schematic diagram showing a system for determining the mutation rate of a nucleic acid sample to be tested according to another embodiment of the present invention;

FIG. 15 is a schematic diagram showing a system for determining the mutation rate of a nucleic acid sample according to another embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

The invention provides a system for determining the mutation rate of a nucleic acid sample to be detected. Referring to fig. 11, the system includes: a sequencing device 100, wherein the sequencing device 100 is used for sequencing a nucleic acid sample to be tested so as to obtain a sequencing result, the lowest effective depth of the sequencing is 10-100, the data size of the sequencing result is determined based on the length of a reference genome, the lowest effective sequencing depth and a preset lowest mutation rate of detectable mutation, and the sequencing result is composed of a plurality of sequencing reads; the comparison device 200 is connected with the sequencing device 100, and is used for comparing the sequencing result with a reference genome sequence of the nucleic acid sample to be tested so as to obtain a comparison result, wherein the comparison result comprises a matched sequencing read and an unmatched sequencing read, and the average length of the sequenced library-building fragment is determined based on the matched sequencing read; a matched sequencing read analysis device 300, wherein the matched sequencing read analysis device 300 is connected with the comparison device 200 and is used for respectively determining and correcting structural variation, single nucleotide and/or small fragment variation based on the average lengths of the matched sequencing read and the library building fragment; a non-matched sequencing read analysis device 400, wherein the non-matched sequencing read analysis device 400 is connected with the comparison device 200, splices the non-matched sequencing reads, and compares the splicing result with a reference genome sequence of the nucleic acid sample to be detected so as to determine exogenous variation; and the output device 500 is connected with the matched sequencing read analysis device 300 and the unmatched sequencing read analysis device 400, and is used for summarizing the structural variation, the single nucleotide and/or small fragment variation and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be tested. According to an embodiment of the invention, the nucleic acid sample to be tested is a viral genome. The system according to the embodiment of the invention is suitable for executing the method for determining the mutation rate of the nucleic acid sample to be detected, and effectively and comprehensively detects the conditions of external pollution and self mutation in the analysis sample.

According to an embodiment of the invention, referring to fig. 12, the system further comprises: and a lowest mutation rate determining unit 600, wherein the lowest mutation rate determining unit 600 is connected to the sequencing unit 100 and the comparing unit 200, and is configured to perform quality evaluation and screening on the sequencing result in advance, and to redetermine the lowest mutation rate of the detectable mutation based on the screening result, and if the lowest mutation rate of the detectable mutation is lower than a predetermined threshold, increase the amount of the nucleic acid sample in the sequencing unit, and if the lowest mutation rate of the detectable mutation is not lower than the predetermined threshold, input the sequencing result to the comparing unit.

According to an embodiment of the present invention, referring to fig. 13, the system further comprises an unmatched sequencing read source analysis device 700, wherein the unmatched sequencing read source analysis device 700 is connected to the alignment device 200, and is used for aligning the unmatched sequencing read with genome reference sequences of other species, determining sequencing read proportions of sources of the other species and unknown sources, and inputting the results to the output device 500.

According to an embodiment of the invention, referring to fig. 14, the system further comprises a splice result source analysis device 800, said splice result source analysis device 800 being connected to the unmatched sequencing read analysis device 400 for determining possible source species based on said splice result, the result being input to said output device 500.

According to an embodiment of the present invention, referring to fig. 15, the system further comprises a PCR device 900, wherein the PCR device 900 is connected to the matched sequencing read analysis device 300, the unmatched sequencing read source analysis device 700 and the unmatched sequencing read analysis device 400, and is used for performing PCR verification on the structural variation and/or the exogenous variation, and the result is input to the output device 500.

The specific flow is as follows:

1) Obtaining a reference genome sequence of the detected virus through literature investigation, first generation sequencing and other modes; high quality viral genomic DNA is extracted after virus purification.

2) And (3) performing library-building sequencing on the virus extracted genome by a deep sequencing technology to obtain high-throughput sequencing data with sufficient depth. The minimum effective sequencing depth is an empirical value of 10-100, the total data volume required can be estimated by referring to the length of the genome sequence, the minimum effective sequencing depth and the preset minimum detectable variation rate,

3) Sequencing quality is evaluated by using sequencing data quality evaluation software (such as Fastqc), and base quality distribution and joint pollution conditions are mainly judged; preprocessing by using sequencing data preprocessing software (such as Cutadapt), and selecting corresponding linker types, base quality thresholds and sequence length thresholds by combining the quality evaluation results; carrying out quality evaluation again after data preprocessing, and confirming the preprocessing effect; and re-estimating the lowest mutation rate of the detectable mutation according to the total data amount after pretreatment, and adding a test sample if the lowest mutation rate of the detectable mutation is not up to the preset value.

4) Comparing the preprocessed data with a reference genome by using sequence comparison software (such as Bwa), obtaining a comparison result, and reserving a sub-optimal comparison result; performing de-duplication on the comparison result by using comparison result processing software (such as Samblaster) and extracting unmatched sequencing reads; the comparison results are ranked using comparison result processing software (e.g., sambamba) and the average length of the library-building segments is estimated.

5) Structural variation analysis is performed by using structural variation detection software (such as Pindel) based on the comparison result and the average length of the library building fragments, a proper detection variation length range is selected according to the length of the reference genome sequence, and half of the lowest variation rate of the preset detectable variation is selected as a variation rate screening threshold.

6) Correcting the mutation detection result and the detected mutation rate by using high-resolution mutation detection correction software, combining the same type of detected mutation based on the re-comparison result of the detected mutation related data, eliminating false positive detection results generated due to low-quality bases, comparison result errors and the like, and eliminating common mutation types according to the public data and the historical detection data.

7) The single nucleotide and small fragment variation analysis is performed based on the comparison result by using single nucleotide and small fragment variation detection software (such as Mutect < 2 >), common variation types are eliminated according to the published virus polymorphism data and the historical detection data, and the comparison is performed with the structural variation detection result so as to reduce false negative as a target, supplement single nucleotide and small fragment variation which is not detected in the structural variation detection, and correct the estimated variation rate of the single nucleotide and small fragment variation which is also detected in the structural variation detection.

8) Comparing unmatched sequencing reads with possible pollution genomes (such as human genome and mycoplasma genome) respectively, and counting the proportion of each pollution source and unknown source in the sample; and re-estimating the total data amount of the detected virus sources and the lowest detectable mutation rate according to the proportion of the pollution source sequences, and adding a detection sample if the preset lowest detectable mutation rate is not reached.

9) Splice software (e.g., spades) is used to splice unmatched sequencing reads and kmer parameters are adjusted to obtain the best splice rate and splice length. And detecting exogenous substitution variation according to an exogenous fragment splicing result by using substitution variation detection software, comparing the spliced fragment with a virus reference genome, screening spliced fragments of which both half read lengths at two ends of the fragment can be matched with the reference genome, analyzing possible exogenous insertion variation according to the matching position, and estimating variation rate according to the depth of the reference genome data and the depth of the spliced fragment data.

10 Searching for splice fragments using sequence similarity searching software (e.g., blast), and analyzing the possible source species and genetic information of the splice fragments.

11 For the detected structural variation and exogenous fragment variation, designing corresponding PCR experiments for verification, and recovering the PCR fragments for first-generation sequencing verification.

12 And (3) integrating the analysis results to generate a final analysis report.

The flow chart of the invention is shown in figure 1.

Embodiments of the present invention will be described in further detail below, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

Example 1 simulation test of Virus variant detection analysis procedure

Experiment one

Simulation testing of the accuracy of various tool detection of deletion, inversion, insertion, copy number variation of various lengths

Each simulation randomly generates 40000bp sequence as a reference genome sequence, and a tool Art simulation library building process is used for generating 40000-to-Illumina PE150 sequencing data as non-mutated sequencing data; randomly generating deletion mutation with lengths of 1, 10, 100, 200 and 1000bp in a reference genome sequence respectively, and generating 40000-bit Illumina PE150 sequencing data serving as sequencing data after mutation by using a tool Art simulation library building process; the non-mutated and mutated sequencing data samples were mixed at a mutation rate of 0.1, 0.2, 0.5, respectively, and the resulting mutation was detected using tools Mutect, freeBayes, pindel, delly, gridss, lumpy, respectively. The simulation was repeated 200 times to evaluate the accuracy of each tool to detect deletion variations of different lengths.

Each simulation randomly generates 40000bp sequence as a reference genome sequence, and a tool Art simulation library building process is used for generating 40000-to-Illumina PE150 sequencing data as non-mutated sequencing data; randomly generating turnover mutation with lengths of 10 bp, 100 bp, 200 bp and 1000bp in a reference genome sequence respectively, and generating 40000-bit Illumina PE150 sequencing data serving as sequencing data after mutation by using a tool Art simulation library building process; the non-mutated and mutated sequencing data samples were mixed at a mutation rate of 0.1, 0.2, 0.5, respectively, and the resulting mutation was detected using tools Mutect, freeBayes, pindel, delly, gridss, lumpy, respectively. The simulation was repeated 200 times to evaluate the accuracy of each tool to detect different length rollover variations.

Each simulation randomly generates 40000bp sequence as a reference genome sequence, and a tool Art simulation library building process is used for generating 40000-to-Illumina PE150 sequencing data as non-mutated sequencing data; randomly generating insertion mutation with lengths of 1, 10, 100 and 200bp in a reference genome sequence respectively, and generating 40000-bit Illumina PE150 sequencing data serving as sequencing data after mutation by using a tool Art simulation library building process; the non-mutated and mutated sequencing data samples were mixed at a mutation rate of 0.1, 0.2, 0.5, respectively, and the resulting mutation was detected using tools Mutect, freeBayes, pindel, delly, gridss, lumpy, respectively. The simulation was repeated 200 times to evaluate the accuracy of each tool to detect insertion variations of different lengths.

Each simulation randomly generates 40000bp sequence as a reference genome sequence, and a tool Art simulation library building process is used for generating 40000-to-Illumina PE150 sequencing data as non-mutated sequencing data; randomly generating 2X and 3X copy number variations with lengths of 25, 50, 100, 200 and 1000bp in a reference genome sequence respectively, and generating 40000-bit Illumina PE150 sequencing data as sequencing data after variation by using a tool Art simulation library building process; the non-mutated and mutated sequencing data samples were mixed at a mutation rate of 0.1, 0.2, 0.5, respectively, and the resulting mutation was detected using tools Mutect, pindel, delly, gridss, respectively. The simulation was repeated 200 times to evaluate the accuracy of each tool to detect copy number variations of different lengths.

The detection results are shown in figures 2A-2F, 3A-3F, 4A-4F and 5A-5F. Mutect2 and FreeBayes are means for detecting single nucleotide variation and small fragment insertion deletion variation, deletion, insertion and turnover variation with a maximum length of 10bp can be detected in the test, mutect2 can detect copy number 2X variation with a maximum length of 50bp, and the variation rate estimated by the means is close to the actual variation rate. Delly, lumpy and Gridss are tools for detecting structural variation, deletion and turnover variation with the minimum length of 100bp can be detected in the test, and Gridss can also detect insertion variation with the partial length of 100 and 200bp due to the function of partial splicing of Gridss. Delly can detect 2X copy number variations of a minimum length of 200 bp. Gridss can detect a 3 Xcopy number variation of 25bp in minimum length. Neither Delly, lumpy, nor Gridss can evaluate the variability. Pindel can detect variations of various lengths, and the estimated variation rate of the tool is close to the actual variation rate. With 30X coverage corresponding to the simulation data variation rate of 0.1 as a detection limit, the performance of each tool is consistent compared with the simulation data with the variation rate of 0.2 or 0.5. In conclusion, pindel can comprehensively detect various types of variation in various lengths, and can be used as a main tool for detecting virus variation; mutect2 and FreeBayes can be used as supplements for single nucleotide variation and small fragment insertion deletion variation detection; longer length exogenous insertion variants cannot be detected with tools based on known profile alignment, and need to be detected by tools based on stitching.

Experiment two

Simulation test of accuracy of detection of exogenous fragment variations of various lengths by each tool

Each simulation randomly generates 40000bp sequence as a reference genome sequence, and a tool Art simulation library building process is used for generating 40000-to-Illumina PE150 sequencing data as non-mutated sequencing data; randomly generating substitution variants with deletion lengths of 0, 200, 500, 1000 and 10000bp respectively and insertion lengths of 200, 500, 1000 and 10000bp respectively in a reference genome sequence, and generating 40000-Illumina PE150 sequencing data as sequencing data after the variants by using a tool Art simulation library building process; mixing the unmutated and mutated sequencing data samples with mutation rates of 0.1, 0.2 and 0.5 respectively, and splicing unmatched sequencing reads by using a tool Spades to detect the generated mutation. Exogenous insertion/substitution variation detection flow is shown in fig. 6. The simulation was repeated 200 times to evaluate Spades the accuracy of the splicing tool to detect substitution variations of different length inserts.

The detection results are shown in FIGS. 7A to 7D. For the replacement variation of the inserted and deleted fragments with different lengths, the Spades tool can splice the exogenous fragments and accurately compare the exogenous fragments to the occurrence position of the replacement variation, and the variation rate estimated by the tool is close to the actual variation rate. And taking 30X coverage corresponding to the simulation data mutation rate of 0.1 as a detection limit, and comparing with simulation data with the mutation rate of 0.2 or 0.5, the detection accuracy shows consistency. This shows that for longer exogenous fragment insertion/substitution variation, the splice-based Spades tool can accurately detect exogenous fragment and variation occurrence positions.

Example 2 adenovirus experimental test of viral variation detection analysis procedure

Experiment one

Experimental test virus variation detection analysis procedure for detecting accuracy of deletion and inversion variation of various lengths

Constructing adenovirus packaging plasmid with the length of about 40000bp as an unmutated vector; deletion and inversion variants of different lengths were introduced at different positions on the non-variant vector as shown in table 1; mixing the non-mutated vector and mutated vector with mutation rates of 0.001, 0.01 and 0.1 respectively, and performing Illumina PE150 deep sequencing with sequencing depth 1G; and detecting the variation in each sample through a virus variation detection analysis flow, comparing the variation with a variation vector and an actual variation rate, and evaluating the accuracy of the virus variation detection analysis flow.

The detection results are shown in FIGS. 8A to 8C. For the deletion and turnover mutation of different lengths of different detected positions, the virus mutation detection and analysis flow can accurately detect the deletion and turnover mutation, and the detection accuracy is consistent compared with the mutation rate of 0.01 or 0.1 by taking the coverage of about 30X corresponding to the actual mutation rate of 0.001 as the detection limit.

Table 1:

	Type(s)	Length (bp)	Initial position (bp)
				Variation A	Deletion of	15	1177
Variation B	Deletion of	347	1837
				Variation C	Deletion of	1042	2907
Variation D	Overturning	12	6084
				Variation E	Overturning	320	5047
Variation F	Overturning	2079	4504

Experiment two

Experimental test virus variation detection analysis procedure for detecting virus sample SYN

Extracting a SYN genome of a virus sample and carrying out deep sequencing of Illumina PE150 at a sequencing depth of 1G; detecting the variation in each sample through a virus variation detection analysis flow, and verifying the detection result through PCR and a first generation sequencing reaction.

By analyzing the sequencing result, the exogenous fragment 1 and the exogenous fragment 2 are detected. The corresponding PCR primers were designed separately, and the gel was run as shown in FIG. 9, consistent with the length and position of the detected exogenous fragment. And (4) recovering the PCR fragments for carrying out first-generation sequencing, wherein the sequences are consistent with the detected exogenous fragments.

Example 3Pindel variant high resolution correction flow test

Experiment one

Test Pindel detection of variant high resolution correction procedure the detection results of virus sample SYN2 were corrected

Pindel the variant high resolution correction flow is shown in fig. 10. Screening Pindel results with variation depth greater than 10, and totally 254 results; after high resolution correction, 34 microsatellite unstable variants are detected altogether, 8 other variants are detected, wherein 2 variants are combined variants; compared with the historical detection results, the microsatellite instability variants are all existing variants, 5 of the other variants are the existing variants, and 3 are the newly detected variants. The results are shown in Table 2. The accuracy of Pindel detection results is effectively improved through the visual correction flow

Table 2:

example 4 lentivirus Experimental test of Virus variant detection analysis procedure

Experiment one

Experimental test virus variation detection analysis procedure for detecting lentivirus sample

Constructing a lentiviral vector with a certain length of fragments as an unmutated vector; introducing deletion and inversion variants of different lengths at different positions on the non-variant carrier; mixing the non-mutated vector and mutated vector with a mutation rate of 0.01 respectively, and performing Illumina PE150 deep sequencing with a sequencing depth of 1G; and detecting the variation in each sample through a virus variation detection analysis flow, comparing the variation with a variation vector and an actual variation rate, and evaluating the accuracy of the virus variation detection analysis flow.

The virus mutation detection and analysis flow can accurately detect deletion and turnover mutation of different lengths at different positions, and the detection accuracy is consistent compared with the mutation rate of 0.01.

Example 5 adeno-associated Virus Experimental test of Virus variant detection analysis procedure

Experiment one

Experimental test virus variation detection analysis procedure for detecting adeno-associated virus sample

Constructing an adeno-associated virus vector with a certain length of fragments as an unmutated vector; introducing deletion and inversion variants of different lengths at different positions on the non-variant carrier; mixing the non-mutated vector and mutated vector with a mutation rate of 0.01 respectively, and performing Illumina PE150 deep sequencing with a sequencing depth of 1G; and detecting the variation in each sample through a virus variation detection analysis flow, comparing the variation with a variation vector and an actual variation rate, and evaluating the accuracy of the virus variation detection analysis flow.

Example 6 Experimental test of herpes simplex Virus of Virus variant detection analysis procedure

Experiment one

Experimental test virus variation detection analysis flow for detecting herpes simplex virus sample

Constructing a herpes simplex virus vector with a certain length of fragments as an unmutated vector; introducing deletion and inversion variants of different lengths at different positions on the non-variant carrier; mixing the non-mutated vector and mutated vector with a mutation rate of 0.01 respectively, and performing Illumina PE150 deep sequencing with a sequencing depth of 1G; and detecting the variation in each sample through a virus variation detection analysis flow, comparing the variation with a variation vector and an actual variation rate, and evaluating the accuracy of the virus variation detection analysis flow.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A method for determining the mutation rate of a nucleic acid sample to be tested, comprising:

(1) sequencing the nucleic acid sample to be tested to obtain a sequencing result, wherein the minimum effective depth of the sequencing is 10 to 100, the data volume of the sequencing result is determined based on the length of the reference genome, the minimum effective sequencing depth, and a predetermined minimum mutation rate of detectable mutations, and the sequencing result is composed of multiple sequencing reads;

(2) comparing the sequencing result with the reference genome sequence of the nucleic acid sample to be tested to obtain a comparison result, wherein the comparison result includes matched sequencing reads and unmatched sequencing reads, and determining the average length of the sequenced library construction fragments based on the matched sequencing reads;

(3) determining and correcting structural variations, single nucleotide variations and/or small fragment variations based on the matched sequencing reads and the average length of the library fragments;

(4) splicing the unmatched sequencing reads, and comparing the splicing results with the reference genome sequence of the nucleic acid sample to be tested to determine the exogenous variation;

(5) Summarizing the structural variation, the single nucleotide and/or small fragment variation, and the exogenous variation to determine the variation rate of the nucleic acid sample to be tested.

2. The method according to claim 1 is characterized in that the nucleic acid sample to be tested includes a viral genome.

3. The method according to claim 1 is characterized in that, before performing step (2), the sequencing results are quality assessed and screened in advance, and based on the screening results, the minimum detectable mutation rate is re-determined, and if the minimum detectable mutation rate is lower than a predetermined threshold, the amount of the nucleic acid sample is increased in step (1).

4. The method according to claim 1 is characterized in that, in step (3), Pindel is used to determine the structural variation, and half of the predetermined minimum variation rate of detectable variation is used as the variation rate screening threshold.

5. The method according to claim 1 is characterized in that after determining the structural variation, the single nucleotide variation and/or the small fragment variation, a secondary alignment is performed on the sequencing reads involved in the structural variation, the single nucleotide variation and/or the small fragment variation, the same type of detected variations are merged and false positive detection results caused by low-quality bases and errors in the alignment results are corrected.

6. The method according to claim 5, characterized in that, in step (3), the secondary comparison and the comparison in step (2) use different software.

7. The method according to claim 1 is characterized in that, in step (3), common variant types are excluded based on public data and historical detection data.

8. The method according to claim 1, characterized in that in step (3), Mutect2 is used to detect the single nucleotide and/or small fragment variation.

9. The method according to claim 1 is characterized in that the unmatched sequencing reads are further compared with genome reference sequences of other species, and the ratio of sequencing reads from other species and unknown sources is determined.

10. The method according to claim 9, characterized in that the genomes of other species include a human genome and/or a mycoplasma genome.

11. The method according to claim 1, characterized in that in step (4), the possible source species is determined based on the splicing result.

12. The method according to claim 1 is characterized in that it further comprises PCR verification of the structural variation and/or exogenous variation.

13. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the method for determining the mutation rate of a nucleic acid sample to be tested as described in any one of claims 1 to 12 is implemented.

14. An electronic device, characterized in that it comprises a memory and a processor;

The processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the method for determining the mutation rate of the nucleic acid sample to be tested as described in any one of claims 1-12.

15. A system for determining the mutation rate of a nucleic acid sample to be tested, comprising:

A sequencing device, wherein the sequencing device is used to sequence a nucleic acid sample to be detected so as to obtain a sequencing result, wherein the minimum effective depth of the sequencing is 10 to 100, the data volume of the sequencing result is determined based on the length of the reference genome, the minimum effective sequencing depth, and a predetermined minimum mutation rate of detectable mutations, and the sequencing result is composed of a plurality of sequencing reads;

A comparison device, connected to the sequencing device, for comparing the sequencing result with the reference genome sequence of the nucleic acid sample to be tested, so as to obtain a comparison result, wherein the comparison result includes matched sequencing reads and unmatched sequencing reads, and determining the average length of the sequenced library construction fragments based on the matched sequencing reads;

A matching sequencing read analysis device, which is connected to the comparison device and is used to determine and correct structural variations, single nucleotide and/or small fragment variations based on the matching sequencing reads and the average length of the library construction fragments;

An unmatched sequencing read analysis device, which is connected to the comparison device, splices the unmatched sequencing reads, and compares the splicing result with the reference genome sequence of the nucleic acid sample to be tested, so as to determine the exogenous variation;

An output device, the output device is connected to the matched sequencing read analysis device and the unmatched sequencing read analysis device, and is used to summarize the structural variation, the single nucleotide and/or small fragment variation, and the exogenous variation so as to determine the variation rate of the nucleic acid sample to be tested.

16. The system according to claim 15, characterized in that the nucleic acid sample to be tested includes a viral genome.

17. The system according to claim 15, further comprising:

A minimum mutation rate determination device is connected to the sequencing device and the comparison device, and is used to pre-evaluate and screen the sequencing results, and based on the screening results, redetermine the minimum mutation rate of detectable mutations. If the minimum mutation rate of detectable mutations is lower than a predetermined threshold, the amount of the nucleic acid sample is increased in the sequencing device; if the minimum mutation rate of detectable mutations is not lower than the predetermined threshold, the sequencing results are input into the comparison device.

18. The system according to claim 15 is characterized in that it further comprises an unmatched sequencing read source analysis device, which is connected to the comparison device and is used to compare the unmatched sequencing reads with genome reference sequences of other species, and determine the ratio of sequencing reads from other species and unknown sources, and the results are input into the output device.

19. The system according to claim 15 is characterized in that it further comprises a splicing result source analysis device, which is connected to the unmatched sequencing read analysis device and is used to determine the possible source species based on the splicing result, and the result is input into the output device.

20. The system according to claim 18 is characterized in that it further comprises a PCR device, which is connected to the matched sequencing read analysis device, the unmatched sequencing read analysis device, and the unmatched sequencing read source analysis device, and is used to perform PCR verification on the structural variation and/or exogenous variation, and the results are input into the output device.