CN119132404A

CN119132404A - Base recognition method based on double-end sequencing, sequencing data quality assessment method, program product and equipment

Info

Publication number: CN119132404A
Application number: CN202411116287.4A
Authority: CN
Inventors: 姚天然; 王谷丰; 包原野; 赵陆洋
Original assignee: Shenzhen Sailu Medical Technology Co ltd
Current assignee: Shenzhen Sailu Medical Technology Co ltd
Priority date: 2024-08-14
Filing date: 2024-08-14
Publication date: 2024-12-13

Abstract

The invention discloses a base identification method based on double-end sequencing, a sequencing data quality evaluation method, a program product and equipment, wherein the method comprises the steps of obtaining a double-end sequencing file, and obtaining first sequence data and second sequence data corresponding to each sequencing fragment from the double-end sequencing file; determining the overlapping region of each sequencing fragment based on the first sequence data and the second sequence data corresponding to each sequencing fragment, and determining the correct base type corresponding to the base position based on the comparison result of the first base type and the second base type respectively corresponding to each base position obtained from the first sequence data and the comparison result of the estimated base quality score corresponding to the obtained first base type and the estimated base quality score corresponding to the second base type and a preset quality score threshold value.

Description

Base identification method based on double-end sequencing, sequencing data quality evaluation method, program product and equipment

Technical Field

The invention relates to the technical field of genes, in particular to a base class identification method based on double-end sequencing, a sequencing data quality evaluation method based on double-end sequencing, a computer program product and computer equipment.

Background

In double ended sequencing, if the length of one insert (INSERT FRAGMENT) is less than twice the sequencing length/read length (READ LENGTH), overlap occurs at the middle of the insert. In this overlap region, each base is detected twice. Within the overlap region, the sequencer detects virtually twice for each base of the same insert. In the prior art, if the two detection results of the bases are the same, the detected base type base is the correct base type base, and if the two detection results of the bases are different, the base with the position mutation frequency or allele frequency (allele frequency) of more than or equal to 99% is used as the correct base type of the locus, namely, the method considers all loci on the genome in the sample to be homozygous. This is also inconsistent with common sense, since even a large number of heterozygous sites are present in a healthy human genome, for example the standard HG001NA12878 human genome, which has at least two million heterozygous sites. If the sample is unhealthy tissue, such as a tumor or a sample containing a genetic disorder (aborted fetal tissue, etc.), the number of heterozygous sites will be higher. The correct base type of the heterozygous site will not be determined using the base type recognition methods of the prior art, because there are no bases with a mutation frequency of 99% or more at the heterozygous site location within the genome. Therefore, if the results of the two detection of the base are different, the correct base type of the site is not adapted to the position of the heterozygous site in the genome by using 99% or more of the base having the mutation frequency or allele frequency (allele frequency).

In high throughput sequencing, the sequencer outputs, for each detected base (base call), a Quality value called the base Quality score (Quality score), also called the Q value (Q score), each base corresponding to a base Quality score, the base Quality score value representing an estimate of the error rate of the sequencer for the base identification. It is important that the sequencer outputs a base accurate mass fraction value because almost all downstream analyses for high throughput sequencing data rely on base mass fraction values. Algorithms including data quality control, sequence alignment, mutation detection (short indels, copy number, structural mutation), etc., are all based on numerical calculations of base quality scores. However, in reality, the output process of the base quality score value is that firstly the sequencer collects the optical signal or the electrical signal from the sensor to detect a certain base, and then the corresponding base quality score value is deduced through the empirical relationship between the signal intensity and the base quality score value. It can be seen that the sequencer is not able to directly calculate the error rate of base recognition, but can only estimate the base quality score value, i.e., the base quality score of each base can be read from the test file after sequencing is completed, as the estimated base quality score of each base. This results in that in most cases the base quality score value output by the sequencer does not accurately reflect the recognition error rate, and thus a data quality inspection technique is required to inspect the quality of the sequencing data output by the sequencer.

In the existing data quality inspection technology, the two information output by the default sequencer, namely the type of the base and the base quality score, are real, so that the accuracy and the stability of the base quality score value (namely the reported base quality score value) in the sequencing result FASTQ file are not detected, and the error rate and the preference of the sequencing base type (namely the reported base type) in the sequencing result FASTQ file are not detected. The occurrence of detection errors (also known as false detection or mismatch, i.e., mismatching with the true base type) in the sequencing base type is a common phenomenon, so that whether the quality of the sequencing data meets the requirement cannot be accurately estimated in the prior art, and the use of the subsequent sequencing data is affected.

Disclosure of Invention

In order to solve the existing technical problems, the embodiment of the invention provides a sequencing data quality evaluation method and equipment based on double-end sequencing, which can accurately identify base types and accurately evaluate whether sequencing data quality meets the requirements.

According to a first aspect, a base class identification method based on double-end sequencing is provided, and comprises the steps of obtaining a double-end sequencing file, obtaining first sequence data and second sequence data corresponding to each sequencing fragment from the double-end sequencing file, wherein the first sequence data are base sequence data obtained by sequencing from a first end to a second end of the sequencing fragment, the second sequence data are base sequence data obtained by sequencing from the second end to the first end of the sequencing fragment, determining an overlapping region of each sequencing fragment based on the first sequence data and the second sequence data corresponding to each sequencing fragment, obtaining first base types and estimated base quality scores corresponding to the first base types respectively corresponding to each base position from the first sequence data, obtaining estimated base quality scores corresponding to the second base types and the second base types respectively corresponding to each base position from the second sequence data, determining a quality score of the base types corresponding to the first base types and the second base types corresponding to each base position, and a preset base type quality score based on a comparison result of the estimated base types corresponding to the base positions and the estimated base types corresponding to the second base types, and a correct base quality score is determined.

According to a second aspect, a sequencing data quality assessment method based on double-end sequencing is provided, which comprises the steps of determining correct base types corresponding to base positions in the overlapping region of each sequencing fragment based on the base type identification method based on double-end sequencing, acquiring base information of bases in the overlapping region from the double-end sequencing file for each sequencing fragment, and assessing sequencing data quality in the double-end sequencing file based on the correct base types corresponding to the bases in the overlapping region and the base information of the bases in the overlapping region.

In a third aspect, a computer program product is provided, which comprises a computer program, wherein the computer program, when executed by a processor, implements the base class identification method based on double-ended sequencing according to any embodiment of the present application, or implements the sequencing data quality assessment method based on double-ended sequencing according to any embodiment of the present application.

In a fourth aspect, a computer device is provided, including a memory and a processor, where the memory stores a computer program that, when executed by the processor, causes the processor to perform the double-ended sequencing-based base class identification method according to any of the embodiments of the present application, or to perform the double-ended sequencing-based sequencing data quality assessment method according to any of the embodiments of the present application.

According to the embodiment of the application, the first sequence data and the second sequence data of the same sequencing fragment are obtained from a double-end sequencing file, the first base type, the estimated base quality score corresponding to the second base type and the estimated base quality score corresponding to the second base type corresponding to each base position in an overlapping region can be determined through the first sequence data and the second sequence data of the same sequencing fragment, so that the base type corresponding to each base position in the overlapping region can be accurately identified, and based on at least one of the correct base type corresponding to each base, the sequencing base type corresponding to the base position and the sequencing quality score corresponding to the second base type, the data quality in the double-end sequencing file can be estimated based on the comparison result of the first base type corresponding to the base position and the second base type, and the comparison result of the estimated base quality score corresponding to the first base type and the estimated base quality score corresponding to the second base type and a preset quality score threshold, whether the base type corresponding to the base position meets the base position can be accurately identified, and whether the data quality is required to be accurately estimated or not can be estimated based on the data.

Drawings

FIG. 1 is a schematic diagram of a genetic sequencer according to an embodiment;

FIG. 2 is a schematic diagram of correlation of report Q values and true Q values based on Salus Evo sequencers and llumina sequencers in one embodiment;

FIG. 3 is a schematic diagram of the sequencing of an insert in double-ended sequencing in one embodiment;

FIG. 4 is a graph showing the correlation of the nominal Q value to the logarithm of the error rate for different c-cycle numbers for the same duplex base AC in an exemplary embodiment of an Illumina sequencer;

FIG. 5 is a schematic representation of mismatched bases at certain immobilized positions of the genome in one embodiment;

FIG. 6 is a flow chart of a base class identification method based on double-ended sequencing in one embodiment;

FIG. 7 is a flow chart of a sequencing data quality assessment method based on double-ended sequencing in an embodiment;

FIG. 8 is a schematic diagram of a fault detection type in an embodiment;

FIG. 9 is a graphical representation of the preference of mismatched bases in two sequencers in one embodiment;

FIG. 10 is a flow chart of a method of sequencing data quality assessment based on double ended sequencing in another embodiment;

FIG. 11 is a schematic diagram of a target line fitted under different sequencing platforms in an embodiment;

FIG. 12 is a schematic diagram of a fitted straight line fitted under different sequencing platforms in an embodiment;

FIG. 13 is a schematic diagram of saliency values corresponding to candidate features according to an embodiment;

FIG. 14 is a diagram of preliminary statistics of effective information in an embodiment;

FIG. 15 is a schematic diagram of information after grouping in one embodiment;

FIG. 16 is a graphical representation of raw base mass fraction for each group in one embodiment;

FIG. 17 is a schematic diagram of a base class identification device based on double-ended sequencing in one embodiment;

FIG. 18 is a schematic diagram of a sequencing data quality assessment device based on double-ended sequencing in an embodiment;

FIG. 19 is a schematic diagram showing the structure of a gene sequencer according to an embodiment.

Detailed Description

The technical scheme of the invention is further elaborated below by referring to the drawings in the specification and the specific embodiments.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the scope of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

In the following description, reference is made to the expression "some embodiments" which describe a subset of all possible embodiments, but it should be understood that "some embodiments" may be the same subset or a different subset of all possible embodiments and may be combined with each other without conflict.

Gene sequencing refers to analyzing the base sequence of DNA fragments of the data to be tested, i.e., the arrangement of adenine (A), thymine (T), cytosine (C) and guanine (G). At present, a fluorescent labeling method is commonly used for gene sequencing, a laser is used for exciting a fluorescent label on a sequencing chip by a gene sequencing optical system to generate fluorescence, fluorescence signals are collected, and four bases are combined with different fluorescent labels to generate four different fluorescence wave bands, so that bases are identified.

In the second generation sequencing technology, using an Illumina sequencer as an example, different fluorescent molecules with different fluorescence emission wavelengths can emit fluorescent signals with corresponding wavelengths when being irradiated by laser, and the fluorescent signals with specific wavelengths can be obtained by selectively filtering light rays with non-specific wavelengths through a filter after the laser irradiation, so that the base type can be identified by analyzing the fluorescent signals by obtaining the fluorescent signals. Mainly comprises sample preparation, cluster generation, sequencing and data analysis.

Sample preparation, namely extracting and purifying a DNA sample to be sequenced, and then carrying out DNA fragmentation and aptamer ligation. In alternative examples, the DNA sample is typically cleaved using ultrasound or restriction enzymes, and the DNA sample is cleaved into smaller, larger DNA fragments. Then, an aptamer comprising a specific sequence for subsequent ligation and sequencing reactions is ligated to both ends of the DNA fragment.

Cluster formation, which is to amplify a DNA fragment to form an immobilized DNA fragment so that a DNA fragment is formed into a base cluster later. In an alternative example, specifically, the DNA fragments are amplified by polymerase chain reaction (Polymerase Chain Reaction, PCR) or bridge amplification, etc., such that millions of replicas of each DNA fragment are formed, and the amplified DNA fragments are immobilized on a fixation plate. Each DNA fragment forms a separate cluster on the fixation plate.

Sequencing, namely sequencing and reading each base cluster on Flowcell, wherein a fluorescent marked dNTP sequencing primer is added in the sequencing, one end of the chemical formula of dNTP is connected with an azide group, polymerization can be prevented when the sequenced chain extends, one cycle (cycle) can be ensured to be prolonged by only one base, and a sequencing reading is correspondingly generated, namely sequencing while synthesizing. In one cycle, a base is identified by fluorescent labeling dNTPs for each base cluster, sequencing signal responses of different base types are respectively corresponding to fluorescent signals of specific colors, and the base corresponding to each base cluster in the current cycle can be judged according to the emitted fluorescent colors by laser scanning. In one cycle, tens of millions of base clusters are sequenced simultaneously at Flowcell, one fluorescent spot represents the fluorescence emitted by one base cluster, and one base cluster corresponds to one read in fastq. In the sequencing stage, fluorescent images on the surface of Flowcell are shot through an infrared camera, the fluorescent images are subjected to image processing and fluorescent point position positioning to detect base clusters, template construction is carried out according to base cluster detection results of a plurality of fluorescent images corresponding to sequencing signal responses of different base types, and positions of all base cluster template points (clusters) on Flowcell are constructed. And extracting fluorescence intensity from the filtered image according to the template, correcting the fluorescence intensity, and finally calculating a score according to the maximum intensity of the position of the template point of each base cluster to output fastq base sequence files.

The gene sequencer can also comprise an optical platform, the optical platform can comprise an operation table and a camera, wherein the sequencing chip can be arranged on the operation table, the gene sequencer uses laser to excite fluorescent markers on the sequencing chip to generate fluorescence, and collect fluorescent signals, and four bases are combined with different fluorescent markers to generate four different fluorescent wave bands. I.e. fluorescence images of four base types. The sequencing chip is photographed by a camera, a fluorescent image of a fluorescent signal generated on a Charge Coupled Device (CCD) on the testing chip is captured, a plurality of fluorescent points exist in one fluorescent image, and one fluorescent point in the fluorescent image represents fluorescence emitted by one base cluster.

The imaging mode of the gene sequencer can be a four-channel imaging system or a two-channel imaging system. For a two-channel imaging system, each camera needs to be exposed twice at the same location of the test chip. For a four-channel imaging system, the camera of each channel shoots once at the same position of the sample, and fluorescent images of four base types are respectively obtained. For example, a fluorescent image of the A base type, a fluorescent image representing the A base type, a fluorescent image of the C base type, a fluorescent image of the G base type, and a fluorescent image of the T base type are obtained, respectively. Since the light with a non-specific wavelength is selectively filtered by using the optical filter after the laser irradiation to obtain the fluorescent signal with a specific wavelength, each base type corresponds to a different fluorescent signal, in the same Cycle (Cycle) reaction, the same type of base cluster emits light with a far greater brightness than other types of bases in the corresponding type of base type, and the base clusters emitted by each channel theoretically do not have repetition.

After the fluorescence image is obtained by the gene sequencer, the collected image is subjected to gene image reconstruction, gene image registration and gene base identification (gene basecall), so that a gene sequence is obtained.

Wherein the genetic image reconstruction is used to increase the resolution of the fluorescent image to increase the sharpness of the image to reduce the cross-talk effects between samples. Gene image reconstruction includes, but is not limited to, conventional operations such as deconvolution.

The gene image registration is to correct the fluorescent images of four base types, so that the fluorescent images of four base types can be overlapped, and the fluorescent brightness of 4 channels at the same position can be extracted, thereby facilitating the subsequent base identification. Genetic image registration includes, but is not limited to, image registration of the same channel, global or local affine registration.

The gene recognition process is to judge whether the base cluster in the image belongs to one of A, C, G, T bases according to the registered image. After the data to be detected is subjected to gene identification, the data to be detected is converted into A, C, G, T base sequence information from a digital image, namely a DNA sequence result of a sample, so that the DNA sequence result is used for subsequent analysis and evaluation.

And (3) analyzing and reading sequencing data according to the image data and the sequence information. Sequence information was aligned with the reference genome for mutation identification.

The process of sequencing one piece of data to be tested is called one-time Run, and the sequencing process of one piece of data to be tested consists of a plurality of cycles (cycles), wherein one Cycle corresponds to one reaction period, namely, corresponds to the identification of one base type in a sequencing chip. Sequencing, sequencing while synthesis, is performed. In one cycle, several tens of millions of base clusters are sequenced simultaneously.

One test data includes a plurality of DNA fragments, and each DNA fragment is added with one base during the above-mentioned sequencing, so that the length of the base sequence of the DNA of the test data determines the number of cycles. In each cycle, the gene sequencer can obtain one fluorescence image of each of four base types of ACGT, and when the data to be tested is sequenced, the gene sequencer can obtain the fluorescence images of ACGT channels of a plurality of cycles.

It should be noted that, the foregoing describes a sequencing procedure by using Illumina sequencing technology as an example of a large-scale parallel sequencing technology (MPS), and by amplifying a DNA molecule to be detected by a specific amplification technology, amplifying each DNA fragment (single-stranded library molecule) to form a base cluster, and constructing a template point of the base cluster on the sequencing chip according to a detection result of the base cluster, so that operations such as base recognition can be performed according to the template point of the base cluster in the following steps, thereby improving the base recognition efficiency and accuracy. It can be understood that the base recognition method based on fluorescence labeling dNTP gene sequencing provided by the embodiment of the present application is based on the positioning detection and base type recognition of the base cluster after the single-stranded library molecule is amplified on the sequencing chip, where each base cluster refers to a base signal acquisition unit, so that it is not limited to the amplification technology adopted for the single-stranded library molecule, that is, the sequencing data quality evaluation method based on double-ended sequencing provided by the embodiment of the present application is also applicable to the positioning detection and base type recognition of the base signal acquisition unit for the sequencing chip in other large-scale parallel sequencing technologies, for example, the base signal acquisition unit may refer to the base cluster obtained by using bridge amplification technology in Illumina sequencing technology, and also includes nanospheres obtained by rolling circle amplification technology (RCA, rolling Circle Amplification), which is not limited in this aspect of the present application. In the following examples, for the sake of understanding, a base signal acquisition unit will be described as an example of a base cluster.

Referring to FIG. 1, a schematic diagram of a gene sequencer according to an embodiment is shown. The gene sequencer can also comprise an operation table and a camera, wherein the sequencing chip can be arranged on the operation table, and a plurality of base clusters which are arranged according to an array or are randomly distributed on the gene sequencing chip. Through the staining reagent, different types of base clusters are respectively connected with one of different fluorescent markers in the sequencing reaction, the fluorescent markers emit fluorescent signals after being irradiated by laser, and the fluorescent signals with non-specific wavelengths are selectively filtered through a filter, so that the fluorescent signals with specific wavelengths are obtained. Fluorescent molecules in different fluorescent labels have different fluorescence emission wavelengths, such that different base clusters correspond to different fluorescent signals. Fluorescent images are acquired by a camera and analyzed to identify the base class of each base cluster. Wherein the camera may be an optical microscope.

In the process of one-time gene sequencing, the process of sequencing one gene sample to be tested is called one-time Run, one gene sample to be tested is broken into M base sequences to be tested, which can also be called short chains, each base sequence to be tested comprises N base clusters, and in one cycle, sequencing reaction is carried out on a sequencing chip on the base clusters at the top end of the M short chains at the same time. On a sequencing chip, each base cluster being sequenced corresponds to a position, and in one cycle, tens of millions of base clusters are sequenced simultaneously. N determines the number of cycles tested, the greater N the number of cycles. And under different cycles, sequencing the base clusters in the M base sequences to be tested respectively. For example, if a sample of the gene to be tested is broken into tens of thousands of short strands, each of which is 100 bases in length, then 100 cycles of sequencing reactions are required to identify the base type. At each cycle, the top base cluster of the ten thousand short chains was subjected to a sequencing reaction on a sequencing chip. After one-time gene sequencing is completed, a plurality of sequencing sequences which are subjected to sequencing can be obtained from a sequencing result file, namely a plurality of short chains which are subjected to sequencing can be obtained, and one sequencing sequence is a short chain or a read or a sequencing fragment.

The conversion relation between the error rate of the base recognition and the base quality score value is as follows, e= ^-Q/10, wherein E is the recognition error rate of the base by the sequencer, Q is the base quality score value, for example, if the Q value output by the sequencer for a certain base is 30, it is indicated that the recognition error rate of the base by the sequencer is 10 ^-30/10 =0.001.

The most widely used techniques for correction of base quality scores are the same in that the base grouping method, the prior art considers the following four features of the base to be features that significantly affect the accuracy of the base quality score, and groups (bin) the bases in the sample according to the sequence of the reads (read order), the number of sequencing cycles (cycle), the duplex base (dinucleotide context) consisting of the base and the base 1bp upstream of the base, and the reported estimated base quality score (qual) of the sequencer. However, these four features are preset and not suitable for all samples, and may be due to different samples, different instruments and other influencing factors, and some features may not be the main factors influencing the base quality score, so that the grouping accuracy is influenced, and the correction effect is influenced. Grouping bases based on the four general classes of features described above is based on the characteristics of Illumina NovaSeq sequencers. For other brands of sequencers, the salient features may differ from the Illumina platform sequencer, and instead of grouping the bases using fixed four types of features, the appropriate salient features should be chosen for their specific characteristics. For example, as shown in FIG. 2, the Q value of Salus Evo sequencers is insensitive to read order (read order), where the red line is Salus Evo sequencer data and the blue line is llumina sequencer data, where the black dashed line is the diagonal line. It can be seen that the data points of Salus Evo sequencers are closer to the diagonal with the read order grouping, indicating that their nominal base mass fraction value is very close to the actual base mass fraction value. The same salient feature grouping as Illumina NovaSeq if used can cause the grouping to be inaccurate, affecting the final correction effect.

They first group all the bases in the sample that participate in the calculation according to the above four features, and then calculate the error rate and corresponding base quality score value for each group by comparing the base type of each base with the correct base at the genomic position, where the base quality score value for each group contains an error, called the original base quality score value (raw Q). To reduce the error, the original base quality score values of several groups are then fitted to a new base quality score value, called the fitted base quality score value (fitted Q), using local weighted regression (locally weighted regression, lowess), which is considered to be the true base quality score value for the bases within the group. Finally, the base quality score values for all bases in each group are set to the fitted base quality score values in that group.

When calculating the error rate of the grouping, the number of the error bases of the bases in the grouping needs to be counted, that is, the correct base type of each base in the sequencing file needs to be determined, the method for determining the correct base type base in the prior art can be based on the correction technology of the known standard sites as the name implies, the user compares the base at one base position with the reference genome sequence, the reference genome sequence is a list of the known standard sites, the content of the list is a series of genome coordinates and the correct base type of each coordinate position, the coordinate position corresponding to the base position is determined in the reference genome according to the base position, and the base at the corresponding coordinate position is determined as the correct base type at the base position. GATK-BQSR currently uses the known standard site of human in NCBI dbSNP database. However, the current method of determining the correct base type using known standard sites is only applicable to some species which are very widely studied, such as humans. Most other species do not have known standard sites. This limitation makes GATK-BQSR function essentially unusable on sequencing data from samples of other species. Secondly, the method considers that all base types which are not matched with the known standard sites in the sequencing result are detection errors of a sequencer, which is not matched with common knowledge, because various external factors, such as high-concentration formaldehyde reagent, high-strength mechanical force, inaccurate endonucleases and DNA synthetases, improper sample storage environment and the like, can change the base types in a sample in the process of high-throughput sequencing experiment, so that certain sites in the sample are not matched with the base types of the known standard sites, and the detection errors of the sequencer are not caused. Both of them are attributed to detection errors by the sequencer, which will underestimate the base quality score value.

In double ended sequencing, if the length of one insert (INSERT FRAGMENT) is less than twice the sequencing length/read length (READ LENGTH), then there will be overlap between insert reads 1 and 2. As shown in FIG. 3, the sequencing of one insert in double-ended sequencing is schematically illustrated, with some overlap of the results of read1 and read2 if the insert is small enough.

When fitting is calculated to get the fitted base quality scores, the model is fitted using a local weighted regression (lowess) method, the model cannot be fitted from the function value outside the variable value range, but only values in a partial range are used, for example when groups with estimated base quality scores of 11 to 30 are used, their error rates are E (11), E (12)..e (30), we cannot fit the error rate E (31) of groups with estimated base quality score values of 31. And this approach does not take into account consistency of continuous variable grouping, but uses values in a partial range, which in some cases results in overfitting, for example, fig. 4 is a schematic diagram of the correlation of the nominal Q value and the logarithm of the error rate for different c-cycle numbers in the case of the same duplex base AC in an embodiment of Illumina sequencer, and for Salus Pro sequencer, when duplex base (dinucleotide context) is AC and read order is 1st read, the relationship of the reported Q value, i.e., estimated base quality score (Qual), and true error rate should be a quadratic linear relationship. Fitting different cycles using local weighted regression gives a quadratic linear relationship with a large number of cycles (140 cycles if shown on the left in fig. 4), while the over-fitting occurs with a medium number of cycles (88 cycles on the right in fig. 4).

In sequencing, the preference of mismatched bases (mismatch bias) refers to the tendency of a sequencer to occur on a particular base type, rather than randomly, when it is erroneously detected as other base types by a read sequencing fragment. For example, a certain brand of sequencer may detect 54 errors per hundred, and 32 errors may detect a as T. If the sequencer does not have a bias of mismatched bases, the probability of twelve false positives should be similar.

Currently, quality control software only checks the number or percentage of mismatch (false detection number or false detection rate) of the base type output by the sequencer and the real base type, and does not consider the preference of the mismatch base. In fact, when the number or percentage of misprimed bases of two sequencers is the same, the data quality of sequencers with high preference of mismatched bases is worse, and errors are more likely to occur in downstream analysis. Because highly biased sequencers of mismatched bases tend to accumulate at certain fixed locations in the genome (as shown in FIG. 5 below) when sequencing errors occur, thereby causing false mutation detection (VARIANT CALLING), the quality of data cannot be measured accurately if the biased nature of the mismatched bases is ignored.

As shown in fig. 6, fig. 6 is a flowchart of a base class identification method based on double-ended sequencing in an embodiment, the base class identification method based on double-ended sequencing is applied to a gene sequencer, and the base class identification method based on double-ended sequencing includes the following steps:

s11, acquiring a double-end sequencing file, and acquiring first sequence data and second sequence data corresponding to each sequencing fragment from the double-end sequencing file.

In this embodiment, the bidirectional sequencing file is a sequencing result file output by the sequencer, and the bidirectional sequencing file includes the results of respectively performing two-end sequencing on each sequencing fragment, namely, first sequence data and second sequence data, wherein the first sequence data is base sequence data obtained by sequencing from a first end to a second end of the sequencing fragment, and the second sequence data is base sequence data obtained by sequencing from the second end to the first end of the sequencing fragment, for example, sequencing sequence data corresponding to read1 in fig. 3 and sequencing sequence data corresponding to read 2.

In this embodiment, the double-ended sequencing file is a file directly output from the sequencer, and pretreatment is required for the double-ended sequencing file. Wherein preprocessing includes, but is not limited to, filtering out data that does not meet preset conditions, thereby improving the quality of the data. Comparing the processed double-ended sequencing file to a reference genome by using comparison software, wherein each base position is provided with a corresponding unique reference position in the reference genome, so that after the comparison result is obtained, the first sequence data and the second sequence data corresponding to each sequencing fragment can be obtained.

Wherein the first sequence data and the second sequence data comprise at least one of a base position of each base, a base sequence of each sequencing fragment, an estimated base fraction of each base in each sequencing fragment, a number of cycles of sequencing corresponding to each base, an identity of the sequencing fragment, a length of the sequencing fragment, and an order of the sequencing fragments (read order). Wherein the base position indicates the position of the base in the sequencing fragment, the base sequence indicates the base type of each base in each sequencing fragment output by the sequencer, and the sequence indicates the number and the sequencing direction of the sequencing fragment.

S12, determining the overlapping region of each sequencing fragment based on the first sequence data and the second sequence data corresponding to each sequencing fragment.

In this embodiment, the processed double-ended sequencing file is compared to a reference genome by using comparison software, each base position has a corresponding unique reference position in the reference genome, so as to obtain first sequence data and second sequence data corresponding to each sequencing fragment, and for the same reference position, positions corresponding to two base sequences determine the position of an overlapping region, and the overlapping region can be obtained by traversing the reference positions.

S13, for each base position in the overlapping region, acquiring a first base type and an estimated base mass fraction corresponding to the first base type corresponding to each base position from the first sequence data, and acquiring a second base type and an estimated base mass fraction corresponding to the second base type corresponding to each base position from the second sequence data.

S14, for each base position, determining the correct base type corresponding to the base position based on the comparison result of the first base type and the second base type corresponding to the base position and the comparison result of the estimated base mass fraction corresponding to the first base type and the estimated base mass fraction corresponding to the second base type and a preset mass fraction threshold value.

In the above embodiment, the first sequence data and the second sequence data of the same sequencing fragment are obtained from the double-ended sequencing file, and the first base type, the estimated base mass fraction corresponding to the second base type, and the estimated base mass fraction corresponding to the second base type in the overlapping region can be determined from the first sequence data and the second sequence data of the same sequencing fragment, respectively, and then the correct base type corresponding to the base position can be determined based on the comparison result of the first base type and the second base type corresponding to the base position, and the comparison result of the estimated base mass fraction corresponding to the first base type and the estimated base mass fraction corresponding to the second base type and a preset mass fraction threshold value, thereby enabling accurate identification of the base type in the overlapping region.

In some embodiments, the determining, for each of the base positions, the correct base type corresponding to the base position based on the comparison of the first base type and the second base type corresponding to the base position, and the comparison of the estimated base mass fraction corresponding to the first base type and the estimated base mass fraction corresponding to the second base type, with a preset mass fraction threshold, comprises:

For each base position in the overlap region of each sequencing fragment, if the base position corresponds to the same first base type in the first sequence data as the corresponding second base type in the second sequence data, the correct base type at the base position is either the first base type or the second base type;

If the first base type is different from the second base type, and the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both larger than a preset quality score threshold, taking the base type corresponding to the base type with the higher estimated base quality score as the correct base type corresponding to the base position;

If the first base type is different from the second base type, the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both smaller than or equal to a preset quality score threshold, and when a main allele is present at a reference position corresponding to the base position in a reference genome, the main allele is taken as the correct base type corresponding to the base position;

if the first base type is different from the second base type, the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both smaller than or equal to a preset quality score threshold, and when a main allele base does not exist at a reference position corresponding to the base position in the reference genome, the base type at the reference position is taken as the correct base type corresponding to the base position.

In this example, the base at one base position in the overlap region corresponds to two base types, namely, a first base type in the first sequence data and a second base type in the second sequence data, and if the two base types are identical, it means that the sequencing is correct, and the first base type or the second base type is set as the correct base type corresponding to the base position. If they are not identical, it means that there may be inaccurate sequencing or that the position may be a heterozygous site position, so that it is necessary to further judge whether or not the estimated base quality score corresponding to the obtained first base type and the estimated base quality score corresponding to the second base type are both greater than the preset quality score threshold, if they are both greater than the preset quality score threshold, it means that the sequencing result is correct, the base type corresponding to the higher base type is regarded as the correct base type corresponding to the base position, if they are both smaller than the preset quality score threshold, it means that the base position may be a heterozygous site position, it is further judged that there is a main allele at the reference position corresponding to the base position in the reference genome, if there is a main allele at the reference position corresponding to the base position in the reference genome, for example, the sequencing depth corresponding to the reference position is 100x, that is 100 times total, 60 a, 30T, 8C, 2G bases are present in the sequencing result, a is set as the base type corresponding to the base position, if there is a main allele at the reference position corresponding to the base position in the reference genome is a main allele, it is further judged that there is a main allele at the correct base position corresponding to the base position in the reference position corresponding to the reference genome, if there is no main allele at the reference position corresponding to the main allele.

In the embodiment, the correct base is identified by using the double sequencing result of the overlapping region in double sequencing, when the double sequencing results are different, the base with the mutation frequency of more than 99% is not simply considered to be the correct base, but a series of conditions are analyzed, such as combining with analysis of whether the base types of double sequencing in the overlapping region are the same, analysis of the estimated base mass fraction corresponding to double sequencing, analysis of whether a main allele exists at the base position, and the like, and the correct base type of the base is confirmed by combining multiple factors.

Referring to fig. 7, a flowchart of a sequencing data quality assessment method based on double-ended sequencing according to an embodiment of the application is provided. The sequencing data quality assessment method based on double-end sequencing is applied to a gene sequencer, and comprises the following steps of:

S15, acquiring base information of bases in the overlapping region from the double-end sequencing file for each sequencing fragment.

In this example, the base information includes, but is not limited to, one or more of a sequenced base type corresponding to the base position, an estimated base mass fraction corresponding to the sequenced base type, and a correct base type corresponding to the base position.

S16, evaluating the quality of sequencing data in the double-ended sequencing file based on the correct base type corresponding to the base in the overlapped region and the base information of the base in the overlapped region.

In this example, for the same base position in any one of the sequenced fragments, the correct base type corresponding to the base position is the true base type at that base position, the sequenced base type corresponding to the base position is the sequenced base type outputted by the sequencer, and the sequenced base type may not necessarily be the true base type, and false detection may occur. The fitted base quality score corresponding to the base is calculated based on the base information estimation of the base, and is more accurate than the estimated base quality score corresponding to the base.

In the above embodiment, the first sequence data and the second sequence data of the same sequencing fragment are obtained from the double-ended sequencing file, the base information of the base in the overlapping region can be determined by the first sequence data and the second sequence data of the same sequencing fragment, the correct base type corresponding to the same base position is determined based on the base information of the base in the overlapping region, and the quality of the sequencing data in the double-ended sequencing file is evaluated based on at least one of the correct base type corresponding to each base, the sequencing base type corresponding to the base position, and the quality score of the fitted base, and since the quality of the sequencing data can be evaluated based on at least one more accurate base information, whether the quality of the sequencing data meets the requirement can be accurately evaluated.

In some embodiments, assessing the quality of sequencing data in the double-ended sequencing file based on the correct base type of base correspondence in the overlap region and base information for the bases in the overlap region comprises:

determining error detection data according to the correct base type corresponding to each base in the overlapping region and the corresponding sequencing base type, wherein the error detection data comprises each error detection type and the total number of error detection corresponding to each error detection type;

calculating the error detection proportion corresponding to each error detection type;

calculating the similarity between the error detection type of the double-end sequencing file and a preset average error detection proportion based on the error detection proportion corresponding to each error detection type;

If the similarity is within a preset similarity range, determining that the quality of the sequencing data in the double-end sequencing file meets the requirement;

if the similarity is not in the preset similarity range, determining that the quality of the sequencing data in the double-end sequencing file is not satisfactory.

In this embodiment, the false detection type indicates that one correct base type is detected as another base type, as shown in fig. 8, fig. 8 is a schematic diagram of the false detection type in an embodiment, and as shown in fig. 8, there are 12 kinds of false detection types in total, in the case of random false detection, the twelve kinds of false detection types should occur on average, that is, the probability of occurrence of each false detection type is the same, that is, 1/12 is approximately 0.08333, if the false detection ratio corresponding to the false detection type is closer to 0.08333, the smaller the preference of the mismatched base is, the higher the quality of the sequencing data is, otherwise, the smaller the false detection ratio corresponding to the false detection type is closer to 0.08333, the higher the preference of the mismatched base is, and the worse the quality of the sequencing data is. As shown in FIG. 9, which is a comparative schematic diagram of the preference of the mismatched bases in two sequencers in an embodiment, red and cyan represent two sequencers, respectively, the X-axis represents twelve types of false detection, and the Y-axis represents the percentage, it can be seen that the red sequencer has a higher ratio of A- > C and T- > G false detection (similarity of 0.732), while the cyan sequencer has a higher ratio of A- > T false detection (similarity of 0.856), and the blue sequencer has a smaller preference of the mismatched bases.

Optionally, the ratio of the error detection corresponding to each error detection type is the ratio of the total number of error detection corresponding to each error detection type to the total number of bases, and the formula for calculating the similarity is as follows:

Wherein R _i is the false-positive ratio of the ith false-positive type, and similarity represents similarity.

In the above embodiment, based on the correct base type corresponding to each base and the corresponding sequencing base type in the overlapping region of each sequencing fragment, the false detection data is determined, and the similarity between the false detection type of the double-end sequencing file and the preset average false detection ratio is calculated, wherein the preset average false detection ratio represents the average probability of each false detection type under random probability, and the preference of the unmatched bases in the double-end sequencing file can be measured through the calculated similarity, so that the quality of the sequencing data is evaluated according to the preference of the unmatched bases, and the quality of the data can be measured more accurately.

In some embodiments, the base information at least comprises estimated base quality scores of each base obtained from the double-ended sequencing file, candidate features associated with the estimated base quality scores, as shown in FIG. 10, which is a flow chart of a sequencing data quality assessment method based on double-ended sequencing in another embodiment of the application, S16 further comprises:

s161, determining a salient feature with the association degree with the estimated base quality score meeting a preset condition from the candidate features based on the base information of the bases in the overlapping region;

In this embodiment, the candidate features only represent features related to the base quality score, but may not mainly affect the base quality score, so that it is necessary to select salient features from the candidate features, the salient features represent features with high correlation degree with the base quality score, the salient features are selected by the base information of the base in the overlapping region of each sequencing fragment, and the salient features are determined according to the actual sequencing result of the sequencing fragment of the actual sample, so that the selected salient features can truly reflect features affecting the base quality score, thereby reducing the influence of external factors such as a sample, a sequencing platform, and the like on the correction of the base quality score.

S162, based on the significance characteristics, the correct base types corresponding to the base positions and the estimated base quality scores, grouping the base data in the double-end sequencing file to obtain a plurality of base data sets, acquiring base information of bases in each base data set, and calculating to obtain an original base quality score corresponding to each base data set based on the base information of the bases in each base data set.

In this embodiment, since the saliency features can truly reflect features affecting the base quality score, a base data set is performed according to the saliency features and the estimated base quality score, a more accurate base data set can be obtained, and then, based on the base information of each base data set, a more accurate original base quality score corresponding to each base data set can be calculated.

S163, fitting to obtain a base quality score correction model based on the original base quality score corresponding to each base data set and the significance characteristics in the base information of each base data set, calculating to obtain a fitting base quality score corresponding to each base data set based on the base quality score correction model, and determining the fitting base quality score corresponding to each base as the fitting base quality score corresponding to the base data set where the base is located.

In this embodiment, after the base quality score correction model is obtained by the fitting method, the data of the significance characteristic in the base information of each base data set is used as the argument of the base quality score correction model, so that the fitted base quality score of each base data set can be obtained, then the fitted base quality score of the base in each base data set is the fitted base quality score of each base data set, and then the estimated base quality score of each base is replaced by the fitted base quality score. The base quality score correction model is obtained by fitting the original base quality score corresponding to each base data group and the data of the significance characteristics in the base information of each base data group, namely, the data of all groups are combined and used, and the model obtained by fitting the data of the base data groups is not part of the base data groups, and the consistency of continuous variable base data groups is comprehensively considered, so that the overfitting of the base quality score correction model is avoided, and the correction effect of the base quality score of the base is improved.

S164, evaluating the quality of sequencing data in the double-end sequencing file based on the fitted base quality score corresponding to each base data group.

In the above embodiment, the first sequence data and the second sequence data of the same sequencing fragment are obtained from the double-ended sequencing file, the first base type, the estimated base quality score corresponding to the second base type and the estimated base quality score corresponding to the second base type corresponding to each base position in the overlapping region can be determined according to the first sequence data and the second sequence data of the same sequencing fragment, the correct base type corresponding to the base position can be accurately identified according to the first base type, the estimated base quality score corresponding to the second base type and the comparison result of the estimated base quality score corresponding to the first base type and the estimated base quality score corresponding to the second base type and the preset quality score threshold value in the overlapping region, and whether the base type in the overlapping region is accurate can be accurately identified according to the estimated base type corresponding to each base type, the base position, the sequencing base type corresponding to the second base type and the estimated base quality score corresponding to the second base type, and the estimated base type corresponding to the fitting base type can be accurately evaluated according to the estimated base type quality score.

In some embodiments, the evaluating the quality of sequencing data in the double-ended sequencing file based on the fitted base quality score for each base data set comprises:

Forming a fitting point by the estimated base mass fraction corresponding to each base data set and the fitting base mass fraction corresponding to each base data set, and fitting to obtain a target straight line;

calculating a weighted distance between the target straight line and a preset straight line;

If the weighted distance between the target straight line and the preset straight line is smaller than or equal to the preset weighted distance, determining that the quality of sequencing data in the double-end sequencing file meets the requirement;

if the weighted distance between the target straight line and the preset straight line is larger than the preset weighted distance, determining that the quality of the sequencing data in the double-end sequencing file does not meet the requirement.

In this embodiment, the accuracy and stability of the estimated base quality score output by the sequencer can be measured by forming a fitting point by the estimated base quality score corresponding to each base data set and the corresponding fitting base quality score, fitting to obtain a target straight line, and calculating the weighted distance between the target straight line and the preset straight line. The preset straight line is a diagonal line passing through the origin under the same coordinate with the target straight line. The smaller the weighted distance between the target straight line and the preset straight line is, the closer the fitted target straight line is to the diagonal line, the higher the accuracy of the estimated base quality score of the base output by the sequencer is, and on the contrary, the farther the fitted target straight line is from the diagonal line, the lower the accuracy of the estimated base quality score of the base output by the sequencer is.

In an alternative implementation, the formula for calculating the weighted distance between the target line and the preset line is as follows:

Where D represents the weighted distance between the target line and the preset line, and A and B are the slope and intercept of the target line. Q is a set of estimated base quality score values in the double-ended sequencing file, N is the total number of Q, and R _n is the ratio of the number of bases corresponding to the estimated base quality score value N to the total number of bases. Where for most sequencers Q is a sequence of numbers that increases from 1 to some positive integer starting with 1. For the illumine platform, Q ε {11,25,37}, i.e., the total number of Q's is 3.

As shown in fig. 11, in an embodiment, the target straight line fitted under different sequencing platforms is schematically shown, the curves with different colors are the accurate linear fitted straight lines of two different sequencing platforms, the red dotted line is a diagonal line, the closer the fitted target straight line is to the diagonal line, the smaller the distance D is, the higher the accuracy of the reported Q value (i.e. the estimated base mass fraction) of the sequencer is, the distance D of the calculated red straight line is 0.94, the distance D of the blue straight line is 1.30, which indicates that the accuracy of the reported Q value of the red sequencer platform is higher.

In the above embodiment, the first sequence data and the second sequence data of the same sequencing fragment are obtained from the double-ended sequencing file, the base information of the bases in the overlapping area can be determined through the first sequence data and the second sequence data of the same sequencing fragment, the significance characteristics affecting the base quality score are determined based on the base information of the bases in the overlapping area, the base quality score is grouped based on the significance characteristics, more accurate grouping can be obtained, then based on the base information of each grouping, the original base quality score corresponding to each grouping can be calculated and obtained, and therefore in the subsequent fitting step, a more accurate fitting result can be obtained, the effect of correcting the base quality score is improved.

In some embodiments, the evaluating the quality of sequencing data in the double-ended sequencing file based on the fitted base quality score for each base data set further comprises:

Acquiring the total number of bases corresponding to each estimated base quality score, and determining a target estimated base quality score;

acquiring base sequence data corresponding to each target estimated base quality fraction, wherein the total number of bases of the target estimated base quality fraction is arranged in a preset number of bits;

for the base sequence data corresponding to each target estimated base quality score, acquiring all cyclic sequencing numbers under each target estimated base quality score and fitting base quality scores corresponding to each cyclic sequencing number, forming a fitting point by using the cyclic sequencing numbers and fitting base quality scores corresponding to the cyclic sequencing numbers, and fitting to obtain a fitting straight line corresponding to each target estimated base quality score;

Calculating the weighted distance between the fitting straight line corresponding to the mass fraction of each target estimated base and a preset horizontal line to obtain the weighted distance corresponding to the mass fraction of each target estimated base;

If the weighted distance corresponding to the mass fraction of each target estimated base is smaller than the preset distance, determining that the quality of the sequencing data in the double-ended sequencing file meets the requirement;

if the weighted distance corresponding to the target estimated base quality score is greater than or equal to the preset distance, determining that the quality of the sequencing data in the double-ended sequencing file does not meet the requirement.

In this embodiment, the target estimated base quality score indicates an estimated base quality score that is relatively high in the sequencing result, and subsequent calculations are performed based on the target estimated base quality score, which is more likely to reflect the overall sequencing data quality. And then in the double-end sequencing file, taking the target estimated base quality score as a classification standard, obtaining base sequence data corresponding to each target estimated base quality score, and thus obtaining a fitting straight line corresponding to each target estimated base quality score, wherein the closer the fitting straight line corresponding to each target estimated base quality score is to a preset horizontal line, the more stable the estimated base quality score in the double-end sequencing file is represented, and conversely, the farther the fitting straight line corresponding to each target estimated base quality score is from the preset horizontal line, the more unstable the estimated base quality score in the double-end sequencing file is represented.

In an alternative implementation, the formula for calculating the weighted distance between the fitted straight line corresponding to each target estimated base quality score and the preset horizontal line is as follows:

DD is the weighted distance between the fitted straight line corresponding to the target estimated base quality score and the preset horizontal line, Q _out is a certain report Q value, namely an estimated base quality score, C is a set of all sequencing cycle numbers under Q _out, most common C is a positive integer number sequence which is increased from 1 to 150, N is the number of C, R _n is the proportion of the total number of bases with the sequencing cycle number N to the total number of bases, and Q _n represents the fitted base quality score corresponding to the sequencing cycle number N.

FIG. 12 is a schematic diagram of a fitted straight line under different sequencing platforms in one embodiment, and the left and right graphs show the stability of the base mass fraction output by two different sequencer platforms. In the figure, the X axis is the sequencing cycle number (cycle), the Y axis is the mass fraction of the fitted base, the curves with different colors are the fitted curves with different reported Q values, the red dotted line is a horizontal line, and the closer the fitted curve is to the horizontal line, the error between the reported Q value and the real Q value of the sequencer is irrelevant to the sequencing cycle number, and the higher the stability is. It can be seen that the stability of the reported Q value of the left blue sequencer platform is higher, while the Q value of the right sequencer drops rapidly with increasing cycle, increasing error.

In the above embodiment, the first sequence data and the second sequence data of the same sequencing fragment are obtained from the double-ended sequencing file, the base information of the bases in the overlapping region can be determined through the first sequence data and the second sequence data of the same sequencing fragment, the significance characteristics affecting the base quality score are determined based on the base information of the bases in the overlapping region, the base quality score can be more accurately grouped based on the significance characteristics, then the base quality score corresponding to each group can be more accurately calculated and obtained based on the base information of each group, so that in the subsequent fitting step, more accurate fitting results can be obtained, thereby improving the effect of correcting the base quality score. After obtaining more accurate base quality scores, selecting an estimated base quality score with larger occupation from a double-end sequencing file as a target estimated base quality score, then forming a fitting point for one target estimated base quality score by using the cyclic sequencing number and the fitting base quality score corresponding to the cyclic sequencing number, fitting to obtain a fitting straight line corresponding to each target estimated base quality score, calculating the weighted distance between the fitting straight line corresponding to each target estimated base quality score and a preset horizontal line, obtaining the weighted distance corresponding to each target estimated base quality score, and evaluating the stability of the base quality score of the base output by the sequencer through the weighted distance corresponding to each target estimated base quality score, thereby quantitatively analyzing the accuracy of the sequencing data quality through the stability of the base quality score of the base and more accurately measuring the accuracy of the sequencing data quality.

In some embodiments, the determining, based on the base information of the bases in the overlapping region, a salient feature from the candidate features that satisfies a preset condition in association with the estimated base quality score includes:

Respectively taking each candidate feature as a feature to be checked, acquiring target features corresponding to the feature to be checked, traversing the bases in the overlapping areas of all sequencing fragments, and counting the traversed bases based on the target features corresponding to the feature to be checked to obtain effective information corresponding to the feature to be checked, wherein the effective information comprises a plurality of pieces of data, and each piece of data comprises the target features;

Grouping a plurality of pieces of data in the effective information corresponding to the feature to be checked by taking the feature to be checked and the estimated base quality score as grouping standards, and calculating to obtain the original base quality score and the error rate of each group in the effective information corresponding to the feature to be checked;

Calculating a significance value between the feature to be checked and the estimated base quality score by using a statistical analysis method based on the target feature, the original base quality score and the error rate of each group in the effective information corresponding to the feature to be checked;

And determining the candidate features with the significance values lower than the preset significance values as significance features.

In this embodiment, the significance value indicates the degree of association between the candidate feature and the base quality score, and the lower the significance value, the higher the degree of association, and the higher the significance value, the lower the degree of association. For example, the preset saliency value is set to 0.05, as shown in fig. 13, fig. 13 is a schematic diagram of saliency values corresponding to candidate features in an embodiment, the reported Q value is the estimated base quality score, it can be seen from the figure that saliency values corresponding to features 2,3,5,6,9,10,11 are all less than 0.05, and feature 2,3,5,6,9,10,11 is a salient feature.

Optionally, the significance signature comprises at least one of a number of sequencing cycles in which the base is located, a content ratio of G bases and C bases in the sequencing fragment, an estimated base mass fraction of the base, an estimated base mass fraction corresponding to 2bp upstream of the base, an estimated base mass fraction corresponding to 1bp upstream of the base, an estimated base mass fraction corresponding to 2bp downstream of the base, an estimated base mass fraction corresponding to 1bp downstream of the base, a base type corresponding to 2bp upstream of the base, a base type corresponding to 1bp upstream of the base, a base type corresponding to 2bp downstream of the base, a base type corresponding to 1bp downstream of the base, a correct base type at a base position in which the base is located. Based on the base information of the bases in the overlapping region of the sequenced fragments in the double-ended sequencing file, suitable significance characteristics can be determined according to the differences of the samples.

In the correction process, the characteristic of base fixation is not used, a plurality of candidate characteristics are collected, the salient characteristics are found out through a statistical method, the salient values corresponding to the candidate characteristics are calculated respectively based on the base information of the bases in the overlapping region of the sequencing fragments, the salient characteristics are screened out, then grouping is carried out according to the salient characteristics and the estimated base quality score, more accurate grouping can be obtained, then the original base quality score corresponding to each grouping can be calculated and obtained based on the base information of each grouping, and therefore in the subsequent fitting step, more accurate fitting results can be obtained, and the effect of correcting the base quality score is improved.

In this embodiment, the target feature may be the same or different for each feature to be inspected, the target feature may be a part or all of the candidate features, and the significance value between the feature to be inspected and the base quality score is calculated based on the data of the target feature. Taking the estimated base quality score value of 1bp upstream of the base as an example to be used as a feature to be checked, the correct base type of the base, the estimated base quality score of 1bp upstream of the base, the estimated base quality score of the base, the base type and the number of the base are selected as target features, wherein the base type is a base result output by a sequencer, the correct base type of the base, the estimated base quality score of 1bp upstream of the base, the estimated base quality score of the base and the base type are taken as statistical rules, traversing the bases in the overlapping area of all sequencing fragments, merging the correct base type of the same base, the estimated base quality score of 1bp upstream of the base, the estimated base quality score of the base and the base type, and counting the number under each condition, thereby obtaining a graph as shown in fig. 14, wherein fig. 14 is a schematic diagram of the effective information which is primarily counted in an embodiment, then the pieces of data in fig. 14 are grouped by taking the estimated base quality score and the estimated base quality score of 1bp upstream of the base as grouping standards, calculating the effective base quality information corresponding to the base type of 1bp upstream of the base and the effective base quality score in each group as statistical rule, and the number of the raw data in fig. 15 is obtained, and the statistical data of the quality score of each group as shown in fig. 15 is calculated, and the figure 15 is analyzed.

Optionally, calculating the quality score and error rate of each group of the effective information corresponding to the feature to be inspected includes counting the number of error bases and total number of bases in each group in each obtained group, calculating the error rate of each group based on the number of error bases and total number of bases in each group, calculating the quality score of each group of the original bases based on the error rate of each group,

Raw base quality score = -10 x log ₁₀ (error rate),

And then calculating a significance value between the feature to be inspected and the base quality score by using a Mann-Whitney U test method if the feature to be inspected is a bi-level variable, and calculating the significance value between the feature to be inspected and the base quality score by using variance analysis if the feature to be inspected is a tri-level or more variable.

In the above embodiment, in the correction process, instead of using the base-fixed feature, a plurality of candidate features are collected, the salient features are found out by a statistical method, the salient values corresponding to the candidate features are calculated respectively based on the base information of the bases in the overlapping region of the sequencing fragments, the salient features are screened out, then the base data sets can be obtained by grouping according to the salient features and the estimated base quality scores, and then the original base quality score corresponding to each base data set can be calculated based on the base information of each base data set, so that in the subsequent fitting step, a more accurate fitting result can be obtained, thereby improving the effect of correcting the base quality score.

In some embodiments, the fitting to obtain the base quality score correction model based on the original base quality score corresponding to each base data set and the data of the significance feature in the base information of each base data set includes:

Taking each base data set as one sample point to form a plurality of sample points, taking the saliency characteristic as an independent variable and the original base quality fraction as a dependent variable, and fitting by using a layered linear model based on the plurality of sample points to obtain a base quality fraction correction model.

In this embodiment, after obtaining the significance signature, the base data in the double-ended sequencing file of the significance signature and the estimated base quality score may be grouped to obtain a plurality of base data sets, and in each base data set, the number of erroneous bases and the total number of bases in each base data set are counted according to the above method, and the error rate of each base data set is calculated, and the original base quality score corresponding to each base data set is calculated based on the error rate of each base data set. Thus, each base data set can be understood as one sample point, a plurality of base data sets form a plurality of sample points, the significance characteristic is taken as an independent variable, the original base quality score is taken as a dependent variable, and a base quality score correction model is obtained by fitting based on the plurality of sample points, wherein the base quality score correction model is a layered linear model. For example, if the significance signature is 11 significance signatures, the 11 significance signatures are independent variables, and the plurality of sample points are sample points formed by all base data sets obtained based on the double-ended sequencing file, and are not local data, that is, data of all sets are used in combination, and are not a model obtained by fitting partial base data sets, and consistency of continuous variable base data sets is comprehensively considered, so that overfitting of a base quality score correction model is avoided, and correction effect of base quality scores of bases is improved.

In the above embodiment, the base quality score correction model is obtained by fitting the original base quality score corresponding to each base data set and the data of the significance characteristic in the base information of each base data set, that is, the data of all the base data sets are combined and not the model obtained by fitting the data of part of the base data sets, and the consistency of the base data sets of continuous variables is comprehensively considered, so that the overfitting of the base quality score correction model is avoided, and the correction effect of the base quality score of the base is improved.

In another aspect of the application, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the double-ended sequencing-based base class identification method according to any of the embodiments of the application. Wherein, in the computer program product, an alternative implementation form of the program module architecture of the computer program for realizing the steps of the method can be a base class identification device based on double-ended sequencing. Referring to fig. 17, an embodiment of the present application provides a base class identification device based on double-ended sequencing, which includes an obtaining module 1601 configured to obtain a double-ended sequencing file, and obtain, from the double-ended sequencing file, first sequence data and second sequence data corresponding to each sequenced fragment, where the first sequence data is base sequence data obtained by sequencing from a first end to a second end of a sequenced fragment, the second sequence data is base sequence data obtained by sequencing from the second end to the first end of the sequenced fragment, a determining module 1602 configured to determine an overlapping region of each sequenced fragment based on the first sequence data and the second sequence data corresponding to each sequenced fragment, and the obtaining module 1601 is further configured to obtain, for each base position in the overlapping region, a first base class and an estimated base quality score corresponding to each base position from the first sequence data, obtain, from the second sequence data corresponding to each base position, a second base class and an estimated base quality score corresponding to each base class, and a base class corresponding to each base position, and a base class corresponding to each base position, and an estimated base class quality score corresponding to each base class corresponding to each base position and a base class corresponding to each base position, and a base class corresponding to each base class.

Optionally, the determining module 1602 is further configured to:

In another aspect of the application, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the electrode array positioning method according to any of the embodiments of the application. Wherein, in the computer program product, an alternative implementation form of the program module architecture of the computer program for realizing the steps of the method can be a base class identification device based on double-ended sequencing. Referring to fig. 17, an embodiment of the present application provides a base class identification device based on double-ended sequencing, which includes an acquisition module 1601, configured to acquire a double-ended sequencing file, and acquire first sequence data and second sequence data corresponding to each sequencing fragment from the double-ended sequencing file; wherein the first sequence data is base sequence data obtained by sequencing from a first end to a second end of a sequenced fragment, the second sequence data is base sequence data obtained by sequencing from the second end to the first end of a sequenced fragment, a determining module 1602 for determining an overlapping region of each sequenced fragment based on the first sequence data and the second sequence data corresponding to each sequenced fragment, an obtaining module 1601 for obtaining, for each base position in the overlapping region, an estimated base quality score corresponding to a first base type and a first base type corresponding to each base position from the first sequence data, an estimated base quality score corresponding to a second base type and a second base type corresponding to each base position from the second sequence data, a determining module 1602 for determining, for each base position, a comparison result based on the first base type and the second base type corresponding to the base position, and the estimated base quality score corresponding to the first base type and the estimated base type corresponding to the second base type, a pre-set base quality score corresponding to the base type, a correct base score obtained for each base type in the sequenced fragment from the second sequence data, a determining module 1601 for determining, for each base position in the double-ended sequence region, for assessing the quality of sequencing data in the double-ended sequencing file based on the correct base type corresponding to the bases in the overlap region and the base information of the bases in the overlap region.

It will be appreciated by those skilled in the art that the structure of the double-ended sequencing-based base class identification device in fig. 17 does not constitute a limitation on the double-ended sequencing-based base class identification device, and the structure of the double-ended sequencing-based sequencing data quality assessment device in fig. 18 does not constitute a limitation on the double-ended sequencing-based sequencing data quality assessment device, and the respective modules may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or independent of a controller in a computer device, or may be stored in software in a memory in the computer device, so that the controller may call and execute operations corresponding to the above modules. In other embodiments, more or fewer modules than shown may be included.

Referring to fig. 18, in another aspect of the embodiment of the present application, there is further provided a computer device 200, including a memory 3011 and a processor 3012, where the memory 3011 stores a computer program, and the computer program when executed by the processor causes the processor 3012 to perform the steps of the base class identification method based on double-ended sequencing and/or the steps of the sequencing data quality assessment method based on double-ended sequencing provided in any of the embodiments of the present application. Computer device 200 may include a gene sequencer, a computing device (e.g., a desktop computer, a laptop computer, a tablet computer, a handheld computer, a smart speaker, a server, etc.), a mobile phone (e.g., a smart phone, a wireless phone, etc.), a wearable device (e.g., a pair of smart glasses or a smart watch), or the like.

Where the processor 3012 is a control center, various interfaces and lines are utilized to connect various portions of the overall computer device, perform various functions of the computer device and process data by running or executing software programs and/or modules stored in the memory 3011, and invoking data stored in the memory 3011. Optionally, the processor 3012 may include one or more processing cores, and preferably the processor 3012 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user pages, application programs, and the like, and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 3012.

The memory 3011 may be used to store software programs and modules, and the processor 3012 executes various functional applications and data processing by executing the software programs and modules stored in the memory 3011. The memory 3011 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like, and a storage data area that may store data created according to the use of the computer device, and the like. In addition, memory 3011 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 3011 may also include a memory controller to provide access to the memory 3011 by the processor 3012.

In another aspect of the embodiments of the present application, there is further provided a storage medium storing a computer program, where the computer program when executed by a processor causes the processor to perform the steps of the base class identification method based on double-ended sequencing and/or the steps of the sequencing data quality assessment method based on double-ended sequencing provided in any of the embodiments of the present application.

Those skilled in the art will appreciate that implementing all or part of the processes of the methods provided in the above embodiments may be accomplished by computer programs stored on a non-transitory computer readable storage medium, which when executed, may comprise processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. The scope of the invention is to be determined by the appended claims.

Claims

1. A method for identifying base types based on double-end sequencing, comprising:

Obtaining a double-end sequencing file, and obtaining first sequence data and second sequence data corresponding to each sequencing fragment from the double-end sequencing file; wherein the first sequence data is base sequence data obtained by sequencing from the first end to the second end of the sequencing fragment, and the second sequence data is base sequence data obtained by sequencing from the second end to the first end of the sequencing fragment;

Determining the overlapping region of each sequencing fragment based on the first sequence data and the second sequence data corresponding to each sequencing fragment;

For each base position in the overlapping region, obtaining from the first sequence data a first base type corresponding to each base position and an estimated base quality score corresponding to the first base type, and obtaining from the second sequence data a second base type corresponding to each base position and an estimated base quality score corresponding to the second base type;

For each of the base positions, the correct base type corresponding to the base position is determined based on a comparison result of the first base type and the second base type corresponding to the base position, and a comparison result of the estimated base quality score corresponding to the first base type and the estimated base quality score corresponding to the second base type with a preset quality score threshold.

2. The method for base category identification based on double-end sequencing according to claim 1, characterized in that, for each of the base positions, based on a comparison result of a first base type corresponding to the base position and a comparison result of an estimated base quality score corresponding to the first base type and an estimated base quality score corresponding to the second base type with a preset quality score threshold, determining the correct base type corresponding to the base position comprises:

For each base position in the overlapping region of each sequencing fragment, if the first base type corresponding to the base position in the first sequence data is the same as the second base type corresponding to the base position in the second sequence data, then the correct base type corresponding to the base position is the first base type or the second base type;

If the first base type is different from the second base type, and the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both greater than the preset quality score threshold, the base type corresponding to the estimated base quality score with the higher estimated base quality score is used as the correct base type corresponding to the base position;

If the first base type is different from the second base type, the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both less than or equal to a preset quality score threshold, and when a major allele base exists at a reference position corresponding to the base position in the reference genome, the major allele base is used as the correct base type corresponding to the base position;

If the first base type is different from the second base type, the estimated base quality score corresponding to the first base type and the estimated base quality score corresponding to the second base type obtained are both less than or equal to a preset quality score threshold, and when there is no major allele base at the reference position corresponding to the base position in the reference genome, the base type at the reference position is used as the correct base type corresponding to the base position.

3. A method for evaluating sequencing data quality based on paired-end sequencing, characterized in that:

Based on the base type identification method based on double-end sequencing according to claim 1 or 2, determining the correct base type corresponding to each base position in the overlapping region of each sequencing fragment;

For each sequencing fragment, obtaining base information of the bases in the overlapping region from the double-end sequencing file;

The sequencing data quality in the double-end sequencing file is evaluated based on the correct base types corresponding to the bases in the overlapping region and the base information of the bases in the overlapping region.

4. The method for evaluating sequencing data quality based on double-end sequencing according to claim 3, characterized in that the evaluating the sequencing data quality in the double-end sequencing file based on the correct base types corresponding to the bases in the overlapping region and the base information of the bases in the overlapping region comprises:

Determine error detection data according to the correct base type and the corresponding sequencing base type corresponding to each base in the overlapping region, wherein the error detection data includes each error detection type and the total number of error detections corresponding to each error detection type;

Calculate the false detection ratio corresponding to each false detection type;

Based on the error detection ratio corresponding to each error detection type, calculating the similarity between the error detection type of the double-end sequencing file and the preset average error detection ratio;

If the similarity is within the preset similarity range, it is determined that the quality of the sequencing data in the double-end sequencing file meets the requirements;

If the similarity is not within the preset similarity range, it is determined that the quality of the sequencing data in the double-end sequencing file does not meet the requirements.

5. The method for evaluating sequencing data quality based on paired-end sequencing according to claim 3, wherein the base information comprises at least an estimated base quality score of each base obtained from the paired-end sequencing file and a candidate feature associated with the estimated base quality score; and evaluating the sequencing data quality in the paired-end sequencing file based on the correct base type corresponding to the base in the overlapping region and the base information of the base in the overlapping region further comprises:

Based on the base information of the bases in the overlapping region, determining, from the candidate features, a significant feature whose correlation with the estimated base quality score meets a preset condition;

Based on the significant features, the correct base types corresponding to the base positions and the estimated base quality scores, the base data in the double-end sequencing file are grouped to obtain a plurality of base data groups, base information of the bases in each base data group is obtained, and the original base quality score corresponding to each base data group is calculated based on the base information of the bases in each base data group;

Based on the original base quality score corresponding to each base data group and the significant features in the base information of each base data group, a base quality score correction model is fitted, and based on the base quality score correction model, a fitted base quality score corresponding to each base data group is calculated, and the fitted base quality score corresponding to each base is determined as the fitted base quality score corresponding to the base data group where the base is located;

The quality of the sequencing data in the double-end sequencing file is evaluated based on the fitted base quality score corresponding to each base data group.

6. The method for evaluating sequencing data quality based on paired-end sequencing according to claim 5, wherein evaluating the sequencing data quality in the paired-end sequencing file based on the fitted base quality score corresponding to each base data group comprises:

A fitting point is formed by using the estimated base quality score corresponding to each base data group and the corresponding fitted base quality score, and the target straight line is obtained by fitting;

Calculating the weighted distance between the target straight line and the preset straight line;

If the weighted distance between the target straight line and the preset straight line is less than or equal to the preset weighted distance, it is determined that the quality of the sequencing data in the double-end sequencing file meets the requirements;

If the weighted distance between the target straight line and the preset straight line is greater than the preset weighted distance, it is determined that the quality of the sequencing data in the double-end sequencing file does not meet the requirements.

7. The method for evaluating sequencing data quality based on paired-end sequencing according to claim 5, wherein the step of evaluating the sequencing data quality in the paired-end sequencing file based on the fitted base quality score corresponding to each base data group further comprises:

Obtaining the total number of bases corresponding to each estimated base quality score, and determining a target estimated base quality score;

Obtaining base sequence data corresponding to each target estimated base quality score, wherein the total number of bases of the target estimated base quality score is ranked in the first preset number of digits;

For the base sequence data corresponding to each target estimated base quality score, all cycle sequencing numbers and the fitted base quality score corresponding to each cycle sequencing number under each target estimated base quality score are obtained, a fitting point is formed by the cycle sequencing number and the fitted base quality score corresponding to the cycle sequencing number, and a fitting straight line corresponding to each target estimated base quality score is obtained by fitting;

Calculate the weighted distance between the fitting straight line corresponding to each target estimated base quality score and the preset horizontal line to obtain the weighted distance corresponding to each target estimated base quality score;

If the weighted distance corresponding to each target estimated base quality score is less than the preset distance, it is determined that the quality of the sequencing data in the double-end sequencing file meets the requirements;

If there is a weighted distance corresponding to the target estimated base quality score that is greater than or equal to the preset distance, it is determined that the quality of the sequencing data in the double-end sequencing file does not meet the requirements.

8. The method for evaluating sequencing data quality based on paired-end sequencing according to claim 5, wherein the determining, based on the base information of the bases in the overlapping region, from the candidate features, a significant feature whose correlation with the estimated base quality score satisfies a preset condition comprises:

Taking each candidate feature as a feature to be checked, obtaining a target feature corresponding to the feature to be checked, traversing the bases in the overlapping regions of all sequencing fragments, and performing statistics on the traversed bases based on the target feature corresponding to the feature to be checked, to obtain valid information corresponding to the feature to be checked; wherein the valid information includes multiple pieces of data, each piece of data includes a target feature;

Taking the feature to be checked and the estimated base quality score as grouping criteria, grouping multiple pieces of data in the valid information corresponding to the feature to be checked, and calculating the original base quality score and error rate of each group in the valid information corresponding to the feature to be checked;

Based on the target feature, the original base quality score and the error rate of each group in the valid information corresponding to the feature to be checked, the significance value between the feature to be checked and the estimated base quality score is calculated by using a statistical analysis method;

The candidate features whose significance values are lower than the preset significance value are determined as significant features.

9. A computer program product, characterized in that it comprises a computer program, which, when executed by a processor, implements the base category identification method based on double-end sequencing as described in claim 1 or 2, or implements the sequencing data quality assessment method based on double-end sequencing as described in any one of claims 3 to 8.

10. A computer device, characterized in that it comprises a memory and a processor, the memory storing a computer program, and when the computer program is executed by the processor, the processor executes the base category identification method based on double-end sequencing as described in claim 1 or 2, or executes the sequencing data quality assessment method based on double-end sequencing as described in any one of claims 3 to 8.