[go: up one dir, main page]

CN119007834A - Base quality fraction correction method based on double-ended sequencing, program product, equipment and storage medium - Google Patents

Base quality fraction correction method based on double-ended sequencing, program product, equipment and storage medium Download PDF

Info

Publication number
CN119007834A
CN119007834A CN202411116286.XA CN202411116286A CN119007834A CN 119007834 A CN119007834 A CN 119007834A CN 202411116286 A CN202411116286 A CN 202411116286A CN 119007834 A CN119007834 A CN 119007834A
Authority
CN
China
Prior art keywords
base
quality score
sequencing
estimated
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411116286.XA
Other languages
Chinese (zh)
Inventor
姚天然
王谷丰
包原野
赵陆洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Sailu Medical Technology Co ltd
Original Assignee
Shenzhen Sailu Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Sailu Medical Technology Co ltd filed Critical Shenzhen Sailu Medical Technology Co ltd
Priority to CN202411116286.XA priority Critical patent/CN119007834A/en
Publication of CN119007834A publication Critical patent/CN119007834A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Data Mining & Analysis (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biochemistry (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Microbiology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a base quality fraction correction method based on double-end sequencing, a program product, equipment and a storage medium, wherein the method comprises the following steps: obtaining a double-end sequencing file, and obtaining first sequence data and second sequence data corresponding to each sequencing fragment; acquiring base information of bases in an overlapping region of each sequencing fragment based on the first sequence data and the second sequence data, and determining a salient feature from the candidate features; based on the significance characteristics and the estimated base quality fractions, base data are subjected to base data sets to obtain a plurality of base data sets, and based on base information of bases in each base data set, the original base quality fraction corresponding to each base data set is calculated; based on the original base quality score and the significance characteristic corresponding to each base data set, a base quality score correction model is obtained through fitting, and based on the base quality score correction model, the fitted base quality score of each base in each base data set is obtained through calculation.

Description

Base quality fraction correction method based on double-ended sequencing, program product, equipment and storage medium
Technical Field
The invention relates to the technical field of genes, in particular to a base quality score correction method and a computer program product based on double-end sequencing, base quality score correction equipment based on double-end sequencing and a computer readable storage medium.
Background
In high throughput sequencing, the sequencer outputs, for each detected base (base call), a Quality value called the base Quality score (Quality score), also called the Q value (Q score), each base corresponding to a base Quality score, the base Quality score value representing an estimate of the error rate of the sequencer for the base identification. It is important that the sequencer outputs a base accurate mass fraction value because almost all downstream analyses for high throughput sequencing data rely on base mass fraction values. Algorithms including data quality control, sequence alignment, mutation detection (short indels, copy number, structural mutation), etc., are all based on numerical calculations of base quality scores. However, in reality, the output process of the base quality score value is that firstly the sequencer collects the optical signal or the electrical signal from the sensor to detect a certain base, and then the corresponding base quality score value is deduced through the empirical relationship between the signal intensity and the base quality score value. It can be seen that the sequencer is not able to directly calculate the error rate of base recognition, but can only estimate the base quality score value, i.e., the base quality score of each base can be read from the test file after sequencing is completed, as the estimated base quality score of each base. This results in the fact that in most cases the base quality score value output by the sequencer does not accurately reflect the recognition error rate, and therefore a base quality score correction (recaliberation) technique is required so that the base quality score value output by the sequencer reflects the true error rate.
The currently used base quality score correction technology adopts preset base characteristics as characteristics affecting the accuracy of the base quality score, then groups the bases in the sequencing file based on the preset base characteristics, calculates the original base quality score value of each group respectively, then fits the original base quality scores of a plurality of groups into new base quality scores, called fitting base quality scores, based on the preset base characteristics and the original base quality score values of each group by using a local weighted regression (locally weighted regression, lowess) model, but groups the bases in the sequencing file based on the preset base characteristics, and the correction effect of the base quality score is also affected due to the difference of samples, the difference of instruments of the sequencer, the influence of other factors in the testing process and the like.
Disclosure of Invention
In order to solve the existing technical problems, the embodiment of the invention provides a base quality score correction method computer program product based on double-end sequencing, base quality score correction equipment based on double-end sequencing and a computer readable storage medium, which can improve the correction effect of the base quality score of a base.
In a first aspect, there is provided a base quality score correction method based on double ended sequencing, comprising: obtaining a double-end sequencing file, and obtaining first sequence data and second sequence data corresponding to each sequencing fragment from the double-end sequencing file; wherein the first sequence data is base sequence data obtained by sequencing from a first end to a second end of the sequencing fragment, and the second sequence data is base sequence data obtained by sequencing from the second end to the first end of the sequencing fragment; acquiring base information of bases in an overlapping region of each sequencing fragment based on the first sequence data and the second sequence data corresponding to each sequencing fragment; the base information includes estimated base mass fractions obtained from the double ended sequencing file, and candidate features associated with base mass fractions; determining a salient feature with the association degree with the base quality fraction meeting a preset condition from the candidate features based on the base information of the bases in the overlapping region of each sequencing fragment; based on the significance characteristics and the estimated base quality fractions, base data groups are carried out on the base data in the double-end sequencing file, a plurality of base data groups are obtained, base information of bases in each base data group is obtained, and the original base quality fraction corresponding to each base data group is calculated and obtained based on the base information of the bases in each base data group; fitting to obtain a base quality score correction model based on the original base quality score corresponding to each base data set and the significance characteristics in the base information of each base data set, and calculating to obtain the fitted base quality score of each base in each base data set based on the base quality score correction model.
In a second aspect, a computer program product is provided, comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the base quality score correction method based on double ended sequencing provided by an embodiment of the present application.
In a third aspect, there is provided a base quality score correction device based on double-ended sequencing, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the base quality score correction method based on double-ended sequencing provided by the embodiment of the application.
In a fourth aspect, a computer readable storage medium is provided, storing a computer program, which when executed by a processor, causes the processor to perform the steps of the base quality score correction method based on double-ended sequencing provided by the embodiment of the present application.
According to the embodiment of the application, the first sequence data and the second sequence data of the same sequencing fragment are obtained from the double-end sequencing file, the base information of the base in the overlapping area can be determined through the first sequence data and the second sequence data of the same sequencing fragment, the significance characteristic affecting the base quality score is determined based on the base information of the base in the overlapping area, the base data sets can be obtained by grouping based on the significance characteristic, then the original base quality score corresponding to each base data set can be obtained by calculating based on the base information of each base data set, so that in the subsequent fitting step, a more accurate fitting result can be obtained, the effect of correcting the base quality score is improved, in addition, in the fitting step, the data of all groups is used, and is not a model obtained by fitting partial base data sets, the consistency of continuous variable base data sets is comprehensively considered, so that the overfitting of the base quality score correction model is avoided, and the base quality score correction effect is improved.
Drawings
FIG. 1 is a schematic diagram of a genetic sequencer according to an embodiment;
FIG. 2 is a schematic diagram of correlation of report Q values and true Q values based on Salus Evo sequencers and llumina sequencers in one embodiment;
FIG. 3 is a schematic diagram of the sequencing of an insert in double-ended sequencing in one embodiment;
FIG. 4 is a graph showing the correlation of the nominal Q value to the logarithm of the error rate for different c-cycle numbers for the same duplex base AC in an exemplary embodiment of an Illumina sequencer;
FIG. 5 is a flow chart of a base quality score correction method based on double ended sequencing in one embodiment;
FIG. 6 is a graph showing the variation of the estimated base mass fraction versus the fitted base mass fraction after grouping based on the number of cycles in one embodiment;
FIG. 7 is a graph showing the variation of the estimated base mass fraction versus the fitted base mass fraction after grouping based on the GC content in the sequenced fragments in one embodiment;
FIG. 8 is a graph showing the variation of the relationship between the estimated base quality score and the fitted base quality score after grouping based on the estimated base quality score corresponding to the upstream 2bp in one embodiment;
FIG. 9 is a graph showing the variation of the relationship between the estimated base quality score and the fitted base quality score after grouping based on the estimated base quality score corresponding to 1bp upstream in one embodiment;
FIG. 10 is a graph showing the change in the relationship between the estimated base mass fraction and the fitted base mass fraction after grouping based on the base types in one embodiment;
FIG. 11 is a graph showing the change in the relationship between the estimated base quality score and the fitted base quality score after grouping based on the base class of 1bp upstream in one embodiment;
FIG. 12 is a graph showing the change in the relationship between the estimated base quality score and the fitted base quality score after grouping based on the base class of 2bp upstream in one embodiment;
FIG. 13 is a schematic diagram of saliency values corresponding to candidate features according to an embodiment;
FIG. 14 is a diagram of preliminary statistics of effective information in an embodiment;
FIG. 15 is a schematic diagram of information after grouping in one embodiment;
FIG. 16 is a graphical representation of raw base mass fraction for each group in one embodiment;
FIG. 17 is a schematic diagram of a base quality score correction device based on double-ended sequencing in one embodiment;
FIG. 18 is a schematic diagram showing the structure of a gene sequencer according to an embodiment.
Detailed Description
The technical scheme of the invention is further elaborated below by referring to the drawings in the specification and the specific embodiments.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the scope of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
In the following description, reference is made to the expression "some embodiments" which describe a subset of all possible embodiments, but it should be understood that "some embodiments" may be the same subset or a different subset of all possible embodiments and may be combined with each other without conflict.
Gene sequencing refers to analyzing the base sequence of DNA fragments of the data to be tested, i.e., the arrangement of adenine (A), thymine (T), cytosine (C) and guanine (G). At present, a fluorescent labeling method is commonly used for gene sequencing, a laser is used for exciting a fluorescent label on a sequencing chip by a gene sequencing optical system to generate fluorescence, fluorescence signals are collected, and four bases are combined with different fluorescent labels to generate four different fluorescence wave bands, so that bases are identified.
In the second generation sequencing technology, using an Illumina sequencer as an example, different fluorescent molecules with different fluorescence emission wavelengths can emit fluorescent signals with corresponding wavelengths when being irradiated by laser, and the fluorescent signals with specific wavelengths can be obtained by selectively filtering light rays with non-specific wavelengths through a filter after the laser irradiation, so that the base type can be identified by analyzing the fluorescent signals by obtaining the fluorescent signals. Mainly comprises sample preparation, cluster generation, sequencing and data analysis.
Sample preparation: the DNA sample to be sequenced is subjected to extraction and purification treatment, and then DNA fragmentation and aptamer ligation are performed. In alternative examples, the DNA sample is typically cleaved using ultrasound or restriction enzymes, and the DNA sample is cleaved into smaller, larger DNA fragments. Then, an aptamer comprising a specific sequence for subsequent ligation and sequencing reactions is ligated to both ends of the DNA fragment.
Cluster generation: the process is to amplify a DNA fragment to form an immobilized DNA fragment so that a DNA fragment can be formed into a base cluster later. In an alternative example, specifically, the DNA fragments are amplified by polymerase chain reaction (Polymerase Chain Reaction, PCR) or bridge amplification, etc., such that millions of replicas of each DNA fragment are formed, and the amplified DNA fragments are immobilized on a fixation plate. Each DNA fragment forms a separate cluster on the fixation plate.
Sequencing, namely sequencing and reading each base cluster on Flowcell, wherein a fluorescent marked dNTP sequencing primer is added in the sequencing, one end of the chemical formula of dNTP is connected with an azide group, polymerization can be prevented when the sequenced chain extends, one cycle (cycle) can be ensured to be prolonged by only one base, and a sequencing reading is correspondingly generated, namely sequencing while synthesizing. In one cycle, a base is identified by fluorescent labeling dNTPs for each base cluster, sequencing signal responses of different base types are respectively corresponding to fluorescent signals of specific colors, and the base corresponding to each base cluster in the current cycle can be judged according to the emitted fluorescent colors by laser scanning. In one cycle, tens of millions of base clusters are sequenced simultaneously at Flowcell, one fluorescent spot represents the fluorescence emitted by one base cluster, and one base cluster corresponds to one read in fastq. In the sequencing stage, fluorescent images on the surface of Flowcell are shot through an infrared camera, the fluorescent images are subjected to image processing and fluorescent point position positioning to detect base clusters, template construction is carried out according to base cluster detection results of a plurality of fluorescent images corresponding to sequencing signal responses of different base types, and positions of all base cluster template points (clusters) on Flowcell are constructed. And extracting fluorescence intensity from the filtered image according to the template, correcting the fluorescence intensity, and finally calculating a score according to the maximum intensity of the position of the template point of each base cluster to output fastq base sequence files.
The gene sequencer can also comprise an optical platform, the optical platform can comprise an operation table and a camera, wherein the sequencing chip can be arranged on the operation table, the gene sequencer uses laser to excite fluorescent markers on the sequencing chip to generate fluorescence, and collect fluorescent signals, and four bases are combined with different fluorescent markers to generate four different fluorescent wave bands. I.e. fluorescence images of four base types. The sequencing chip is photographed by a camera, a fluorescent image of a fluorescent signal generated on a Charge Coupled Device (CCD) on the testing chip is captured, a plurality of fluorescent points exist in one fluorescent image, and one fluorescent point in the fluorescent image represents fluorescence emitted by one base cluster.
The imaging mode of the gene sequencer can be a four-channel imaging system or a two-channel imaging system. For a two-channel imaging system, each camera needs to be exposed twice at the same location of the test chip. For a four-channel imaging system, the camera of each channel shoots once at the same position of the sample, and fluorescent images of four base types are respectively obtained. For example, a fluorescent image of the A base type, a fluorescent image representing the A base type, a fluorescent image of the C base type, a fluorescent image of the G base type, and a fluorescent image of the T base type are obtained, respectively. Since the light with a non-specific wavelength is selectively filtered by using the optical filter after the laser irradiation to obtain the fluorescent signal with a specific wavelength, each base type corresponds to a different fluorescent signal, in the same Cycle (Cycle) reaction, the same type of base cluster emits light with a far greater brightness than other types of bases in the corresponding type of base type, and the base clusters emitted by each channel theoretically do not have repetition.
After the fluorescence image is obtained by the gene sequencer, the collected image is subjected to gene image reconstruction, gene image registration and gene base identification (gene basecall), so that a gene sequence is obtained.
Wherein the genetic image reconstruction is used to increase the resolution of the fluorescent image to increase the sharpness of the image to reduce the cross-talk effects between samples. Gene image reconstruction includes, but is not limited to, conventional operations such as deconvolution.
The gene image registration is to correct the fluorescent images of four base types, so that the fluorescent images of four base types can be overlapped, and the fluorescent brightness of 4 channels at the same position can be extracted, thereby facilitating the subsequent base identification. Genetic image registration includes, but is not limited to, image registration of the same channel, global or local affine registration.
The gene recognition process is to judge whether the base cluster in the image belongs to one of A, C, G, T bases according to the registered image. After the data to be detected is subjected to gene identification, the data to be detected is converted into A, C, G, T base sequence information from a digital image, namely a DNA sequence result of a sample, so that the DNA sequence result is used for subsequent analysis and evaluation.
Data analysis: analysis and interpretation of sequencing data is performed based on the image data and the sequence information. Sequence information was aligned with the reference genome for mutation identification.
The process of sequencing one piece of data to be tested is called one-time Run, and the sequencing process of one piece of data to be tested consists of a plurality of cycles (cycles), wherein one Cycle corresponds to one reaction period, namely, corresponds to the identification of one base type in a sequencing chip. Sequencing, sequencing while synthesis, is performed. In one cycle, several tens of millions of base clusters are sequenced simultaneously.
One test data includes a plurality of DNA fragments, and each DNA fragment is added with one base during the above-mentioned sequencing, so that the length of the base sequence of the DNA of the test data determines the number of cycles. In each cycle, the gene sequencer can obtain one fluorescence image of each of four base types of ACGT, and when the data to be tested is sequenced, the gene sequencer can obtain the fluorescence images of ACGT channels of a plurality of cycles.
It should be noted that, the foregoing describes a sequencing procedure by using Illumina sequencing technology as an example of a large-scale parallel sequencing technology (MPS), and by amplifying a DNA molecule to be detected by a specific amplification technology, amplifying each DNA fragment (single-stranded library molecule) to form a base cluster, and constructing a template point of the base cluster on the sequencing chip according to a detection result of the base cluster, so that operations such as base recognition can be performed according to the template point of the base cluster in the following steps, thereby improving the base recognition efficiency and accuracy. It can be understood that the base recognition method based on fluorescence labeling dNTP gene sequencing provided by the embodiment of the present application is based on the positioning detection and base type recognition of the base cluster after the single-stranded library molecule is amplified on the sequencing chip, where each base cluster refers to a base signal acquisition unit, so that it is not limited to the amplification technology adopted for the single-stranded library molecule, that is, the base quality score correction method based on double-ended sequencing provided by the embodiment of the present application is also applicable to the positioning detection and base type recognition of the base signal acquisition unit for the sequencing chip in other large-scale parallel sequencing technologies, for example, the base signal acquisition unit may refer to the base cluster obtained by using bridge amplification technology in Illumina sequencing technology, and also includes nanospheres obtained by rolling circle amplification technology (RCA, rolling Circle Amplification), and the present application is not limited thereto. In the following examples, for the sake of understanding, a base signal acquisition unit will be described as an example of a base cluster.
Referring to FIG. 1, a schematic diagram of a gene sequencer according to an embodiment is shown. The gene sequencer can also comprise an operation table and a camera, wherein the sequencing chip can be arranged on the operation table, and a plurality of base clusters which are arranged according to an array or are randomly distributed on the gene sequencing chip. Through the staining reagent, different types of base clusters are respectively connected with one of different fluorescent markers in the sequencing reaction, the fluorescent markers emit fluorescent signals after being irradiated by laser, and the fluorescent signals with non-specific wavelengths are selectively filtered through a filter, so that the fluorescent signals with specific wavelengths are obtained. Fluorescent molecules in different fluorescent labels have different fluorescence emission wavelengths, such that different base clusters correspond to different fluorescent signals. Fluorescent images are acquired by a camera and analyzed to identify the base class of each base cluster. Wherein the camera may be an optical microscope.
In the process of one-time gene sequencing, the process of sequencing one gene sample to be tested is called one-time Run, one gene sample to be tested is broken into M base sequences to be tested, which can also be called short chains, each base sequence to be tested comprises N base clusters, and in one cycle, sequencing reaction is carried out on a sequencing chip on the base clusters at the top end of the M short chains at the same time. On a sequencing chip, each base cluster being sequenced corresponds to a position, and in one cycle, tens of millions of base clusters are sequenced simultaneously. N determines the number of cycles tested, the greater N the number of cycles. And under different cycles, sequencing the base clusters in the M base sequences to be tested respectively. For example, if a sample of the gene to be tested is broken into tens of thousands of short strands, each of which is 100 bases in length, then 100 cycles of sequencing reactions are required to identify the base type. At each cycle, the top base cluster of the ten thousand short chains was subjected to a sequencing reaction on a sequencing chip. After one-time gene sequencing is completed, a plurality of sequencing sequences which are subjected to sequencing can be obtained from a sequencing result file, namely a plurality of short chains which are subjected to sequencing can be obtained, and one sequencing sequence is a short chain or a read or a sequencing fragment.
The conversion relationship between the error rate of the base recognition and the base quality score value is as follows for any one base: q= -10 log 10 E, where E is the recognition error rate of the base by the sequencer, Q is the base quality score value, for example if the Q value output by the sequencer for a base is 30, it indicates that the recognition error rate of the base is 0.001.
The most widely used base quality score correction technique is the same at present in that the base grouping method, the prior art considers the following four features of bases to be features that significantly affect the accuracy of the base quality score, and groups the bases in a sample according to the following features (bins): namely the order in which the reads are located (read order), the number of sequencing cycles (cycle), the duplex base consisting of the base and 1bp upstream of the base (dinucleotide context), the reported estimated base mass fraction of the sequencer (qual). However, these four features are preset and not suitable for all samples, and may be due to different samples, different instruments and other influencing factors, and some features may not be the main factors influencing the base quality score, so that the grouping accuracy is influenced, and the correction effect is influenced. Grouping bases based on the four general classes of features described above is based on the characteristics of Illumina NovaSeq sequencers. For other brands of sequencers, the salient features may differ from the Illumina platform sequencer, and instead of grouping the bases using fixed four types of features, the appropriate salient features should be chosen for their specific characteristics. For example, as shown in FIG. 2, the Q value of Salus Evo sequencers is insensitive to read order (read order), where the red line is Salus Evo sequencer data and the blue line is llumina sequencer data, where the black dashed line is the diagonal line. It can be seen that the data points of Salus Evo sequencers are closer to the diagonal with the read order grouping, indicating that their nominal base mass fraction value is very close to the actual base mass fraction value. The same salient feature grouping as Illumina NovaSeq if used can cause the grouping to be inaccurate, affecting the final correction effect.
They first group all the bases in the sample that participate in the calculation according to the above four features, and then calculate the error rate and corresponding base quality score value for each group by comparing the base type of each base with the correct base at the genomic position, where the base quality score value for each group contains an error, called the original base quality score value (raw Q). To reduce the error, the original base quality score values of several groups are then fitted to a new base quality score value, called the fitted base quality score value (fitted Q), using local weighted regression (locally weighted regression, lowess), which is considered to be the true base quality score value for the bases within the group. Finally, the base quality score values for all bases in each group are set to the fitted base quality score values in that group.
When calculating the error rate of the grouping, the number of the error bases of the bases in the grouping needs to be counted, that is, the correct base type of each base in the sequencing file needs to be determined, the method for determining the correct base type base in the prior art can be based on the correction technology of the known standard sites as the name implies, the user compares the base at one base position with the reference genome sequence, the reference genome sequence is a list of the known standard sites, the content of the list is a series of genome coordinates and the correct base type of each coordinate position, the coordinate position corresponding to the base position is determined in the reference genome according to the base position, and the base at the corresponding coordinate position is determined as the correct base type at the base position. GATK-BQSR currently uses the known standard site of human in NCBI dbSNP database. However, the current method of determining the correct base type using known standard sites is only applicable to some species which are very widely studied, such as humans. Most other species do not have known standard sites. This limitation makes GATK-BQSR function essentially unusable on sequencing data from samples of other species. Secondly, the method considers that all base types which are not matched with the known standard sites in the sequencing result are detection errors of a sequencer, which is not matched with common knowledge, because various external factors, such as high-concentration formaldehyde reagent, high-strength mechanical force, inaccurate endonucleases and DNA synthetases, improper sample storage environment and the like, can change the base types in a sample in the process of high-throughput sequencing experiment, so that certain sites in the sample are not matched with the base types of the known standard sites, and the detection errors of the sequencer are not caused. Both of them are attributed to detection errors by the sequencer, which will underestimate the base quality score value.
In double ended sequencing, if the length of one insert (INSERT FRAGMENT) is less than twice the sequencing length/read length (READ LENGTH), then there will be overlap between insert reads 1 and 2. As shown in FIG. 3, a schematic representation of the sequencing of one insert in double-ended sequencing, if the insert is small enough, the results of read1 and read2 overlap in part, where each base is detected twice. Within the overlap region, the sequencer detects virtually twice for each base of the same insert. In the prior art, if the two detection results of the bases are the same, the detected base type base is the correct base type base, and if the two detection results of the bases are different, the base with the position mutation frequency or allele frequency (allele frequency) of more than or equal to 99% is used as the correct base type of the locus, namely, the method considers all loci on the genome in the sample to be homozygous. This is also inconsistent with common sense, since even a large number of heterozygous sites are present in a healthy human genome, for example the standard HG001 NA12878 human genome, which has at least two million heterozygous sites. If the sample is unhealthy tissue, such as a tumor or a sample containing a genetic disorder (aborted fetal tissue, etc.), the number of heterozygous sites will be higher. Using this method, the correct base type of the heterozygous site cannot be determined because there is no base with a mutation frequency of 99% or more at the heterozygous site position within the genome. Therefore, if the results of the two detection of the base are different, the correct base type of the site is not adapted to the position of the heterozygous site in the genome by using 99% or more of the base having the mutation frequency or allele frequency (allele frequency).
When fitting is calculated to get the fitted base quality scores, the model is fitted using a local weighted regression (lowess) method, the model cannot be fitted from the function value outside the variable value range, but only values in a partial range are used, for example when groups with estimated base quality scores of 11 to 30 are used, their error rates are E (11), E (12)..e (30), we cannot fit the error rate E (31) of groups with estimated base quality score values of 31. And this approach does not take into account consistency of continuous variable grouping, but uses values in a partial range, which in some cases results in overfitting, for example, fig. 4 is a schematic diagram of the correlation of the nominal Q value and the logarithm of the error rate for different c-cycle numbers in the case of the same duplex base AC in an embodiment of Illumina sequencer, and for Salus Pro sequencer, when duplex base (dinucleotide context) is AC and read order is 1st read, the relationship of the reported Q value, i.e., estimated base quality score (Qual), and true error rate should be a quadratic linear relationship. Fitting different cycles using local weighted regression gives a quadratic linear relationship with a large number of cycles (140 cycles if shown on the left in fig. 4), while the over-fitting occurs with a medium number of cycles (88 cycles on the right in fig. 4).
Referring to fig. 5, a flow chart of a base quality score correction method based on double-ended sequencing according to an embodiment of the application is shown. The base quality score correction method based on double-end sequencing is applied to a gene sequencer, and comprises the following steps of:
S11, acquiring a double-end sequencing file, and acquiring first sequence data and second sequence data corresponding to each sequencing fragment from the double-end sequencing file.
In this embodiment, the bidirectional sequencing file is a sequencing result file output by the sequencer, and the bidirectional sequencing file includes the results of respectively performing two-end sequencing on each sequencing fragment, namely, first sequence data and second sequence data, wherein the first sequence data is base sequence data obtained by sequencing from a first end to a second end of the sequencing fragment, and the second sequence data is base sequence data obtained by sequencing from the second end to the first end of the sequencing fragment, for example, sequencing sequence data corresponding to read1 in fig. 3 and sequencing sequence data corresponding to read 2.
In this embodiment, the double-ended sequencing file is a file directly output from the sequencer, and pretreatment is required for the double-ended sequencing file. Wherein preprocessing includes, but is not limited to, filtering out data that does not meet preset conditions, thereby improving the quality of the data. Comparing the processed double-ended sequencing file to a reference genome by using comparison software, wherein each base position is provided with a corresponding unique reference position in the reference genome, so that after the comparison result is obtained, the first sequence data and the second sequence data corresponding to each sequencing fragment can be obtained.
Wherein the first sequence data and the second sequence data comprise at least one of a base position of each base, a base sequence of each sequencing fragment, an estimated base fraction of each base in each sequencing fragment, a number of cycles of sequencing corresponding to each base, an identity of the sequencing fragment, a length of the sequencing fragment, and an order of the sequencing fragments (read order). Wherein the base position indicates the position of the base in the sequencing fragment, the base sequence indicates the base type of each base in each sequencing fragment output by the sequencer, and the sequence indicates the number and the sequencing direction of the sequencing fragment.
S12, based on the first sequence data and the second sequence data corresponding to each sequencing fragment, acquiring the base information of the base in the overlapping region of each sequencing fragment.
In this embodiment, when the first sequence data and the second sequence data are obtained for any one of the sequencing fragments, the same base located in the overlapping region of the sequencing fragments corresponds to two base sequences. Wherein the base information includes estimated base mass fractions obtained from the double ended sequencing file and data for candidate features related to the base mass fractions. Wherein candidate features include, but are not limited to: the sequence of the sequencing fragments, the number of sequencing cycles in which the bases are located, the content ratio of the G base and the C base in the sequencing fragments, the estimated base mass fraction of the base, the estimated base mass fraction corresponding to 2bp upstream of the base, the estimated base mass fraction corresponding to 1bp upstream of the base, the estimated base mass fraction corresponding to 2bp downstream of the base, the estimated base mass fraction corresponding to 1bp downstream of the base, the base type corresponding to 2bp upstream of the base, the base type corresponding to 1bp upstream of the base, the base type corresponding to 2bp downstream of the base, the base type corresponding to 1bp downstream of the base, and the correct base type at the base position. Where the candidate feature indicates a correlation with the base quality score, but possibly not a feature that primarily affects the base quality score, it is desirable to select a salient feature from the candidate features. The candidate features may be preset features, or all features may be regarded as candidate features by default.
S13, determining the significance characteristic which has the association degree with the estimated base quality score and meets the preset condition from the candidate characteristics based on the base information of the bases in the overlapping region of each sequencing fragment.
In this embodiment, the candidate features only represent features related to the base quality score, but may not mainly affect the base quality score, so that it is necessary to select salient features from the candidate features, the salient features represent features with high correlation degree with the base quality score, the salient features are selected by the base information of the base in the overlapping region of each sequencing fragment, and the salient features are determined according to the actual sequencing result of the sequencing fragment of the actual sample, so that the selected salient features can truly reflect features affecting the base quality score, thereby reducing the influence of external factors such as a sample, a sequencing platform, and the like on the correction of the base quality score.
S14, grouping the base data in the double-end sequencing file based on the significance characteristics and the estimated base quality scores to obtain a plurality of base data sets, acquiring base information of bases in each base data set, and calculating to obtain the original base quality score corresponding to each base data set based on the base information of the bases in each base data set.
In this embodiment, since the saliency features can truly reflect features affecting the base quality score, a base data set is performed according to the saliency features and the estimated base quality score, a more accurate base data set can be obtained, and then, based on the base information of each base data set, a more accurate original base quality score corresponding to each base data set can be calculated.
S15, fitting to obtain a base quality score correction model based on the original base quality score corresponding to each base data set and the data of the significance characteristics in the base information of each base data set, and calculating to obtain the fitted base quality score of each base in each base data set based on the base quality score correction model.
In this embodiment, after the base quality score correction model is obtained by the fitting method, the data of the significance characteristic in the base information of each base data set is used as the argument of the base quality score correction model, so that the fitted base quality score of each base data set can be obtained, then the fitted base quality score of the base in each base data set is the fitted base quality score of each base data set, and then the estimated base quality score of each base is replaced by the fitted base quality score. The base quality score correction model is obtained by fitting the original base quality score corresponding to each base data group and the data of the significance characteristics in the base information of each base data group, namely, the data of all groups are combined and used, and the model obtained by fitting the data of the base data groups is not part of the base data groups, and the consistency of continuous variable base data groups is comprehensively considered, so that the overfitting of the base quality score correction model is avoided, and the correction effect of the base quality score of the base is improved.
As shown in FIG. 6, in one embodiment, the relationship between the estimated base quality score and the fitted base quality score after the base data set is performed based on the cycle number is shown, different colors represent different sequencing cycle numbers, the left graph in FIG. 6 is a relationship between the estimated base quality score and the fitted base quality score before correction, and the right graph is a relationship between the estimated base quality score and the fitted base quality score after correction. As shown in fig. 7, in an embodiment, the relation between the estimated base quality score and the fitted base quality score after grouping based on the GC content in the sequenced fragments is shown, different colors represent GC content of different reads, red, blue and green represent high, medium and low GC content sequentially, the left graph in fig. 7 is a relation between the estimated base quality score and the fitted base quality score before correction, and the right graph is a relation between the estimated base quality score and the fitted base quality score after correction. As shown in fig. 8, fig. 8 is a schematic diagram showing the relationship between the estimated base quality score and the fitted base quality score after grouping based on the estimated base quality score of 2bp upstream in an embodiment, different colors represent different estimated base quality scores corresponding to 2bp upstream, the left graph in fig. 8 is a graph showing the relationship between the estimated base quality score and the fitted base quality score before correction, and the right graph is a graph showing the relationship between the estimated base quality score and the fitted base quality score after correction. As shown in fig. 9, fig. 9 is a schematic diagram showing a relationship between an estimated base quality score and a fitted base quality score after grouping based on an estimated base quality score corresponding to 1bp upstream in an embodiment, different colors represent different estimated base quality scores corresponding to 1bp upstream, a left graph in fig. 9 is a relationship between an estimated base quality score and a fitted base quality score before correction, and a right graph is a relationship between an estimated base quality score and a fitted base quality score after correction. As shown in fig. 10, in an embodiment, the relationship between the estimated base quality score and the fitted base quality score after grouping based on the base types is shown, different colors represent different base types, red, green, blue and purple are ACGT in sequence, the left graph in fig. 10 is a relationship between the estimated base quality score and the fitted base quality score before correction, and the right graph is a relationship between the estimated base quality score and the fitted base quality score after correction. As shown in FIG. 11, FIG. 11 is a schematic diagram showing the relationship between the estimated base quality score and the fitted base quality score after grouping based on the upstream 1bp base type in one embodiment, different colors represent different upstream 1bp base types, red, green, blue and purple are ACGT in turn, the left graph in FIG. 11 is a graph showing the relationship between the estimated base quality score and the fitted base quality score before correction, and the right graph is a graph showing the relationship between the estimated base quality score and the fitted base quality score after correction. As shown in fig. 12, fig. 12 is a schematic diagram showing a relationship between an estimated base quality score and a fitted base quality score after grouping based on an upstream 2bp base type in an embodiment, different colors represent different upstream 2bp base types, red, green, blue and purple are ACGT in sequence, a left graph in fig. 12 is a relationship between an estimated base quality score and a fitted base quality score before correction, and a right graph is a relationship between an estimated base quality score and a fitted base quality score after correction. In the above figures 6 to 12, the diagonal lines are represented by red dotted lines, each point in the figures representing a base data set (bin), the solid line being a quadratic fit curve of a different salient feature, the closer to the diagonal line in the figures, indicating that the estimated base mass fraction is closer to the fitted base mass fraction, it can be seen that before correction (left plot), the estimated base mass fraction deviates significantly from the original base mass fraction, and there is a significant correlation between the original base mass fraction value and the different salient feature of the base, for example, the left plot of figure 7 shows that if the estimated base mass fraction value of one base is 25, then when the base is in the high GC content region, then the original base mass fraction value for that base is about 20; When the base is in the middle GC content region, then the base's original base mass fraction value is about 26; when the base is in the low GC content region, the original base mass fraction value is about 30, and this condition disappears after correction. After correction of the base mass fraction of the base by the above comparison, the estimated base mass fraction of the base is adjusted to be similar to the fitted base mass fraction and independent of any base characteristics.
In the above embodiment, the first sequence data and the second sequence data of the same sequencing fragment are obtained from the double-ended sequencing file, the base information of the bases in the overlapping region can be determined by the first sequence data and the second sequence data of the same sequencing fragment, the significance characteristics affecting the base quality score are determined based on the base information of the bases in the overlapping region, more accurate base data sets can be obtained by grouping based on the significance characteristics, then the original base quality score corresponding to each base data set can be calculated based on the base information of each base data set, so that in the subsequent fitting step, more accurate fitting results can be obtained, thereby improving the effect of correcting the base quality score.
In some embodiments, the base information further includes the correct base type for which the base position corresponds; the obtaining of base information for bases in the overlapping region of each sequenced fragment comprises:
For a base position in the overlap region of each sequencing fragment, if the base position corresponds to the same first base type in the first sequence data as the second base type in the second sequence data, the correct base type at the base position is either the first base type or the second base type;
If the first base type is different from the second base type, and the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both larger than a preset quality score threshold, taking the base type corresponding to the base type with the higher estimated base quality score as the correct base type corresponding to the base position;
If the first base type is different from the second base type, the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both smaller than or equal to a preset quality score threshold, and when a main allele is present at a reference position corresponding to the base position in the reference genome, the main allele is used as a correct base type corresponding to the base position;
if the first base type is different from the second base type, the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both smaller than or equal to a preset quality score threshold, and when a main allele base does not exist at a reference position corresponding to the base position in the reference genome, the base type at the reference position is taken as a correct base type corresponding to the base position.
In this example, the base at one base position in the overlap region corresponds to two base types, namely, a first base type in the first sequence data and a second base type in the second sequence data, and if the two base types are identical, it means that the sequencing is correct, and the first base type or the second base type is set as the correct base type corresponding to the base position. If the two base types are different, the possible inaccurate sequencing is indicated or the position is possible to be the position of the heterozygous site, so that further judgment is needed to judge whether the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both larger than a preset quality score threshold value or not, if the obtained estimated base quality scores are both larger than the preset quality score threshold value, the sequencing result is correct, and the base type corresponding to the higher base type is taken as the correct base type corresponding to the base position; if the number of the base positions is smaller than the number of the possible heterozygous sites, it is further determined that a major allele exists at the reference position corresponding to the base position in the reference genome, if the major allele exists at the reference position corresponding to the base position in the reference genome, the presence of the major allele at the reference position corresponding to the base position in the reference genome is regarded as the correct base type corresponding to the base position, for example, the sequencing depth corresponding to the reference position is 100x, that is, 100 times in total at the base position, 60A, 30T, 8C and 2G are detected in the sequencing result, and then A is the only major allele, and then A is regarded as the correct base type at the base position, and if the presence of the major allele at the reference position corresponding to the base position in the reference genome is not detected after the further determination, the base type at the reference position corresponding to the base position in the reference genome is determined as the correct base type corresponding to the base position.
In the embodiment, the correct base is identified by using the double sequencing result of the overlapping region in double sequencing, when the double sequencing results are different, the base with the mutation frequency of more than 99% is not simply considered to be the correct base, but a series of conditions are analyzed, such as combining with analysis of whether the base types of double sequencing in the overlapping region are the same, analysis of the estimated base quality fraction corresponding to double sequencing, analysis of whether a main allele base exists at the base position, and the like, and comprehensive factors confirm the correct base types of the bases, so that the base quality fraction correction method based on double sequencing provided by the application is applicable to a plurality of types, has a large identification range and high identification precision, and basically is applicable to determining the correct base types corresponding to any species and any sample types without inputting a standard site list depending on the species, thereby improving the correction effect of the base quality fraction of the bases.
In some embodiments, the determining a significance signature from the candidate signatures that satisfies a preset condition with respect to the degree of association with the estimated base mass fraction based on base information of bases in the overlapping region of each sequencing fragment comprises:
calculating the salience value of each candidate feature based on the base information of the bases in the overlapping region of each sequencing fragment, wherein the salience value indicates the association degree of the candidate feature and the base quality score; and determining the candidate features with the significance values lower than the preset significance values as significance features.
In this embodiment, the significance value indicates the degree of association between the candidate feature and the base quality score, and the lower the significance value, the higher the degree of association, and the higher the significance value, the lower the degree of association. For example, the preset saliency value is set to 0.05, as shown in fig. 3, fig. 13 is a schematic diagram of saliency values corresponding to candidate features in an embodiment, the reported Q value is the estimated base quality score, it can be seen from the figure that saliency values corresponding to features 2,3,5,6,9,10,11 are all less than 0.05, and features 2,3,5,6,9,10,11 are salient features.
Optionally, the salient features include at least one of: the number of sequencing cycles in which the base is located, the content ratio of G base and C base in the sequencing fragment, the estimated base mass fraction of the base, the estimated base mass fraction corresponding to 2bp upstream of the base, the estimated base mass fraction corresponding to 1bp upstream of the base, the estimated base mass fraction corresponding to 2bp downstream of the base, the estimated base mass fraction corresponding to 1bp downstream of the base, the base type corresponding to 2bp upstream of the base, the base type corresponding to 1bp upstream of the base, the base type corresponding to 2bp downstream of the base, the base type corresponding to 1bp downstream of the base, the correct base type at the base position where the base is located. Based on the base information of the bases in the overlapping region of the sequenced fragments in the double-ended sequencing file, suitable significance characteristics can be determined according to the differences of the samples.
In the above embodiment, in the correction process, instead of using the base-fixed feature, a plurality of candidate features are collected, the salient features are found out by a statistical method, the salient values corresponding to the candidate features are calculated respectively based on the base information of the bases in the overlapping region of the sequencing fragments, the salient features are screened out, then the base data sets can be obtained by grouping according to the salient features and the estimated base quality scores, and then the original base quality score corresponding to each base data set can be calculated based on the base information of each base data set, so that in the subsequent fitting step, a more accurate fitting result can be obtained, thereby improving the effect of correcting the base quality score.
Optionally, the calculating, based on the base information of the bases in the overlapping region of each sequencing fragment, the significance value between each candidate feature and the base quality score sequentially includes:
Respectively taking one candidate feature as a feature to be checked, acquiring a target feature corresponding to the feature to be checked, traversing the bases in the overlapping area of all sequencing fragments, and counting the traversed bases based on the target feature corresponding to the feature to be checked to obtain effective information corresponding to the feature to be checked, wherein the effective information comprises a plurality of pieces of data, and each piece of data comprises the data of the target feature;
Grouping a plurality of pieces of data in the effective information corresponding to the feature to be checked by taking the feature to be checked and the estimated base quality score as grouping standards, and calculating to obtain the original base quality score and the error rate of each group in the effective information corresponding to the feature to be checked;
And calculating a significance value between the feature to be inspected and the base quality score by using a statistical analysis method based on the data of the target feature, the original base quality score and the error rate of each group in the effective information corresponding to the feature to be inspected.
In this embodiment, the target feature may be the same or different for each feature to be inspected, the target feature may be a part or all of the candidate features, and the significance value between the feature to be inspected and the base quality score is calculated based on the data of the target feature. Taking the estimated base quality score value of 1bp upstream of the base as an example as a feature to be checked, the correct base type of the base, the estimated base quality score of 1bp upstream of the base, the estimated base quality score of the base, the base type and the number of the base can be selected as target features, wherein the base type is a base result output by a sequencer, the correct base type of the base, the estimated base quality score of 1bp upstream of the base, the estimated base quality score of the base and the base type are taken as statistical rules, the bases in the overlapping area of all sequencing fragments are traversed, the correct base type of the same base, the estimated base quality score of 1bp upstream of the base, the estimated base quality score of the base and the base type are combined, and the number of each case is counted, so that a graph shown in fig. 14 is obtained, and fig. 14 is a schematic diagram of effective information of preliminary statistics in one embodiment; then, the data in the graph 14 are grouped by taking the estimated base quality fraction of 1bp upstream of the base and the estimated base quality fraction as grouping standards, the original base quality fraction and the error rate of each group in the effective information corresponding to the estimated base quality fraction of 1bp upstream of the base are obtained through calculation, the specific data are shown in the graph 15, and the data in the graph 15 are subjected to statistical analysis to obtain the original base quality fraction of each group shown in the graph 16.
Optionally, the calculating to obtain the original base quality score and the error rate of each group in the effective information corresponding to the feature to be checked includes: in each obtained group, counting the number of error bases and the total number of bases in each group, calculating the error rate of each group based on the number of error bases and the total number of bases in each group, calculating the quality score of the original bases corresponding to each group based on the error rate of each group,
Raw base quality score = -10 x log 10 (error rate),
And then calculating a significance value between the feature to be inspected and the base quality score by using a Mann-Whitney U test method if the feature to be inspected is a bi-level variable, and calculating the significance value between the feature to be inspected and the base quality score by using variance analysis if the feature to be inspected is a tri-level or more variable.
In the above embodiment, in the correction process, instead of using the base-fixed feature, a plurality of candidate features are collected, the salient features are found out by a statistical method, the salient values corresponding to the candidate features are calculated respectively based on the base information of the bases in the overlapping region of the sequencing fragments, the salient features are screened out, then the base data sets can be obtained by grouping according to the salient features and the estimated base quality scores, and then the original base quality score corresponding to each base data set can be calculated based on the base information of each base data set, so that in the subsequent fitting step, a more accurate fitting result can be obtained, thereby improving the effect of correcting the base quality score.
In some embodiments, the fitting to obtain the base quality score correction model based on the original base quality score corresponding to each base data set and the data of the significance feature in the base information of each base data set includes:
Taking each base data set as one sample point to form a plurality of sample points; the saliency characteristic is used as an independent variable and the original base quality fraction is used as an independent variable, and based on a plurality of sample points, a layered linear model is utilized to obtain a base quality fraction correction model through fitting.
In this embodiment, after obtaining the significance signature, the base data in the double-ended sequencing file of the significance signature and the estimated base quality score may be grouped to obtain a plurality of base data sets, and in each base data set, the number of erroneous bases and the total number of bases in each base data set are counted according to the above method, and the error rate of each base data set is calculated, and the original base quality score corresponding to each base data set is calculated based on the error rate of each base data set. Thus, each base data set can be understood as one sample point, a plurality of base data sets form a plurality of sample points, the significance characteristic is taken as an independent variable, the original base quality score is taken as a dependent variable, and a base quality score correction model is obtained by fitting based on the plurality of sample points, wherein the base quality score correction model is a layered linear model. For example, if the significance signature is 11 significance signatures, the 11 significance signatures are independent variables, and the plurality of sample points are sample points formed by all base data sets obtained based on the double-ended sequencing file, and are not local data, that is, data of all sets are used in combination, and are not a model obtained by fitting partial base data sets, and consistency of continuous variable base data sets is comprehensively considered, so that overfitting of a base quality score correction model is avoided, and correction effect of base quality scores of bases is improved.
In the above embodiment, the base quality score correction model is obtained by fitting the original base quality score corresponding to each base data set and the data of the significance characteristic in the base information of each base data set, that is, the data of all the base data sets are combined and not the model obtained by fitting the data of part of the base data sets, and the consistency of the base data sets of continuous variables is comprehensively considered, so that the overfitting of the base quality score correction model is avoided, and the correction effect of the base quality score of the base is improved.
In another aspect of the application, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the base quality score correction method based on double-ended sequencing according to any of the embodiments of the application. Wherein, in the computer program product, an alternative implementation form of the program module architecture of the computer program for realizing the steps of the method can be a base quality score correction device based on double-ended sequencing. Referring to fig. 17, an embodiment of the present application provides a base mass fraction correction device based on double-ended sequencing, comprising: the obtaining module 1701 is configured to obtain a double-end sequencing file, and obtain first sequence data and second sequence data corresponding to each sequencing fragment from the double-end sequencing file; wherein the first sequence data is base sequence data obtained by sequencing from a first end to a second end of the sequencing fragment, and the second sequence data is base sequence data obtained by sequencing from the second end to the first end of the sequencing fragment; the obtaining module 1701 is further configured to obtain base information of a base in an overlapping region of each sequencing fragment based on the first sequence data and the second sequence data corresponding to each sequencing fragment; the base information includes an estimated base mass fraction obtained from the double ended sequencing file, and a candidate feature associated with the estimated base mass fraction; a determining module 1702 configured to determine, from the candidate features, a salient feature having a degree of association with the estimated base quality score satisfying a preset condition based on base information of bases in an overlapping region of each sequencing fragment; a calculation module 1703, configured to group the base data in the double-ended sequencing file based on the significance feature and the estimated base quality score, obtain a plurality of base data sets, obtain base information of bases in each base data set, and calculate an original base quality score corresponding to each base data set based on the base information of the bases in each base data set; the calculation module 1703 is further configured to obtain a base quality score correction model by fitting based on the original base quality score corresponding to each base data set and the significance feature in the base information of each base data set, and calculate a fitted base quality score of each base in each base data set based on the base quality score correction model.
Optionally, the acquiring module 1701 is further configured to:
for a base position in the overlap region of each sequencing fragment, if the base position corresponds to the same first base type in the first sequence data as the second base type in the second sequence data, the correct base type at the base position is either the first base type or the second base type;
If the first base type is different from the second base type, and the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both greater than a preset quality score threshold, taking the base type corresponding to the base type with the higher estimated base quality score as the correct base type corresponding to the base position;
If the first base type is different from the second base type, the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both smaller than or equal to a preset quality score threshold, and when a main allele exists at a reference position corresponding to the base position in the reference genome, the main allele is taken as a correct base type corresponding to the base position;
If the first base type is different from the second base type, the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both smaller than or equal to a preset quality score threshold, and when the main allele base does not exist at the reference position corresponding to the base position in the reference genome, the base type at the reference position is taken as the correct base type corresponding to the base position.
Optionally, the determining module 1702 is further configured to:
Calculating the significance value of each candidate feature based on the base information of the bases in the overlapping region of each sequencing fragment; wherein the significance value indicates the degree of association of the candidate feature with the estimated base quality score;
and determining the candidate features with the significance values lower than the preset significance values as significance features.
Optionally, the computing module 1703 is further configured to:
Taking each candidate feature as a feature to be checked, acquiring a target feature corresponding to the feature to be checked, traversing the bases in the overlapping area of all sequencing fragments, and counting the traversed bases based on the target feature corresponding to the feature to be checked to obtain effective information corresponding to the feature to be checked; wherein the effective information comprises a plurality of pieces of data, each piece of data comprising a target feature;
Taking the to-be-inspected feature and the estimated base quality score as base data set standards, performing base data set on a plurality of pieces of data in the effective information corresponding to the to-be-inspected feature, and calculating to obtain the original base quality score and error rate of each set in the effective information corresponding to the to-be-inspected feature;
And calculating a significance value between the feature to be checked and the estimated base quality score by using a statistical analysis method based on the data of the target feature, the original base quality score and the error rate of each group in the effective information corresponding to the feature to be checked.
Optionally, the salient features include at least one of: the number of sequencing cycles in which the base is located, the content ratio of G base and C base in the sequencing fragment, the estimated base mass fraction of the base, the estimated base mass fraction corresponding to 2bp upstream of the base, the estimated base mass fraction corresponding to 1bp upstream of the base, the estimated base mass fraction corresponding to 2bp downstream of the base, the estimated base mass fraction corresponding to 1bp downstream of the base, the base type corresponding to 2bp upstream of the base, the base type corresponding to 1bp upstream of the base, the base type corresponding to 2bp downstream of the base, the base type corresponding to 1bp downstream of the base, the correct base type at the base position where the base is located.
Optionally, the computing module 1703 is further configured to:
counting the number and total number of the error bases in each base data group, and calculating the error rate of each base data group based on the number and total number of the error bases in each base data group;
Calculating the quality fraction of the original base corresponding to each base data group based on the error rate of each base data group; the error rate is calculated as follows:
raw base quality score = -10×log 10 (error rate).
Optionally, the computing module 1703 is further configured to:
taking each base data set as one sample point to form a plurality of sample points;
And taking the significance characteristic as an independent variable and the original base quality score as an independent variable, and fitting by using a layered linear model based on a plurality of sample points to obtain a base quality score correction model.
It will be appreciated by those skilled in the art that the structure of the double-ended sequencing-based base quality score correction device in FIG. 17 does not constitute a limitation of the double-ended sequencing-based base quality score correction device, and that the respective modules may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or independent of a controller in a computer device, or may be stored in software in a memory in the computer device, so that the controller may call and execute operations corresponding to the above modules. In other embodiments, more or fewer modules than shown may be included in the base quality score correction device based on double ended sequencing.
Referring to fig. 18, in another aspect of the embodiment of the present application, there is further provided a base quality score correction apparatus 200 based on double-ended sequencing, including a memory 3011 and a processor 3012, wherein the memory 3011 stores a computer program, and the computer program when executed by the processor causes the processor 3012 to execute the steps of the base quality score correction method based on double-ended sequencing provided in any of the above embodiments of the present application. Base quality score correction device 200 based on double-ended sequencing may include a gene sequencer, a computing device (e.g., a desktop computer, a laptop computer, a tablet computer, a handheld computer, a smart speaker, a server, etc.), a mobile phone (e.g., a smart phone, a wireless phone, etc.), a wearable device (e.g., a pair of smart glasses or a smart watch), or the like.
Where the processor 3012 is a control center, various interfaces and lines are utilized to connect various portions of the overall computer device, perform various functions of the computer device and process data by running or executing software programs and/or modules stored in the memory 3011, and invoking data stored in the memory 3011. Optionally, the processor 3012 may include one or more processing cores; preferably, the processor 3012 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user pages, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 3012.
The memory 3011 may be used to store software programs and modules, and the processor 3012 executes various functional applications and data processing by executing the software programs and modules stored in the memory 3011. The memory 3011 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 3011 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 3011 may also include a memory controller to provide access to the memory 3011 by the processor 3012.
In another aspect of the embodiments of the present application, there is also provided a storage medium storing a computer program, where the computer program when executed by a processor causes the processor to execute the steps of the base quality score correction method based on double-ended sequencing provided in any one of the above embodiments of the present application.
Those skilled in the art will appreciate that implementing all or part of the processes of the methods provided in the above embodiments may be accomplished by computer programs stored on a non-transitory computer readable storage medium, which when executed, may comprise processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. The scope of the invention is to be determined by the appended claims.

Claims (10)

1. A base quality score correction method based on double-ended sequencing, comprising:
Obtaining a double-end sequencing file, and obtaining first sequence data and second sequence data corresponding to each sequencing fragment from the double-end sequencing file; wherein the first sequence data is base sequence data obtained by sequencing from a first end to a second end of the sequencing fragment, and the second sequence data is base sequence data obtained by sequencing from the second end to the first end of the sequencing fragment;
Acquiring base information of bases in an overlapping region of each sequencing fragment based on the first sequence data and the second sequence data corresponding to each sequencing fragment; the base information includes an estimated base mass fraction obtained from the double ended sequencing file, and a candidate feature associated with the estimated base mass fraction;
Determining a salient feature with the association degree with the estimated base quality score meeting a preset condition from the candidate features based on the base information of the bases in the overlapping region of each sequencing fragment;
Grouping the base data in the double-end sequencing file based on the significance characteristics and the estimated base quality scores to obtain a plurality of base data sets, acquiring base information of bases in each base data set, and calculating to obtain an original base quality score corresponding to each base data set based on the base information of the bases in each base data set;
Fitting to obtain a base quality score correction model based on the original base quality score corresponding to each base data set and the significance characteristics in the base information of each base data set, and calculating to obtain the fitted base quality score of each base in each base data set based on the base quality score correction model.
2. The method for correcting a base quality score based on double-ended sequencing according to claim 1, wherein the base information further comprises a correct base type corresponding to a base position; the obtaining of base information for bases in the overlapping region of each sequenced fragment comprises:
for a base position in the overlap region of each sequencing fragment, if the base position corresponds to the same first base type in the first sequence data as the second base type in the second sequence data, the correct base type at the base position is either the first base type or the second base type;
If the first base type is different from the second base type, and the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both greater than a preset quality score threshold, taking the base type corresponding to the base type with the higher estimated base quality score as the correct base type corresponding to the base position;
If the first base type is different from the second base type, the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both smaller than or equal to a preset quality score threshold, and when a main allele exists at a reference position corresponding to the base position in the reference genome, the main allele is taken as a correct base type corresponding to the base position;
If the first base type is different from the second base type, the obtained estimated base quality score corresponding to the first base type and the obtained estimated base quality score corresponding to the second base type are both smaller than or equal to a preset quality score threshold, and when the main allele base does not exist at the reference position corresponding to the base position in the reference genome, the base type at the reference position is taken as the correct base type corresponding to the base position.
3. The double-ended sequencing-based base quality score correction method of claim 1, wherein the determining a significance signature from the candidate signatures that satisfies a preset condition in association with the estimated base quality score based on base information of bases in an overlapping region of each sequencing fragment comprises:
Calculating the significance value of each candidate feature based on the base information of the bases in the overlapping region of each sequencing fragment; wherein the significance value indicates the degree of association of the candidate feature with the estimated base quality score;
and determining the candidate features with the significance values lower than the preset significance values as significance features.
4. The method for correcting a base quality score based on double-ended sequencing according to claim 3, wherein the calculating the significance value of each candidate feature based on the base information of the bases in the overlapping region of each sequencing fragment comprises:
Taking each candidate feature as a feature to be checked, acquiring a target feature corresponding to the feature to be checked, traversing the bases in the overlapping area of all sequencing fragments, and counting the traversed bases based on the target feature corresponding to the feature to be checked to obtain effective information corresponding to the feature to be checked; wherein the effective information comprises a plurality of pieces of data, each piece of data comprising a target feature;
Taking the to-be-inspected feature and the estimated base quality score as base data set standards, performing base data set on a plurality of pieces of data in the effective information corresponding to the to-be-inspected feature, and calculating to obtain the original base quality score and error rate of each set in the effective information corresponding to the to-be-inspected feature;
And calculating a significance value between the feature to be checked and the estimated base quality score by using a statistical analysis method based on the data of the target feature, the original base quality score and the error rate of each group in the effective information corresponding to the feature to be checked.
5. The double-ended sequencing-based base quality score correction method of claim 1, wherein the significance signature comprises at least one of: the number of sequencing cycles in which the base is located, the content ratio of G base and C base in the sequencing fragment, the estimated base mass fraction of the base, the estimated base mass fraction corresponding to 2bp upstream of the base, the estimated base mass fraction corresponding to 1bp upstream of the base, the estimated base mass fraction corresponding to 2bp downstream of the base, the estimated base mass fraction corresponding to 1bp downstream of the base, the base type corresponding to 2bp upstream of the base, the base type corresponding to 1bp upstream of the base, the base type corresponding to 2bp downstream of the base, the base type corresponding to 1bp downstream of the base, the correct base type at the base position where the base is located.
6. The double-ended sequencing-based base quality score correction method according to any one of claims 1 to 5, wherein the calculating based on base information of bases in each base data set to obtain an original base quality score corresponding to each base data set comprises:
counting the number and total number of the error bases in each base data group, and calculating the error rate of each base data group based on the number and total number of the error bases in each base data group;
Calculating the quality fraction of the original base corresponding to each base data group based on the error rate of each base data group; the error rate is calculated as follows:
raw base quality score = -10×log 10 (error rate).
7. The method for correcting a base quality score based on double-ended sequencing according to any one of claims 1 to 5, wherein the fitting to obtain a base quality score correction model based on the raw base quality score corresponding to each base data set and the significance characteristics in the base information of each base data set comprises:
taking each base data set as one sample point to form a plurality of sample points;
And taking the significance characteristic as an independent variable and the original base quality score as an independent variable, and fitting by using a layered linear model based on a plurality of sample points to obtain a base quality score correction model.
8. A computer program product, comprising: comprising a computer program, characterized in that it when executed by a processor implements the base quality score correction method based on double-ended sequencing according to any one of claims 1 to 7.
9. A base quality score correction device based on double ended sequencing, characterized in that it comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 7.
10. A computer readable storage medium storing a computer program, which when executed by a processor causes the processor to perform the steps of the method according to any one of claims 1 to 7.
CN202411116286.XA 2024-08-14 2024-08-14 Base quality fraction correction method based on double-ended sequencing, program product, equipment and storage medium Pending CN119007834A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411116286.XA CN119007834A (en) 2024-08-14 2024-08-14 Base quality fraction correction method based on double-ended sequencing, program product, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411116286.XA CN119007834A (en) 2024-08-14 2024-08-14 Base quality fraction correction method based on double-ended sequencing, program product, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN119007834A true CN119007834A (en) 2024-11-22

Family

ID=93492985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411116286.XA Pending CN119007834A (en) 2024-08-14 2024-08-14 Base quality fraction correction method based on double-ended sequencing, program product, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN119007834A (en)

Similar Documents

Publication Publication Date Title
US10991453B2 (en) Alignment of nucleic acid sequences containing homopolymers based on signal values measured for nucleotide incorporations
CN108573125B (en) Method for detecting genome copy number variation and device comprising same
KR102667912B1 (en) Systems and methods for determining microsatellite instability
WO2025060855A1 (en) Base calling method, method for constructing training set for base calling, gene sequencer, and medium
JP5171254B2 (en) Automated analysis of multiple probe target interaction patterns: pattern matching and allele identification
CN112349346A (en) Method for detecting structural variations in genomic regions
CN117315654B (en) End-to-end gene sequencing method and device, gene sequencer and storage medium
CN117274614A (en) Base recognition method, sequencer and medium based on fluorescence labeling dNTP gene sequencing
CN116189763A (en) Single sample copy number variation detection method based on second generation sequencing
CN111180013B (en) Device for detecting blood disease fusion gene
CN113862351A (en) Kit and method for identifying extracellular RNA biomarkers in body fluid samples
CN113096737B (en) Method and system for automatically analyzing pathogen type
JP5403563B2 (en) Gene identification method and expression analysis method in comprehensive fragment analysis
CN115461817A (en) Genome sequencing and detection techniques
CN119007834A (en) Base quality fraction correction method based on double-ended sequencing, program product, equipment and storage medium
CN114093417B (en) Method and device for identifying chromosomal arm heterozygosity loss
CN119132404A (en) Base recognition method based on double-end sequencing, sequencing data quality assessment method, program product and equipment
CN117672343B (en) Sequencing saturation evaluation method and device, equipment and storage medium
CN117523559B (en) Base recognition method and device, gene sequencer and storage medium
US20250210139A1 (en) Method and apparatus for determining quality parameter in basecall
CN119170097B (en) IKZF1 gene exon deletion identification system and method based on high-throughput transcriptome sequencing
CN117877025A (en) Three-dimensional base recognition method and device, gene sequencer and storage medium
US20230368863A1 (en) Multiplexed Screening Analysis of Peptides for Target Binding
CN117976042A (en) Method for determining read quality score, sequencing method and device
CN118230820A (en) Metagene sequencing data-based drug-resistant gene species source identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination