CN108959851B - Illumina high-throughput sequencing data error correction method - Google Patents
Illumina high-throughput sequencing data error correction method Download PDFInfo
- Publication number
- CN108959851B CN108959851B CN201810601099.9A CN201810601099A CN108959851B CN 108959851 B CN108959851 B CN 108959851B CN 201810601099 A CN201810601099 A CN 201810601099A CN 108959851 B CN108959851 B CN 108959851B
- Authority
- CN
- China
- Prior art keywords
- illumina
- sequencing
- sequencing result
- semiconductor
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012165 high-throughput sequencing Methods 0.000 title claims abstract description 28
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000012937 correction Methods 0.000 title claims abstract description 14
- 238000012163 sequencing technique Methods 0.000 claims abstract description 108
- 239000004065 semiconductor Substances 0.000 claims abstract description 30
- 238000002864 sequence alignment Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 4
- 230000004075 alteration Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides an Illumina high-throughput sequencing data error correction method, which comprises the following steps: 1. semiconductor sequencing was performed simultaneously on Illumina sequencing samples. Namely, the semiconductor sequencing result is obtained while the sample Illumina sequencing result is obtained; 2. respectively determining the position of each sequencing reading in a reference genome by comparing the Illumina sequencing result with the semiconductor sequencing result through sequence; 3. the sequencing results of the same position were analyzed. The invention provides an Illumina high-throughput sequencing data error correction method aiming at the problem. The method utilizes the characteristic that the base type in the semiconductor high-throughput sequencing result is not easy to detect by mistake, and realizes the error correction of the Illumina high-throughput sequencing data by logically analyzing the corresponding relation among the Illumina high-throughput sequencing result, the semiconductor high-throughput sequencing result and the reference genome base sequence.
Description
Technical Field
The invention relates to an Illumina high-throughput sequencing data error correction method, and belongs to the field of molecular biological information detection.
Background
With the rapid development of biological detection technology, the second generation sequencing platforms such as Solexa of Illumina, 454 of Life Sciences and SOLID of ABI are gradually replaced by the new generation sequencing platform. This includes MiSeq, NextSeq, HiSeq series by Illumina, Ion Torrent, Ion Proton, Ion PGM series by ABI, and MinION by Oxford Nanopore Technologies, among others. Although the introduction of a new generation of sequencing platform enables the detection of biological information to be deeper, lower in cost and more efficient, the original interpretation method of high-throughput sequencing data has to be changed correspondingly due to the change of the detection mechanism.
In a new generation of sequencing platform, the Illumina sequencing platform is widely applied to detection of various molecular biological information due to the characteristics of higher sequencing depth, lower error rate and the like. However, due to the base type recognition based on the chromatic aberration of light, the Illumina sequencing data has certain sequencing errors, and the predominant expression form of the sequencing errors is base type errors. This will lead to differences between Illumina sequencing results and the reference genomic base sequence. However, it is normal that some differences exist between the Illumina sequencing result and the reference genome base sequence due to individual differences, and these differences are also the important concerns in subsequent research. Therefore, it is important to distinguish whether the difference between the Illumina sequencing result and the reference genome base sequence is a true difference or a difference caused by Illumina sequencing error.
Semiconductor high-throughput sequencing performs base type recognition through chemical reactions, and therefore, the base type is not easy to be misdetermined. Based on the method, the invention innovatively provides that the error correction of Illumina high-throughput sequencing data is realized by logically analyzing the corresponding relation between Illumina high-throughput sequencing results, semiconductor high-throughput sequencing results and reference genome base sequences.
Disclosure of Invention
The invention aims to provide an Illumina high-throughput sequencing data error correction method which can effectively identify and remove sequencing errors in sequencing data of the sequencing platform.
The purpose of the invention is realized as follows: the method comprises the following steps:
step 1, semiconductor sequencing is carried out on an Illumina sequencing sample at the same time, and a semiconductor sequencing result of the Illumina sequencing sample is obtained while a sample Illumina sequencing result is obtained;
step 2, determining the position of each sequencing reading in a reference genome by sequence alignment of the Illumina sequencing result and the semiconductor sequencing result;
and 3, judging the sequencing result of the same position as follows:
the Illumina sequencing result is the same as the base sequence of the reference genome, and the Illumina sequencing result is correct;
the Illumina sequencing result is different from the base sequence of a reference genome, and the Illumina sequencing result is divided into the following three conditions:
1) the semiconductor sequencing result is the same as the Illumina sequencing result, and the Illumina sequencing result is correct;
2) the base sequence of the semiconductor sequencing result is the same as that of the reference genome, and the Illumina sequencing result is wrong;
3) the semiconductor sequencing result is different from the Illumina sequencing result and the reference genome base sequence, and the Illumina sequencing result is uncertain.
Compared with the prior art, the invention has the beneficial effects that: as a new generation of high-throughput sequencing technology, Illumina sequencing has been widely applied to detection of various molecular biological information. However, the sequencing platform also has a certain sequencing error, and the main expression form of the sequencing error is a base type error. The invention provides an Illumina high-throughput sequencing data error correction method aiming at the problem. The method utilizes the characteristic that the base type in the semiconductor high-throughput sequencing result is not easy to detect by mistake, and realizes the error correction of the Illumina high-throughput sequencing data by logically analyzing the corresponding relation among the Illumina high-throughput sequencing result, the semiconductor high-throughput sequencing result and the reference genome base sequence.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the distribution of the positions of correctly differentiated bases in Illumina sequencing data;
FIG. 3 is a schematic diagram of the location distribution of erroneously different bases in Illumina sequencing data.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
With reference to fig. 1 to 3, Illumina sequencing has been widely applied to detection of various molecular biological information as a new generation high throughput sequencing technology. However, due to the base type identification based on the chromatic aberration of light, the Illumina sequencing result inevitably has certain base type sequencing errors. The semiconductor high-throughput sequencing carries out base type identification through chemical reaction, and the base type is not easy to be tested by mistake. Based on the method, the invention innovatively provides an Illumina high-throughput sequencing data error correction method. The method realizes error correction of Illumina high-throughput sequencing data by logically analyzing the corresponding relation between Illumina high-throughput sequencing results, semiconductor high-throughput sequencing results and reference genome base sequences.
The method of the invention comprises the following steps:
1. semiconductor sequencing was performed simultaneously on Illumina sequencing samples. Namely, the semiconductor sequencing result is obtained while the sample Illumina sequencing result is obtained;
2. respectively determining the position of each sequencing reading in a reference genome by comparing the Illumina sequencing result with the semiconductor sequencing result through sequence;
3. the sequencing results for the same position were logically analyzed as follows:
the Illumina sequencing result is the same as the base sequence of the reference genome, and the Illumina sequencing result is correct;
the Illumina sequencing result is different from the base sequence of a reference genome, and the Illumina sequencing result is divided into the following three conditions:
1) the semiconductor sequencing result is the same as the Illumina sequencing result, and the Illumina sequencing result is correct;
2) the base sequence of the semiconductor sequencing result is the same as that of the reference genome, and the Illumina sequencing result is wrong;
3) the semiconductor sequencing result is different from the Illumina sequencing result and the reference genome base sequence, and the Illumina sequencing result is uncertain.
4. Experimental verification
We performed Illumina sequencing and semiconductor Ion Torrent sequencing, respectively, on the same human experimental sample. The sequencing results of both platforms were then used to determine the position of each sequencing read in the reference genome by sequence alignment.
The Illumina sequencing data total 4592877 sequencing reads with a read length of 100 bases, wherein 1007117 reads contain 1 different base compared to the reference genomic base sequence. We analyzed these 1007117 different bases. Using our proposed calibration method, a total of 11597 base sequencing errors were found, accounting for 1.15%.
We divided 1007117 different bases into two categories, correct sequencing and incorrect sequencing, and counted the positions of the two in the sequencing reads separately. Since the presence or absence of the differential bases is independent of the sequencing read itself, the position of the correctly sequenced differential bases in the sequencing read should correspond to a uniform distribution. Whereas due to the cumulative effect, the missequenced differential bases are more likely to occur in the back of the sequencing read. This is confirmed in fig. 2 and 3, which also demonstrate the effectiveness of our proposed Illumina high-throughput data error correction method.
Claims (1)
1. An Illumina high-throughput sequencing data error correction method, comprising: the method comprises the following steps:
step 1, semiconductor sequencing is carried out on an Illumina sequencing sample at the same time, and a semiconductor sequencing result of the Illumina sequencing sample is obtained while a sample Illumina sequencing result is obtained;
step 2, determining the position of each sequencing reading in a reference genome by sequence alignment of the Illumina sequencing result and the semiconductor sequencing result;
and 3, judging the sequencing result of the same position as follows:
the Illumina sequencing result is the same as the base sequence of the reference genome, and the Illumina sequencing result is correct;
the Illumina sequencing result is different from the base sequence of a reference genome, and the Illumina sequencing result is divided into the following three conditions:
1) the semiconductor sequencing result is the same as the Illumina sequencing result, and the Illumina sequencing result is correct;
2) the base sequence of the semiconductor sequencing result is the same as that of the reference genome, and the Illumina sequencing result is wrong;
3) the semiconductor sequencing result is different from the Illumina sequencing result and the reference genome base sequence, and the Illumina sequencing result is uncertain.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810601099.9A CN108959851B (en) | 2018-06-12 | 2018-06-12 | Illumina high-throughput sequencing data error correction method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810601099.9A CN108959851B (en) | 2018-06-12 | 2018-06-12 | Illumina high-throughput sequencing data error correction method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN108959851A CN108959851A (en) | 2018-12-07 |
| CN108959851B true CN108959851B (en) | 2022-03-18 |
Family
ID=64488394
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201810601099.9A Active CN108959851B (en) | 2018-06-12 | 2018-06-12 | Illumina high-throughput sequencing data error correction method |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN108959851B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114420214A (en) * | 2022-01-28 | 2022-04-29 | 赛纳生物科技(北京)有限公司 | Quality evaluation method and screening method of nucleic acid sequencing data |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101886114A (en) * | 2009-05-14 | 2010-11-17 | 上海聚类生物科技有限公司 | Method for analyzing high-throughput sequencing data based on RMI (Read Mass Index) |
| CN102622534A (en) * | 2012-04-11 | 2012-08-01 | 哈尔滨工程大学 | A DNA high-throughput sequencing data correction method for gene expression detection |
| CN105886605A (en) * | 2015-03-05 | 2016-08-24 | 南京市妇幼保健院 | Amplification primer for detecting PKD2 gene mutation and detection method |
| CN105925675A (en) * | 2016-04-26 | 2016-09-07 | 序康医疗科技(苏州)有限公司 | Method for amplifying dna |
| CN106156543A (en) * | 2016-06-22 | 2016-11-23 | 厦门艾德生物医药科技股份有限公司 | A kind of tumor ctDNA information statistical method |
| CN107229842A (en) * | 2017-06-02 | 2017-10-03 | 肖传乐 | A kind of three generations's sequencing sequence bearing calibration based on Local map |
| CN107849600A (en) * | 2015-06-09 | 2018-03-27 | 生命技术公司 | For the method for molecular labeling, system, composition, kit, device and computer-readable media |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9758780B2 (en) * | 2014-06-02 | 2017-09-12 | Drexel University | Whole genome mapping by DNA sequencing with linked-paired-end library |
-
2018
- 2018-06-12 CN CN201810601099.9A patent/CN108959851B/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101886114A (en) * | 2009-05-14 | 2010-11-17 | 上海聚类生物科技有限公司 | Method for analyzing high-throughput sequencing data based on RMI (Read Mass Index) |
| CN102622534A (en) * | 2012-04-11 | 2012-08-01 | 哈尔滨工程大学 | A DNA high-throughput sequencing data correction method for gene expression detection |
| CN105886605A (en) * | 2015-03-05 | 2016-08-24 | 南京市妇幼保健院 | Amplification primer for detecting PKD2 gene mutation and detection method |
| CN107849600A (en) * | 2015-06-09 | 2018-03-27 | 生命技术公司 | For the method for molecular labeling, system, composition, kit, device and computer-readable media |
| CN105925675A (en) * | 2016-04-26 | 2016-09-07 | 序康医疗科技(苏州)有限公司 | Method for amplifying dna |
| CN106156543A (en) * | 2016-06-22 | 2016-11-23 | 厦门艾德生物医药科技股份有限公司 | A kind of tumor ctDNA information statistical method |
| CN107229842A (en) * | 2017-06-02 | 2017-10-03 | 肖传乐 | A kind of three generations's sequencing sequence bearing calibration based on Local map |
Non-Patent Citations (3)
| Title |
|---|
| Denoising DNA deep sequencing data— high-throughput sequencing errors and their correction;David Laehnemann;《Briefings in Bioinformatics》;20161231;第17卷(第1期);第154–179页 * |
| Ion_torrent多聚碱基测序分析方法研究;宋锋飞;《万方数据知识服务平台》;20160505;第1-64页 * |
| 高通量测序数据误差分析方法研究;董彦生;《中国优秀硕士学位论文全文数据基础科学辑》;20180315(第3期);第32-44页 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN108959851A (en) | 2018-12-07 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Almeida et al. | Bioinformatics tools to assess metagenomic data for applied microbiology | |
| Bragg et al. | Metagenomics using next-generation sequencing | |
| EP3143537B1 (en) | Rare variant calls in ultra-deep sequencing | |
| Sinclair et al. | Microbial community composition and diversity via 16S rRNA gene amplicons: evaluating the illumina platform | |
| CN109346130B (en) | A method for obtaining microhaplotypes and their typing directly from whole-genome resequencing data | |
| CN110189796A (en) | A sheep whole genome resequencing analysis method | |
| CN110211633B (en) | MGMT gene promoter methylation detection method, sequencing data processing method and processing device | |
| CN107075565B (en) | Method and device for typing individual single nucleotide polymorphism sites | |
| Brozynska et al. | Direct chloroplast sequencing: comparison of sequencing platforms and analysis tools for whole chloroplast barcoding | |
| Ahmed et al. | Identifying A-and P-site locations on ribosome-protected mRNA fragments using Integer Programming | |
| CN109920480B (en) | Method and device for correcting high-throughput sequencing data | |
| CN112513292A (en) | Method and device for detecting homologous sequence based on high-throughput sequencing | |
| Owens et al. | A novel post hoc method for detecting index switching finds no evidence for increased switching on the Illumina HiSeq X | |
| CN117253539B (en) | Methods and systems for detecting sample contamination in high-throughput sequencing based on germline mutations | |
| Birzu et al. | Hybridization breaks species barriers in long-term coevolution of a cyanobacterial population | |
| CN108959851B (en) | Illumina high-throughput sequencing data error correction method | |
| KR20210105725A (en) | A method and apparatus for determining true positive variation in nucleic acid sequencing analysis | |
| CN110942806A (en) | Blood type genotyping method and device and storage medium | |
| CN118038979B (en) | Methods for detecting mutation patterns and transposition imprints of transposon insertion into human genome | |
| CN102154452A (en) | Method and system for identifying cis-regulatory action and trans-regulatory action | |
| CN114093417B (en) | Method and device for identifying chromosomal arm heterozygosity loss | |
| JP2012235723A (en) | Large-scale base sequence analysis method, program, and apparatus | |
| Benaglio et al. | Ultra high throughput sequencing in human DNA variation detection: a comparative study on the NDUFA3-PRPF31 region | |
| KR100450816B1 (en) | Selection method of probe set for genotyping | |
| CN111826428B (en) | Microsatellite instability detection method and system based on second-generation sequencing |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |