[go: up one dir, main page]

CN108959851B - Illumina high-throughput sequencing data error correction method - Google Patents

Illumina high-throughput sequencing data error correction method Download PDF

Info

Publication number
CN108959851B
CN108959851B CN201810601099.9A CN201810601099A CN108959851B CN 108959851 B CN108959851 B CN 108959851B CN 201810601099 A CN201810601099 A CN 201810601099A CN 108959851 B CN108959851 B CN 108959851B
Authority
CN
China
Prior art keywords
illumina
sequencing
sequencing result
semiconductor
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810601099.9A
Other languages
Chinese (zh)
Other versions
CN108959851A (en
Inventor
冯伟兴
贺波
陈多娇
王雪莹
南方伯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201810601099.9A priority Critical patent/CN108959851B/en
Publication of CN108959851A publication Critical patent/CN108959851A/en
Application granted granted Critical
Publication of CN108959851B publication Critical patent/CN108959851B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides an Illumina high-throughput sequencing data error correction method, which comprises the following steps: 1. semiconductor sequencing was performed simultaneously on Illumina sequencing samples. Namely, the semiconductor sequencing result is obtained while the sample Illumina sequencing result is obtained; 2. respectively determining the position of each sequencing reading in a reference genome by comparing the Illumina sequencing result with the semiconductor sequencing result through sequence; 3. the sequencing results of the same position were analyzed. The invention provides an Illumina high-throughput sequencing data error correction method aiming at the problem. The method utilizes the characteristic that the base type in the semiconductor high-throughput sequencing result is not easy to detect by mistake, and realizes the error correction of the Illumina high-throughput sequencing data by logically analyzing the corresponding relation among the Illumina high-throughput sequencing result, the semiconductor high-throughput sequencing result and the reference genome base sequence.

Description

Illumina high-throughput sequencing data error correction method
Technical Field
The invention relates to an Illumina high-throughput sequencing data error correction method, and belongs to the field of molecular biological information detection.
Background
With the rapid development of biological detection technology, the second generation sequencing platforms such as Solexa of Illumina, 454 of Life Sciences and SOLID of ABI are gradually replaced by the new generation sequencing platform. This includes MiSeq, NextSeq, HiSeq series by Illumina, Ion Torrent, Ion Proton, Ion PGM series by ABI, and MinION by Oxford Nanopore Technologies, among others. Although the introduction of a new generation of sequencing platform enables the detection of biological information to be deeper, lower in cost and more efficient, the original interpretation method of high-throughput sequencing data has to be changed correspondingly due to the change of the detection mechanism.
In a new generation of sequencing platform, the Illumina sequencing platform is widely applied to detection of various molecular biological information due to the characteristics of higher sequencing depth, lower error rate and the like. However, due to the base type recognition based on the chromatic aberration of light, the Illumina sequencing data has certain sequencing errors, and the predominant expression form of the sequencing errors is base type errors. This will lead to differences between Illumina sequencing results and the reference genomic base sequence. However, it is normal that some differences exist between the Illumina sequencing result and the reference genome base sequence due to individual differences, and these differences are also the important concerns in subsequent research. Therefore, it is important to distinguish whether the difference between the Illumina sequencing result and the reference genome base sequence is a true difference or a difference caused by Illumina sequencing error.
Semiconductor high-throughput sequencing performs base type recognition through chemical reactions, and therefore, the base type is not easy to be misdetermined. Based on the method, the invention innovatively provides that the error correction of Illumina high-throughput sequencing data is realized by logically analyzing the corresponding relation between Illumina high-throughput sequencing results, semiconductor high-throughput sequencing results and reference genome base sequences.
Disclosure of Invention
The invention aims to provide an Illumina high-throughput sequencing data error correction method which can effectively identify and remove sequencing errors in sequencing data of the sequencing platform.
The purpose of the invention is realized as follows: the method comprises the following steps:
step 1, semiconductor sequencing is carried out on an Illumina sequencing sample at the same time, and a semiconductor sequencing result of the Illumina sequencing sample is obtained while a sample Illumina sequencing result is obtained;
step 2, determining the position of each sequencing reading in a reference genome by sequence alignment of the Illumina sequencing result and the semiconductor sequencing result;
and 3, judging the sequencing result of the same position as follows:
the Illumina sequencing result is the same as the base sequence of the reference genome, and the Illumina sequencing result is correct;
the Illumina sequencing result is different from the base sequence of a reference genome, and the Illumina sequencing result is divided into the following three conditions:
1) the semiconductor sequencing result is the same as the Illumina sequencing result, and the Illumina sequencing result is correct;
2) the base sequence of the semiconductor sequencing result is the same as that of the reference genome, and the Illumina sequencing result is wrong;
3) the semiconductor sequencing result is different from the Illumina sequencing result and the reference genome base sequence, and the Illumina sequencing result is uncertain.
Compared with the prior art, the invention has the beneficial effects that: as a new generation of high-throughput sequencing technology, Illumina sequencing has been widely applied to detection of various molecular biological information. However, the sequencing platform also has a certain sequencing error, and the main expression form of the sequencing error is a base type error. The invention provides an Illumina high-throughput sequencing data error correction method aiming at the problem. The method utilizes the characteristic that the base type in the semiconductor high-throughput sequencing result is not easy to detect by mistake, and realizes the error correction of the Illumina high-throughput sequencing data by logically analyzing the corresponding relation among the Illumina high-throughput sequencing result, the semiconductor high-throughput sequencing result and the reference genome base sequence.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the distribution of the positions of correctly differentiated bases in Illumina sequencing data;
FIG. 3 is a schematic diagram of the location distribution of erroneously different bases in Illumina sequencing data.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
With reference to fig. 1 to 3, Illumina sequencing has been widely applied to detection of various molecular biological information as a new generation high throughput sequencing technology. However, due to the base type identification based on the chromatic aberration of light, the Illumina sequencing result inevitably has certain base type sequencing errors. The semiconductor high-throughput sequencing carries out base type identification through chemical reaction, and the base type is not easy to be tested by mistake. Based on the method, the invention innovatively provides an Illumina high-throughput sequencing data error correction method. The method realizes error correction of Illumina high-throughput sequencing data by logically analyzing the corresponding relation between Illumina high-throughput sequencing results, semiconductor high-throughput sequencing results and reference genome base sequences.
The method of the invention comprises the following steps:
1. semiconductor sequencing was performed simultaneously on Illumina sequencing samples. Namely, the semiconductor sequencing result is obtained while the sample Illumina sequencing result is obtained;
2. respectively determining the position of each sequencing reading in a reference genome by comparing the Illumina sequencing result with the semiconductor sequencing result through sequence;
3. the sequencing results for the same position were logically analyzed as follows:
the Illumina sequencing result is the same as the base sequence of the reference genome, and the Illumina sequencing result is correct;
the Illumina sequencing result is different from the base sequence of a reference genome, and the Illumina sequencing result is divided into the following three conditions:
1) the semiconductor sequencing result is the same as the Illumina sequencing result, and the Illumina sequencing result is correct;
2) the base sequence of the semiconductor sequencing result is the same as that of the reference genome, and the Illumina sequencing result is wrong;
3) the semiconductor sequencing result is different from the Illumina sequencing result and the reference genome base sequence, and the Illumina sequencing result is uncertain.
4. Experimental verification
We performed Illumina sequencing and semiconductor Ion Torrent sequencing, respectively, on the same human experimental sample. The sequencing results of both platforms were then used to determine the position of each sequencing read in the reference genome by sequence alignment.
The Illumina sequencing data total 4592877 sequencing reads with a read length of 100 bases, wherein 1007117 reads contain 1 different base compared to the reference genomic base sequence. We analyzed these 1007117 different bases. Using our proposed calibration method, a total of 11597 base sequencing errors were found, accounting for 1.15%.
We divided 1007117 different bases into two categories, correct sequencing and incorrect sequencing, and counted the positions of the two in the sequencing reads separately. Since the presence or absence of the differential bases is independent of the sequencing read itself, the position of the correctly sequenced differential bases in the sequencing read should correspond to a uniform distribution. Whereas due to the cumulative effect, the missequenced differential bases are more likely to occur in the back of the sequencing read. This is confirmed in fig. 2 and 3, which also demonstrate the effectiveness of our proposed Illumina high-throughput data error correction method.

Claims (1)

1. An Illumina high-throughput sequencing data error correction method, comprising: the method comprises the following steps:
step 1, semiconductor sequencing is carried out on an Illumina sequencing sample at the same time, and a semiconductor sequencing result of the Illumina sequencing sample is obtained while a sample Illumina sequencing result is obtained;
step 2, determining the position of each sequencing reading in a reference genome by sequence alignment of the Illumina sequencing result and the semiconductor sequencing result;
and 3, judging the sequencing result of the same position as follows:
the Illumina sequencing result is the same as the base sequence of the reference genome, and the Illumina sequencing result is correct;
the Illumina sequencing result is different from the base sequence of a reference genome, and the Illumina sequencing result is divided into the following three conditions:
1) the semiconductor sequencing result is the same as the Illumina sequencing result, and the Illumina sequencing result is correct;
2) the base sequence of the semiconductor sequencing result is the same as that of the reference genome, and the Illumina sequencing result is wrong;
3) the semiconductor sequencing result is different from the Illumina sequencing result and the reference genome base sequence, and the Illumina sequencing result is uncertain.
CN201810601099.9A 2018-06-12 2018-06-12 Illumina high-throughput sequencing data error correction method Active CN108959851B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810601099.9A CN108959851B (en) 2018-06-12 2018-06-12 Illumina high-throughput sequencing data error correction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810601099.9A CN108959851B (en) 2018-06-12 2018-06-12 Illumina high-throughput sequencing data error correction method

Publications (2)

Publication Number Publication Date
CN108959851A CN108959851A (en) 2018-12-07
CN108959851B true CN108959851B (en) 2022-03-18

Family

ID=64488394

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810601099.9A Active CN108959851B (en) 2018-06-12 2018-06-12 Illumina high-throughput sequencing data error correction method

Country Status (1)

Country Link
CN (1) CN108959851B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114420214A (en) * 2022-01-28 2022-04-29 赛纳生物科技(北京)有限公司 Quality evaluation method and screening method of nucleic acid sequencing data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101886114A (en) * 2009-05-14 2010-11-17 上海聚类生物科技有限公司 Method for analyzing high-throughput sequencing data based on RMI (Read Mass Index)
CN102622534A (en) * 2012-04-11 2012-08-01 哈尔滨工程大学 A DNA high-throughput sequencing data correction method for gene expression detection
CN105886605A (en) * 2015-03-05 2016-08-24 南京市妇幼保健院 Amplification primer for detecting PKD2 gene mutation and detection method
CN105925675A (en) * 2016-04-26 2016-09-07 序康医疗科技(苏州)有限公司 Method for amplifying dna
CN106156543A (en) * 2016-06-22 2016-11-23 厦门艾德生物医药科技股份有限公司 A kind of tumor ctDNA information statistical method
CN107229842A (en) * 2017-06-02 2017-10-03 肖传乐 A kind of three generations's sequencing sequence bearing calibration based on Local map
CN107849600A (en) * 2015-06-09 2018-03-27 生命技术公司 For the method for molecular labeling, system, composition, kit, device and computer-readable media

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9758780B2 (en) * 2014-06-02 2017-09-12 Drexel University Whole genome mapping by DNA sequencing with linked-paired-end library

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101886114A (en) * 2009-05-14 2010-11-17 上海聚类生物科技有限公司 Method for analyzing high-throughput sequencing data based on RMI (Read Mass Index)
CN102622534A (en) * 2012-04-11 2012-08-01 哈尔滨工程大学 A DNA high-throughput sequencing data correction method for gene expression detection
CN105886605A (en) * 2015-03-05 2016-08-24 南京市妇幼保健院 Amplification primer for detecting PKD2 gene mutation and detection method
CN107849600A (en) * 2015-06-09 2018-03-27 生命技术公司 For the method for molecular labeling, system, composition, kit, device and computer-readable media
CN105925675A (en) * 2016-04-26 2016-09-07 序康医疗科技(苏州)有限公司 Method for amplifying dna
CN106156543A (en) * 2016-06-22 2016-11-23 厦门艾德生物医药科技股份有限公司 A kind of tumor ctDNA information statistical method
CN107229842A (en) * 2017-06-02 2017-10-03 肖传乐 A kind of three generations's sequencing sequence bearing calibration based on Local map

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Denoising DNA deep sequencing data— high-throughput sequencing errors and their correction;David Laehnemann;《Briefings in Bioinformatics》;20161231;第17卷(第1期);第154–179页 *
Ion_torrent多聚碱基测序分析方法研究;宋锋飞;《万方数据知识服务平台》;20160505;第1-64页 *
高通量测序数据误差分析方法研究;董彦生;《中国优秀硕士学位论文全文数据基础科学辑》;20180315(第3期);第32-44页 *

Also Published As

Publication number Publication date
CN108959851A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
Almeida et al. Bioinformatics tools to assess metagenomic data for applied microbiology
Bragg et al. Metagenomics using next-generation sequencing
EP3143537B1 (en) Rare variant calls in ultra-deep sequencing
Sinclair et al. Microbial community composition and diversity via 16S rRNA gene amplicons: evaluating the illumina platform
CN109346130B (en) A method for obtaining microhaplotypes and their typing directly from whole-genome resequencing data
CN110189796A (en) A sheep whole genome resequencing analysis method
CN110211633B (en) MGMT gene promoter methylation detection method, sequencing data processing method and processing device
CN107075565B (en) Method and device for typing individual single nucleotide polymorphism sites
Brozynska et al. Direct chloroplast sequencing: comparison of sequencing platforms and analysis tools for whole chloroplast barcoding
Ahmed et al. Identifying A-and P-site locations on ribosome-protected mRNA fragments using Integer Programming
CN109920480B (en) Method and device for correcting high-throughput sequencing data
CN112513292A (en) Method and device for detecting homologous sequence based on high-throughput sequencing
Owens et al. A novel post hoc method for detecting index switching finds no evidence for increased switching on the Illumina HiSeq X
CN117253539B (en) Methods and systems for detecting sample contamination in high-throughput sequencing based on germline mutations
Birzu et al. Hybridization breaks species barriers in long-term coevolution of a cyanobacterial population
CN108959851B (en) Illumina high-throughput sequencing data error correction method
KR20210105725A (en) A method and apparatus for determining true positive variation in nucleic acid sequencing analysis
CN110942806A (en) Blood type genotyping method and device and storage medium
CN118038979B (en) Methods for detecting mutation patterns and transposition imprints of transposon insertion into human genome
CN102154452A (en) Method and system for identifying cis-regulatory action and trans-regulatory action
CN114093417B (en) Method and device for identifying chromosomal arm heterozygosity loss
JP2012235723A (en) Large-scale base sequence analysis method, program, and apparatus
Benaglio et al. Ultra high throughput sequencing in human DNA variation detection: a comparative study on the NDUFA3-PRPF31 region
KR100450816B1 (en) Selection method of probe set for genotyping
CN111826428B (en) Microsatellite instability detection method and system based on second-generation sequencing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant