CN108959851B

CN108959851B - Illumina high-throughput sequencing data error correction method

Info

Publication number: CN108959851B
Application number: CN201810601099.9A
Authority: CN
Inventors: 冯伟兴; 贺波; 陈多娇; 王雪莹; 南方伯
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2018-06-12
Filing date: 2018-06-12
Publication date: 2022-03-18
Anticipated expiration: 2038-06-12
Also published as: CN108959851A

Abstract

The invention provides an Illumina high-throughput sequencing data error correction method, which comprises the following steps: 1. semiconductor sequencing was performed simultaneously on Illumina sequencing samples. Namely, the semiconductor sequencing result is obtained while the sample Illumina sequencing result is obtained; 2. respectively determining the position of each sequencing reading in a reference genome by comparing the Illumina sequencing result with the semiconductor sequencing result through sequence; 3. the sequencing results of the same position were analyzed. The invention provides an Illumina high-throughput sequencing data error correction method aiming at the problem. The method utilizes the characteristic that the base type in the semiconductor high-throughput sequencing result is not easy to detect by mistake, and realizes the error correction of the Illumina high-throughput sequencing data by logically analyzing the corresponding relation among the Illumina high-throughput sequencing result, the semiconductor high-throughput sequencing result and the reference genome base sequence.

Description

Illumina high-throughput sequencing data error correction method

Technical Field

The invention relates to an Illumina high-throughput sequencing data error correction method, and belongs to the field of molecular biological information detection.

Background

With the rapid development of biological detection technology, the second generation sequencing platforms such as Solexa of Illumina, 454 of Life Sciences and SOLID of ABI are gradually replaced by the new generation sequencing platform. This includes MiSeq, NextSeq, HiSeq series by Illumina, Ion Torrent, Ion Proton, Ion PGM series by ABI, and MinION by Oxford Nanopore Technologies, among others. Although the introduction of a new generation of sequencing platform enables the detection of biological information to be deeper, lower in cost and more efficient, the original interpretation method of high-throughput sequencing data has to be changed correspondingly due to the change of the detection mechanism.

In a new generation of sequencing platform, the Illumina sequencing platform is widely applied to detection of various molecular biological information due to the characteristics of higher sequencing depth, lower error rate and the like. However, due to the base type recognition based on the chromatic aberration of light, the Illumina sequencing data has certain sequencing errors, and the predominant expression form of the sequencing errors is base type errors. This will lead to differences between Illumina sequencing results and the reference genomic base sequence. However, it is normal that some differences exist between the Illumina sequencing result and the reference genome base sequence due to individual differences, and these differences are also the important concerns in subsequent research. Therefore, it is important to distinguish whether the difference between the Illumina sequencing result and the reference genome base sequence is a true difference or a difference caused by Illumina sequencing error.

Semiconductor high-throughput sequencing performs base type recognition through chemical reactions, and therefore, the base type is not easy to be misdetermined. Based on the method, the invention innovatively provides that the error correction of Illumina high-throughput sequencing data is realized by logically analyzing the corresponding relation between Illumina high-throughput sequencing results, semiconductor high-throughput sequencing results and reference genome base sequences.

Disclosure of Invention

The invention aims to provide an Illumina high-throughput sequencing data error correction method which can effectively identify and remove sequencing errors in sequencing data of the sequencing platform.

The purpose of the invention is realized as follows: the method comprises the following steps:

step 1, semiconductor sequencing is carried out on an Illumina sequencing sample at the same time, and a semiconductor sequencing result of the Illumina sequencing sample is obtained while a sample Illumina sequencing result is obtained;

step 2, determining the position of each sequencing reading in a reference genome by sequence alignment of the Illumina sequencing result and the semiconductor sequencing result;

and 3, judging the sequencing result of the same position as follows:

the Illumina sequencing result is the same as the base sequence of the reference genome, and the Illumina sequencing result is correct;

the Illumina sequencing result is different from the base sequence of a reference genome, and the Illumina sequencing result is divided into the following three conditions:

1) the semiconductor sequencing result is the same as the Illumina sequencing result, and the Illumina sequencing result is correct;

2) the base sequence of the semiconductor sequencing result is the same as that of the reference genome, and the Illumina sequencing result is wrong;

3) the semiconductor sequencing result is different from the Illumina sequencing result and the reference genome base sequence, and the Illumina sequencing result is uncertain.

Compared with the prior art, the invention has the beneficial effects that: as a new generation of high-throughput sequencing technology, Illumina sequencing has been widely applied to detection of various molecular biological information. However, the sequencing platform also has a certain sequencing error, and the main expression form of the sequencing error is a base type error. The invention provides an Illumina high-throughput sequencing data error correction method aiming at the problem. The method utilizes the characteristic that the base type in the semiconductor high-throughput sequencing result is not easy to detect by mistake, and realizes the error correction of the Illumina high-throughput sequencing data by logically analyzing the corresponding relation among the Illumina high-throughput sequencing result, the semiconductor high-throughput sequencing result and the reference genome base sequence.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the distribution of the positions of correctly differentiated bases in Illumina sequencing data;

FIG. 3 is a schematic diagram of the location distribution of erroneously different bases in Illumina sequencing data.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

With reference to fig. 1 to 3, Illumina sequencing has been widely applied to detection of various molecular biological information as a new generation high throughput sequencing technology. However, due to the base type identification based on the chromatic aberration of light, the Illumina sequencing result inevitably has certain base type sequencing errors. The semiconductor high-throughput sequencing carries out base type identification through chemical reaction, and the base type is not easy to be tested by mistake. Based on the method, the invention innovatively provides an Illumina high-throughput sequencing data error correction method. The method realizes error correction of Illumina high-throughput sequencing data by logically analyzing the corresponding relation between Illumina high-throughput sequencing results, semiconductor high-throughput sequencing results and reference genome base sequences.

The method of the invention comprises the following steps:

1. semiconductor sequencing was performed simultaneously on Illumina sequencing samples. Namely, the semiconductor sequencing result is obtained while the sample Illumina sequencing result is obtained;

2. respectively determining the position of each sequencing reading in a reference genome by comparing the Illumina sequencing result with the semiconductor sequencing result through sequence;

3. the sequencing results for the same position were logically analyzed as follows:

4. Experimental verification

We performed Illumina sequencing and semiconductor Ion Torrent sequencing, respectively, on the same human experimental sample. The sequencing results of both platforms were then used to determine the position of each sequencing read in the reference genome by sequence alignment.

The Illumina sequencing data total 4592877 sequencing reads with a read length of 100 bases, wherein 1007117 reads contain 1 different base compared to the reference genomic base sequence. We analyzed these 1007117 different bases. Using our proposed calibration method, a total of 11597 base sequencing errors were found, accounting for 1.15%.

We divided 1007117 different bases into two categories, correct sequencing and incorrect sequencing, and counted the positions of the two in the sequencing reads separately. Since the presence or absence of the differential bases is independent of the sequencing read itself, the position of the correctly sequenced differential bases in the sequencing read should correspond to a uniform distribution. Whereas due to the cumulative effect, the missequenced differential bases are more likely to occur in the back of the sequencing read. This is confirmed in fig. 2 and 3, which also demonstrate the effectiveness of our proposed Illumina high-throughput data error correction method.

Claims

1. An Illumina high-throughput sequencing data error correction method, comprising: the method comprises the following steps:

and 3, judging the sequencing result of the same position as follows: