CN117711488B

CN117711488B - Gene haplotype detection method based on long-reading long-sequencing and application thereof

Info

Publication number: CN117711488B
Application number: CN202311620961.8A
Authority: CN
Inventors: 黄铨飞; 苏恺婵; 刘情; 景丽芳; 刘腾飞; 李蓓蓓; 庾晓康; 王康丽
Original assignee: CapitalBio Genomics Co Ltd
Current assignee: CapitalBio Genomics Co Ltd
Priority date: 2023-11-29
Filing date: 2023-11-29
Publication date: 2024-07-02
Anticipated expiration: 2043-11-29
Also published as: CN117711488A

Abstract

The invention discloses a gene haplotype detection method based on long-reading long-sequencing and application thereof. Amplifying a CYP2D6 gene sequence by a primer, amplifying the CYP2D6 gene sequence again by using a specific tag index, marking different samples, carrying out Nanopore sequencing analysis on sample libraries pooling with different tags, carrying out sample data splitting by using the specific tag index as a mark and setting the index fault tolerance number as 1 on sequencing machine-down data, effectively improving the effective rate of data splitting, deducing the linkage relation between point mutations by long-reading long sequencing results, predicting haplotypes according to the linkage relation, and finally realizing accurate parting haplotypes. Compared with the existing method, the method provided by the invention has the advantages that the correction of the point mutation in the Nanopore sequencing is realized, so that the F1 value is greatly improved, and the more accurate and effective CYP2D6 genotyping is realized.

Description

Gene haplotype detection method based on long-reading long-sequencing and application thereof

Technical Field

The invention relates to the technical field of biology, in particular to a gene haplotype detection method based on long-reading long-sequencing and application thereof.

Background

It has been found that about 25% of the drug is metabolized by a single cytochrome P450-2D6 (CYP 2D 6) enzyme that is highly expressed in the liver. Polymorphism of the CYP2D6 gene allows different genotypes to exhibit different enzyme activities. The enzyme activities of P450-2D6 can be divided into four classes: PM (catabolic), IM (anabolic), EM (normal catabolic), and UM (ultra-fast catabolic). There is a great difference in the case of different enzyme activities on drug metabolism, such as the 8-aminoquinoline antimalarial drug Primaquine (PQ), a prodrug that requires CYP2D6 metabolism to produce activity. Primaquine is the only effective therapeutic to prevent plasmodium vivax recurrence, but for patients with PM or IM phenotypes they cannot metabolize primaquine to its active metabolite, and patients with such phenotypes are also at higher risk of plasmodium vivax recurrence after cure. Therefore, in controlling malaria with PQ, it is important to know the frequency of PM or IM phenotypes in the target population.

Current methods for detecting CYP2D6 genotypes, such as SNP-microarrays, qPCR, or short read long sequencing methods, are relatively inexpensive. However, these methods have limitations, such as that they can only identify the common CYP2D6 genotype by detecting known mutation types, but have a large limitation in detecting new mutations. In the short-read long sequencing method, although some new variants can be found, since the CYP2D6 and CYP2D7 gene sequences have high similarity, mismatches of reads are easily caused in actual detection, and error variants are detected. Furthermore, these methods are based on the inference of detected variations and known allele frequencies, rather than directly obtaining the sequence of individual alleles. Thus, the presence of rare or new alleles will further confound the detection results of these methods.

Long-read long sequencing can solve the challenges faced by the current short-read long sequencing technology to a certain extent, but due to the high sequencing error rate (5% -15%) of long-read long sequencing, the problems that false mutation can not be detected and true mutation can not be detected exist in mutation detection, and currently used mainstream mutation detection software still can generate more false positive mutation and missing false negative sites, which can greatly influence CYP2D6 genotyping. Furthermore, in the study of genomic variations using long-read long sequencing data, detection of both SNP and InDel is a fundamental detection project. At present, although a plurality of different algorithms are available for SNP and InDel analysis in second generation sequencing data, the methods are developed for the second generation sequencing data, and therefore cannot be well operated on long-reading long-sequencing data with high sequencing error rate.

In the prior art, a method for analyzing the raw information by using minimap & lt2+ & gt nanopolish software combination is mainly used at present, and compared with other comparison software and mutation detection software combination analysis PPV and Sensitivity are optimal, wherein the PPV is 79.12%, the Sensitivity is 96.43%, and the F1 value is 0.8692. It still has more false positive variation (lower PPV value) which is mainly due to higher long read long sequencing error rate. This false positive variation can deviate from the next haplotype prediction and thus affect genotyping. Therefore, there is a need to provide an efficient long-read long sequencing method capable of high accuracy for CYP2D6 genotyping.

Disclosure of Invention

The present invention aims to solve at least one of the above technical problems in the prior art. Therefore, the invention aims to provide a gene haplotype detection method based on long-reading long sequencing and an application thereof. According to the method, after the CYP2D6 gene sequence is amplified through the primer, the CYP2D6 gene sequence is amplified again through the specific tag index, different samples are marked, then the sample library pooling with different tags is subjected to Nanopore sequencing analysis, the index fault tolerance number is set to be 1 through taking the specific tag index as the mark, sample data are split, so that the effective rate of data splitting is effectively improved, then the linkage relation among point mutations is deduced through long-reading long sequencing results, haplotypes are predicted according to the linkage relation, and finally accurate typing haplotypes are realized. Compared with the existing method, the method provided by the invention has the advantages that the correction of the point mutation in the Nanopore sequencing is realized, so that the F1 value is greatly improved, and the more accurate and effective CYP2D6 genotyping is realized.

In a first aspect of the present invention, there is provided a method for detecting a gene haplotype, comprising the steps of:

(1) Carrying out PCR amplification on a sample to be detected by using a specific primer to obtain a target fragment, and then carrying out PCR amplification on the target fragment again by using a label primer to obtain amplified products with different labels;

(2) Equivalent mixing is carried out on amplification products with different labels from different samples to be detected, a Nanopore sequencing library is constructed, long-reading long-sequencing is carried out on the Nanopore sequencing library, sequencing results are compared with human reference genome to obtain sequenced BAM files, mutation detection is carried out, VCF files are obtained, and a Bayesian correction model is used for correcting the VCF files to obtain corrected VCF files;

(3) And (3) phase-splitting the corrected VCF file and the ordered BAM file by using a phase command, then executing haplotag command according to a phase-splitting result, marking the data, and judging the gene haplotype according to the mark.

In some embodiments of the invention, the human reference genome is a CYP2D6 gene reference sequence.

In some embodiments of the invention, the specific primer is set forth in SEQ ID NO: 1-2.

In the present invention, the specific primer includes a binding portion that specifically targets a target sequence and a common sequence portion. Wherein the common sequence part is used for subsequent tag ligation.

In some embodiments of the invention, the public sequence is linked to the 5' end of the binding portion of the specific targeting target sequence.

In some embodiments of the invention, the tag primer is set forth in SEQ ID NO:3 to 206.

In the present invention, the tag primer includes a tag portion and a common sequence portion.

In some embodiments of the invention, the public sequence is attached to the 3' end of the tag moiety.

In some embodiments of the invention, the bayesian correction model is:

Wherein, gi represents the genotype of the target site, and G0, G1 and G2 represent the wild, heterozygous mutation and homozygous mutation, respectively;

a represents the frequency AF of the variant allele;

p (A|G0), P (A|G1) and P (A|G2) are obtained by calculating the sample mean value and the sample standard deviation of the site through the prior probabilities P (G0), P (G1) and P (G2) of the corresponding genotypes respectively and then fitting by normal distribution.

According to the invention, based on the Bayesian formula modeling added in the biological information analysis method, the point mutation (including SNP and small InDel) detected by long-reading long-sequencing can be corrected, and then the linkage relation between the point mutations is deduced through the long-reading long-sequencing result, and finally accurate genotyping haplotype is realized.

In some embodiments of the invention, the gene haplotype detection method is used for CYP2D6 genotyping.

For CYP2D6 genotyping, the sequencing of the long PCR amplicon can not only clearly detect variation without being interfered by homologous pseudogenes, but also means that the mutation analysis can be directly carried out on long reads, thereby effectively reducing the complexity of steps and simultaneously ensuring high accuracy.

In some embodiments of the present invention, when correcting using a bayesian correction model, the following formulas are used simultaneously for site result verification:

wherein A represents the frequency AF of the variant allele;

μ represents the mean of the site in the wild-type sample;

Sigma represents the standard deviation of the site in the wild-type sample;

If the Z value is less than 1.96, the locus genotype of the sample to be tested is identical to the wild type;

and (3) if the Z value is more than or equal to 1.96, the locus genotype of the sample to be detected is the locus genotype in the VCF file obtained in the step (2).

In some embodiments of the invention, in step (3),

If phase separation can be carried out, splitting the VCF file after phase separation into two haploid VCF files by using a Perl script, then carrying out genotype detection on the two haploid VCF files by using Stargezar software, and finally combining haplotypes of the two haploids as a final genotype result;

if the phase separation can not be carried out, carrying out genotype detection on the corrected VCF file by Stargezar software to obtain a final genotype result

In some embodiments of the invention, the method further comprises performing data processing after long read long sequencing, comprising:

Extracting DNA sequence information by Guppy software, filtering out q <8 parts, then using a specific label index as a mark, using a Python script to set the index fault tolerance number as 1 pair of filtered data splitting, removing joints, filtering out q <9 parts, using Minimap2 comparison software to perform sequence comparison to obtain SAM comparison files, then using Samtools software to process to obtain ordered BAM files, using mplieup and call commands of Bcftools software, and using multiallelic-caller algorithm to perform mutation detection on the ordered BAM files to obtain VCF files.

In some embodiments of the invention, poreChop software is used to remove the linker.

In some embodiments of the invention, nanoFilt software is used to filter out portions of q < 9.

In some embodiments of the invention, the Minimap alignment software uses map-ont mode.

In some embodiments of the present invention, view, sort, and index commands are used sequentially when processing using Samtools software.

In the present invention, a flowchart of the gene haplotype detection method is shown in FIG. 1.

In a second aspect, the invention provides the use of the method for detecting a gene haplotype according to the first aspect of the invention in CYP2D6 enzyme activity typing.

In the present invention, after the genotype of CYP2D6 in the sample to be measured is determined by the gene haplotype detection method according to the first aspect of the invention, CYP2D6 enzyme activity typing can be performed according to the genotype corresponding to CYP2D6 enzyme activity in the art.

The beneficial effects of the invention are as follows:

1. The gene haplotype detection method effectively solves the problem of low accuracy of CYP2D6 genotyping in the prior art, can more accurately perform CYP2D6 genotyping by combining simple PCR amplification with long-reading long-sequencing, and has remarkable improvement compared with the accuracy and the like in the prior art.

2. The gene haplotype detection method introduces a Bayesian correction model to correct data, and simultaneously discovers that minimap2+ bcftools has a better detection effect than minimap2+ nanopolish in the prior art, and can basically realize the 100% detection effect in practical verification.

Drawings

FIG. 1 is a flow chart of a method for detecting a gene haplotype according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples. The starting materials, reagents or apparatus used in the examples and comparative examples were either commercially available from conventional sources or may be obtained by prior art methods unless specifically indicated. Unless otherwise indicated, assays or testing methods are routine in the art.

EXAMPLE 1 design of specific primers for CYP2D6 Gene

In the present invention, a pair of CYP2D6 gene-specific primers was designed based on the CYP2D6 gene sequence (reference sequence number NG_ 008376.4). The primer pair is continuously tested and optimized on the basis of taking human genome DNA as a target detection sample, and can finally realize specific amplification of CYP2D6 complete gene sequence through optimization of an amplification system and an amplification program.

The obtained CYP2D6 gene-specific primers are shown in Table 1.

TABLE 1 CYP2D6 Gene primer information Table

Wherein, SEQ ID NO: 1-2, the 5' -end of the CYP2D6 gene primer is bolded and underlined as a common sequence for ligation of the specific tag index in a subsequent step.

Example 2CYP2D6 Gene haplotype detection method

In this example, a CYP2D6 gene haplotype detection method (two-step PCR) using the above primer pair as a base is exemplified. The method comprises the steps of firstly carrying out full-length amplification on CYP2D6 gene sequences in human genome to obtain a required target fragment, then modifying the target fragment to connect with corresponding specific tag index, and then carrying out long-reading long-sequencing on the target fragment after pooling based on sample libraries of different tags to obtain a CYP2D6 gene haplotype detection result, thereby obtaining parting information.

The method comprises the following specific steps:

(1) Acquisition of genomic DNA:

In this example, the genomic DNA is derived from a peripheral blood sample (of course, it is also possible to directly detect the genomic DNA based on the existing genomic DNA).

Taking 0.1-0.2 mL of peripheral blood sample, extracting genome DNA by using a conventional nucleic acid extraction method or product in the field (such as a nucleic acid extraction or purification reagent manufactured by Dongguan Bo Aoshi gene technology Co., ltd., product number: S10040), carrying out initial evaluation on purity detection and concentration of the extracted genome DNA by using Nanodrop2000, and carrying out integrity verification on the extracted genome DNA by agarose gel electrophoresis. Genomic DNA that passed the above test was taken for subsequent detection.

(2) And (3) PCR amplification:

PCR amplification was performed using the extracted genomic DNA as a template, and CYP2D6 gene-specific primers to which the public sequences were ligated in the above examples. The reaction system (25. Mu.L) is shown in Table 2.

TABLE 2 PCR amplification System

Component (A)	Content of
		Sample DNA to be tested (genomic DNA)	10ng
LA Taq enzyme	0.3μL
		10 Xamplification buffer	2.5μL
dNTP	4μL
		10 Mu M CYP2D6 gene upstream and downstream specific primer	1 Mu L each
Enzyme-free water	Supplement to 25. Mu.L

The amplification reaction conditions were: pre-denaturation at 94℃for 1min; denaturation at 98℃for 10s, annealing at 65℃for 60s, extension at 72℃for 6min,20 cycles; extending at 72℃for 10min. Obtaining an amplification product.

The amplified product is purified by using a magnetic bead purification method, and the specific steps are as follows: taking 25 mu L of the PCR amplification product obtained after the PCR amplification, adding 12.5 mu L of magnetic beads into a centrifuge tube, standing for 5min, placing the mixture on a magnetic frame for treatment, removing supernatant magnetic beads, washing twice with 75% absolute ethyl alcohol, and then adding 17 mu L of anhydrous enzyme water (nucleic-FREE WATER) for resuspension. And sucking the supernatant into a new centrifuge tube to obtain a purified PCR amplification product.

The 3' end of the specific tag index was ligated to a public sequence (submitted to the trade company, ind. Strapdesk).

The obtained specific tag index sequence information linked to the common sequence is shown in Table 3.

TABLE 3 specific tag index sequence information

The purified PCR amplification product (as a template) was amplified with the above-mentioned specific tag index (as a primer) to which the common sequence was attached. The reaction system (25. Mu.L) is shown in Table 4.

TABLE 4PCR amplification System

Component (A)	Content of
		Sample DNA to be tested (purified PCR amplified product)	16.2μL
LA Taq enzyme	0.3μL
		10 Xamplification buffer	2.5μL
dNTP	4μL
		10 Mu M upstream and downstream specific tag index linked to common sequence	1 Mu L each
Enzyme-free water	Supplement to 25. Mu.L

The amplification reaction conditions were: pre-denaturation at 94℃for 1min; denaturation at 98℃for 10s, annealing at 65℃for 60s, extension at 72℃for 6min,15 cycles; extending at 72℃for 10min. A second amplification product (i.e., a library of amplicons) is obtained.

The above magnetic bead purification steps were repeated to purify the amplicon library. After washing twice with 75% absolute ethanol, 50 μl of anhydrous water was added for resuspension, the supernatant was aspirated into a new centrifuge tube, a purified amplicon library was obtained, and quantitated for Qubit.

(3) Long read long sequencing and result analysis:

purified amplicon libraries derived from different test samples were mixed in equal proportions for long-read long sequencing.

The initial input of sample sequencing was estimated based on the amplicon length of the CYP2D6 gene, the Nanopore library-building sequencing kit (EXP-NBD 104 and SQK-LSK 110) instructions and the amount of data required for the belief analysis. Then long-reading long sequencing was performed according to the instruction of the Nanopore library-building sequencing kit (EXP-NBD 104, SQK-LSK 110).

Biological information analysis is carried out on the long-reading long-sequencing data after the MinION is started, and the specific analysis steps are as follows:

Extracting the DNA sequence in the Fast5 storage nanopore signal file generated by sequencing by utilizing Guppy software (v.6.4.6) through Minion, filtering low-quality sequences (q < 8) in the DNA sequence, and obtaining successful reads to generate a final Fastq sequence file. And obtaining specific tag index information according to sequencing, setting the fault tolerance number of the specific tag index to be 1 by using a Python script, and splitting Fastq sequence files to obtain Fastq sequence files of samples corresponding to different specific tag indexes. And then quality control is carried out on Fastq sequence files of samples corresponding to different specificity tag index by utilizing NanoPlot software (v1.40.2), and quality information is counted. The Fastq sequence files of the corresponding samples of the different specificity tag index were subjected to a deblocking process using PoreChop software (v0.2.4) and low quality sequences (q < 9) were filtered using NanoFilt software (v2.8.0). And (3) comparing the filtered Fastq sequence file with a CYP2D6 reference sequence (NG_ 008376.4) by utilizing map-ont mode in Minimap2 comparison software (v 2.17-r 941) to obtain a SAM comparison file. And processing the SAM comparison file sequentially by utilizing view, sort, index commands in Samtools software (v 1.2) to obtain the ordered BAM file. And performing quality control on the ordered BAM files according to the specific amplification region by using Bamdst software (v1.0.9), and counting information such as coverage. And using mplieup and call commands of Bcftools software (v 1.12) to perform mutation detection on the sequenced BAM files by using a parameter-m (multiallelic-caller algorithm) to obtain VCF files. Correcting point mutations (including SNP and small InDel) in the VCF file through a correction model to obtain the corrected VCF file. Genotype prediction is carried out on the basis of the corrected VCF file by a haplotype detection method, and the copy number is deduced by combining with AF frequency, so that the final genotype is obtained.

The correction model is constructed based on a Bayesian correction model. Considering that the mutation rate of the CYP2D6 gene is relatively high, all possible mutation sites cannot be trained in a model, so in the method, the construction of the model is focused on 396 key mutation sites recorded by PharmVar for allele identification, and therefore, the sites directly influence the judgment of alleles.

And (3) taking point mutations (including SNP and small InDel) detected by the Illumina second generation sequencing data as a standard, and establishing a correction model through mutation frequency of the base positions detected by the long-reading long-sequencing data. For sites with more mutation numbers in the Illumina sequencing result, constructing a genotype frequency Bayesian model of the site, and when a certain AF is detected at a certain site, calculating the probability that the site is of a certain genotype according to the following formula:

Wherein, in the formula:

Gi represents a genotype at a site, each site having three genotypes G0, G1 and G2, representing wild-type, heterozygous and homozygous mutations, respectively. A represents the frequency AF of the variant allele. P (G0), P (G1) and P (G2) are the prior probabilities of the crowd frequency of the corresponding genotypes, and are derived from the east Asia crowd frequency of the gnomAD locus genotype in the database (v2.1.1), and P (A|G0), P (A|G1) and P (A|G2) are obtained by calculating the locus sample mean value and the sample standard deviation respectively and then fitting by normal distribution.

For the sites with fewer mutations in the Illumina sequencing result, such as the wild type result, a site FP filtering model is constructed to process the sites, and the specific formula is as follows:

Wherein, in the formula:

A represents the frequency AF of the variant allele. Mu represents the mean of the sites in the wild-type sample. Sigma represents the standard deviation of the site in the wild-type sample.

When a certain AF is detected at a certain site, the Z value of the site can be calculated, and when the Z value is <1.96, the genotype of the site is considered to be the wild type.

In the model construction stage, because part of sites in the training set for construction have low frequency in the east Asian population, the FP filter model is firstly carried out to determine the site genotype, and then the Bayesian model is further used for correction. However, in actual detection (e.g., using a test set or actual detection sample), the bayesian model and FP filter model are performed synchronously. That is, when the Z value is 1.96 or more, the FP filter model and the bayesian model corrected result are output. When the Z value is <1.96, the corrected wild-type site genotype is output, giving a wild-type (negative) result through the subsequent steps.

Of course, if in actual detection, when the Z value is <1.96 and the locus still allows correction using bayesian models, bayesian model correction is still necessary for locus genotypes output by mutation detection software to ensure that false positive and false negative results are filtered out. However, if the use of the bayesian model is not allowed, no bayesian model correction is performed.

To test the effectiveness of the correction model, a separate sample was used for verification. The results are shown in Table 5.

Table 5 bayesian correction model performance verification

The meaning and corresponding algorithm formula of each index parameter in table 5 are shown in table 6.

Table 6 significance of the test value formula

From this, after correcting the loci by the established bayesian correction model, the 396 locus recorded by PharmVar is taken as an independent sample, the F1 value is increased from uncorrected 0.9686 to corrected 0.9916, and the corrected F1 value is higher than 0.9900 (approaching 100%, which is a very significant increase), which indicates that the detection result of the single base level is already very close to the result of Illumina sequencing, and the result of the allele is not affected by the error of mutation detection. Wherein the total number of false negative variations (FN) is reduced by 82.2% (267/325), and the total number of false positive variations (FP) is reduced by 59.2% (129/218).

Meanwhile, another set of sample data analyzed with the use of minimap2+ nanopolish combination (specific procedure reference Liau,Yusmiati.et al.Nanopore sequencing of the pharmacogene CYP2D6 allows simultaneous haplotyping and detection of duplications.Pharmacogenomics J.14,1033-1047(2019).)) was used as a comparison without performing the bayesian correction model process, and the results are shown in table 7.

Table 7 minimap2+nanopolish combined test results

As a result, it was found that the effect of using minimap2+ nanopolish combination without correction was still far less than the detection effect of the method of the examples of the present invention.

In addition, in some sites (shown in table 8) with high frequency of CYP2D6 typing in China, for example, NG_008376.4:5119 sites (the alleles are identified as critical sites of 10 and 39), false negative variation exists in the results obtained by software before partial sample uncorrectation. This site pseudo-anion (FN) directly affects the detection accuracy of the allele (misjudging original 10 as 39), and after correction by bayesian model, the F1 value can reach 1.0000 at ng_008376.4:5119 site, which indicates that at the key site of CYP2D6 typing, the correction model of the above embodiment has very high accuracy, which is beneficial to improving the typing accuracy of the subsequent genotypes.

Table 8 NG_008376.4:5119 site Performance validation

Type(s)

Ref

Alt

TP

FN

FP

TN

PPV

Sensitivity

F1

After correction

C

T

294

0

189

100％

1.0000

Before correction

C

T

292

2

0

189

100％

99.32％

0.9966

The haplotype detection method comprises the following steps: phase commands of WhatsHap software (v 1.4) are utilized to split phases of the corrected VCF file and the ordered BAM file according to the detected point mutation (including SNP and small InDel), so as to obtain the split-phase VCF file; then, using haplotag command of WhatsHap software (v 1.4), haploid marks H1, H2 or None are carried out on reads according to the VCF files after phase separation and the BAM files after sequencing, and finally a phase separation list is obtained.

When the result obtained by the software is that phase separation is possible (namely, the phase separation list comprises H1 and H2), splitting the VCF file after phase separation into two VCF files of haploids (H1 and H2) according to the names of reads corresponding to the phase separation list by utilizing a Perl (v5.26.2) script; genotype testing was performed on the two haploid VCF files using Stargezar software (v2.0.0), and finally combining the haplotypes of the two haploids as the final genotype result. When the software obtained the result that phase separation was impossible (i.e., the phase separation list was None, "-"), stargezar software (v2.0.0) was used to genotype the corrected VCF file to obtain the final genotype result.

And respectively taking different independent samples for simulation verification. The results are shown in tables 9 and 10.

Table 9 haplotype split/split-not-split accuracy

Method of	Accurate sample number	Accuracy rate of
			Haplotype splitting	473	97.93％
Haplotype is not split	455	94.20％

Table 10 haplotype split/not split final genotype inaccurate sample display

Sample name	Haplotype is not split	Haplotype splitting
			sample1	10/39	1/10
sample2	10/39	2/10
			sample3	10/106	1/10
sample4	10/39	2/10
			sample5	-/-	2/36
sample6	-/-	10/36

It was found that after phase separation using WhatsHap software, the accuracy increased from 94.2% to 97.93%. Furthermore, the VCF file was split using Stargezar software (via Beagle software, which is self-contained in Stargezar software) after the comparison haplotype was not split, which indicated that the accuracy of WhatsHap split was higher than it was. Therefore, whatsHap software is more recommended for phase separation in Nanopore sequencing analysis.

Example 3 method verification

To illustrate the effectiveness of the above method, a new validation sample was additionally set to perform validation according to the above method, and the results are shown in tables 11 and 12.

TABLE 11 long read long sequencing genotyping accuracy

Total number of samples	Genotype identity number of samples	Accuracy rate of
			31	29	93.55％

Table 12 sample results display-qPCR and Long read Long sequencing genotype prediction results

The above results indicate that genotyping accuracy was as high as 93.55% by performing a belief analysis by the method of the above example using the Nanopore sequencing analysis of the above example, with qPCR results as a standard.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A gene haplotype detection method comprises the following steps:

(3) Phase commands are used for carrying out phase separation on the corrected VCF file and the ordered BAM file, then haplotag commands are executed according to the phase separation results, the data are marked, and the gene haplotype is judged according to the marks;

Wherein, the specific primer is shown as SEQ ID NO: 1-2, wherein the label primer is shown as SEQ ID NO:3 to 206.

2. The method of claim 1, wherein the bayesian correction model is:

a represents the frequency AF of the variant allele;

p (A|G0), P (A|G1) and P (A|G2) are obtained by calculating the sample mean value and the sample standard deviation of the site through the prior probabilities P (G0), P (G1) and P (G2) of the corresponding site genotypes respectively and then fitting by normal distribution.

3. The method according to claim 1, wherein the following formulas are used for checking the site result at the same time when correcting using a bayesian correction model:

wherein A represents the frequency AF of the variant allele;

μ represents the mean of the site in the wild-type sample;

Sigma represents the standard deviation of the site in the wild-type sample;

if the Z value is less than 1.96, the genotype of the corresponding site of the sample to be detected is identical to the wild type;

If the Z value is more than or equal to 1.96, the corresponding locus genotype of the sample to be detected is the locus genotype in the VCF file obtained in the step (2) of claim 1.

4. The method for detecting gene haplotype according to claim 1, wherein in the step (3), if phase separation is possible, the split VCF file is split into two haploid VCF files by using Perl script, then genotype detection is performed on the two haploid VCF files by using Stargezar software, and finally the haplotype of the two haploids is combined as a final genotype result;

if the phase separation can not be carried out, carrying out genotype detection on the corrected VCF file by Stargezar software to obtain a final genotype result.

5. The method of claim 1, further comprising performing data processing after long-read long-sequencing, comprising:

6. The method of claim 5, wherein the Minimap alignment software uses map-ont mode.

7. The method according to claim 5, wherein when processing using Samtools software, processing is performed using view, sort, and index commands in sequence.

8. Use of the gene haplotype detection method according to any one of claims 1-7 in CYP2D6 enzyme activity typing.