[go: up one dir, main page]

CN117711488B - Gene haplotype detection method based on long-reading long-sequencing and application thereof - Google Patents

Gene haplotype detection method based on long-reading long-sequencing and application thereof Download PDF

Info

Publication number
CN117711488B
CN117711488B CN202311620961.8A CN202311620961A CN117711488B CN 117711488 B CN117711488 B CN 117711488B CN 202311620961 A CN202311620961 A CN 202311620961A CN 117711488 B CN117711488 B CN 117711488B
Authority
CN
China
Prior art keywords
long
sequencing
software
sample
genotype
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311620961.8A
Other languages
Chinese (zh)
Other versions
CN117711488A (en
Inventor
黄铨飞
苏恺婵
刘情
景丽芳
刘腾飞
李蓓蓓
庾晓康
王康丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CapitalBio Genomics Co Ltd
Original Assignee
CapitalBio Genomics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CapitalBio Genomics Co Ltd filed Critical CapitalBio Genomics Co Ltd
Priority to CN202311620961.8A priority Critical patent/CN117711488B/en
Publication of CN117711488A publication Critical patent/CN117711488A/en
Application granted granted Critical
Publication of CN117711488B publication Critical patent/CN117711488B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6858Allele-specific amplification
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/172Haplotypes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Analytical Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Wood Science & Technology (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biochemistry (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a gene haplotype detection method based on long-reading long-sequencing and application thereof. Amplifying a CYP2D6 gene sequence by a primer, amplifying the CYP2D6 gene sequence again by using a specific tag index, marking different samples, carrying out Nanopore sequencing analysis on sample libraries pooling with different tags, carrying out sample data splitting by using the specific tag index as a mark and setting the index fault tolerance number as 1 on sequencing machine-down data, effectively improving the effective rate of data splitting, deducing the linkage relation between point mutations by long-reading long sequencing results, predicting haplotypes according to the linkage relation, and finally realizing accurate parting haplotypes. Compared with the existing method, the method provided by the invention has the advantages that the correction of the point mutation in the Nanopore sequencing is realized, so that the F1 value is greatly improved, and the more accurate and effective CYP2D6 genotyping is realized.

Description

Gene haplotype detection method based on long-reading long-sequencing and application thereof
Technical Field
The invention relates to the technical field of biology, in particular to a gene haplotype detection method based on long-reading long-sequencing and application thereof.
Background
It has been found that about 25% of the drug is metabolized by a single cytochrome P450-2D6 (CYP 2D 6) enzyme that is highly expressed in the liver. Polymorphism of the CYP2D6 gene allows different genotypes to exhibit different enzyme activities. The enzyme activities of P450-2D6 can be divided into four classes: PM (catabolic), IM (anabolic), EM (normal catabolic), and UM (ultra-fast catabolic). There is a great difference in the case of different enzyme activities on drug metabolism, such as the 8-aminoquinoline antimalarial drug Primaquine (PQ), a prodrug that requires CYP2D6 metabolism to produce activity. Primaquine is the only effective therapeutic to prevent plasmodium vivax recurrence, but for patients with PM or IM phenotypes they cannot metabolize primaquine to its active metabolite, and patients with such phenotypes are also at higher risk of plasmodium vivax recurrence after cure. Therefore, in controlling malaria with PQ, it is important to know the frequency of PM or IM phenotypes in the target population.
Current methods for detecting CYP2D6 genotypes, such as SNP-microarrays, qPCR, or short read long sequencing methods, are relatively inexpensive. However, these methods have limitations, such as that they can only identify the common CYP2D6 genotype by detecting known mutation types, but have a large limitation in detecting new mutations. In the short-read long sequencing method, although some new variants can be found, since the CYP2D6 and CYP2D7 gene sequences have high similarity, mismatches of reads are easily caused in actual detection, and error variants are detected. Furthermore, these methods are based on the inference of detected variations and known allele frequencies, rather than directly obtaining the sequence of individual alleles. Thus, the presence of rare or new alleles will further confound the detection results of these methods.
Long-read long sequencing can solve the challenges faced by the current short-read long sequencing technology to a certain extent, but due to the high sequencing error rate (5% -15%) of long-read long sequencing, the problems that false mutation can not be detected and true mutation can not be detected exist in mutation detection, and currently used mainstream mutation detection software still can generate more false positive mutation and missing false negative sites, which can greatly influence CYP2D6 genotyping. Furthermore, in the study of genomic variations using long-read long sequencing data, detection of both SNP and InDel is a fundamental detection project. At present, although a plurality of different algorithms are available for SNP and InDel analysis in second generation sequencing data, the methods are developed for the second generation sequencing data, and therefore cannot be well operated on long-reading long-sequencing data with high sequencing error rate.
In the prior art, a method for analyzing the raw information by using minimap & lt2+ & gt nanopolish software combination is mainly used at present, and compared with other comparison software and mutation detection software combination analysis PPV and Sensitivity are optimal, wherein the PPV is 79.12%, the Sensitivity is 96.43%, and the F1 value is 0.8692. It still has more false positive variation (lower PPV value) which is mainly due to higher long read long sequencing error rate. This false positive variation can deviate from the next haplotype prediction and thus affect genotyping. Therefore, there is a need to provide an efficient long-read long sequencing method capable of high accuracy for CYP2D6 genotyping.
Disclosure of Invention
The present invention aims to solve at least one of the above technical problems in the prior art. Therefore, the invention aims to provide a gene haplotype detection method based on long-reading long sequencing and an application thereof. According to the method, after the CYP2D6 gene sequence is amplified through the primer, the CYP2D6 gene sequence is amplified again through the specific tag index, different samples are marked, then the sample library pooling with different tags is subjected to Nanopore sequencing analysis, the index fault tolerance number is set to be 1 through taking the specific tag index as the mark, sample data are split, so that the effective rate of data splitting is effectively improved, then the linkage relation among point mutations is deduced through long-reading long sequencing results, haplotypes are predicted according to the linkage relation, and finally accurate typing haplotypes are realized. Compared with the existing method, the method provided by the invention has the advantages that the correction of the point mutation in the Nanopore sequencing is realized, so that the F1 value is greatly improved, and the more accurate and effective CYP2D6 genotyping is realized.
In a first aspect of the present invention, there is provided a method for detecting a gene haplotype, comprising the steps of:
(1) Carrying out PCR amplification on a sample to be detected by using a specific primer to obtain a target fragment, and then carrying out PCR amplification on the target fragment again by using a label primer to obtain amplified products with different labels;
(2) Equivalent mixing is carried out on amplification products with different labels from different samples to be detected, a Nanopore sequencing library is constructed, long-reading long-sequencing is carried out on the Nanopore sequencing library, sequencing results are compared with human reference genome to obtain sequenced BAM files, mutation detection is carried out, VCF files are obtained, and a Bayesian correction model is used for correcting the VCF files to obtain corrected VCF files;
(3) And (3) phase-splitting the corrected VCF file and the ordered BAM file by using a phase command, then executing haplotag command according to a phase-splitting result, marking the data, and judging the gene haplotype according to the mark.
In some embodiments of the invention, the human reference genome is a CYP2D6 gene reference sequence.
In some embodiments of the invention, the specific primer is set forth in SEQ ID NO: 1-2.
In the present invention, the specific primer includes a binding portion that specifically targets a target sequence and a common sequence portion. Wherein the common sequence part is used for subsequent tag ligation.
In some embodiments of the invention, the public sequence is linked to the 5' end of the binding portion of the specific targeting target sequence.
In some embodiments of the invention, the tag primer is set forth in SEQ ID NO:3 to 206.
In the present invention, the tag primer includes a tag portion and a common sequence portion.
In some embodiments of the invention, the public sequence is attached to the 3' end of the tag moiety.
In some embodiments of the invention, the bayesian correction model is:
Wherein, gi represents the genotype of the target site, and G0, G1 and G2 represent the wild, heterozygous mutation and homozygous mutation, respectively;
a represents the frequency AF of the variant allele;
p (A|G0), P (A|G1) and P (A|G2) are obtained by calculating the sample mean value and the sample standard deviation of the site through the prior probabilities P (G0), P (G1) and P (G2) of the corresponding genotypes respectively and then fitting by normal distribution.
According to the invention, based on the Bayesian formula modeling added in the biological information analysis method, the point mutation (including SNP and small InDel) detected by long-reading long-sequencing can be corrected, and then the linkage relation between the point mutations is deduced through the long-reading long-sequencing result, and finally accurate genotyping haplotype is realized.
In some embodiments of the invention, the gene haplotype detection method is used for CYP2D6 genotyping.
For CYP2D6 genotyping, the sequencing of the long PCR amplicon can not only clearly detect variation without being interfered by homologous pseudogenes, but also means that the mutation analysis can be directly carried out on long reads, thereby effectively reducing the complexity of steps and simultaneously ensuring high accuracy.
In some embodiments of the present invention, when correcting using a bayesian correction model, the following formulas are used simultaneously for site result verification:
wherein A represents the frequency AF of the variant allele;
μ represents the mean of the site in the wild-type sample;
Sigma represents the standard deviation of the site in the wild-type sample;
If the Z value is less than 1.96, the locus genotype of the sample to be tested is identical to the wild type;
and (3) if the Z value is more than or equal to 1.96, the locus genotype of the sample to be detected is the locus genotype in the VCF file obtained in the step (2).
In some embodiments of the invention, in step (3),
If phase separation can be carried out, splitting the VCF file after phase separation into two haploid VCF files by using a Perl script, then carrying out genotype detection on the two haploid VCF files by using Stargezar software, and finally combining haplotypes of the two haploids as a final genotype result;
if the phase separation can not be carried out, carrying out genotype detection on the corrected VCF file by Stargezar software to obtain a final genotype result
In some embodiments of the invention, the method further comprises performing data processing after long read long sequencing, comprising:
Extracting DNA sequence information by Guppy software, filtering out q <8 parts, then using a specific label index as a mark, using a Python script to set the index fault tolerance number as 1 pair of filtered data splitting, removing joints, filtering out q <9 parts, using Minimap2 comparison software to perform sequence comparison to obtain SAM comparison files, then using Samtools software to process to obtain ordered BAM files, using mplieup and call commands of Bcftools software, and using multiallelic-caller algorithm to perform mutation detection on the ordered BAM files to obtain VCF files.
In some embodiments of the invention, poreChop software is used to remove the linker.
In some embodiments of the invention, nanoFilt software is used to filter out portions of q < 9.
In some embodiments of the invention, the Minimap alignment software uses map-ont mode.
In some embodiments of the present invention, view, sort, and index commands are used sequentially when processing using Samtools software.
In the present invention, a flowchart of the gene haplotype detection method is shown in FIG. 1.
In a second aspect, the invention provides the use of the method for detecting a gene haplotype according to the first aspect of the invention in CYP2D6 enzyme activity typing.
In the present invention, after the genotype of CYP2D6 in the sample to be measured is determined by the gene haplotype detection method according to the first aspect of the invention, CYP2D6 enzyme activity typing can be performed according to the genotype corresponding to CYP2D6 enzyme activity in the art.
The beneficial effects of the invention are as follows:
1. The gene haplotype detection method effectively solves the problem of low accuracy of CYP2D6 genotyping in the prior art, can more accurately perform CYP2D6 genotyping by combining simple PCR amplification with long-reading long-sequencing, and has remarkable improvement compared with the accuracy and the like in the prior art.
2. The gene haplotype detection method introduces a Bayesian correction model to correct data, and simultaneously discovers that minimap2+ bcftools has a better detection effect than minimap2+ nanopolish in the prior art, and can basically realize the 100% detection effect in practical verification.
Drawings
FIG. 1 is a flow chart of a method for detecting a gene haplotype according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples. The starting materials, reagents or apparatus used in the examples and comparative examples were either commercially available from conventional sources or may be obtained by prior art methods unless specifically indicated. Unless otherwise indicated, assays or testing methods are routine in the art.
EXAMPLE 1 design of specific primers for CYP2D6 Gene
In the present invention, a pair of CYP2D6 gene-specific primers was designed based on the CYP2D6 gene sequence (reference sequence number NG_ 008376.4). The primer pair is continuously tested and optimized on the basis of taking human genome DNA as a target detection sample, and can finally realize specific amplification of CYP2D6 complete gene sequence through optimization of an amplification system and an amplification program.
The obtained CYP2D6 gene-specific primers are shown in Table 1.
TABLE 1 CYP2D6 Gene primer information Table
Wherein, SEQ ID NO: 1-2, the 5' -end of the CYP2D6 gene primer is bolded and underlined as a common sequence for ligation of the specific tag index in a subsequent step.
Example 2CYP2D6 Gene haplotype detection method
In this example, a CYP2D6 gene haplotype detection method (two-step PCR) using the above primer pair as a base is exemplified. The method comprises the steps of firstly carrying out full-length amplification on CYP2D6 gene sequences in human genome to obtain a required target fragment, then modifying the target fragment to connect with corresponding specific tag index, and then carrying out long-reading long-sequencing on the target fragment after pooling based on sample libraries of different tags to obtain a CYP2D6 gene haplotype detection result, thereby obtaining parting information.
The method comprises the following specific steps:
(1) Acquisition of genomic DNA:
In this example, the genomic DNA is derived from a peripheral blood sample (of course, it is also possible to directly detect the genomic DNA based on the existing genomic DNA).
Taking 0.1-0.2 mL of peripheral blood sample, extracting genome DNA by using a conventional nucleic acid extraction method or product in the field (such as a nucleic acid extraction or purification reagent manufactured by Dongguan Bo Aoshi gene technology Co., ltd., product number: S10040), carrying out initial evaluation on purity detection and concentration of the extracted genome DNA by using Nanodrop2000, and carrying out integrity verification on the extracted genome DNA by agarose gel electrophoresis. Genomic DNA that passed the above test was taken for subsequent detection.
(2) And (3) PCR amplification:
PCR amplification was performed using the extracted genomic DNA as a template, and CYP2D6 gene-specific primers to which the public sequences were ligated in the above examples. The reaction system (25. Mu.L) is shown in Table 2.
TABLE 2 PCR amplification System
Component (A) Content of
Sample DNA to be tested (genomic DNA) 10ng
LA Taq enzyme 0.3μL
10 Xamplification buffer 2.5μL
dNTP 4μL
10 Mu M CYP2D6 gene upstream and downstream specific primer 1 Mu L each
Enzyme-free water Supplement to 25. Mu.L
The amplification reaction conditions were: pre-denaturation at 94℃for 1min; denaturation at 98℃for 10s, annealing at 65℃for 60s, extension at 72℃for 6min,20 cycles; extending at 72℃for 10min. Obtaining an amplification product.
The amplified product is purified by using a magnetic bead purification method, and the specific steps are as follows: taking 25 mu L of the PCR amplification product obtained after the PCR amplification, adding 12.5 mu L of magnetic beads into a centrifuge tube, standing for 5min, placing the mixture on a magnetic frame for treatment, removing supernatant magnetic beads, washing twice with 75% absolute ethyl alcohol, and then adding 17 mu L of anhydrous enzyme water (nucleic-FREE WATER) for resuspension. And sucking the supernatant into a new centrifuge tube to obtain a purified PCR amplification product.
The 3' end of the specific tag index was ligated to a public sequence (submitted to the trade company, ind. Strapdesk).
The obtained specific tag index sequence information linked to the common sequence is shown in Table 3.
TABLE 3 specific tag index sequence information
The purified PCR amplification product (as a template) was amplified with the above-mentioned specific tag index (as a primer) to which the common sequence was attached. The reaction system (25. Mu.L) is shown in Table 4.
TABLE 4PCR amplification System
Component (A) Content of
Sample DNA to be tested (purified PCR amplified product) 16.2μL
LA Taq enzyme 0.3μL
10 Xamplification buffer 2.5μL
dNTP 4μL
10 Mu M upstream and downstream specific tag index linked to common sequence 1 Mu L each
Enzyme-free water Supplement to 25. Mu.L
The amplification reaction conditions were: pre-denaturation at 94℃for 1min; denaturation at 98℃for 10s, annealing at 65℃for 60s, extension at 72℃for 6min,15 cycles; extending at 72℃for 10min. A second amplification product (i.e., a library of amplicons) is obtained.
The above magnetic bead purification steps were repeated to purify the amplicon library. After washing twice with 75% absolute ethanol, 50 μl of anhydrous water was added for resuspension, the supernatant was aspirated into a new centrifuge tube, a purified amplicon library was obtained, and quantitated for Qubit.
(3) Long read long sequencing and result analysis:
purified amplicon libraries derived from different test samples were mixed in equal proportions for long-read long sequencing.
The initial input of sample sequencing was estimated based on the amplicon length of the CYP2D6 gene, the Nanopore library-building sequencing kit (EXP-NBD 104 and SQK-LSK 110) instructions and the amount of data required for the belief analysis. Then long-reading long sequencing was performed according to the instruction of the Nanopore library-building sequencing kit (EXP-NBD 104, SQK-LSK 110).
Biological information analysis is carried out on the long-reading long-sequencing data after the MinION is started, and the specific analysis steps are as follows:
Extracting the DNA sequence in the Fast5 storage nanopore signal file generated by sequencing by utilizing Guppy software (v.6.4.6) through Minion, filtering low-quality sequences (q < 8) in the DNA sequence, and obtaining successful reads to generate a final Fastq sequence file. And obtaining specific tag index information according to sequencing, setting the fault tolerance number of the specific tag index to be 1 by using a Python script, and splitting Fastq sequence files to obtain Fastq sequence files of samples corresponding to different specific tag indexes. And then quality control is carried out on Fastq sequence files of samples corresponding to different specificity tag index by utilizing NanoPlot software (v1.40.2), and quality information is counted. The Fastq sequence files of the corresponding samples of the different specificity tag index were subjected to a deblocking process using PoreChop software (v0.2.4) and low quality sequences (q < 9) were filtered using NanoFilt software (v2.8.0). And (3) comparing the filtered Fastq sequence file with a CYP2D6 reference sequence (NG_ 008376.4) by utilizing map-ont mode in Minimap2 comparison software (v 2.17-r 941) to obtain a SAM comparison file. And processing the SAM comparison file sequentially by utilizing view, sort, index commands in Samtools software (v 1.2) to obtain the ordered BAM file. And performing quality control on the ordered BAM files according to the specific amplification region by using Bamdst software (v1.0.9), and counting information such as coverage. And using mplieup and call commands of Bcftools software (v 1.12) to perform mutation detection on the sequenced BAM files by using a parameter-m (multiallelic-caller algorithm) to obtain VCF files. Correcting point mutations (including SNP and small InDel) in the VCF file through a correction model to obtain the corrected VCF file. Genotype prediction is carried out on the basis of the corrected VCF file by a haplotype detection method, and the copy number is deduced by combining with AF frequency, so that the final genotype is obtained.
The correction model is constructed based on a Bayesian correction model. Considering that the mutation rate of the CYP2D6 gene is relatively high, all possible mutation sites cannot be trained in a model, so in the method, the construction of the model is focused on 396 key mutation sites recorded by PharmVar for allele identification, and therefore, the sites directly influence the judgment of alleles.
And (3) taking point mutations (including SNP and small InDel) detected by the Illumina second generation sequencing data as a standard, and establishing a correction model through mutation frequency of the base positions detected by the long-reading long-sequencing data. For sites with more mutation numbers in the Illumina sequencing result, constructing a genotype frequency Bayesian model of the site, and when a certain AF is detected at a certain site, calculating the probability that the site is of a certain genotype according to the following formula:
Wherein, in the formula:
Gi represents a genotype at a site, each site having three genotypes G0, G1 and G2, representing wild-type, heterozygous and homozygous mutations, respectively. A represents the frequency AF of the variant allele. P (G0), P (G1) and P (G2) are the prior probabilities of the crowd frequency of the corresponding genotypes, and are derived from the east Asia crowd frequency of the gnomAD locus genotype in the database (v2.1.1), and P (A|G0), P (A|G1) and P (A|G2) are obtained by calculating the locus sample mean value and the sample standard deviation respectively and then fitting by normal distribution.
For the sites with fewer mutations in the Illumina sequencing result, such as the wild type result, a site FP filtering model is constructed to process the sites, and the specific formula is as follows:
Wherein, in the formula:
A represents the frequency AF of the variant allele. Mu represents the mean of the sites in the wild-type sample. Sigma represents the standard deviation of the site in the wild-type sample.
When a certain AF is detected at a certain site, the Z value of the site can be calculated, and when the Z value is <1.96, the genotype of the site is considered to be the wild type.
In the model construction stage, because part of sites in the training set for construction have low frequency in the east Asian population, the FP filter model is firstly carried out to determine the site genotype, and then the Bayesian model is further used for correction. However, in actual detection (e.g., using a test set or actual detection sample), the bayesian model and FP filter model are performed synchronously. That is, when the Z value is 1.96 or more, the FP filter model and the bayesian model corrected result are output. When the Z value is <1.96, the corrected wild-type site genotype is output, giving a wild-type (negative) result through the subsequent steps.
Of course, if in actual detection, when the Z value is <1.96 and the locus still allows correction using bayesian models, bayesian model correction is still necessary for locus genotypes output by mutation detection software to ensure that false positive and false negative results are filtered out. However, if the use of the bayesian model is not allowed, no bayesian model correction is performed.
To test the effectiveness of the correction model, a separate sample was used for verification. The results are shown in Table 5.
Table 5 bayesian correction model performance verification
The meaning and corresponding algorithm formula of each index parameter in table 5 are shown in table 6.
Table 6 significance of the test value formula
From this, after correcting the loci by the established bayesian correction model, the 396 locus recorded by PharmVar is taken as an independent sample, the F1 value is increased from uncorrected 0.9686 to corrected 0.9916, and the corrected F1 value is higher than 0.9900 (approaching 100%, which is a very significant increase), which indicates that the detection result of the single base level is already very close to the result of Illumina sequencing, and the result of the allele is not affected by the error of mutation detection. Wherein the total number of false negative variations (FN) is reduced by 82.2% (267/325), and the total number of false positive variations (FP) is reduced by 59.2% (129/218).
Meanwhile, another set of sample data analyzed with the use of minimap2+ nanopolish combination (specific procedure reference Liau,Yusmiati.et al.Nanopore sequencing of the pharmacogene CYP2D6 allows simultaneous haplotyping and detection of duplications.Pharmacogenomics J.14,1033-1047(2019).)) was used as a comparison without performing the bayesian correction model process, and the results are shown in table 7.
Table 7 minimap2+nanopolish combined test results
As a result, it was found that the effect of using minimap2+ nanopolish combination without correction was still far less than the detection effect of the method of the examples of the present invention.
In addition, in some sites (shown in table 8) with high frequency of CYP2D6 typing in China, for example, NG_008376.4:5119 sites (the alleles are identified as critical sites of 10 and 39), false negative variation exists in the results obtained by software before partial sample uncorrectation. This site pseudo-anion (FN) directly affects the detection accuracy of the allele (misjudging original 10 as 39), and after correction by bayesian model, the F1 value can reach 1.0000 at ng_008376.4:5119 site, which indicates that at the key site of CYP2D6 typing, the correction model of the above embodiment has very high accuracy, which is beneficial to improving the typing accuracy of the subsequent genotypes.
Table 8 NG_008376.4:5119 site Performance validation
Type(s) Ref Alt TP FN FP TN PPV Sensitivity F1
After correction C T 294 0 0 189 100% 100% 1.0000
Before correction C T 292 2 0 189 100% 99.32% 0.9966
The haplotype detection method comprises the following steps: phase commands of WhatsHap software (v 1.4) are utilized to split phases of the corrected VCF file and the ordered BAM file according to the detected point mutation (including SNP and small InDel), so as to obtain the split-phase VCF file; then, using haplotag command of WhatsHap software (v 1.4), haploid marks H1, H2 or None are carried out on reads according to the VCF files after phase separation and the BAM files after sequencing, and finally a phase separation list is obtained.
When the result obtained by the software is that phase separation is possible (namely, the phase separation list comprises H1 and H2), splitting the VCF file after phase separation into two VCF files of haploids (H1 and H2) according to the names of reads corresponding to the phase separation list by utilizing a Perl (v5.26.2) script; genotype testing was performed on the two haploid VCF files using Stargezar software (v2.0.0), and finally combining the haplotypes of the two haploids as the final genotype result. When the software obtained the result that phase separation was impossible (i.e., the phase separation list was None, "-"), stargezar software (v2.0.0) was used to genotype the corrected VCF file to obtain the final genotype result.
And respectively taking different independent samples for simulation verification. The results are shown in tables 9 and 10.
Table 9 haplotype split/split-not-split accuracy
Method of Accurate sample number Accuracy rate of
Haplotype splitting 473 97.93%
Haplotype is not split 455 94.20%
Table 10 haplotype split/not split final genotype inaccurate sample display
Sample name Haplotype is not split Haplotype splitting
sample1 *10/*39 *1/*10
sample2 *10/*39 *2/*10
sample3 *10/*106 *1/*10
sample4 *10/*39 *2/*10
sample5 -/- *2/*36
sample6 -/- *10/*36
It was found that after phase separation using WhatsHap software, the accuracy increased from 94.2% to 97.93%. Furthermore, the VCF file was split using Stargezar software (via Beagle software, which is self-contained in Stargezar software) after the comparison haplotype was not split, which indicated that the accuracy of WhatsHap split was higher than it was. Therefore, whatsHap software is more recommended for phase separation in Nanopore sequencing analysis.
Example 3 method verification
To illustrate the effectiveness of the above method, a new validation sample was additionally set to perform validation according to the above method, and the results are shown in tables 11 and 12.
TABLE 11 long read long sequencing genotyping accuracy
Total number of samples Genotype identity number of samples Accuracy rate of
31 29 93.55%
Table 12 sample results display-qPCR and Long read Long sequencing genotype prediction results
The above results indicate that genotyping accuracy was as high as 93.55% by performing a belief analysis by the method of the above example using the Nanopore sequencing analysis of the above example, with qPCR results as a standard.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (8)

1. A gene haplotype detection method comprises the following steps:
(1) Carrying out PCR amplification on a sample to be detected by using a specific primer to obtain a target fragment, and then carrying out PCR amplification on the target fragment again by using a label primer to obtain amplified products with different labels;
(2) Equivalent mixing is carried out on amplification products with different labels from different samples to be detected, a Nanopore sequencing library is constructed, long-reading long-sequencing is carried out on the Nanopore sequencing library, sequencing results are compared with human reference genome to obtain sequenced BAM files, mutation detection is carried out, VCF files are obtained, and a Bayesian correction model is used for correcting the VCF files to obtain corrected VCF files;
(3) Phase commands are used for carrying out phase separation on the corrected VCF file and the ordered BAM file, then haplotag commands are executed according to the phase separation results, the data are marked, and the gene haplotype is judged according to the marks;
Wherein, the specific primer is shown as SEQ ID NO: 1-2, wherein the label primer is shown as SEQ ID NO:3 to 206.
2. The method of claim 1, wherein the bayesian correction model is:
Wherein, gi represents the genotype of the target site, and G0, G1 and G2 represent the wild, heterozygous mutation and homozygous mutation, respectively;
a represents the frequency AF of the variant allele;
p (A|G0), P (A|G1) and P (A|G2) are obtained by calculating the sample mean value and the sample standard deviation of the site through the prior probabilities P (G0), P (G1) and P (G2) of the corresponding site genotypes respectively and then fitting by normal distribution.
3. The method according to claim 1, wherein the following formulas are used for checking the site result at the same time when correcting using a bayesian correction model:
wherein A represents the frequency AF of the variant allele;
μ represents the mean of the site in the wild-type sample;
Sigma represents the standard deviation of the site in the wild-type sample;
if the Z value is less than 1.96, the genotype of the corresponding site of the sample to be detected is identical to the wild type;
If the Z value is more than or equal to 1.96, the corresponding locus genotype of the sample to be detected is the locus genotype in the VCF file obtained in the step (2) of claim 1.
4. The method for detecting gene haplotype according to claim 1, wherein in the step (3), if phase separation is possible, the split VCF file is split into two haploid VCF files by using Perl script, then genotype detection is performed on the two haploid VCF files by using Stargezar software, and finally the haplotype of the two haploids is combined as a final genotype result;
if the phase separation can not be carried out, carrying out genotype detection on the corrected VCF file by Stargezar software to obtain a final genotype result.
5. The method of claim 1, further comprising performing data processing after long-read long-sequencing, comprising:
Extracting DNA sequence information by Guppy software, filtering out q <8 parts, then using a specific label index as a mark, using a Python script to set the index fault tolerance number as 1 pair of filtered data splitting, removing joints, filtering out q <9 parts, using Minimap2 comparison software to perform sequence comparison to obtain SAM comparison files, then using Samtools software to process to obtain ordered BAM files, using mplieup and call commands of Bcftools software, and using multiallelic-caller algorithm to perform mutation detection on the ordered BAM files to obtain VCF files.
6. The method of claim 5, wherein the Minimap alignment software uses map-ont mode.
7. The method according to claim 5, wherein when processing using Samtools software, processing is performed using view, sort, and index commands in sequence.
8. Use of the gene haplotype detection method according to any one of claims 1-7 in CYP2D6 enzyme activity typing.
CN202311620961.8A 2023-11-29 2023-11-29 Gene haplotype detection method based on long-reading long-sequencing and application thereof Active CN117711488B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311620961.8A CN117711488B (en) 2023-11-29 2023-11-29 Gene haplotype detection method based on long-reading long-sequencing and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311620961.8A CN117711488B (en) 2023-11-29 2023-11-29 Gene haplotype detection method based on long-reading long-sequencing and application thereof

Publications (2)

Publication Number Publication Date
CN117711488A CN117711488A (en) 2024-03-15
CN117711488B true CN117711488B (en) 2024-07-02

Family

ID=90157981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311620961.8A Active CN117711488B (en) 2023-11-29 2023-11-29 Gene haplotype detection method based on long-reading long-sequencing and application thereof

Country Status (1)

Country Link
CN (1) CN117711488B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103459614A (en) * 2011-01-05 2013-12-18 香港中文大学 Non-invasive prenatal genotyping of fetal sex chromosomes
CN103745136A (en) * 2013-12-26 2014-04-23 中国农业大学 Efficient haplotype inference and deleted genotype fill method

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9977861B2 (en) * 2012-07-18 2018-05-22 Illumina Cambridge Limited Methods and systems for determining haplotypes and phasing of haplotypes
CN108885648A (en) * 2016-02-09 2018-11-23 托马生物科学公司 Systems and methods for analyzing nucleic acids
TW201816645A (en) * 2016-09-23 2018-05-01 美商德萊福公司 Integrated systems and methods for automated processing and analysis of biological samples, clinical information processing and clinical trial matching
US11725232B2 (en) * 2016-10-31 2023-08-15 The Hong Kong University Of Science And Technology Compositions, methods and kits for detection of genetic variants for alzheimer's disease
CN107766785B (en) * 2017-01-25 2022-04-29 丁贤根 Face recognition method
MX2020012717A (en) * 2018-05-25 2021-07-15 Arca Biopharma Inc Methods and compositions involving bucindolol for the treatment of atrial fibrillation.
CN109063417B (en) * 2018-07-09 2022-03-15 福建国脉生物科技有限公司 Genotype filling method for constructing hidden Markov chain
US10468141B1 (en) * 2018-11-28 2019-11-05 Asia Genomics Pte. Ltd. Ancestry-specific genetic risk scores
GB202004528D0 (en) * 2020-03-27 2020-05-13 Univ Birmingham Methods, compositions and kits for hla typing
CN111518917B (en) * 2020-04-02 2022-06-07 中山大学 Micro haplotype genetic marker combination and method for noninvasive prenatal paternity relationship determination
CN114250279B (en) * 2020-09-22 2024-04-30 上海韦翰斯生物医药科技有限公司 Construction method of haplotype
US20240287048A1 (en) * 2020-10-16 2024-08-29 The Broad Institute, Inc. Substituted acyl sulfonamides for treating cancer
CN113555062B (en) * 2021-07-23 2022-07-12 哈尔滨因极科技有限公司 Data analysis system and analysis method for genome base variation detection
CN113564247B (en) * 2021-09-24 2022-01-28 北京贝瑞和康生物技术有限公司 Primer group and kit for simultaneously detecting multiple mutations of 9 genes related to congenital adrenal cortical hyperplasia
CN114496077B (en) * 2022-04-15 2022-06-21 北京贝瑞和康生物技术有限公司 Methods, devices, and media for detecting single nucleotide variations and indels
CN114649055B (en) * 2022-04-15 2022-10-21 北京贝瑞和康生物技术有限公司 Methods, devices and media for detecting single nucleotide variations and indels
CN117133355A (en) * 2023-08-25 2023-11-28 山东省农业科学院畜牧兽医研究所 An error correction and missing filling method and application of GBTS detection genotype

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103459614A (en) * 2011-01-05 2013-12-18 香港中文大学 Non-invasive prenatal genotyping of fetal sex chromosomes
CN103745136A (en) * 2013-12-26 2014-04-23 中国农业大学 Efficient haplotype inference and deleted genotype fill method

Also Published As

Publication number Publication date
CN117711488A (en) 2024-03-15

Similar Documents

Publication Publication Date Title
Kumar et al. Next-generation sequencing and emerging technologies
Deschamps et al. Genotyping-by-sequencing in plants
Pereira et al. Straightforward inference of ancestry and admixture proportions through ancestry-informative insertion deletion multiplexing
US20160125128A1 (en) Accurate typing of hla through exome sequencing
Lu et al. The motif composition of variable number tandem repeats impacts gene expression
CN109182538B (en) Method for genotyping and analyzing key SNPs sites rs88640083 and 2b-RAD of dairy cow mastitis
US20230120825A1 (en) Compositions, Methods, and Systems for Paternity Determination
CN112513292A (en) Method and device for detecting homologous sequence based on high-throughput sequencing
Silva et al. A 3K Axiom SNP array from a transcriptome-wide SNP resource sheds new light on the genetic diversity and structure of the iconic subtropical conifer tree Araucaria angustifolia (Bert.) Kuntze
CN116312776A (en) Method for detecting differentiated RNA editing sites
CN112086131A (en) A screening method for false positive variant sites in high-throughput sequencing
CN115851964A (en) SNP molecular marker related to milk production traits and lamb production traits of milk goats, liquid chip detection kit and application
Valle-Silva et al. Analysis and comparison of the STR genotypes called with HipSTR, STRait Razor and toaSTR by using next generation sequencing data in a Brazilian population sample
Kim et al. Validation and application of new NGS‐based HLA genotyping to clinical diagnostic practice
US20200265920A1 (en) A system for determining diplotypes
CN117711488B (en) Gene haplotype detection method based on long-reading long-sequencing and application thereof
Pouseele et al. Accurate whole-genome sequencing-based epidemiological surveillance of Mycobacterium tuberculosis
JP2025013900A (en) Methods and systems for detecting allelic imbalance in cell-free nucleic acid samples - Patents.com
Xu et al. The research of a large-scale analysis platform for MNS blood group identification based on long-read sequencing
CN116083562B (en) SNP marker combination and primer set related to aspirin resistance auxiliary diagnosis and application thereof
CN109182505B (en) Method for genotyping and analyzing key SNPs sites rs75762330 and 2b-RAD of dairy cow mastitis
CN105154543A (en) Quality control method for biological sample nucleic acid detection
Ruiz-Ramírez et al. Inter-platform evaluation of the MPSplex large-scale tri-allelic SNP panel for forensic identification
Benaglio et al. Ultra high throughput sequencing in human DNA variation detection: a comparative study on the NDUFA3-PRPF31 region
CN109182504B (en) Method for genotyping and analyzing key SNPs sites rs20438858 and 2b-RAD of dairy cow mastitis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant