[go: up one dir, main page]

CN115035950B - Genotype detection method, sample pollution detection method, device, equipment and medium - Google Patents

Genotype detection method, sample pollution detection method, device, equipment and medium

Info

Publication number
CN115035950B
CN115035950B CN202210749217.7A CN202210749217A CN115035950B CN 115035950 B CN115035950 B CN 115035950B CN 202210749217 A CN202210749217 A CN 202210749217A CN 115035950 B CN115035950 B CN 115035950B
Authority
CN
China
Prior art keywords
abundance
tested
threshold
detected
snp site
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210749217.7A
Other languages
Chinese (zh)
Other versions
CN115035950A (en
Inventor
刘成林
张周
汉雨生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Burning Rock Dx Co ltd
Original Assignee
Guangzhou Burning Rock Dx Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Burning Rock Dx Co ltd filed Critical Guangzhou Burning Rock Dx Co ltd
Priority to CN202210749217.7A priority Critical patent/CN115035950B/en
Publication of CN115035950A publication Critical patent/CN115035950A/en
Application granted granted Critical
Publication of CN115035950B publication Critical patent/CN115035950B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明公开了一种基因型检测方法、样本污染检测方法、装置、设备及介质,该基因型检测方法包括:获取待检测样本中的待测SNP位点以及待测丰度阈值;针对每个待测丰度阈值,计算所述待检测样本的丰度波动水平;基于丰度波动水平中符合拐点特征的丰度波动水平对应的待测丰度阈值和各待测SNP位点分别对应的突变丰度,确定各待测SNP位点分别对应的基因型。基于满足基因型条件的待测SNP位点对应的次等位基因频率,构建高斯混合模型,对所述高斯混合模型进行最优化求解操作,得到所述待检测样本对应的污染状态数据。本发明实施例解决了现有基因型检测方法依赖配对测序数据的问题,提高了基因型检测结果的准确度。

The present invention discloses a genotype detection method, a sample contamination detection method, an apparatus, a device and a medium. The genotype detection method comprises: obtaining the SNP site to be detected and the abundance threshold to be detected in the sample to be detected; for each abundance threshold to be detected, calculating the abundance fluctuation level of the sample to be detected; based on the abundance threshold to be detected corresponding to the abundance fluctuation level that meets the inflection point characteristics in the abundance fluctuation level and the mutation abundance corresponding to each SNP site to be detected, determining the genotype corresponding to each SNP site to be detected. Based on the minor allele frequency corresponding to the SNP site to be detected that meets the genotype condition, a Gaussian mixture model is constructed, and the Gaussian mixture model is optimized to obtain the contamination status data corresponding to the sample to be detected. The embodiment of the present invention solves the problem that the existing genotype detection method relies on paired sequencing data, and improves the accuracy of the genotype detection results.

Description

Genotype detection method, sample pollution detection method, device, equipment and medium
Technical Field
The invention relates to the field of biotechnology, in particular to a genotype detection method, a sample pollution detection device, sample pollution detection equipment and a sample pollution detection medium.
Background
SNP locus genotyping and bulk SNP locus genotyping are essential steps in locus genotyping based on high throughput sequencing and copy number variation detection based on high throughput sequencing. SNP (Single Nucleotide Polymorphism ) refers mainly to DNA (Deoxyribo Nucleic Acid, deoxyribonucleic acid) sequence polymorphism caused by variation of single nucleotide at genome level. The polymorphism represented by SNP locus involves only single base variation, which may be caused by single base transition or transversion, or may be caused by base insertion or deletion. The data show that specific enzyme SNP sites are associated with drug metabolism, for example breast cancer patients carrying CYP2D6 x 10 homozygous mutations, and that recurrence risk with moxifene adjuvant therapy is likely to be higher than in wild type patients. In addition, the identification of chromosomal copy number variation based on the shift in bulk heterozygous SNP abundance needs to rely on the identification of SNP genotypes in the first place. In general, SNP genotyping employs a static abundance threshold method (e.g., allle-specific copy number analysis of tumors published by Peter Van Loo et al, PNAS vol.107, no. 30:16910-16915), i.e., a method in which homozygous mutations are obtained by abundance above a predetermined threshold (e.g., 95%). The method is greatly influenced by tumor duty ratio, chromosome structure variation, sample quality and sample pollution, and has low stability and accuracy.
On the other hand, in high throughput sequencing, contamination of heterologous DNA caused during sample storage, preparation, and the like is a non-negligible problem. Sample pollution directly causes a large number of low-abundance mutations from heterologous DNA in mutation detection, and the low-abundance mutations are difficult to distinguish from the real system mutations of the sample to be detected, so that misjudgment of mutation results is caused. Therefore, the identification of sample contamination is an important element of high throughput sequencing quality control, and accurate quantification of sample contamination helps identify and extract reliable system mutations. In general, the sample contamination is identified by a control sample, and there are problems of difficult sampling, increased additional cost, etc. in practical practice. Sample contamination quantification is more difficult, including the fact that samples may have large copy number variations, may be contaminated by multiple sources of contamination, and is insensitive to low-level contamination identification and quantification. Some studies have used contamination level quantification by constructing heterozygous SNP abundance models, however SNP genotype identification approaches based on static thresholds directly lead to unreliable heterozygous SNP datasets.
For example, fi vet et al ART-DeCo:easy tool for detection and characterization of cross-contamination of DNA samples in diagnostic next-generation sequencing analysis(European Journal of Human Genetics(2019)27:792–800) describes a method for detecting contamination in a sequenced sample, which involves screening for non-heterozygous SNP sites and detecting the proportion of contamination sources. Wherein the static threshold value used for screening non-heterozygous SNP sites is considered non-heterozygous when the AR (allelic ratio, allele ratio, similar to allele frequency) is [0-0.005] and [0.995-1 ]. The method for detecting the pollution source ratio comprises two methods, namely, simply detecting the pollution source ratio according to a WCS formula (see right column of page 794 of the paper) based on sequencing data of non-heterozygous SNP loci of a sample to be detected, and finely calculating the pollution source detection ratio based on the SNP genotype of a pollution sample and the sequencing data of the non-heterozygous SNP loci of the sample to be detected.
However, due to sample-to-sample variability, it is sometimes not accurate to use such a static threshold to discriminate the SNP genotypes of all samples. In order to identify the pollution source, besides the sample to be detected, an additional group of pollution samples needs to be detected, and the purpose of pollution identification can not be realized through the detection data of a single sample to be detected.
Disclosure of Invention
The invention provides a genotype detection method, a sample pollution detection device, equipment and a medium, which are used for solving the problem that the conventional genotype detection method depends on paired sequencing data and providing data guarantee for the accuracy of subsequent sample pollution detection.
According to an aspect of the present invention, there is provided a genotype detection method comprising:
acquiring at least 10 SNP loci to be detected in a sample to be detected and acquiring at least 3 abundance thresholds to be detected;
aiming at each abundance threshold to be detected, calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected;
Taking an abundance threshold to be measured corresponding to the abundance fluctuation level which accords with the inflection point feature in the at least 3 abundance fluctuation levels as a first abundance threshold;
And determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.
According to another aspect of the present invention, there is provided a sample contamination detection method comprising:
obtaining at least 10 SNP loci to be detected in a sample to be detected, and determining genotypes corresponding to the SNP loci to be detected respectively;
Constructing a Gaussian mixture model based on the minor allele frequency corresponding to the SNP locus to be detected meeting the genotype condition, wherein the genotype condition comprises that the genotype is wild type and/or the genotype is homozygous;
And carrying out optimization solving operation on the Gaussian mixture model to obtain pollution state data corresponding to the sample to be detected, wherein the pollution state data comprises at least one of pollution proportion, suboptimal gene frequency variance and the number of pollution sources.
According to another aspect of the present invention, there is provided a genotype detection device comprising:
the SNP locus to be detected acquisition module is used for acquiring at least 10 SNP loci to be detected in a sample to be detected and acquiring at least 3 abundance thresholds to be detected;
the abundance fluctuation level determining module is used for calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected respectively aiming at each abundance threshold to be detected;
The first abundance threshold determining module is used for taking an abundance threshold to be detected corresponding to the abundance fluctuation level which accords with the inflection point feature in at least 3 abundance fluctuation levels as a first abundance threshold;
the genotype determining module is used for determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.
According to another aspect of the present invention, there is provided a sample contamination detection apparatus comprising:
the genotype determining module is used for acquiring at least 10 SNP loci to be detected in a sample to be detected and determining genotypes corresponding to the SNP loci to be detected respectively;
The Gaussian mixture model construction module is used for constructing a Gaussian mixture model based on the subordinated gene frequency corresponding to the SNP locus to be detected meeting the genotype condition, wherein the genotype condition comprises that the genotype is wild type and/or the genotype is homozygous;
the pollution state data determining module is used for carrying out optimization solving operation on the Gaussian mixture model to obtain pollution state data corresponding to the sample to be detected, wherein the pollution state data comprises at least one of pollution proportion, secondary allele frequency variance and the number of pollution sources.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor, and
A memory communicatively coupled to the at least one processor, wherein,
The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the genotype detection method or the sample contamination detection method according to any of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the genotype detection method or sample contamination detection method according to any embodiment of the present invention when executed.
According to the technical scheme, according to the obtained abundance threshold values to be detected, the abundance fluctuation level of a sample to be detected is determined based on the sub-allele frequencies respectively corresponding to the abundance threshold values to be detected and the SNP loci to be detected, the abundance threshold value to be detected corresponding to the abundance fluctuation level meeting the inflection point characteristic in at least 3 abundance fluctuation levels is used as a first abundance threshold value, the first abundance threshold value is determined through the constructed dynamic abundance threshold value model, and genotypes respectively corresponding to the SNP loci to be detected are determined according to the first abundance threshold value and mutation abundances respectively corresponding to the SNP loci to be detected.
By the dynamic threshold method, the problems of inaccurate genotype discrimination caused by homozygous, heterozygous, wild SNP abundance deviating from 0%,50% and 100% due to high tumor ratio, copy number variation, sample pollution and the like are effectively avoided. Meanwhile, the algorithm does not depend on a control sample, and the problems of difficult sampling of the control sample, increased detection cost and the like are solved.
Further, the present invention may have greater accuracy, i.e., lower false positives or false negatives, for genotypes that rely on dynamic thresholds for discrimination between SNPs heterozygous and non-heterozygous. Meanwhile, the mixed model method can confirm the number of sample pollution sources, can be used for identifying and quantifying the pollution of multiple pollution sources, identifying and quantifying the pollution of low degree, and identifying and quantifying the pollution degree by stably identifying and quantifying the sample with different chromosome states including chromosome stability and a large number of copy number variation, and meanwhile, does not need sequencing information of other samples (such as reference samples, pollution samples and the like) except the sample to be tested.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a genotype detection method according to a first embodiment of the present invention;
FIG. 2 is a flow chart of a sample contamination detection method according to a second embodiment of the present invention;
FIG. 3 is a flow chart of a sample contamination detection method according to a third embodiment of the present invention;
FIG. 4 is a schematic diagram of a genotype detecting device according to a fourth embodiment of the present invention;
Fig. 5 is a schematic structural diagram of a sample contamination detection apparatus according to a fifth embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
Fig. 1 is a flowchart of a genotype detection method according to a first embodiment of the present invention, where the method can be performed by a genotype detection method device, which can be implemented in hardware and/or software, and the genotype detection device can be configured in a terminal device. As shown in fig. 1, the method includes:
S110, obtaining at least 10 SNP loci to be detected in a sample to be detected and obtaining at least 3 abundance thresholds to be detected.
The sample to be detected is a single tissue sample, such as a cancer tissue sample. In this example, the type of SNP locus to be detected is an germline SNP locus. There are a large number of mutation sites in human genome, which can be classified into germ line mutation and somatic mutation according to the source, wherein germ line mutation is also called germ cell mutation, and is a mutation derived from germ cells such as sperm or ovum, so that all cells in the human body usually carry the mutation, including cancer tissue samples. Somatic mutations, also known as acquired mutations, are mutations obtained during growth or acquired under the influence of environmental factors, and are usually carried by only a portion of the cells.
Specifically, after performing quality control on a FASTQ file (next machine data) containing a high-throughput sequencing sequence of a sample to be detected, using a comparison software to compare the high-throughput sequencing sequence to a human reference genome, using an SNP site in the high-throughput sequencing sequence compared to the human reference genome as an SNP site to be detected, and generating a SAM file. The alignment software may be BWA-MEM (0.7.10), and the human reference genome may be hg19 or b37, among others, by way of example. The SAM file is illustratively converted to a BAM file by samtools (0.1.19) software.
As an optional implementation manner, the obtaining of at least 10 SNP loci to be detected in the sample to be detected comprises taking at least 10 SNP loci with the crowd frequency satisfying a preset crowd frequency range in the sample to be detected as the SNP loci to be detected respectively, wherein the preset crowd frequency range comprises 0.4-0.6. The crowd frequency is used for representing the occurrence frequency of SNP loci in crowd genomes, and specifically, SNP loci which are compared with the crowd frequency on a human reference genome and meet 0.4-0.6 are taken as SNP loci to be detected.
Specifically, the abundance threshold to be detected is used for representing a screening threshold corresponding to the frequency of the minor allele. The threshold difference value between the abundance thresholds to be measured can be the same or different. For example, a set of abundance thresholds to be measured th satisfies th e {0%,1%,2%,..10%, 30% }. Specific parameters of the abundance thresholds to be measured are not limited, and a user can set each abundance threshold to be measured according to actual requirements.
S120, aiming at each abundance threshold to be detected, calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected.
The method comprises the steps of calculating the abundance fluctuation level of a sample to be detected based on the abundance threshold to be detected and the minor allele frequencies corresponding to SNP loci to be detected respectively, wherein the method comprises the steps of executing segmentation operation on the chromosome based on copy numbers corresponding to at least two SNP loci to be detected on each chromosome to obtain one or more chromosome sections, wherein the copy number difference value corresponding to any two SNP loci to be detected in each chromosome section is smaller than a preset difference value threshold, taking the SNP loci to be detected with the minor allele frequencies greater than the abundance threshold in the chromosome sections as target SNP loci, and calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies corresponding to at least two target SNP loci.
Among these, the types of chromosomes include autosomes and/or sex chromosomes. Of the 23 pairs of chromosomes in humans, 22 pairs are autosomes and 1 pair is sex chromosome. The chromosomes in this example include all chromosomes that may have heterozygous variations.
In particular, the copy number is used to characterize the number of occurrences of a gene or a gene sequence in the genome, and can also be said to be the number of times a gene fragment is repeated in the genome. Normally, the copy number is 2. When copy number variation occurs, the genome rearranges, and the copy number of the gene changes from 2 to n, where n is a natural number that is not equal to 2.
Specifically, calculating the coverage depth of the SNP locus to be detected by GATK software, and calculating the copy number of the SNP locus to be detected based on the coverage depth. The coverage depth refers to the ratio of the total number of bases (bp) obtained by sequencing to the Genome size (Genome). In this example, the coverage depth of the SNP loci to be detected is higher than 200X.
Therein, by way of example, a CBS (circular binary segmentation) algorithm is employed to segment a chromosome into one or more chromosome segments based on copy number. The CBS algorithm may be implemented by R software package PSCBS, among other things. In this embodiment, the copy number difference corresponding to any two SNP loci to be detected in each chromosome segment is smaller than a preset difference threshold, where the preset difference threshold may be 2. For example, if the copy numbers of the 4 SNP sites to be detected on the chromosome are 2, 3, 5 and 6, respectively, the chromosome segment a contains 2 SNP sites to be detected with copy numbers of 2 and 3, respectively, and the chromosome segment B contains 2 SNP sites to be detected with copy numbers of 5 and 6, respectively.
Wherein, for each mutated SNP site, the major allele is the highest-count allele in the given population, the minor allele is the second highest-count allele in the given population, and the minor allele frequency is used to characterize the ratio of the number of minor alleles to the number of all alleles. Alleles are used to describe genes that control different morphologies of the same trait at the same position on a pair of homologous chromosomes. For example, assuming that there are three alleles a, C and G at a SNP site on a chromosome in 100 persons, and the number of occurrences of base a, base C and base G is 100, 80 and 20, respectively, using a genome sequencing method, the frequency of the minor allele corresponding to the SNP site is 80/200=0.4.
As an alternative implementation mode, the method for calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies corresponding to at least two target SNP loci comprises the steps of sorting all target SNP loci based on locus positions of all target SNP loci in chromosomes, taking a difference value between the minor allele frequency of a current target SNP locus and the minor allele frequency of a next target SNP locus as an abundance difference value based on a sorting result, and carrying out average processing on at least one abundance difference value to obtain the abundance fluctuation level of the sample to be detected.
In the exemplary embodiment, it is assumed that 4 target SNP sites on the chromosome are a target SNP site a, a target SNP site B, a target SNP site C, and a target SNP site D in this order, and the corresponding copy numbers are 2, 6, 5, and 3, respectively. The target SNP site and the sequence thereof contained in the chromosome segment a are the target SNP site a and the target SNP site D, and the target SNP site and the sequence thereof contained in the chromosome segment B are the target SNP site B and the target SNP site C.
For example, assuming that the chromosome segment a contains 100 target SNP sites, the number of abundance differences calculated is 99, and the abundance differences are averaged to obtain the abundance fluctuation level corresponding to the chromosome segment a.
S130, taking an abundance threshold to be measured corresponding to the abundance fluctuation level which accords with the inflection point feature in the at least 3 abundance fluctuation levels as a first abundance threshold.
Specifically, an abundance fluctuation level curve is constructed by taking an abundance threshold to be measured as an abscissa and an abundance fluctuation level as an ordinate. As the threshold value of the abundance to be measured increases, the level of the fluctuation of the abundance increases and then tends to be stable.
S140, determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.
Wherein, the mutation abundance, also called mutation allele frequency, is used for characterizing the ratio of the number of read lengths of mutant alleles to the number of read lengths of the whole SNP locus. Specifically, samtools mpileup software is adopted to calculate the number of reads A of mutant alleles and the number of reads B of wild alleles on SNP loci to be detected, and mutation abundance=A/(A+B). For example, alleles at SNP sites are denoted AAA, BBB, ABB and AAB, and the mutation abundances corresponding to each allele representation are 100%, 0%, 25% and 75%, respectively.
In this example, genotypes include heterozygous, wild-type and homozygous. Heterozygous is used to characterize genotype individuals, such as AB, that contain different types of alleles on homologous chromosomes. Wild type is used to characterize genotype individuals, such as AA, that contain a wild allele on a homologous chromosome, and homozygous is used to characterize genotype individuals, such as BB, that contain a homozygous allele on a homologous chromosome.
As an alternative implementation mode, determining genotypes corresponding to SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected respectively comprises calculating a second abundance threshold based on the first abundance threshold, wherein the sum of the first abundance threshold and the second abundance threshold is one, for each SNP locus to be detected, if mutation abundance corresponding to the SNP locus to be detected is smaller than or equal to the first abundance threshold, the genotypes of the SNP loci to be detected are wild types, if mutation abundance corresponding to the SNP locus to be detected is greater than or equal to the second abundance threshold, the genotypes of the SNP loci to be detected are homozygous types, and if mutation abundance corresponding to the SNP locus to be detected is greater than the first abundance threshold and smaller than the second abundance threshold, the genotypes of the SNP loci to be detected are heterozygous types.
Specifically, the second abundance threshold=1-the first division threshold, and the genotype of the SNP site to be detected satisfies the following relationship:
wherein AF represents mutation abundance, tBest represents a first abundance threshold, mBest represents a second abundance threshold.
The present example data is based on the results of high throughput sequencing of 100 real tissue samples. The tissue samples include samples of different tumor occupancy levels, different chromosome stabilities, no contamination and different contamination levels. At the same time, control leukocyte samples from the same individual as the tissue samples were high throughput sequenced to determine the true genotype of the SNP as a performance reference standard. Unlike cancer tissue samples, leukocyte samples are stable in genome and almost no contamination from sample storage and preparation exists. In order to ensure the reliability of the reference standard, a simple model, namely, a principle that approximately 0% is wild type, approximately 100% is homozygous and approximately 50% is heterozygous is adopted for directly carrying out genotype interpretation. Considering data noise, 5% of abundance fluctuations are tolerated, i.e., abundance 0% -5% is wild-type, 95% -100% is homozygous, and 47.5% -52.5% is heterozygous. For tissue samples we compared genotype identification based on two static threshold combinations and the dynamic threshold algorithm of the present invention.
Wherein the static threshold combination 1:
wherein the static threshold combination 2:
Further, the level of identity of the three strategies to the leukocyte reference standard was assessed with the heterozygous SNP site set as positive site and the non-heterozygous site set as negative site, as described in table 1 below. The dynamic threshold method has better and stable performance in sensitivity and specificity, and the accuracy level is highest.
Table 1 is a list of the identity of genotype identification provided in accordance with example one of the present invention.
According to the technical scheme of the embodiment, the abundance fluctuation level of a sample to be detected is determined based on the sub-allele frequency corresponding to each SNP locus to be detected and the obtained abundance threshold to be detected, the abundance threshold to be detected corresponding to the abundance fluctuation level conforming to the inflection point characteristic in at least 3 abundance fluctuation levels is used as a first abundance threshold, the first abundance threshold is determined through the constructed dynamic abundance threshold model, genotypes corresponding to each SNP locus to be detected are determined according to the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected, the problem that the conventional genotype detection method depends on paired sequencing data is solved, and the accuracy of genotype detection results is improved.
Example two
Fig. 2 is a flowchart of a sample contamination detection method according to a second embodiment of the present invention. The present embodiment is applicable to the case of detecting and evaluating the contamination level of a single tissue sample, and the method may be performed by a sample contamination detection method apparatus, which may be implemented in hardware and/or software, and which may be configured in a terminal device. As shown in fig. 2, the method includes:
S210, obtaining at least 10 SNP loci to be detected in a sample to be detected, and determining genotypes corresponding to the SNP loci to be detected respectively.
The sample to be detected is a single tissue sample, such as a cancer tissue sample. In this example, the type of SNP locus to be detected is an germline SNP locus. There are a large number of mutation sites in human genome, which can be classified into germ line mutation and somatic mutation according to the source, wherein germ line mutation is also called germ cell mutation, and is a mutation derived from germ cells such as sperm or ovum, so that all cells in the human body usually carry the mutation, including cancer tissue samples. Somatic mutations, also known as acquired mutations, are mutations obtained during growth or acquired under the influence of environmental factors, and are usually carried by only a portion of the cells.
Specifically, after performing quality control on a FASTQ file (next machine data) containing a high-throughput sequencing sequence of a sample to be detected, using a comparison software to compare the high-throughput sequencing sequence to a human reference genome, using an SNP site in the high-throughput sequencing sequence compared to the human reference genome as an SNP site to be detected, and generating a SAM file. The alignment software may be BWA-MEM (0.7.10), and the human reference genome may be hg19 or b37, among others, by way of example. The SAM file is illustratively converted to a BAM file by samtools (0.1.19) software.
As an optional implementation manner, the obtaining of at least 10 SNP loci to be detected in the sample to be detected comprises taking at least 10 SNP loci with the crowd frequency satisfying a preset crowd frequency range in the sample to be detected as the SNP loci to be detected respectively, wherein the preset crowd frequency range comprises 0.4-0.6. The crowd frequency is used for representing the occurrence frequency of SNP loci in crowd genomes, and specifically, SNP loci which are compared with the crowd frequency on a human reference genome and meet 0.4-0.6 are taken as SNP loci to be detected.
As an alternative implementation mode, determining genotypes corresponding to the SNP loci to be detected respectively comprises determining the genotypes corresponding to the SNP loci to be detected respectively by adopting a preset detection method, wherein the preset detection method comprises at least one of a sequencing method, a chip method and a mass spectrometry method.
In another embodiment, optionally, determining genotypes corresponding to the SNP loci to be detected respectively comprises determining genotypes corresponding to the SNP loci to be detected respectively, wherein the determining comprises obtaining a first abundance threshold, determining mutation abundance corresponding to the SNP loci to be detected respectively, and determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and the mutation abundance corresponding to the SNP loci to be detected respectively.
As an alternative embodiment, the first abundance threshold is preset by the user. For example, the first preset abundance threshold may be 0.01 or 0.03, where the specific value of the first preset abundance threshold is not limited, and may be set by the user according to actual needs.
Wherein, mutation abundance, also called mutation allele frequency, is used to characterize the ratio of the number of reads of a mutated allele to the number of reads of the entire SNP locus, wherein alleles are used to describe genes located at the same position on a pair of homologous chromosomes that control different morphologies of the same trait. Specifically, samtools mpileup software is adopted to calculate the number of reads A of mutant alleles and the number of reads B of wild alleles on SNP loci to be detected, and mutation abundance=A/(A+B). For example, alleles at SNP sites are denoted AAA, BBB, ABB and AAB, and the mutation abundances corresponding to each allele representation are 100%, 0%, 25% and 75%, respectively.
In this example, genotypes include heterozygous, wild-type and homozygous. Heterozygous is used to characterize genotype individuals, such as AB, that contain different types of alleles on homologous chromosomes. Wild type is used to characterize genotype individuals, such as AA, that contain a wild allele on a homologous chromosome, and homozygous is used to characterize genotype individuals, such as BB, that contain a homozygous allele on a homologous chromosome.
As an alternative implementation mode, determining genotypes corresponding to SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected respectively comprises calculating a second abundance threshold based on the first abundance threshold, wherein the sum of the first abundance threshold and the second abundance threshold is one, for each SNP locus to be detected, if mutation abundance corresponding to the SNP locus to be detected is smaller than or equal to the first abundance threshold, the genotypes of the SNP loci to be detected are wild types, if mutation abundance corresponding to the SNP locus to be detected is greater than or equal to the second abundance threshold, the genotypes of the SNP loci to be detected are homozygous types, and if mutation abundance corresponding to the SNP locus to be detected is greater than the first abundance threshold and smaller than the second abundance threshold, the genotypes of the SNP loci to be detected are heterozygous types.
Specifically, the second abundance threshold=1-the first division threshold, and the genotype of the SNP site to be detected satisfies the following relationship:
wherein AF represents mutation abundance, tBest represents a first abundance threshold, mBest represents a second abundance threshold.
S220, constructing a Gaussian mixture model based on the minor allele frequency corresponding to the SNP locus to be detected meeting the genotype condition.
In this embodiment, the genotypic conditions include the genotype being wild-type and/or the genotype being homozygous.
Of these, a SNP site typically contains only two alleles, in particular. For a set of SNP loci to be tested, it is assumed that the 4 SNP loci to be tested contain alleles of AAA, BBB, ABB and AAB, respectively. When calculating mutation abundance, we need to obtain the proportion of A or B in the total number of alleles, for example, A belongs to mutant genes, and the mutation abundance of 4 SNP loci to be detected is 100%, 0%, 25% and 75% respectively. For non-heterozygous SNP loci to be detected, only the degree of difference of the numbers of A and B is concerned, and the proportion of A or B in the total number of alleles is not concerned, so that the allele frequency can well meet the requirements, and the minor allele frequencies of 4 SNP loci to be detected are respectively 0%, 25% and 25%. By way of example, the mutation abundance was mapped to 50% as the center, and the mapped mutation abundance was equal to the minor allele frequency.
As an alternative embodiment, the gaussian mixture model satisfies the formula:
Wherein, the
Wherein, the
Where maf represents minor allele frequency, δ 2 represents variance of minor allele frequency, N represents number of contamination sources, α represents contamination ratio, pbinom represents bernoulli probability distribution, P (c=i) represents probability distribution corresponding to i contamination sources, and N represents probability distribution of gaussian mixture model.
And S230, performing optimization solving operation on the Gaussian mixture model to obtain pollution state data corresponding to the sample to be detected.
Exemplary methods of optimization solution include, but are not limited to, at least one of a maximum likelihood estimation method, a expectation maximization algorithm, and a markov chain monte carlo algorithm, among others. Specifically, an optimization function of R software packages stats can be used for carrying out optimization solving operation on the Gaussian mixture model by adopting an L-BFGS-B algorithm, so that pollution state data are obtained.
In this embodiment, the contamination status data includes at least one of a contamination ratio, a sub-allele frequency variance, and a number of contamination sources.
According to the technical scheme, at least 10 SNP loci to be detected in a sample to be detected are obtained, genotypes corresponding to the SNP loci to be detected are determined, a Gaussian mixture model is constructed based on the hypo-allele frequencies corresponding to the SNP loci to be detected which are not heterozygous, and the Gaussian mixture model is subjected to optimization solving operation to obtain pollution state data corresponding to the sample to be detected, so that the problems that an existing sample pollution measuring method is greatly influenced by copy number variation/heterozygosity loss and is difficult to detect low-level pollution are solved, and the accuracy of a measuring result of sample pollution is improved.
Example III
Fig. 3 is a flowchart of a sample contamination detection method according to a third embodiment of the present invention, where the technical feature of "obtaining the first abundance threshold" in the foregoing embodiment is further refined. As shown in fig. 3, the method includes:
S310, obtaining at least 10 SNP loci to be detected in a sample to be detected and obtaining at least 3 abundance thresholds to be detected.
Specifically, the abundance threshold to be detected is used for representing a screening threshold corresponding to the frequency of the minor allele. The threshold difference value between the abundance thresholds to be measured can be the same or different. For example, a set of abundance thresholds to be measured th satisfies th e {0%,1%,2%,..10%, 30% }. Specific parameters of the abundance thresholds to be measured are not limited, and a user can set each abundance threshold to be measured according to actual requirements.
S320, aiming at each abundance threshold to be detected, calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected.
The method comprises the steps of calculating the abundance fluctuation level of a sample to be detected based on the abundance threshold to be detected and the minor allele frequencies corresponding to SNP loci to be detected respectively, wherein the method comprises the steps of executing segmentation operation on the chromosome based on copy numbers corresponding to at least two SNP loci to be detected on each chromosome to obtain one or more chromosome sections, wherein the copy number difference value corresponding to any two SNP loci to be detected in each chromosome section is smaller than a preset difference value threshold, taking the SNP loci to be detected with the minor allele frequencies greater than the abundance threshold in the chromosome sections as target SNP loci, and calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies corresponding to at least two target SNP loci.
Among these, the types of chromosomes include autosomes and/or sex chromosomes. Of the 23 pairs of chromosomes in humans, 22 pairs are autosomes and 1 pair is sex chromosome. The chromosomes in this example include all chromosomes that may have heterozygous variations.
In particular, the copy number is used to characterize the number of occurrences of a gene or a gene sequence in the genome, and can also be said to be the number of times a gene fragment is repeated in the genome. Normally, the copy number is 2. When copy number variation occurs, the genome rearranges, and the copy number of the gene changes from 2 to n, where n is a natural number that is not equal to 2.
Specifically, calculating the coverage depth of the SNP locus to be detected by GATK software, and calculating the copy number of the SNP locus to be detected based on the coverage depth. The coverage depth refers to the ratio of the total number of bases (bp) obtained by sequencing to the Genome size (Genome). In this example, the coverage depth of the SNP loci to be detected is higher than 200X.
Therein, by way of example, a CBS (circular binary segmentation) algorithm is employed to segment a chromosome into one or more chromosome segments based on copy number. The CBS algorithm may be implemented by R software package PSCBS, among other things. In this embodiment, the copy number difference corresponding to any two SNP loci to be detected in each chromosome segment is smaller than a preset difference threshold, where the preset difference threshold may be 2. For example, if the copy numbers of the 4 SNP sites to be detected on the chromosome are 2, 3, 5 and 6, respectively, the chromosome segment a contains 2 SNP sites to be detected with copy numbers of 2 and 3, respectively, and the chromosome segment B contains 2 SNP sites to be detected with copy numbers of 5 and 6, respectively.
As an alternative implementation mode, the method for calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies corresponding to at least two target SNP loci comprises the steps of sorting all target SNP loci based on locus positions of all target SNP loci in chromosomes, taking a difference value between the minor allele frequency of a current target SNP locus and the minor allele frequency of a next target SNP locus as an abundance difference value based on a sorting result, and carrying out average processing on at least one abundance difference value to obtain the abundance fluctuation level of the sample to be detected.
In the exemplary embodiment, it is assumed that 4 target SNP sites on the chromosome are a target SNP site a, a target SNP site B, a target SNP site C, and a target SNP site D in this order, and the corresponding copy numbers are 2, 6, 5, and 3, respectively. The target SNP site and the sequence thereof contained in the chromosome segment a are the target SNP site a and the target SNP site D, and the target SNP site and the sequence thereof contained in the chromosome segment B are the target SNP site B and the target SNP site C.
For example, assuming that the chromosome segment a contains 100 target SNP sites, the number of abundance differences calculated is 99, and the abundance differences are averaged to obtain the abundance fluctuation level corresponding to the chromosome segment a.
S330, taking an abundance threshold to be measured corresponding to the abundance fluctuation level which accords with the inflection point feature in the at least 3 abundance fluctuation levels as a first abundance threshold.
Specifically, an abundance fluctuation level curve is constructed by taking an abundance threshold to be measured as an abscissa and an abundance fluctuation level as an ordinate. As the threshold value of the abundance to be measured increases, the level of the fluctuation of the abundance increases and then tends to be stable.
S340, determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.
S350, constructing a Gaussian mixture model based on the minor allele frequency corresponding to the SNP locus to be detected meeting the genotype condition.
And S360, performing optimization solving operation on the Gaussian mixture model to obtain pollution state data corresponding to the sample to be detected.
The third embodiment of the present invention also provides result data obtained by measuring simulation data from 20 groups of human cancer tissue samples by using the sample contamination detection method described in the present embodiment.
Specifically, mixing 20 groups of human cancer tissue samples according to a preset pollution ratio, and polluting sample data to obtain a sample to be detected. Table 2 below is a sample contamination specification list provided in example three of the present invention.
Table 3 below shows pollution ratio data obtained by the sample pollution detection method according to the third embodiment of the present invention.
Pollution level Pollution detection rate Predicting pollution ratio
0.5% 100% 0.47%
1% 100% 0.81%
5% 100% 5.2%
10% 100% 9.6%
20% 100% 18%
As can be seen from Table 3, the examples of the present invention are applicable to single sample and multiple sample source pollution conditions, can be stably detected at a pollution level of 0.5%, and have accurate pollution level prediction.
Table 4 below shows the pollution detection rate obtained by measuring different pollution levels by using different detection methods according to the third embodiment of the present invention.
Table 5 below shows pollution ratio data obtained by a conventional conPair (v 0.2) method according to example III of the present invention.
From tables 4 and 5, it is possible to obtain a contaminated sample in which the existing ART-DeCo (v 1.1) method cannot recognize a low contamination level. The existing conPair (v 0.2) method, while able to detect low contamination levels of contaminated samples, yields a significantly higher contamination ratio than the actual contamination ratio for contaminated samples containing a large number of copies of the sample.
Therefore, the sample pollution detection method provided by the embodiment of the invention has higher detection sensitivity and higher accuracy of pollution level evaluation than similar software.
According to the technical scheme of the embodiment, the abundance fluctuation level of a sample to be detected is determined based on the sub-allele frequency corresponding to each SNP locus to be detected and the obtained abundance threshold to be detected, the abundance threshold to be detected corresponding to the abundance fluctuation level conforming to the inflection point characteristic in at least 3 abundance fluctuation levels is used as a first abundance threshold, the first abundance threshold is determined through the constructed dynamic abundance threshold model, genotypes corresponding to each SNP locus to be detected are determined according to the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected, the problem that the conventional genotype detection method depends on paired sequencing data is solved, the accuracy of genotype detection results is improved, and the accuracy of detection results of sample pollution is further improved.
Example IV
Fig. 4 is a schematic structural diagram of a genotype detecting device according to a fourth embodiment of the present invention. As shown in FIG. 4, the device comprises a SNP locus acquisition module 410 to be detected, an abundance fluctuation level determination module 420, a first abundance threshold determination module 430 and a genotype determination module 440.
The SNP site to be detected acquisition module 410 is configured to acquire at least 10 SNP sites to be detected in a sample to be detected and acquire at least 3 abundance thresholds to be detected;
the abundance fluctuation level determining module 420 is configured to calculate, for each abundance threshold to be detected, an abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequencies corresponding to the SNP loci to be detected, respectively;
A first abundance threshold determining module 430, configured to take, as a first abundance threshold, an abundance threshold to be measured corresponding to an abundance fluctuation level that meets the inflection point feature in at least 3 abundance fluctuation levels;
The genotype determining module 440 is configured to determine genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundances corresponding to the SNP loci to be detected respectively.
According to the technical scheme of the embodiment, the abundance fluctuation level of a sample to be detected is determined based on the sub-allele frequency corresponding to each SNP locus to be detected and the obtained abundance threshold to be detected, the abundance threshold to be detected corresponding to the abundance fluctuation level conforming to the inflection point characteristic in at least 3 abundance fluctuation levels is used as a first abundance threshold, the first abundance threshold is determined through the constructed dynamic abundance threshold model, genotypes corresponding to each SNP locus to be detected are determined according to the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected, the problem that the conventional genotype detection method depends on paired sequencing data is solved, and the accuracy of genotype detection results is improved.
Based on the above embodiments, optionally, the abundance fluctuation level determination module 420 includes:
The chromosome segment determining unit is used for executing segmentation operation on each chromosome based on the copy numbers corresponding to at least two SNP loci to be detected on the chromosome to obtain one or more chromosome segments, wherein the copy number difference value corresponding to any two SNP loci to be detected in each chromosome segment is smaller than a preset difference value threshold;
A target SNP locus determining unit, which is used for taking the SNP locus to be detected, of which the minor allele frequency is greater than the abundance threshold to be detected, in the chromosome section as a target SNP locus;
the abundance fluctuation level unit is used for calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies respectively corresponding to the at least two target SNP loci.
On the basis of the above embodiment, optionally, the abundance fluctuation level unit is specifically configured to:
sequencing each target SNP locus based on the locus position of each target SNP locus in the chromosome;
Based on the sequencing result, taking the difference value between the frequency of the minor allele of the current target SNP locus and the frequency of the minor allele of the next target SNP locus as an abundance difference value;
And carrying out average value processing on at least one abundance difference value to obtain the abundance fluctuation level of the sample to be detected.
Based on the above embodiments, the genotype determining module 440 is optionally specifically configured to:
calculating a second abundance threshold based on the first abundance threshold, wherein the sum of the first abundance threshold and the second abundance threshold is one;
aiming at each SNP locus to be detected, if the mutation abundance corresponding to the SNP locus to be detected is smaller than or equal to a first abundance threshold, the genotype of the SNP locus to be detected is wild type;
If the mutation abundance corresponding to the SNP locus to be detected is greater than or equal to a second abundance threshold, the genotype of the SNP locus to be detected is homozygous;
And if the mutation abundance corresponding to the SNP locus to be detected is larger than the first abundance threshold and smaller than the second abundance threshold, the genotype of the SNP locus to be detected is heterozygous.
Based on the above embodiments, optionally, the SNP site obtaining module 410 to be tested is specifically configured to:
And respectively taking at least 10 SNP loci with the crowd frequency meeting a preset crowd frequency range in a sample to be detected as SNP loci to be detected, wherein the preset crowd frequency range comprises 0.4-0.6.
The genotype detection device provided by the embodiment of the invention can execute the genotype detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example five
Fig. 5 is a schematic structural diagram of a sample contamination detection apparatus according to a fifth embodiment of the present invention. As shown in FIG. 5, the apparatus includes a genotype determination module 510, a Gaussian mixture model construction module 520, and a pollution status data determination module 530.
The genotype determining module 510 is configured to obtain at least 10 SNP sites to be detected in a sample to be detected, and determine genotypes corresponding to the SNP sites to be detected respectively;
the Gaussian mixture model construction module 520 is used for constructing a Gaussian mixture model based on the subordinated gene frequency corresponding to the SNP locus to be detected which meets the genotype condition, wherein the genotype condition comprises that the genotype is wild type and/or the genotype is homozygous;
The pollution state data determining module 530 is configured to perform an optimization solving operation on the gaussian mixture model to obtain pollution state data corresponding to the sample to be detected, where the pollution state data includes at least one of a pollution proportion, a secondary allele frequency variance, and a number of pollution sources.
According to the technical scheme, at least 10 SNP loci to be detected in a sample to be detected are obtained, genotypes corresponding to the SNP loci to be detected are determined, a Gaussian mixture model is constructed based on the hypo-allele frequencies corresponding to the SNP loci to be detected which are not heterozygous, and the Gaussian mixture model is subjected to optimization solving operation to obtain pollution state data corresponding to the sample to be detected, so that the problems that an existing sample pollution measuring method is greatly influenced by copy number variation/heterozygosity loss and is difficult to detect low-level pollution are solved, and the accuracy of a measuring result of sample pollution is improved.
On the basis of the above embodiment, optionally, the gaussian mixture model satisfies the formula:
Wherein, the
Wherein, the
Where maf represents minor allele frequency, δ 2 represents variance of minor allele frequency, N represents number of contamination sources, α represents contamination ratio, pbinom represents bernoulli probability distribution, P (c=i) represents probability distribution corresponding to i contamination sources, and N represents probability distribution of gaussian mixture model.
Based on the above embodiment, optionally, the genotype determining module 510 includes:
The first abundance threshold acquisition unit is used for acquiring a first abundance threshold and determining mutation abundance corresponding to each SNP locus to be detected;
The genotype determining unit is used for determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.
On the basis of the above embodiments, optionally, the genotype determining unit is specifically configured to:
calculating a second abundance threshold based on the first abundance threshold, wherein the sum of the first abundance threshold and the second abundance threshold is one;
aiming at each SNP locus to be detected, if the mutation abundance corresponding to the SNP locus to be detected is smaller than or equal to a first abundance threshold, the genotype of the SNP locus to be detected is wild type;
If the mutation abundance corresponding to the SNP locus to be detected is greater than or equal to a second abundance threshold, the genotype of the SNP locus to be detected is homozygous;
And if the mutation abundance corresponding to the SNP locus to be detected is larger than the first abundance threshold and smaller than the second abundance threshold, the genotype of the SNP locus to be detected is heterozygous.
On the basis of the above embodiment, optionally, the first abundance threshold acquiring unit includes:
the abundance threshold value obtaining subunit to be detected is used for obtaining at least three abundance threshold values to be detected;
the abundance fluctuation level determining subunit is used for calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected respectively aiming at each abundance threshold to be detected;
A first abundance threshold determining subunit, configured to use, as a first abundance threshold, an abundance threshold to be measured corresponding to an abundance fluctuation level that meets an inflection point feature in at least 3 abundance fluctuation levels
Based on the above embodiments, optionally, the abundance fluctuation level determining subunit is specifically configured to:
For each chromosome where each SNP locus to be detected is located, based on the copy numbers corresponding to at least two SNP loci to be detected on the chromosome, performing segmentation operation on the chromosome to obtain one or more chromosome sections, wherein the copy number difference value corresponding to any two SNP loci to be detected in each chromosome section is smaller than a preset difference value threshold;
aiming at each chromosome segment, taking a SNP locus to be detected, of which the minor allele frequency is greater than the abundance threshold to be detected, in the chromosome segment as a target SNP locus;
and calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies respectively corresponding to the at least two target SNP loci.
Based on the above embodiments, optionally, the abundance fluctuation level determining subunit is specifically configured to:
sequencing each target SNP locus based on the locus position of each target SNP locus in the chromosome;
Based on the sequencing result, taking the difference value between the frequency of the minor allele of the current target SNP locus and the frequency of the minor allele of the next target SNP locus as an abundance difference value;
And carrying out average value processing on at least one abundance difference value to obtain the abundance fluctuation level of the sample to be detected.
Based on the above embodiment, optionally, the genotype determining module 510 includes:
The SNP locus determination unit is used for respectively taking at least 10 SNP loci with the crowd frequency meeting the preset crowd frequency range in the sample to be detected as SNP loci to be detected, wherein the preset crowd frequency range comprises 0.4-0.6.
The sample pollution detection device provided by the embodiment of the invention can execute the sample pollution detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example six
Fig. 6 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including an input unit 16, such as a keyboard, mouse, etc., an output unit 17, such as various types of displays, speakers, etc., a storage unit 18, such as a magnetic disk, optical disk, etc., and a communication unit 19, such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, such as a genotype detection method or a sample contamination detection method.
In some embodiments, the genotype detection method or sample contamination detection method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the genotype detection method or sample contamination detection method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the genotype detection method or the sample contamination detection method in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
The computer program for implementing the genotyping method or the sample contamination detection method of the invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
Example seven
The seventh embodiment of the present invention also provides a computer-readable storage medium storing computer instructions for causing a processor to execute a genotype detection method, the method comprising:
acquiring at least 10 SNP loci to be detected in a sample to be detected and acquiring at least 3 abundance thresholds to be detected;
aiming at each abundance threshold to be detected, calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected;
Taking an abundance threshold to be measured corresponding to the abundance fluctuation level which accords with the inflection point feature in the at least 3 abundance fluctuation levels as a first abundance threshold;
And determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.
Or computer instructions for causing a processor to perform a method of sample contamination detection, the method comprising:
obtaining at least 10 SNP loci to be detected in a sample to be detected, and determining genotypes corresponding to the SNP loci to be detected respectively;
Constructing a Gaussian mixture model based on the minor allele frequency corresponding to the SNP locus to be detected meeting the genotype condition, wherein the genotype condition comprises that the genotype is wild type and/or the genotype is homozygous;
And carrying out optimization solving operation on the Gaussian mixture model to obtain pollution state data corresponding to the sample to be detected, wherein the pollution state data comprises at least one of pollution proportion, suboptimal gene frequency variance and the number of pollution sources.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), a blockchain network, and the Internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (12)

1.一种基因型检测方法,其特征在于,包括:1. A genotype detection method, comprising: 获取待检测样本中的至少10个待测SNP位点以及获取至少3个待测丰度阈值;Obtain at least 10 SNP sites to be tested in the sample to be tested and obtain at least 3 abundance thresholds to be tested; 针对每个待测丰度阈值,基于所述待测丰度阈值和各待测SNP位点分别对应的次等位基因频率,计算所述待检测样本的丰度波动水平;For each abundance threshold to be tested, the abundance fluctuation level of the sample to be tested is calculated based on the abundance threshold to be tested and the minor allele frequency corresponding to each SNP site to be tested; 将至少3个丰度波动水平中符合拐点特征的丰度波动水平对应的待测丰度阈值作为第一丰度阈值;The abundance threshold to be measured corresponding to the abundance fluctuation level that meets the inflection point characteristics among at least three abundance fluctuation levels is used as the first abundance threshold; 基于所述第一丰度阈值和各待测SNP位点分别对应的突变丰度,确定各待测SNP位点分别对应的基因型;Determining the genotype corresponding to each SNP site to be tested based on the first abundance threshold and the mutation abundance corresponding to each SNP site to be tested; 所述基于所述待测丰度阈值和各待测SNP位点分别对应的次等位基因频率,计算所述待检测样本的丰度波动水平,包括:The calculating the abundance fluctuation level of the sample to be tested based on the abundance threshold to be tested and the minor allele frequency corresponding to each SNP site to be tested comprises: 针对各待测SNP位点所在的每条染色体,基于所述染色体上的至少两个待测SNP位点分别对应的拷贝数,对所述染色体执行分割操作,得到一个或多个染色体区段;其中,每个染色体区段中的任意两个待测SNP位点对应的拷贝数差值小于预设差值阈值;For each chromosome where each SNP site to be tested is located, a segmentation operation is performed on the chromosome based on the copy numbers corresponding to at least two SNP sites to be tested on the chromosome to obtain one or more chromosome segments; wherein the copy number difference corresponding to any two SNP sites to be tested in each chromosome segment is less than a preset difference threshold; 针对每个染色体区段,将所述染色体区段中次等位基因频率大于所述待测丰度阈值的待测SNP位点作为目标SNP位点;For each chromosome segment, the SNP site to be tested in which the minor allele frequency in the chromosome segment is greater than the abundance threshold to be tested is taken as the target SNP site; 基于至少两个目标SNP位点分别对应的次等位基因频率,计算所述待检测样本的丰度波动水平。Based on the minor allele frequencies corresponding to at least two target SNP sites, the abundance fluctuation level of the sample to be tested is calculated. 2.根据权利要求1所述的方法,其特征在于,所述基于至少两个目标SNP位点分别对应的次等位基因频率,计算所述待检测样本的丰度波动水平,包括:2. The method according to claim 1, wherein the calculating the abundance fluctuation level of the sample to be tested based on the minor allele frequencies corresponding to at least two target SNP sites comprises: 基于各目标SNP位点分别在染色体中的位点位置,对各目标SNP位点进行排序;Sort each target SNP site based on its location in the chromosome; 基于排序结果,将当前目标SNP位点的次等位基因频率与下一目标SNP位点的次等位基因频率的差值作为丰度差值;Based on the sorting results, the difference between the minor allele frequency of the current target SNP site and the minor allele frequency of the next target SNP site is used as the abundance difference; 对至少一个丰度差值执行求均值处理,得到所述待检测样本的丰度波动水平。Averaging processing is performed on at least one abundance difference to obtain an abundance fluctuation level of the sample to be detected. 3.根据权利要求1所述的方法,其特征在于,所述基于所述第一丰度阈值和各待测SNP位点分别对应的突变丰度,确定各待测SNP位点分别对应的基因型,包括:3. The method according to claim 1, wherein determining the genotype corresponding to each SNP site to be tested based on the first abundance threshold and the mutation abundance corresponding to each SNP site to be tested comprises: 基于所述第一丰度阈值,计算第二丰度阈值;其中,所述第一丰度阈值与所述第二丰度阈值的和为一;Calculating a second abundance threshold based on the first abundance threshold; wherein the sum of the first abundance threshold and the second abundance threshold is one; 针对每个待测SNP位点,如果所述待测SNP位点对应的突变丰度小于或等于第一丰度阈值,则待测SNP位点的基因型为野生型;For each SNP site to be tested, if the mutation abundance corresponding to the SNP site to be tested is less than or equal to the first abundance threshold, the genotype of the SNP site to be tested is wild type; 如果所述待测SNP位点对应的突变丰度大于或等于第二丰度阈值,则待测SNP位点的基因型为纯合型;If the mutation abundance corresponding to the SNP site to be tested is greater than or equal to the second abundance threshold, the genotype of the SNP site to be tested is homozygous; 如果所述待测SNP位点对应的突变丰度大于第一丰度阈值且小于第二丰度阈值,则待测SNP位点的基因型为杂合型。If the mutation abundance corresponding to the SNP site to be tested is greater than the first abundance threshold and less than the second abundance threshold, the genotype of the SNP site to be tested is heterozygous. 4.根据权利要求1所述的方法,其特征在于,所述获取待检测样本中的至少10个待测SNP位点,包括:4. The method according to claim 1, wherein obtaining at least 10 SNP sites to be detected in the sample to be detected comprises: 将待检测样本中人群频率满足预设人群频率范围的至少10个SNP位点分别作为待测SNP位点;其中,所述预设人群频率范围包括0.4-0.6。At least 10 SNP sites in the sample to be tested whose population frequencies meet the preset population frequency range are respectively used as SNP sites to be tested; wherein, the preset population frequency range includes 0.4-0.6. 5.一种样本污染检测方法,其特征在于,包括:5. A sample contamination detection method, comprising: 获取待检测样本中的至少10个待检测SNP位点,并确定各待测SNP位点分别对应的基因型;Obtain at least 10 SNP sites to be detected in the sample to be tested, and determine the genotype corresponding to each SNP site to be detected; 基于满足基因型条件的待测SNP位点对应的次等位基因频率,构建高斯混合模型;其中,所述基因型条件包括基因型为野生型和/或基因型为纯合型;Constructing a Gaussian mixture model based on the minor allele frequency corresponding to the SNP site to be tested that meets the genotype conditions; wherein the genotype conditions include the genotype being wild type and/or the genotype being homozygous; 对所述高斯混合模型进行最优化求解操作,得到所述待检测样本对应的污染状态数据;Performing an optimization solution operation on the Gaussian mixture model to obtain pollution status data corresponding to the sample to be detected; 其中,所述污染状态数据包括污染比例、次等位基因频率方差和污染源的数目中至少一种;Wherein, the pollution status data includes at least one of pollution ratio, minor allele frequency variance and number of pollution sources; 所述高斯混合模型满足公式:The Gaussian mixture model satisfies the formula: 其中,in, 其中,in, 其中,maf表示次等位基因频率,δ2表示次等位基因频率的方差,n表示污染源的数目,α表示污染比例,pbinom表示伯努利概率分布,P(C=i)表示i个污染源对应的概率分布,N表示高斯混合模型的概率分布;Where maf represents the minor allele frequency, δ2 represents the variance of the minor allele frequency, n represents the number of pollution sources, α represents the pollution proportion, pbinom represents the Bernoulli probability distribution, P(C=i) represents the probability distribution corresponding to the i pollution source, and N represents the probability distribution of the Gaussian mixture model; 所述确定各待测SNP位点分别对应的基因型,包括:Determining the genotype corresponding to each SNP site to be tested includes: 获取第一丰度阈值,以及确定各待测SNP位点分别对应的突变丰度;Obtaining a first abundance threshold, and determining the mutation abundance corresponding to each SNP site to be tested; 基于第一丰度阈值和各待测SNP位点分别对应的突变丰度,确定各待测SNP位点分别对应的基因型;Determining the genotype corresponding to each SNP site to be tested based on the first abundance threshold and the mutation abundance corresponding to each SNP site to be tested; 所述获取第一丰度阈值,包括:The obtaining of the first abundance threshold comprises: 获取至少三个待测丰度阈值;Obtain at least three abundance thresholds to be measured; 针对每个待测丰度阈值,基于所述待测丰度阈值和各待测SNP位点分别对应的次等位基因频率,计算所述待检测样本的丰度波动水平;For each abundance threshold to be tested, the abundance fluctuation level of the sample to be tested is calculated based on the abundance threshold to be tested and the minor allele frequency corresponding to each SNP site to be tested; 将至少3个丰度波动水平中符合拐点特征的丰度波动水平对应的待测丰度阈值作为第一丰度阈值;The abundance threshold to be measured corresponding to the abundance fluctuation level that meets the inflection point characteristics among at least three abundance fluctuation levels is used as the first abundance threshold; 所述基于所述待测丰度阈值和各待测SNP位点分别对应的次等位基因频率,计算所述待检测样本的丰度波动水平,包括:The calculating the abundance fluctuation level of the sample to be tested based on the abundance threshold to be tested and the minor allele frequency corresponding to each SNP site to be tested comprises: 针对各待测SNP位点所在的每条染色体,基于所述染色体上的至少两个待测SNP位点分别对应的拷贝数,对所述染色体执行分割操作,得到一个或多个染色体区段;其中,每个染色体区段中的任意两个待测SNP位点对应的拷贝数差值小于预设差值阈值;For each chromosome where each SNP site to be tested is located, a segmentation operation is performed on the chromosome based on the copy numbers corresponding to at least two SNP sites to be tested on the chromosome to obtain one or more chromosome segments; wherein the copy number difference corresponding to any two SNP sites to be tested in each chromosome segment is less than a preset difference threshold; 针对每个染色体区段,将所述染色体区段中次等位基因频率大于所述待测丰度阈值的待测SNP位点作为目标SNP位点;For each chromosome segment, the SNP site to be tested in which the minor allele frequency in the chromosome segment is greater than the abundance threshold to be tested is taken as the target SNP site; 基于至少两个目标SNP位点分别对应的次等位基因频率,计算所述待检测样本的丰度波动水平。Based on the minor allele frequencies corresponding to at least two target SNP sites, the abundance fluctuation level of the sample to be tested is calculated. 6.根据权利要求5所述的方法,其特征在于,所述基于第一丰度阈值和各待测SNP位点分别对应的突变丰度,确定各待测SNP位点分别对应的基因型,包括:6. The method according to claim 5, wherein determining the genotype corresponding to each SNP site to be tested based on the first abundance threshold and the mutation abundance corresponding to each SNP site to be tested comprises: 基于所述第一丰度阈值,计算第二丰度阈值;其中,所述第一丰度阈值与所述第二丰度阈值的和为一;Calculating a second abundance threshold based on the first abundance threshold; wherein the sum of the first abundance threshold and the second abundance threshold is one; 针对每个待测SNP位点,如果所述待测SNP位点对应的突变丰度小于或等于第一丰度阈值,则待测SNP位点的基因型为野生型;For each SNP site to be tested, if the mutation abundance corresponding to the SNP site to be tested is less than or equal to the first abundance threshold, the genotype of the SNP site to be tested is wild type; 如果所述待测SNP位点对应的突变丰度大于或等于第二丰度阈值,则待测SNP位点的基因型为纯合型;If the mutation abundance corresponding to the SNP site to be tested is greater than or equal to the second abundance threshold, the genotype of the SNP site to be tested is homozygous; 如果所述待测SNP位点对应的突变丰度大于第一丰度阈值且小于第二丰度阈值,则待测SNP位点的基因型为杂合型。If the mutation abundance corresponding to the SNP site to be tested is greater than the first abundance threshold and less than the second abundance threshold, the genotype of the SNP site to be tested is heterozygous. 7.根据权利要求5所述的方法,其特征在于,所述基于至少两个目标SNP位点分别对应的次等位基因频率,计算所述待检测样本的丰度波动水平,包括:7. The method according to claim 5, wherein the calculating the abundance fluctuation level of the sample to be tested based on the minor allele frequencies corresponding to at least two target SNP sites comprises: 基于各目标SNP位点分别在染色体中的位点位置,对各目标SNP位点进行排序;Sort each target SNP site based on its location in the chromosome; 基于排序结果,将当前目标SNP位点的次等位基因频率与下一目标SNP位点的次等位基因频率的差值作为丰度差值;Based on the sorting results, the difference between the minor allele frequency of the current target SNP site and the minor allele frequency of the next target SNP site is used as the abundance difference; 对至少一个丰度差值执行求均值处理,得到所述待检测样本的丰度波动水平。Averaging processing is performed on at least one abundance difference to obtain an abundance fluctuation level of the sample to be detected. 8.根据权利要求5所述的方法,其特征在于,所述获取待检测样本中的至少10个待测SNP位点,包括:8. The method according to claim 5, wherein obtaining at least 10 SNP sites to be detected in the sample to be detected comprises: 将待检测样本中人群频率满足预设人群频率范围的至少10个SNP位点分别作为待测SNP位点;其中,所述预设人群频率范围包括0.4-0.6。At least 10 SNP sites in the sample to be tested whose population frequencies meet the preset population frequency range are respectively used as SNP sites to be tested; wherein, the preset population frequency range includes 0.4-0.6. 9.一种基因型检测装置,其特征在于,包括:9. A genotype detection device, comprising: 待测SNP位点获取模块,用于获取待检测样本中的至少10个待测SNP位点以及获取至少3个待测丰度阈值;A SNP site acquisition module to be tested is used to obtain at least 10 SNP sites to be tested in the sample to be tested and obtain at least 3 abundance thresholds to be tested; 丰度波动水平确定模块,用于针对每个待测丰度阈值,基于所述待测丰度阈值和各待测SNP位点分别对应的次等位基因频率,计算所述待检测样本的丰度波动水平;an abundance fluctuation level determination module, configured to calculate, for each abundance threshold to be measured, the abundance fluctuation level of the sample to be tested based on the abundance threshold to be measured and the minor allele frequency corresponding to each SNP site to be measured; 第一丰度阈值确定模块,用于将至少3个丰度波动水平中符合拐点特征的丰度波动水平对应的待测丰度阈值作为第一丰度阈值;a first abundance threshold determination module, configured to use the abundance threshold to be measured corresponding to the abundance fluctuation level that meets the inflection point feature among the at least three abundance fluctuation levels as the first abundance threshold; 基因型确定模块,用于基于所述第一丰度阈值和各待测SNP位点分别对应的突变丰度,确定各待测SNP位点分别对应的基因型;a genotype determination module, configured to determine the genotype corresponding to each SNP site to be tested based on the first abundance threshold and the mutation abundance corresponding to each SNP site to be tested; 所述丰度波动水平确定模块,包括:The abundance fluctuation level determination module comprises: 染色体区段确定单元,用于针对各待测SNP位点所在的每条染色体,基于所述染色体上的至少两个待测SNP位点分别对应的拷贝数,对所述染色体执行分割操作,得到一个或多个染色体区段;其中,每个染色体区段中的任意两个待测SNP位点对应的拷贝数差值小于预设差值阈值;a chromosome segment determination unit configured to, for each chromosome where each SNP site to be tested is located, perform a segmentation operation on the chromosome based on the copy numbers corresponding to at least two SNP sites to be tested on the chromosome, thereby obtaining one or more chromosome segments; wherein the copy number difference between any two SNP sites to be tested in each chromosome segment is less than a preset difference threshold; 目标SNP位点确定单元,用于针对每个染色体区段,将所述染色体区段中次等位基因频率大于所述待测丰度阈值的待测SNP位点作为目标SNP位点;a target SNP site determination unit, configured to, for each chromosome segment, determine a SNP site to be tested in the chromosome segment whose minor allele frequency is greater than the abundance threshold to be tested as a target SNP site; 丰度波动水平单元,用于基于至少两个目标SNP位点分别对应的次等位基因频率,计算所述待检测样本的丰度波动水平。The abundance fluctuation level unit is used to calculate the abundance fluctuation level of the sample to be tested based on the minor allele frequencies corresponding to at least two target SNP sites. 10.一种样本污染检测装置,其特征在于,包括:10. A sample contamination detection device, comprising: 基因型确定模块,用于获取待检测样本中的至少10个待检测SNP位点,并确定各待测SNP位点分别对应的基因型;The genotype determination module is used to obtain at least 10 SNP sites to be detected in the sample to be detected and determine the genotype corresponding to each SNP site to be detected; 高斯混合模型构建模块,用于基于满足基因型条件的待测SNP位点对应的次等位基因频率,构建高斯混合模型;其中,所述基因型条件包括基因型为野生型和/或基因型为纯合型;A Gaussian mixture model construction module is used to construct a Gaussian mixture model based on the minor allele frequency corresponding to the SNP site to be tested that meets the genotype conditions; wherein the genotype conditions include the genotype being wild type and/or the genotype being homozygous; 污染状态数据确定模块,用于对所述高斯混合模型进行最优化求解操作,得到所述待检测样本对应的污染状态数据;A pollution status data determination module is used to perform an optimization solution operation on the Gaussian mixture model to obtain pollution status data corresponding to the sample to be detected; 其中,所述污染状态数据包括污染比例、次等位基因频率方差和污染源的数目;Wherein, the pollution status data includes pollution ratio, minor allele frequency variance and number of pollution sources; 所述高斯混合模型满足公式:The Gaussian mixture model satisfies the formula: 其中,in, 其中,in, 其中,maf表示次等位基因频率,δ2表示次等位基因频率的方差,n表示污染源的数目,α表示污染比例,pbinom表示伯努利概率分布,P(C=i)表示i个污染源对应的概率分布,N表示高斯混合模型的概率分布;Where maf represents the minor allele frequency, δ2 represents the variance of the minor allele frequency, n represents the number of pollution sources, α represents the pollution proportion, pbinom represents the Bernoulli probability distribution, P(C=i) represents the probability distribution corresponding to the i pollution source, and N represents the probability distribution of the Gaussian mixture model; 所述基因型确定模块,包括:The genotype determination module comprises: 第一丰度阈值获取单元,用于获取第一丰度阈值,以及确定各待测SNP位点分别对应的突变丰度;A first abundance threshold acquisition unit is used to obtain a first abundance threshold and determine the mutation abundance corresponding to each SNP site to be tested; 基因型确定单元,用于基于第一丰度阈值和各待测SNP位点分别对应的突变丰度,确定各待测SNP位点分别对应的基因型;a genotype determination unit, configured to determine the genotype corresponding to each SNP site to be tested based on the first abundance threshold and the mutation abundance corresponding to each SNP site to be tested; 所述第一丰度阈值获取单元,包括:The first abundance threshold acquisition unit includes: 待测丰度阈值获取子单元,用于获取至少三个待测丰度阈值;The abundance threshold value acquisition subunit to be measured is used to obtain at least three abundance threshold values to be measured; 丰度波动水平确定子单元,用于针对每个待测丰度阈值,基于所述待测丰度阈值和各待测SNP位点分别对应的次等位基因频率,计算所述待检测样本的丰度波动水平;an abundance fluctuation level determination subunit, configured to calculate, for each abundance threshold to be measured, the abundance fluctuation level of the sample to be tested based on the abundance threshold to be measured and the minor allele frequency corresponding to each SNP site to be measured; 第一丰度阈值确定子单元,用于将至少3个丰度波动水平中符合拐点特征的丰度波动水平对应的待测丰度阈值作为第一丰度阈值;a first abundance threshold determination subunit, configured to use the abundance threshold to be measured corresponding to the abundance fluctuation level that meets the inflection point feature among the at least three abundance fluctuation levels as the first abundance threshold; 所述丰度波动水平确定子单元,具体用于:The abundance fluctuation level determination subunit is specifically used to: 针对各待测SNP位点所在的每条染色体,基于所述染色体上的至少两个待测SNP位点分别对应的拷贝数,对所述染色体执行分割操作,得到一个或多个染色体区段;其中,每个染色体区段中的任意两个待测SNP位点对应的拷贝数差值小于预设差值阈值;For each chromosome where each SNP site to be tested is located, a segmentation operation is performed on the chromosome based on the copy numbers corresponding to at least two SNP sites to be tested on the chromosome to obtain one or more chromosome segments; wherein the copy number difference corresponding to any two SNP sites to be tested in each chromosome segment is less than a preset difference threshold; 针对每个染色体区段,将所述染色体区段中次等位基因频率大于所述待测丰度阈值的待测SNP位点作为目标SNP位点;For each chromosome segment, the SNP site to be tested in which the minor allele frequency in the chromosome segment is greater than the abundance threshold to be tested is taken as the target SNP site; 基于至少两个目标SNP位点分别对应的次等位基因频率,计算所述待检测样本的丰度波动水平。Based on the minor allele frequencies corresponding to at least two target SNP sites, the abundance fluctuation level of the sample to be tested is calculated. 11.一种电子设备,其特征在于,所述电子设备包括:11. An electronic device, characterized in that the electronic device comprises: 至少一个处理器;以及at least one processor; and 与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein, 所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行时实现权利要求1-4中任一所述的基因型检测方法或实现权利要求5-8中任一所述的样本污染检测方法。The memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can implement the genotype detection method described in any one of claims 1 to 4 or the sample contamination detection method described in any one of claims 5 to 8 when executed. 12.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现权利要求1-4中任一所述的基因型检测方法或实现权利要求5-8中任一所述的样本污染检测方法。12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the genotype detection method described in any one of claims 1 to 4 or the sample contamination detection method described in any one of claims 5 to 8 when executed.
CN202210749217.7A 2022-06-28 2022-06-28 Genotype detection method, sample pollution detection method, device, equipment and medium Active CN115035950B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210749217.7A CN115035950B (en) 2022-06-28 2022-06-28 Genotype detection method, sample pollution detection method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210749217.7A CN115035950B (en) 2022-06-28 2022-06-28 Genotype detection method, sample pollution detection method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN115035950A CN115035950A (en) 2022-09-09
CN115035950B true CN115035950B (en) 2025-08-12

Family

ID=83126621

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210749217.7A Active CN115035950B (en) 2022-06-28 2022-06-28 Genotype detection method, sample pollution detection method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115035950B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115472222B (en) * 2022-11-02 2023-03-24 杭州链康医学检验实验室有限公司 Single cell transcriptome RNA pollution identification method, medium and equipment
CN115985389B (en) * 2022-12-26 2025-07-18 广州燃石医学检验所有限公司 Method and device for detecting sample cross contamination
CN117059164A (en) * 2023-08-15 2023-11-14 苏州贝康医疗器械有限公司 A method and its application for detecting sample cross-contamination based on SNP sites
CN116955735A (en) * 2023-08-29 2023-10-27 上海福君基因生物科技有限公司 Quality control method, device, equipment and storage medium for high-throughput sequencing data
CN116935966B (en) * 2023-09-13 2024-01-23 北京诺禾致源科技股份有限公司 Method and device for judging pollution of high-throughput sequencing paired data
CN118866116B (en) * 2024-07-08 2025-04-04 广州达安临床检验中心有限公司 A method, device, system and storage medium for analyzing contamination of sequencing samples
CN118932028A (en) * 2024-08-14 2024-11-12 厦门飞朔生物技术有限公司 A method for detecting sample mismatch and sample contamination in whole exome sequencing
CN119517162B (en) * 2025-01-21 2025-10-21 臻和(北京)生物科技有限公司 A method and device for ultra-high sensitivity sample component evaluation and traceability

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016078067A1 (en) * 2014-11-21 2016-05-26 深圳华大基因研究院 Individual single nucleotide polymorphisms locus genotyping method and apparatus
CN114530198A (en) * 2020-11-23 2022-05-24 福建和瑞基因科技有限公司 Screening method of SNP (single nucleotide polymorphism) sites for detecting sample pollution level and detection method of sample pollution level

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101705288B (en) * 2009-11-17 2011-12-28 西北农林科技大学 Mononucleotide polymorphism of milk goat MFG-E8 genes and detection method thereof
SI3078752T1 (en) * 2011-04-12 2018-12-31 Verinata Health, Inc Resolving genome fractions using polymorphism counts
CN104182655B (en) * 2014-09-01 2017-03-08 上海美吉生物医药科技有限公司 A kind of method for judging fetus genotype
CN106202991B (en) * 2016-06-30 2019-03-08 厦门艾德生物医药科技股份有限公司 The detection method of abrupt information in product is sequenced in a kind of genome multiplex amplification
CN107423578B (en) * 2017-03-02 2020-09-22 北京诺禾致源科技股份有限公司 Device for detecting somatic cell mutation
CN109022562A (en) * 2018-08-29 2018-12-18 天津诺禾致源生物信息科技有限公司 For detecting the screening technique of the SNP site of sample contamination and the method for detecting sample contamination in high-flux sequence
CN113136422A (en) * 2020-01-19 2021-07-20 北京圣谷同创科技发展有限公司 Method for detecting high-throughput sequencing sample contamination by grouping SNP sites
CN113628683B (en) * 2021-08-24 2024-04-09 慧算医疗科技(上海)有限公司 High-throughput sequencing mutation detection method, device and apparatus and readable storage medium
CN113674803B (en) * 2021-08-30 2023-08-08 广州燃石医学检验所有限公司 Copy number variation detection method, device, storage medium and application thereof
CN114093428B (en) * 2021-11-08 2023-04-14 南京世和基因生物技术股份有限公司 System and method for detecting low-abundance mutation under ctDNA ultrahigh sequencing depth

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016078067A1 (en) * 2014-11-21 2016-05-26 深圳华大基因研究院 Individual single nucleotide polymorphisms locus genotyping method and apparatus
CN114530198A (en) * 2020-11-23 2022-05-24 福建和瑞基因科技有限公司 Screening method of SNP (single nucleotide polymorphism) sites for detecting sample pollution level and detection method of sample pollution level

Also Published As

Publication number Publication date
CN115035950A (en) 2022-09-09

Similar Documents

Publication Publication Date Title
CN115035950B (en) Genotype detection method, sample pollution detection method, device, equipment and medium
Ha et al. TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data
Capra et al. A model-based analysis of GC-biased gene conversion in the human and chimpanzee genomes
CN110010197B (en) Detection method, device and storage medium of single nucleotide variation based on blood circulating tumor DNA
CN114530198A (en) Screening method of SNP (single nucleotide polymorphism) sites for detecting sample pollution level and detection method of sample pollution level
EP1535232A2 (en) A system and method for snp genotype clustering
WO2017139801A1 (en) Methods and systems for detection of abnormal karyotypes
CN109461473B (en) Method and device for obtaining fetal free DNA concentration
DeSaix et al. Population assignment from genotype likelihoods for low‐coverage whole‐genome sequencing data
CN110268072A (en) Method and system for determining paralogous genes
CN105874460B (en) Identify method, readable medium and the equipment of at least one base of target sequence
Sauk et al. NIPTmer: rapid k-mer-based software package for detection of fetal aneuploidies
US20220093211A1 (en) Detecting cross-contamination in sequencing data
US20230090925A1 (en) Methylation fragment probabilistic noise model with noisy region filtration
Garner Confounded by sequencing depth in association studies of rare alleles
CN107273715B (en) Detection method and device
CN108256294A (en) A kind of device for being used to detect somatic mutation
CN117497047B (en) Method, equipment and medium for screening tumor gene markers based on exon sequencing
CN107109324A (en) Method and apparatus for determining fetal nucleic acid content
CN116244602A (en) Sample pollution detection and model training method, device, equipment and medium
WO2019213810A1 (en) Method, apparatus, and system for detecting chromosome aneuploidy
CN118098345B (en) A method, device, equipment and storage medium for detecting chromosome aneuploidy
Kingma et al. Saturated Transposon Analysis in Yeast as a one-step method to quantify the fitness effects of gene disruptions on a genome-wide scale
Niehus et al. PopDel identifies medium-size deletions jointly in tens of thousands of genomes
US20260011408A1 (en) Context-Specific Tumor-Only Mutation Classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant