CN115035950B

CN115035950B - Genotype detection method, sample pollution detection method, device, equipment and medium

Info

Publication number: CN115035950B
Application number: CN202210749217.7A
Authority: CN
Inventors: 刘成林; 张周; 汉雨生
Original assignee: Guangzhou Burning Rock Dx Co ltd
Current assignee: Guangzhou Burning Rock Dx Co ltd
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2025-08-12
Anticipated expiration: 2042-06-28
Also published as: CN115035950A

Abstract

The present invention discloses a genotype detection method, a sample contamination detection method, an apparatus, a device and a medium. The genotype detection method comprises: obtaining the SNP site to be detected and the abundance threshold to be detected in the sample to be detected; for each abundance threshold to be detected, calculating the abundance fluctuation level of the sample to be detected; based on the abundance threshold to be detected corresponding to the abundance fluctuation level that meets the inflection point characteristics in the abundance fluctuation level and the mutation abundance corresponding to each SNP site to be detected, determining the genotype corresponding to each SNP site to be detected. Based on the minor allele frequency corresponding to the SNP site to be detected that meets the genotype condition, a Gaussian mixture model is constructed, and the Gaussian mixture model is optimized to obtain the contamination status data corresponding to the sample to be detected. The embodiment of the present invention solves the problem that the existing genotype detection method relies on paired sequencing data, and improves the accuracy of the genotype detection results.

Description

Genotype detection method, sample pollution detection method, device, equipment and medium

Technical Field

The invention relates to the field of biotechnology, in particular to a genotype detection method, a sample pollution detection device, sample pollution detection equipment and a sample pollution detection medium.

Background

SNP locus genotyping and bulk SNP locus genotyping are essential steps in locus genotyping based on high throughput sequencing and copy number variation detection based on high throughput sequencing. SNP (Single Nucleotide Polymorphism ) refers mainly to DNA (Deoxyribo Nucleic Acid, deoxyribonucleic acid) sequence polymorphism caused by variation of single nucleotide at genome level. The polymorphism represented by SNP locus involves only single base variation, which may be caused by single base transition or transversion, or may be caused by base insertion or deletion. The data show that specific enzyme SNP sites are associated with drug metabolism, for example breast cancer patients carrying CYP2D6 x 10 homozygous mutations, and that recurrence risk with moxifene adjuvant therapy is likely to be higher than in wild type patients. In addition, the identification of chromosomal copy number variation based on the shift in bulk heterozygous SNP abundance needs to rely on the identification of SNP genotypes in the first place. In general, SNP genotyping employs a static abundance threshold method (e.g., allle-specific copy number analysis of tumors published by Peter Van Loo et al, PNAS vol.107, no. 30:16910-16915), i.e., a method in which homozygous mutations are obtained by abundance above a predetermined threshold (e.g., 95%). The method is greatly influenced by tumor duty ratio, chromosome structure variation, sample quality and sample pollution, and has low stability and accuracy.

On the other hand, in high throughput sequencing, contamination of heterologous DNA caused during sample storage, preparation, and the like is a non-negligible problem. Sample pollution directly causes a large number of low-abundance mutations from heterologous DNA in mutation detection, and the low-abundance mutations are difficult to distinguish from the real system mutations of the sample to be detected, so that misjudgment of mutation results is caused. Therefore, the identification of sample contamination is an important element of high throughput sequencing quality control, and accurate quantification of sample contamination helps identify and extract reliable system mutations. In general, the sample contamination is identified by a control sample, and there are problems of difficult sampling, increased additional cost, etc. in practical practice. Sample contamination quantification is more difficult, including the fact that samples may have large copy number variations, may be contaminated by multiple sources of contamination, and is insensitive to low-level contamination identification and quantification. Some studies have used contamination level quantification by constructing heterozygous SNP abundance models, however SNP genotype identification approaches based on static thresholds directly lead to unreliable heterozygous SNP datasets.

For example, fi vet et al ART-DeCo:easy tool for detection and characterization of cross-contamination of DNA samples in diagnostic next-generation sequencing analysis(European Journal of Human Genetics(2019)27:792–800) describes a method for detecting contamination in a sequenced sample, which involves screening for non-heterozygous SNP sites and detecting the proportion of contamination sources. Wherein the static threshold value used for screening non-heterozygous SNP sites is considered non-heterozygous when the AR (allelic ratio, allele ratio, similar to allele frequency) is [0-0.005] and [0.995-1 ]. The method for detecting the pollution source ratio comprises two methods, namely, simply detecting the pollution source ratio according to a WCS formula (see right column of page 794 of the paper) based on sequencing data of non-heterozygous SNP loci of a sample to be detected, and finely calculating the pollution source detection ratio based on the SNP genotype of a pollution sample and the sequencing data of the non-heterozygous SNP loci of the sample to be detected.

However, due to sample-to-sample variability, it is sometimes not accurate to use such a static threshold to discriminate the SNP genotypes of all samples. In order to identify the pollution source, besides the sample to be detected, an additional group of pollution samples needs to be detected, and the purpose of pollution identification can not be realized through the detection data of a single sample to be detected.

Disclosure of Invention

The invention provides a genotype detection method, a sample pollution detection device, equipment and a medium, which are used for solving the problem that the conventional genotype detection method depends on paired sequencing data and providing data guarantee for the accuracy of subsequent sample pollution detection.

According to an aspect of the present invention, there is provided a genotype detection method comprising:

acquiring at least 10 SNP loci to be detected in a sample to be detected and acquiring at least 3 abundance thresholds to be detected;

aiming at each abundance threshold to be detected, calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected;

Taking an abundance threshold to be measured corresponding to the abundance fluctuation level which accords with the inflection point feature in the at least 3 abundance fluctuation levels as a first abundance threshold;

And determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.

According to another aspect of the present invention, there is provided a sample contamination detection method comprising:

obtaining at least 10 SNP loci to be detected in a sample to be detected, and determining genotypes corresponding to the SNP loci to be detected respectively;

Constructing a Gaussian mixture model based on the minor allele frequency corresponding to the SNP locus to be detected meeting the genotype condition, wherein the genotype condition comprises that the genotype is wild type and/or the genotype is homozygous;

And carrying out optimization solving operation on the Gaussian mixture model to obtain pollution state data corresponding to the sample to be detected, wherein the pollution state data comprises at least one of pollution proportion, suboptimal gene frequency variance and the number of pollution sources.

According to another aspect of the present invention, there is provided a genotype detection device comprising:

the SNP locus to be detected acquisition module is used for acquiring at least 10 SNP loci to be detected in a sample to be detected and acquiring at least 3 abundance thresholds to be detected;

the abundance fluctuation level determining module is used for calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected respectively aiming at each abundance threshold to be detected;

The first abundance threshold determining module is used for taking an abundance threshold to be detected corresponding to the abundance fluctuation level which accords with the inflection point feature in at least 3 abundance fluctuation levels as a first abundance threshold;

the genotype determining module is used for determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.

According to another aspect of the present invention, there is provided a sample contamination detection apparatus comprising:

the genotype determining module is used for acquiring at least 10 SNP loci to be detected in a sample to be detected and determining genotypes corresponding to the SNP loci to be detected respectively;

The Gaussian mixture model construction module is used for constructing a Gaussian mixture model based on the subordinated gene frequency corresponding to the SNP locus to be detected meeting the genotype condition, wherein the genotype condition comprises that the genotype is wild type and/or the genotype is homozygous;

the pollution state data determining module is used for carrying out optimization solving operation on the Gaussian mixture model to obtain pollution state data corresponding to the sample to be detected, wherein the pollution state data comprises at least one of pollution proportion, secondary allele frequency variance and the number of pollution sources.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the genotype detection method or the sample contamination detection method according to any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the genotype detection method or sample contamination detection method according to any embodiment of the present invention when executed.

According to the technical scheme, according to the obtained abundance threshold values to be detected, the abundance fluctuation level of a sample to be detected is determined based on the sub-allele frequencies respectively corresponding to the abundance threshold values to be detected and the SNP loci to be detected, the abundance threshold value to be detected corresponding to the abundance fluctuation level meeting the inflection point characteristic in at least 3 abundance fluctuation levels is used as a first abundance threshold value, the first abundance threshold value is determined through the constructed dynamic abundance threshold value model, and genotypes respectively corresponding to the SNP loci to be detected are determined according to the first abundance threshold value and mutation abundances respectively corresponding to the SNP loci to be detected.

By the dynamic threshold method, the problems of inaccurate genotype discrimination caused by homozygous, heterozygous, wild SNP abundance deviating from 0%,50% and 100% due to high tumor ratio, copy number variation, sample pollution and the like are effectively avoided. Meanwhile, the algorithm does not depend on a control sample, and the problems of difficult sampling of the control sample, increased detection cost and the like are solved.

Further, the present invention may have greater accuracy, i.e., lower false positives or false negatives, for genotypes that rely on dynamic thresholds for discrimination between SNPs heterozygous and non-heterozygous. Meanwhile, the mixed model method can confirm the number of sample pollution sources, can be used for identifying and quantifying the pollution of multiple pollution sources, identifying and quantifying the pollution of low degree, and identifying and quantifying the pollution degree by stably identifying and quantifying the sample with different chromosome states including chromosome stability and a large number of copy number variation, and meanwhile, does not need sequencing information of other samples (such as reference samples, pollution samples and the like) except the sample to be tested.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a genotype detection method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a sample contamination detection method according to a second embodiment of the present invention;

FIG. 3 is a flow chart of a sample contamination detection method according to a third embodiment of the present invention;

FIG. 4 is a schematic diagram of a genotype detecting device according to a fourth embodiment of the present invention;

Fig. 5 is a schematic structural diagram of a sample contamination detection apparatus according to a fifth embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a genotype detection method according to a first embodiment of the present invention, where the method can be performed by a genotype detection method device, which can be implemented in hardware and/or software, and the genotype detection device can be configured in a terminal device. As shown in fig. 1, the method includes:

S110, obtaining at least 10 SNP loci to be detected in a sample to be detected and obtaining at least 3 abundance thresholds to be detected.

The sample to be detected is a single tissue sample, such as a cancer tissue sample. In this example, the type of SNP locus to be detected is an germline SNP locus. There are a large number of mutation sites in human genome, which can be classified into germ line mutation and somatic mutation according to the source, wherein germ line mutation is also called germ cell mutation, and is a mutation derived from germ cells such as sperm or ovum, so that all cells in the human body usually carry the mutation, including cancer tissue samples. Somatic mutations, also known as acquired mutations, are mutations obtained during growth or acquired under the influence of environmental factors, and are usually carried by only a portion of the cells.

Specifically, after performing quality control on a FASTQ file (next machine data) containing a high-throughput sequencing sequence of a sample to be detected, using a comparison software to compare the high-throughput sequencing sequence to a human reference genome, using an SNP site in the high-throughput sequencing sequence compared to the human reference genome as an SNP site to be detected, and generating a SAM file. The alignment software may be BWA-MEM (0.7.10), and the human reference genome may be hg19 or b37, among others, by way of example. The SAM file is illustratively converted to a BAM file by samtools (0.1.19) software.

As an optional implementation manner, the obtaining of at least 10 SNP loci to be detected in the sample to be detected comprises taking at least 10 SNP loci with the crowd frequency satisfying a preset crowd frequency range in the sample to be detected as the SNP loci to be detected respectively, wherein the preset crowd frequency range comprises 0.4-0.6. The crowd frequency is used for representing the occurrence frequency of SNP loci in crowd genomes, and specifically, SNP loci which are compared with the crowd frequency on a human reference genome and meet 0.4-0.6 are taken as SNP loci to be detected.

Specifically, the abundance threshold to be detected is used for representing a screening threshold corresponding to the frequency of the minor allele. The threshold difference value between the abundance thresholds to be measured can be the same or different. For example, a set of abundance thresholds to be measured th satisfies th e {0%,1%,2%,..10%, 30% }. Specific parameters of the abundance thresholds to be measured are not limited, and a user can set each abundance threshold to be measured according to actual requirements.

S120, aiming at each abundance threshold to be detected, calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected.

The method comprises the steps of calculating the abundance fluctuation level of a sample to be detected based on the abundance threshold to be detected and the minor allele frequencies corresponding to SNP loci to be detected respectively, wherein the method comprises the steps of executing segmentation operation on the chromosome based on copy numbers corresponding to at least two SNP loci to be detected on each chromosome to obtain one or more chromosome sections, wherein the copy number difference value corresponding to any two SNP loci to be detected in each chromosome section is smaller than a preset difference value threshold, taking the SNP loci to be detected with the minor allele frequencies greater than the abundance threshold in the chromosome sections as target SNP loci, and calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies corresponding to at least two target SNP loci.

Among these, the types of chromosomes include autosomes and/or sex chromosomes. Of the 23 pairs of chromosomes in humans, 22 pairs are autosomes and 1 pair is sex chromosome. The chromosomes in this example include all chromosomes that may have heterozygous variations.

In particular, the copy number is used to characterize the number of occurrences of a gene or a gene sequence in the genome, and can also be said to be the number of times a gene fragment is repeated in the genome. Normally, the copy number is 2. When copy number variation occurs, the genome rearranges, and the copy number of the gene changes from 2 to n, where n is a natural number that is not equal to 2.

Specifically, calculating the coverage depth of the SNP locus to be detected by GATK software, and calculating the copy number of the SNP locus to be detected based on the coverage depth. The coverage depth refers to the ratio of the total number of bases (bp) obtained by sequencing to the Genome size (Genome). In this example, the coverage depth of the SNP loci to be detected is higher than 200X.

Therein, by way of example, a CBS (circular binary segmentation) algorithm is employed to segment a chromosome into one or more chromosome segments based on copy number. The CBS algorithm may be implemented by R software package PSCBS, among other things. In this embodiment, the copy number difference corresponding to any two SNP loci to be detected in each chromosome segment is smaller than a preset difference threshold, where the preset difference threshold may be 2. For example, if the copy numbers of the 4 SNP sites to be detected on the chromosome are 2, 3, 5 and 6, respectively, the chromosome segment a contains 2 SNP sites to be detected with copy numbers of 2 and 3, respectively, and the chromosome segment B contains 2 SNP sites to be detected with copy numbers of 5 and 6, respectively.

Wherein, for each mutated SNP site, the major allele is the highest-count allele in the given population, the minor allele is the second highest-count allele in the given population, and the minor allele frequency is used to characterize the ratio of the number of minor alleles to the number of all alleles. Alleles are used to describe genes that control different morphologies of the same trait at the same position on a pair of homologous chromosomes. For example, assuming that there are three alleles a, C and G at a SNP site on a chromosome in 100 persons, and the number of occurrences of base a, base C and base G is 100, 80 and 20, respectively, using a genome sequencing method, the frequency of the minor allele corresponding to the SNP site is 80/200=0.4.

As an alternative implementation mode, the method for calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies corresponding to at least two target SNP loci comprises the steps of sorting all target SNP loci based on locus positions of all target SNP loci in chromosomes, taking a difference value between the minor allele frequency of a current target SNP locus and the minor allele frequency of a next target SNP locus as an abundance difference value based on a sorting result, and carrying out average processing on at least one abundance difference value to obtain the abundance fluctuation level of the sample to be detected.

In the exemplary embodiment, it is assumed that 4 target SNP sites on the chromosome are a target SNP site a, a target SNP site B, a target SNP site C, and a target SNP site D in this order, and the corresponding copy numbers are 2, 6, 5, and 3, respectively. The target SNP site and the sequence thereof contained in the chromosome segment a are the target SNP site a and the target SNP site D, and the target SNP site and the sequence thereof contained in the chromosome segment B are the target SNP site B and the target SNP site C.

For example, assuming that the chromosome segment a contains 100 target SNP sites, the number of abundance differences calculated is 99, and the abundance differences are averaged to obtain the abundance fluctuation level corresponding to the chromosome segment a.

S130, taking an abundance threshold to be measured corresponding to the abundance fluctuation level which accords with the inflection point feature in the at least 3 abundance fluctuation levels as a first abundance threshold.

Specifically, an abundance fluctuation level curve is constructed by taking an abundance threshold to be measured as an abscissa and an abundance fluctuation level as an ordinate. As the threshold value of the abundance to be measured increases, the level of the fluctuation of the abundance increases and then tends to be stable.

S140, determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.

Wherein, the mutation abundance, also called mutation allele frequency, is used for characterizing the ratio of the number of read lengths of mutant alleles to the number of read lengths of the whole SNP locus. Specifically, samtools mpileup software is adopted to calculate the number of reads A of mutant alleles and the number of reads B of wild alleles on SNP loci to be detected, and mutation abundance=A/(A+B). For example, alleles at SNP sites are denoted AAA, BBB, ABB and AAB, and the mutation abundances corresponding to each allele representation are 100%, 0%, 25% and 75%, respectively.

In this example, genotypes include heterozygous, wild-type and homozygous. Heterozygous is used to characterize genotype individuals, such as AB, that contain different types of alleles on homologous chromosomes. Wild type is used to characterize genotype individuals, such as AA, that contain a wild allele on a homologous chromosome, and homozygous is used to characterize genotype individuals, such as BB, that contain a homozygous allele on a homologous chromosome.

As an alternative implementation mode, determining genotypes corresponding to SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected respectively comprises calculating a second abundance threshold based on the first abundance threshold, wherein the sum of the first abundance threshold and the second abundance threshold is one, for each SNP locus to be detected, if mutation abundance corresponding to the SNP locus to be detected is smaller than or equal to the first abundance threshold, the genotypes of the SNP loci to be detected are wild types, if mutation abundance corresponding to the SNP locus to be detected is greater than or equal to the second abundance threshold, the genotypes of the SNP loci to be detected are homozygous types, and if mutation abundance corresponding to the SNP locus to be detected is greater than the first abundance threshold and smaller than the second abundance threshold, the genotypes of the SNP loci to be detected are heterozygous types.

Specifically, the second abundance threshold=1-the first division threshold, and the genotype of the SNP site to be detected satisfies the following relationship:

wherein AF represents mutation abundance, tBest represents a first abundance threshold, mBest represents a second abundance threshold.

The present example data is based on the results of high throughput sequencing of 100 real tissue samples. The tissue samples include samples of different tumor occupancy levels, different chromosome stabilities, no contamination and different contamination levels. At the same time, control leukocyte samples from the same individual as the tissue samples were high throughput sequenced to determine the true genotype of the SNP as a performance reference standard. Unlike cancer tissue samples, leukocyte samples are stable in genome and almost no contamination from sample storage and preparation exists. In order to ensure the reliability of the reference standard, a simple model, namely, a principle that approximately 0% is wild type, approximately 100% is homozygous and approximately 50% is heterozygous is adopted for directly carrying out genotype interpretation. Considering data noise, 5% of abundance fluctuations are tolerated, i.e., abundance 0% -5% is wild-type, 95% -100% is homozygous, and 47.5% -52.5% is heterozygous. For tissue samples we compared genotype identification based on two static threshold combinations and the dynamic threshold algorithm of the present invention.

Wherein the static threshold combination 1:

wherein the static threshold combination 2:

Further, the level of identity of the three strategies to the leukocyte reference standard was assessed with the heterozygous SNP site set as positive site and the non-heterozygous site set as negative site, as described in table 1 below. The dynamic threshold method has better and stable performance in sensitivity and specificity, and the accuracy level is highest.

Table 1 is a list of the identity of genotype identification provided in accordance with example one of the present invention.

According to the technical scheme of the embodiment, the abundance fluctuation level of a sample to be detected is determined based on the sub-allele frequency corresponding to each SNP locus to be detected and the obtained abundance threshold to be detected, the abundance threshold to be detected corresponding to the abundance fluctuation level conforming to the inflection point characteristic in at least 3 abundance fluctuation levels is used as a first abundance threshold, the first abundance threshold is determined through the constructed dynamic abundance threshold model, genotypes corresponding to each SNP locus to be detected are determined according to the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected, the problem that the conventional genotype detection method depends on paired sequencing data is solved, and the accuracy of genotype detection results is improved.

Example two

Fig. 2 is a flowchart of a sample contamination detection method according to a second embodiment of the present invention. The present embodiment is applicable to the case of detecting and evaluating the contamination level of a single tissue sample, and the method may be performed by a sample contamination detection method apparatus, which may be implemented in hardware and/or software, and which may be configured in a terminal device. As shown in fig. 2, the method includes:

S210, obtaining at least 10 SNP loci to be detected in a sample to be detected, and determining genotypes corresponding to the SNP loci to be detected respectively.

As an alternative implementation mode, determining genotypes corresponding to the SNP loci to be detected respectively comprises determining the genotypes corresponding to the SNP loci to be detected respectively by adopting a preset detection method, wherein the preset detection method comprises at least one of a sequencing method, a chip method and a mass spectrometry method.

In another embodiment, optionally, determining genotypes corresponding to the SNP loci to be detected respectively comprises determining genotypes corresponding to the SNP loci to be detected respectively, wherein the determining comprises obtaining a first abundance threshold, determining mutation abundance corresponding to the SNP loci to be detected respectively, and determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and the mutation abundance corresponding to the SNP loci to be detected respectively.

As an alternative embodiment, the first abundance threshold is preset by the user. For example, the first preset abundance threshold may be 0.01 or 0.03, where the specific value of the first preset abundance threshold is not limited, and may be set by the user according to actual needs.

Wherein, mutation abundance, also called mutation allele frequency, is used to characterize the ratio of the number of reads of a mutated allele to the number of reads of the entire SNP locus, wherein alleles are used to describe genes located at the same position on a pair of homologous chromosomes that control different morphologies of the same trait. Specifically, samtools mpileup software is adopted to calculate the number of reads A of mutant alleles and the number of reads B of wild alleles on SNP loci to be detected, and mutation abundance=A/(A+B). For example, alleles at SNP sites are denoted AAA, BBB, ABB and AAB, and the mutation abundances corresponding to each allele representation are 100%, 0%, 25% and 75%, respectively.

S220, constructing a Gaussian mixture model based on the minor allele frequency corresponding to the SNP locus to be detected meeting the genotype condition.

In this embodiment, the genotypic conditions include the genotype being wild-type and/or the genotype being homozygous.

Of these, a SNP site typically contains only two alleles, in particular. For a set of SNP loci to be tested, it is assumed that the 4 SNP loci to be tested contain alleles of AAA, BBB, ABB and AAB, respectively. When calculating mutation abundance, we need to obtain the proportion of A or B in the total number of alleles, for example, A belongs to mutant genes, and the mutation abundance of 4 SNP loci to be detected is 100%, 0%, 25% and 75% respectively. For non-heterozygous SNP loci to be detected, only the degree of difference of the numbers of A and B is concerned, and the proportion of A or B in the total number of alleles is not concerned, so that the allele frequency can well meet the requirements, and the minor allele frequencies of 4 SNP loci to be detected are respectively 0%, 25% and 25%. By way of example, the mutation abundance was mapped to 50% as the center, and the mapped mutation abundance was equal to the minor allele frequency.

As an alternative embodiment, the gaussian mixture model satisfies the formula:

Wherein, the

Where maf represents minor allele frequency, δ ² represents variance of minor allele frequency, N represents number of contamination sources, α represents contamination ratio, pbinom represents bernoulli probability distribution, P (c=i) represents probability distribution corresponding to i contamination sources, and N represents probability distribution of gaussian mixture model.

And S230, performing optimization solving operation on the Gaussian mixture model to obtain pollution state data corresponding to the sample to be detected.

Exemplary methods of optimization solution include, but are not limited to, at least one of a maximum likelihood estimation method, a expectation maximization algorithm, and a markov chain monte carlo algorithm, among others. Specifically, an optimization function of R software packages stats can be used for carrying out optimization solving operation on the Gaussian mixture model by adopting an L-BFGS-B algorithm, so that pollution state data are obtained.

In this embodiment, the contamination status data includes at least one of a contamination ratio, a sub-allele frequency variance, and a number of contamination sources.

According to the technical scheme, at least 10 SNP loci to be detected in a sample to be detected are obtained, genotypes corresponding to the SNP loci to be detected are determined, a Gaussian mixture model is constructed based on the hypo-allele frequencies corresponding to the SNP loci to be detected which are not heterozygous, and the Gaussian mixture model is subjected to optimization solving operation to obtain pollution state data corresponding to the sample to be detected, so that the problems that an existing sample pollution measuring method is greatly influenced by copy number variation/heterozygosity loss and is difficult to detect low-level pollution are solved, and the accuracy of a measuring result of sample pollution is improved.

Example III

Fig. 3 is a flowchart of a sample contamination detection method according to a third embodiment of the present invention, where the technical feature of "obtaining the first abundance threshold" in the foregoing embodiment is further refined. As shown in fig. 3, the method includes:

S310, obtaining at least 10 SNP loci to be detected in a sample to be detected and obtaining at least 3 abundance thresholds to be detected.

S320, aiming at each abundance threshold to be detected, calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected.

S330, taking an abundance threshold to be measured corresponding to the abundance fluctuation level which accords with the inflection point feature in the at least 3 abundance fluctuation levels as a first abundance threshold.

S340, determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.

S350, constructing a Gaussian mixture model based on the minor allele frequency corresponding to the SNP locus to be detected meeting the genotype condition.

And S360, performing optimization solving operation on the Gaussian mixture model to obtain pollution state data corresponding to the sample to be detected.

The third embodiment of the present invention also provides result data obtained by measuring simulation data from 20 groups of human cancer tissue samples by using the sample contamination detection method described in the present embodiment.

Specifically, mixing 20 groups of human cancer tissue samples according to a preset pollution ratio, and polluting sample data to obtain a sample to be detected. Table 2 below is a sample contamination specification list provided in example three of the present invention.

Table 3 below shows pollution ratio data obtained by the sample pollution detection method according to the third embodiment of the present invention.

Pollution level	Pollution detection rate	Predicting pollution ratio
			0.5%	100%	0.47%
1%	100%	0.81%
			5%	100%	5.2%
10%	100%	9.6%
			20%	100%	18%

As can be seen from Table 3, the examples of the present invention are applicable to single sample and multiple sample source pollution conditions, can be stably detected at a pollution level of 0.5%, and have accurate pollution level prediction.

Table 4 below shows the pollution detection rate obtained by measuring different pollution levels by using different detection methods according to the third embodiment of the present invention.

Table 5 below shows pollution ratio data obtained by a conventional conPair (v 0.2) method according to example III of the present invention.

From tables 4 and 5, it is possible to obtain a contaminated sample in which the existing ART-DeCo (v 1.1) method cannot recognize a low contamination level. The existing conPair (v 0.2) method, while able to detect low contamination levels of contaminated samples, yields a significantly higher contamination ratio than the actual contamination ratio for contaminated samples containing a large number of copies of the sample.

Therefore, the sample pollution detection method provided by the embodiment of the invention has higher detection sensitivity and higher accuracy of pollution level evaluation than similar software.

According to the technical scheme of the embodiment, the abundance fluctuation level of a sample to be detected is determined based on the sub-allele frequency corresponding to each SNP locus to be detected and the obtained abundance threshold to be detected, the abundance threshold to be detected corresponding to the abundance fluctuation level conforming to the inflection point characteristic in at least 3 abundance fluctuation levels is used as a first abundance threshold, the first abundance threshold is determined through the constructed dynamic abundance threshold model, genotypes corresponding to each SNP locus to be detected are determined according to the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected, the problem that the conventional genotype detection method depends on paired sequencing data is solved, the accuracy of genotype detection results is improved, and the accuracy of detection results of sample pollution is further improved.

Example IV

Fig. 4 is a schematic structural diagram of a genotype detecting device according to a fourth embodiment of the present invention. As shown in FIG. 4, the device comprises a SNP locus acquisition module 410 to be detected, an abundance fluctuation level determination module 420, a first abundance threshold determination module 430 and a genotype determination module 440.

The SNP site to be detected acquisition module 410 is configured to acquire at least 10 SNP sites to be detected in a sample to be detected and acquire at least 3 abundance thresholds to be detected;

the abundance fluctuation level determining module 420 is configured to calculate, for each abundance threshold to be detected, an abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequencies corresponding to the SNP loci to be detected, respectively;

A first abundance threshold determining module 430, configured to take, as a first abundance threshold, an abundance threshold to be measured corresponding to an abundance fluctuation level that meets the inflection point feature in at least 3 abundance fluctuation levels;

The genotype determining module 440 is configured to determine genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundances corresponding to the SNP loci to be detected respectively.

Based on the above embodiments, optionally, the abundance fluctuation level determination module 420 includes:

The chromosome segment determining unit is used for executing segmentation operation on each chromosome based on the copy numbers corresponding to at least two SNP loci to be detected on the chromosome to obtain one or more chromosome segments, wherein the copy number difference value corresponding to any two SNP loci to be detected in each chromosome segment is smaller than a preset difference value threshold;

A target SNP locus determining unit, which is used for taking the SNP locus to be detected, of which the minor allele frequency is greater than the abundance threshold to be detected, in the chromosome section as a target SNP locus;

the abundance fluctuation level unit is used for calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies respectively corresponding to the at least two target SNP loci.

On the basis of the above embodiment, optionally, the abundance fluctuation level unit is specifically configured to:

sequencing each target SNP locus based on the locus position of each target SNP locus in the chromosome;

Based on the sequencing result, taking the difference value between the frequency of the minor allele of the current target SNP locus and the frequency of the minor allele of the next target SNP locus as an abundance difference value;

And carrying out average value processing on at least one abundance difference value to obtain the abundance fluctuation level of the sample to be detected.

Based on the above embodiments, the genotype determining module 440 is optionally specifically configured to:

calculating a second abundance threshold based on the first abundance threshold, wherein the sum of the first abundance threshold and the second abundance threshold is one;

aiming at each SNP locus to be detected, if the mutation abundance corresponding to the SNP locus to be detected is smaller than or equal to a first abundance threshold, the genotype of the SNP locus to be detected is wild type;

If the mutation abundance corresponding to the SNP locus to be detected is greater than or equal to a second abundance threshold, the genotype of the SNP locus to be detected is homozygous;

And if the mutation abundance corresponding to the SNP locus to be detected is larger than the first abundance threshold and smaller than the second abundance threshold, the genotype of the SNP locus to be detected is heterozygous.

Based on the above embodiments, optionally, the SNP site obtaining module 410 to be tested is specifically configured to:

And respectively taking at least 10 SNP loci with the crowd frequency meeting a preset crowd frequency range in a sample to be detected as SNP loci to be detected, wherein the preset crowd frequency range comprises 0.4-0.6.

The genotype detection device provided by the embodiment of the invention can execute the genotype detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example five

Fig. 5 is a schematic structural diagram of a sample contamination detection apparatus according to a fifth embodiment of the present invention. As shown in FIG. 5, the apparatus includes a genotype determination module 510, a Gaussian mixture model construction module 520, and a pollution status data determination module 530.

The genotype determining module 510 is configured to obtain at least 10 SNP sites to be detected in a sample to be detected, and determine genotypes corresponding to the SNP sites to be detected respectively;

the Gaussian mixture model construction module 520 is used for constructing a Gaussian mixture model based on the subordinated gene frequency corresponding to the SNP locus to be detected which meets the genotype condition, wherein the genotype condition comprises that the genotype is wild type and/or the genotype is homozygous;

The pollution state data determining module 530 is configured to perform an optimization solving operation on the gaussian mixture model to obtain pollution state data corresponding to the sample to be detected, where the pollution state data includes at least one of a pollution proportion, a secondary allele frequency variance, and a number of pollution sources.

On the basis of the above embodiment, optionally, the gaussian mixture model satisfies the formula:

Wherein, the

Based on the above embodiment, optionally, the genotype determining module 510 includes:

The first abundance threshold acquisition unit is used for acquiring a first abundance threshold and determining mutation abundance corresponding to each SNP locus to be detected;

The genotype determining unit is used for determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.

On the basis of the above embodiments, optionally, the genotype determining unit is specifically configured to:

On the basis of the above embodiment, optionally, the first abundance threshold acquiring unit includes:

the abundance threshold value obtaining subunit to be detected is used for obtaining at least three abundance threshold values to be detected;

the abundance fluctuation level determining subunit is used for calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected respectively aiming at each abundance threshold to be detected;

A first abundance threshold determining subunit, configured to use, as a first abundance threshold, an abundance threshold to be measured corresponding to an abundance fluctuation level that meets an inflection point feature in at least 3 abundance fluctuation levels

Based on the above embodiments, optionally, the abundance fluctuation level determining subunit is specifically configured to:

For each chromosome where each SNP locus to be detected is located, based on the copy numbers corresponding to at least two SNP loci to be detected on the chromosome, performing segmentation operation on the chromosome to obtain one or more chromosome sections, wherein the copy number difference value corresponding to any two SNP loci to be detected in each chromosome section is smaller than a preset difference value threshold;

aiming at each chromosome segment, taking a SNP locus to be detected, of which the minor allele frequency is greater than the abundance threshold to be detected, in the chromosome segment as a target SNP locus;

and calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies respectively corresponding to the at least two target SNP loci.

The SNP locus determination unit is used for respectively taking at least 10 SNP loci with the crowd frequency meeting the preset crowd frequency range in the sample to be detected as SNP loci to be detected, wherein the preset crowd frequency range comprises 0.4-0.6.

The sample pollution detection device provided by the embodiment of the invention can execute the sample pollution detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example six

Fig. 6 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including an input unit 16, such as a keyboard, mouse, etc., an output unit 17, such as various types of displays, speakers, etc., a storage unit 18, such as a magnetic disk, optical disk, etc., and a communication unit 19, such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, such as a genotype detection method or a sample contamination detection method.

In some embodiments, the genotype detection method or sample contamination detection method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the genotype detection method or sample contamination detection method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the genotype detection method or the sample contamination detection method in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

The computer program for implementing the genotyping method or the sample contamination detection method of the invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

Example seven

The seventh embodiment of the present invention also provides a computer-readable storage medium storing computer instructions for causing a processor to execute a genotype detection method, the method comprising:

Or computer instructions for causing a processor to perform a method of sample contamination detection, the method comprising:

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), a blockchain network, and the Internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A genotype detection method, comprising:

Obtain at least 10 SNP sites to be tested in the sample to be tested and obtain at least 3 abundance thresholds to be tested;

For each abundance threshold to be tested, the abundance fluctuation level of the sample to be tested is calculated based on the abundance threshold to be tested and the minor allele frequency corresponding to each SNP site to be tested;

The abundance threshold to be measured corresponding to the abundance fluctuation level that meets the inflection point characteristics among at least three abundance fluctuation levels is used as the first abundance threshold;

Determining the genotype corresponding to each SNP site to be tested based on the first abundance threshold and the mutation abundance corresponding to each SNP site to be tested;

The calculating the abundance fluctuation level of the sample to be tested based on the abundance threshold to be tested and the minor allele frequency corresponding to each SNP site to be tested comprises:

For each chromosome where each SNP site to be tested is located, a segmentation operation is performed on the chromosome based on the copy numbers corresponding to at least two SNP sites to be tested on the chromosome to obtain one or more chromosome segments; wherein the copy number difference corresponding to any two SNP sites to be tested in each chromosome segment is less than a preset difference threshold;

For each chromosome segment, the SNP site to be tested in which the minor allele frequency in the chromosome segment is greater than the abundance threshold to be tested is taken as the target SNP site;

Based on the minor allele frequencies corresponding to at least two target SNP sites, the abundance fluctuation level of the sample to be tested is calculated.

2. The method according to claim 1, wherein the calculating the abundance fluctuation level of the sample to be tested based on the minor allele frequencies corresponding to at least two target SNP sites comprises:

Sort each target SNP site based on its location in the chromosome;

Based on the sorting results, the difference between the minor allele frequency of the current target SNP site and the minor allele frequency of the next target SNP site is used as the abundance difference;

Averaging processing is performed on at least one abundance difference to obtain an abundance fluctuation level of the sample to be detected.

3. The method according to claim 1, wherein determining the genotype corresponding to each SNP site to be tested based on the first abundance threshold and the mutation abundance corresponding to each SNP site to be tested comprises:

Calculating a second abundance threshold based on the first abundance threshold; wherein the sum of the first abundance threshold and the second abundance threshold is one;

For each SNP site to be tested, if the mutation abundance corresponding to the SNP site to be tested is less than or equal to the first abundance threshold, the genotype of the SNP site to be tested is wild type;

If the mutation abundance corresponding to the SNP site to be tested is greater than or equal to the second abundance threshold, the genotype of the SNP site to be tested is homozygous;

If the mutation abundance corresponding to the SNP site to be tested is greater than the first abundance threshold and less than the second abundance threshold, the genotype of the SNP site to be tested is heterozygous.

4. The method according to claim 1, wherein obtaining at least 10 SNP sites to be detected in the sample to be detected comprises:

At least 10 SNP sites in the sample to be tested whose population frequencies meet the preset population frequency range are respectively used as SNP sites to be tested; wherein, the preset population frequency range includes 0.4-0.6.

5. A sample contamination detection method, comprising:

Obtain at least 10 SNP sites to be detected in the sample to be tested, and determine the genotype corresponding to each SNP site to be detected;

Constructing a Gaussian mixture model based on the minor allele frequency corresponding to the SNP site to be tested that meets the genotype conditions; wherein the genotype conditions include the genotype being wild type and/or the genotype being homozygous;

Performing an optimization solution operation on the Gaussian mixture model to obtain pollution status data corresponding to the sample to be detected;

Wherein, the pollution status data includes at least one of pollution ratio, minor allele frequency variance and number of pollution sources;

The Gaussian mixture model satisfies the formula:

in,

Where maf represents the minor allele frequency, ^δ2 represents the variance of the minor allele frequency, n represents the number of pollution sources, α represents the pollution proportion, pbinom represents the Bernoulli probability distribution, P(C=i) represents the probability distribution corresponding to the i pollution source, and N represents the probability distribution of the Gaussian mixture model;

Determining the genotype corresponding to each SNP site to be tested includes:

Obtaining a first abundance threshold, and determining the mutation abundance corresponding to each SNP site to be tested;

The obtaining of the first abundance threshold comprises:

Obtain at least three abundance thresholds to be measured;

6. The method according to claim 5, wherein determining the genotype corresponding to each SNP site to be tested based on the first abundance threshold and the mutation abundance corresponding to each SNP site to be tested comprises:

7. The method according to claim 5, wherein the calculating the abundance fluctuation level of the sample to be tested based on the minor allele frequencies corresponding to at least two target SNP sites comprises:

Sort each target SNP site based on its location in the chromosome;

8. The method according to claim 5, wherein obtaining at least 10 SNP sites to be detected in the sample to be detected comprises:

9. A genotype detection device, comprising:

A SNP site acquisition module to be tested is used to obtain at least 10 SNP sites to be tested in the sample to be tested and obtain at least 3 abundance thresholds to be tested;

an abundance fluctuation level determination module, configured to calculate, for each abundance threshold to be measured, the abundance fluctuation level of the sample to be tested based on the abundance threshold to be measured and the minor allele frequency corresponding to each SNP site to be measured;

a first abundance threshold determination module, configured to use the abundance threshold to be measured corresponding to the abundance fluctuation level that meets the inflection point feature among the at least three abundance fluctuation levels as the first abundance threshold;

a genotype determination module, configured to determine the genotype corresponding to each SNP site to be tested based on the first abundance threshold and the mutation abundance corresponding to each SNP site to be tested;

The abundance fluctuation level determination module comprises:

a chromosome segment determination unit configured to, for each chromosome where each SNP site to be tested is located, perform a segmentation operation on the chromosome based on the copy numbers corresponding to at least two SNP sites to be tested on the chromosome, thereby obtaining one or more chromosome segments; wherein the copy number difference between any two SNP sites to be tested in each chromosome segment is less than a preset difference threshold;

a target SNP site determination unit, configured to, for each chromosome segment, determine a SNP site to be tested in the chromosome segment whose minor allele frequency is greater than the abundance threshold to be tested as a target SNP site;

The abundance fluctuation level unit is used to calculate the abundance fluctuation level of the sample to be tested based on the minor allele frequencies corresponding to at least two target SNP sites.

10. A sample contamination detection device, comprising:

The genotype determination module is used to obtain at least 10 SNP sites to be detected in the sample to be detected and determine the genotype corresponding to each SNP site to be detected;

A Gaussian mixture model construction module is used to construct a Gaussian mixture model based on the minor allele frequency corresponding to the SNP site to be tested that meets the genotype conditions; wherein the genotype conditions include the genotype being wild type and/or the genotype being homozygous;

A pollution status data determination module is used to perform an optimization solution operation on the Gaussian mixture model to obtain pollution status data corresponding to the sample to be detected;

Wherein, the pollution status data includes pollution ratio, minor allele frequency variance and number of pollution sources;

The Gaussian mixture model satisfies the formula:

in,

The genotype determination module comprises:

A first abundance threshold acquisition unit is used to obtain a first abundance threshold and determine the mutation abundance corresponding to each SNP site to be tested;

a genotype determination unit, configured to determine the genotype corresponding to each SNP site to be tested based on the first abundance threshold and the mutation abundance corresponding to each SNP site to be tested;

The first abundance threshold acquisition unit includes:

The abundance threshold value acquisition subunit to be measured is used to obtain at least three abundance threshold values to be measured;

an abundance fluctuation level determination subunit, configured to calculate, for each abundance threshold to be measured, the abundance fluctuation level of the sample to be tested based on the abundance threshold to be measured and the minor allele frequency corresponding to each SNP site to be measured;

a first abundance threshold determination subunit, configured to use the abundance threshold to be measured corresponding to the abundance fluctuation level that meets the inflection point feature among the at least three abundance fluctuation levels as the first abundance threshold;

The abundance fluctuation level determination subunit is specifically used to:

11. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can implement the genotype detection method described in any one of claims 1 to 4 or the sample contamination detection method described in any one of claims 5 to 8 when executed.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the genotype detection method described in any one of claims 1 to 4 or the sample contamination detection method described in any one of claims 5 to 8 when executed.