Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
Fig. 1 is a flowchart of a genotype detection method according to a first embodiment of the present invention, where the method can be performed by a genotype detection method device, which can be implemented in hardware and/or software, and the genotype detection device can be configured in a terminal device. As shown in fig. 1, the method includes:
S110, obtaining at least 10 SNP loci to be detected in a sample to be detected and obtaining at least 3 abundance thresholds to be detected.
The sample to be detected is a single tissue sample, such as a cancer tissue sample. In this example, the type of SNP locus to be detected is an germline SNP locus. There are a large number of mutation sites in human genome, which can be classified into germ line mutation and somatic mutation according to the source, wherein germ line mutation is also called germ cell mutation, and is a mutation derived from germ cells such as sperm or ovum, so that all cells in the human body usually carry the mutation, including cancer tissue samples. Somatic mutations, also known as acquired mutations, are mutations obtained during growth or acquired under the influence of environmental factors, and are usually carried by only a portion of the cells.
Specifically, after performing quality control on a FASTQ file (next machine data) containing a high-throughput sequencing sequence of a sample to be detected, using a comparison software to compare the high-throughput sequencing sequence to a human reference genome, using an SNP site in the high-throughput sequencing sequence compared to the human reference genome as an SNP site to be detected, and generating a SAM file. The alignment software may be BWA-MEM (0.7.10), and the human reference genome may be hg19 or b37, among others, by way of example. The SAM file is illustratively converted to a BAM file by samtools (0.1.19) software.
As an optional implementation manner, the obtaining of at least 10 SNP loci to be detected in the sample to be detected comprises taking at least 10 SNP loci with the crowd frequency satisfying a preset crowd frequency range in the sample to be detected as the SNP loci to be detected respectively, wherein the preset crowd frequency range comprises 0.4-0.6. The crowd frequency is used for representing the occurrence frequency of SNP loci in crowd genomes, and specifically, SNP loci which are compared with the crowd frequency on a human reference genome and meet 0.4-0.6 are taken as SNP loci to be detected.
Specifically, the abundance threshold to be detected is used for representing a screening threshold corresponding to the frequency of the minor allele. The threshold difference value between the abundance thresholds to be measured can be the same or different. For example, a set of abundance thresholds to be measured th satisfies th e {0%,1%,2%,..10%, 30% }. Specific parameters of the abundance thresholds to be measured are not limited, and a user can set each abundance threshold to be measured according to actual requirements.
S120, aiming at each abundance threshold to be detected, calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected.
The method comprises the steps of calculating the abundance fluctuation level of a sample to be detected based on the abundance threshold to be detected and the minor allele frequencies corresponding to SNP loci to be detected respectively, wherein the method comprises the steps of executing segmentation operation on the chromosome based on copy numbers corresponding to at least two SNP loci to be detected on each chromosome to obtain one or more chromosome sections, wherein the copy number difference value corresponding to any two SNP loci to be detected in each chromosome section is smaller than a preset difference value threshold, taking the SNP loci to be detected with the minor allele frequencies greater than the abundance threshold in the chromosome sections as target SNP loci, and calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies corresponding to at least two target SNP loci.
Among these, the types of chromosomes include autosomes and/or sex chromosomes. Of the 23 pairs of chromosomes in humans, 22 pairs are autosomes and 1 pair is sex chromosome. The chromosomes in this example include all chromosomes that may have heterozygous variations.
In particular, the copy number is used to characterize the number of occurrences of a gene or a gene sequence in the genome, and can also be said to be the number of times a gene fragment is repeated in the genome. Normally, the copy number is 2. When copy number variation occurs, the genome rearranges, and the copy number of the gene changes from 2 to n, where n is a natural number that is not equal to 2.
Specifically, calculating the coverage depth of the SNP locus to be detected by GATK software, and calculating the copy number of the SNP locus to be detected based on the coverage depth. The coverage depth refers to the ratio of the total number of bases (bp) obtained by sequencing to the Genome size (Genome). In this example, the coverage depth of the SNP loci to be detected is higher than 200X.
Therein, by way of example, a CBS (circular binary segmentation) algorithm is employed to segment a chromosome into one or more chromosome segments based on copy number. The CBS algorithm may be implemented by R software package PSCBS, among other things. In this embodiment, the copy number difference corresponding to any two SNP loci to be detected in each chromosome segment is smaller than a preset difference threshold, where the preset difference threshold may be 2. For example, if the copy numbers of the 4 SNP sites to be detected on the chromosome are 2, 3, 5 and 6, respectively, the chromosome segment a contains 2 SNP sites to be detected with copy numbers of 2 and 3, respectively, and the chromosome segment B contains 2 SNP sites to be detected with copy numbers of 5 and 6, respectively.
Wherein, for each mutated SNP site, the major allele is the highest-count allele in the given population, the minor allele is the second highest-count allele in the given population, and the minor allele frequency is used to characterize the ratio of the number of minor alleles to the number of all alleles. Alleles are used to describe genes that control different morphologies of the same trait at the same position on a pair of homologous chromosomes. For example, assuming that there are three alleles a, C and G at a SNP site on a chromosome in 100 persons, and the number of occurrences of base a, base C and base G is 100, 80 and 20, respectively, using a genome sequencing method, the frequency of the minor allele corresponding to the SNP site is 80/200=0.4.
As an alternative implementation mode, the method for calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies corresponding to at least two target SNP loci comprises the steps of sorting all target SNP loci based on locus positions of all target SNP loci in chromosomes, taking a difference value between the minor allele frequency of a current target SNP locus and the minor allele frequency of a next target SNP locus as an abundance difference value based on a sorting result, and carrying out average processing on at least one abundance difference value to obtain the abundance fluctuation level of the sample to be detected.
In the exemplary embodiment, it is assumed that 4 target SNP sites on the chromosome are a target SNP site a, a target SNP site B, a target SNP site C, and a target SNP site D in this order, and the corresponding copy numbers are 2, 6, 5, and 3, respectively. The target SNP site and the sequence thereof contained in the chromosome segment a are the target SNP site a and the target SNP site D, and the target SNP site and the sequence thereof contained in the chromosome segment B are the target SNP site B and the target SNP site C.
For example, assuming that the chromosome segment a contains 100 target SNP sites, the number of abundance differences calculated is 99, and the abundance differences are averaged to obtain the abundance fluctuation level corresponding to the chromosome segment a.
S130, taking an abundance threshold to be measured corresponding to the abundance fluctuation level which accords with the inflection point feature in the at least 3 abundance fluctuation levels as a first abundance threshold.
Specifically, an abundance fluctuation level curve is constructed by taking an abundance threshold to be measured as an abscissa and an abundance fluctuation level as an ordinate. As the threshold value of the abundance to be measured increases, the level of the fluctuation of the abundance increases and then tends to be stable.
S140, determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.
Wherein, the mutation abundance, also called mutation allele frequency, is used for characterizing the ratio of the number of read lengths of mutant alleles to the number of read lengths of the whole SNP locus. Specifically, samtools mpileup software is adopted to calculate the number of reads A of mutant alleles and the number of reads B of wild alleles on SNP loci to be detected, and mutation abundance=A/(A+B). For example, alleles at SNP sites are denoted AAA, BBB, ABB and AAB, and the mutation abundances corresponding to each allele representation are 100%, 0%, 25% and 75%, respectively.
In this example, genotypes include heterozygous, wild-type and homozygous. Heterozygous is used to characterize genotype individuals, such as AB, that contain different types of alleles on homologous chromosomes. Wild type is used to characterize genotype individuals, such as AA, that contain a wild allele on a homologous chromosome, and homozygous is used to characterize genotype individuals, such as BB, that contain a homozygous allele on a homologous chromosome.
As an alternative implementation mode, determining genotypes corresponding to SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected respectively comprises calculating a second abundance threshold based on the first abundance threshold, wherein the sum of the first abundance threshold and the second abundance threshold is one, for each SNP locus to be detected, if mutation abundance corresponding to the SNP locus to be detected is smaller than or equal to the first abundance threshold, the genotypes of the SNP loci to be detected are wild types, if mutation abundance corresponding to the SNP locus to be detected is greater than or equal to the second abundance threshold, the genotypes of the SNP loci to be detected are homozygous types, and if mutation abundance corresponding to the SNP locus to be detected is greater than the first abundance threshold and smaller than the second abundance threshold, the genotypes of the SNP loci to be detected are heterozygous types.
Specifically, the second abundance threshold=1-the first division threshold, and the genotype of the SNP site to be detected satisfies the following relationship:
wherein AF represents mutation abundance, tBest represents a first abundance threshold, mBest represents a second abundance threshold.
The present example data is based on the results of high throughput sequencing of 100 real tissue samples. The tissue samples include samples of different tumor occupancy levels, different chromosome stabilities, no contamination and different contamination levels. At the same time, control leukocyte samples from the same individual as the tissue samples were high throughput sequenced to determine the true genotype of the SNP as a performance reference standard. Unlike cancer tissue samples, leukocyte samples are stable in genome and almost no contamination from sample storage and preparation exists. In order to ensure the reliability of the reference standard, a simple model, namely, a principle that approximately 0% is wild type, approximately 100% is homozygous and approximately 50% is heterozygous is adopted for directly carrying out genotype interpretation. Considering data noise, 5% of abundance fluctuations are tolerated, i.e., abundance 0% -5% is wild-type, 95% -100% is homozygous, and 47.5% -52.5% is heterozygous. For tissue samples we compared genotype identification based on two static threshold combinations and the dynamic threshold algorithm of the present invention.
Wherein the static threshold combination 1:
wherein the static threshold combination 2:
Further, the level of identity of the three strategies to the leukocyte reference standard was assessed with the heterozygous SNP site set as positive site and the non-heterozygous site set as negative site, as described in table 1 below. The dynamic threshold method has better and stable performance in sensitivity and specificity, and the accuracy level is highest.
Table 1 is a list of the identity of genotype identification provided in accordance with example one of the present invention.
According to the technical scheme of the embodiment, the abundance fluctuation level of a sample to be detected is determined based on the sub-allele frequency corresponding to each SNP locus to be detected and the obtained abundance threshold to be detected, the abundance threshold to be detected corresponding to the abundance fluctuation level conforming to the inflection point characteristic in at least 3 abundance fluctuation levels is used as a first abundance threshold, the first abundance threshold is determined through the constructed dynamic abundance threshold model, genotypes corresponding to each SNP locus to be detected are determined according to the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected, the problem that the conventional genotype detection method depends on paired sequencing data is solved, and the accuracy of genotype detection results is improved.
Example two
Fig. 2 is a flowchart of a sample contamination detection method according to a second embodiment of the present invention. The present embodiment is applicable to the case of detecting and evaluating the contamination level of a single tissue sample, and the method may be performed by a sample contamination detection method apparatus, which may be implemented in hardware and/or software, and which may be configured in a terminal device. As shown in fig. 2, the method includes:
S210, obtaining at least 10 SNP loci to be detected in a sample to be detected, and determining genotypes corresponding to the SNP loci to be detected respectively.
The sample to be detected is a single tissue sample, such as a cancer tissue sample. In this example, the type of SNP locus to be detected is an germline SNP locus. There are a large number of mutation sites in human genome, which can be classified into germ line mutation and somatic mutation according to the source, wherein germ line mutation is also called germ cell mutation, and is a mutation derived from germ cells such as sperm or ovum, so that all cells in the human body usually carry the mutation, including cancer tissue samples. Somatic mutations, also known as acquired mutations, are mutations obtained during growth or acquired under the influence of environmental factors, and are usually carried by only a portion of the cells.
Specifically, after performing quality control on a FASTQ file (next machine data) containing a high-throughput sequencing sequence of a sample to be detected, using a comparison software to compare the high-throughput sequencing sequence to a human reference genome, using an SNP site in the high-throughput sequencing sequence compared to the human reference genome as an SNP site to be detected, and generating a SAM file. The alignment software may be BWA-MEM (0.7.10), and the human reference genome may be hg19 or b37, among others, by way of example. The SAM file is illustratively converted to a BAM file by samtools (0.1.19) software.
As an optional implementation manner, the obtaining of at least 10 SNP loci to be detected in the sample to be detected comprises taking at least 10 SNP loci with the crowd frequency satisfying a preset crowd frequency range in the sample to be detected as the SNP loci to be detected respectively, wherein the preset crowd frequency range comprises 0.4-0.6. The crowd frequency is used for representing the occurrence frequency of SNP loci in crowd genomes, and specifically, SNP loci which are compared with the crowd frequency on a human reference genome and meet 0.4-0.6 are taken as SNP loci to be detected.
As an alternative implementation mode, determining genotypes corresponding to the SNP loci to be detected respectively comprises determining the genotypes corresponding to the SNP loci to be detected respectively by adopting a preset detection method, wherein the preset detection method comprises at least one of a sequencing method, a chip method and a mass spectrometry method.
In another embodiment, optionally, determining genotypes corresponding to the SNP loci to be detected respectively comprises determining genotypes corresponding to the SNP loci to be detected respectively, wherein the determining comprises obtaining a first abundance threshold, determining mutation abundance corresponding to the SNP loci to be detected respectively, and determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and the mutation abundance corresponding to the SNP loci to be detected respectively.
As an alternative embodiment, the first abundance threshold is preset by the user. For example, the first preset abundance threshold may be 0.01 or 0.03, where the specific value of the first preset abundance threshold is not limited, and may be set by the user according to actual needs.
Wherein, mutation abundance, also called mutation allele frequency, is used to characterize the ratio of the number of reads of a mutated allele to the number of reads of the entire SNP locus, wherein alleles are used to describe genes located at the same position on a pair of homologous chromosomes that control different morphologies of the same trait. Specifically, samtools mpileup software is adopted to calculate the number of reads A of mutant alleles and the number of reads B of wild alleles on SNP loci to be detected, and mutation abundance=A/(A+B). For example, alleles at SNP sites are denoted AAA, BBB, ABB and AAB, and the mutation abundances corresponding to each allele representation are 100%, 0%, 25% and 75%, respectively.
In this example, genotypes include heterozygous, wild-type and homozygous. Heterozygous is used to characterize genotype individuals, such as AB, that contain different types of alleles on homologous chromosomes. Wild type is used to characterize genotype individuals, such as AA, that contain a wild allele on a homologous chromosome, and homozygous is used to characterize genotype individuals, such as BB, that contain a homozygous allele on a homologous chromosome.
As an alternative implementation mode, determining genotypes corresponding to SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected respectively comprises calculating a second abundance threshold based on the first abundance threshold, wherein the sum of the first abundance threshold and the second abundance threshold is one, for each SNP locus to be detected, if mutation abundance corresponding to the SNP locus to be detected is smaller than or equal to the first abundance threshold, the genotypes of the SNP loci to be detected are wild types, if mutation abundance corresponding to the SNP locus to be detected is greater than or equal to the second abundance threshold, the genotypes of the SNP loci to be detected are homozygous types, and if mutation abundance corresponding to the SNP locus to be detected is greater than the first abundance threshold and smaller than the second abundance threshold, the genotypes of the SNP loci to be detected are heterozygous types.
Specifically, the second abundance threshold=1-the first division threshold, and the genotype of the SNP site to be detected satisfies the following relationship:
wherein AF represents mutation abundance, tBest represents a first abundance threshold, mBest represents a second abundance threshold.
S220, constructing a Gaussian mixture model based on the minor allele frequency corresponding to the SNP locus to be detected meeting the genotype condition.
In this embodiment, the genotypic conditions include the genotype being wild-type and/or the genotype being homozygous.
Of these, a SNP site typically contains only two alleles, in particular. For a set of SNP loci to be tested, it is assumed that the 4 SNP loci to be tested contain alleles of AAA, BBB, ABB and AAB, respectively. When calculating mutation abundance, we need to obtain the proportion of A or B in the total number of alleles, for example, A belongs to mutant genes, and the mutation abundance of 4 SNP loci to be detected is 100%, 0%, 25% and 75% respectively. For non-heterozygous SNP loci to be detected, only the degree of difference of the numbers of A and B is concerned, and the proportion of A or B in the total number of alleles is not concerned, so that the allele frequency can well meet the requirements, and the minor allele frequencies of 4 SNP loci to be detected are respectively 0%, 25% and 25%. By way of example, the mutation abundance was mapped to 50% as the center, and the mapped mutation abundance was equal to the minor allele frequency.
As an alternative embodiment, the gaussian mixture model satisfies the formula:
Wherein, the
Wherein, the
Where maf represents minor allele frequency, δ 2 represents variance of minor allele frequency, N represents number of contamination sources, α represents contamination ratio, pbinom represents bernoulli probability distribution, P (c=i) represents probability distribution corresponding to i contamination sources, and N represents probability distribution of gaussian mixture model.
And S230, performing optimization solving operation on the Gaussian mixture model to obtain pollution state data corresponding to the sample to be detected.
Exemplary methods of optimization solution include, but are not limited to, at least one of a maximum likelihood estimation method, a expectation maximization algorithm, and a markov chain monte carlo algorithm, among others. Specifically, an optimization function of R software packages stats can be used for carrying out optimization solving operation on the Gaussian mixture model by adopting an L-BFGS-B algorithm, so that pollution state data are obtained.
In this embodiment, the contamination status data includes at least one of a contamination ratio, a sub-allele frequency variance, and a number of contamination sources.
According to the technical scheme, at least 10 SNP loci to be detected in a sample to be detected are obtained, genotypes corresponding to the SNP loci to be detected are determined, a Gaussian mixture model is constructed based on the hypo-allele frequencies corresponding to the SNP loci to be detected which are not heterozygous, and the Gaussian mixture model is subjected to optimization solving operation to obtain pollution state data corresponding to the sample to be detected, so that the problems that an existing sample pollution measuring method is greatly influenced by copy number variation/heterozygosity loss and is difficult to detect low-level pollution are solved, and the accuracy of a measuring result of sample pollution is improved.
Example III
Fig. 3 is a flowchart of a sample contamination detection method according to a third embodiment of the present invention, where the technical feature of "obtaining the first abundance threshold" in the foregoing embodiment is further refined. As shown in fig. 3, the method includes:
S310, obtaining at least 10 SNP loci to be detected in a sample to be detected and obtaining at least 3 abundance thresholds to be detected.
Specifically, the abundance threshold to be detected is used for representing a screening threshold corresponding to the frequency of the minor allele. The threshold difference value between the abundance thresholds to be measured can be the same or different. For example, a set of abundance thresholds to be measured th satisfies th e {0%,1%,2%,..10%, 30% }. Specific parameters of the abundance thresholds to be measured are not limited, and a user can set each abundance threshold to be measured according to actual requirements.
S320, aiming at each abundance threshold to be detected, calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected.
The method comprises the steps of calculating the abundance fluctuation level of a sample to be detected based on the abundance threshold to be detected and the minor allele frequencies corresponding to SNP loci to be detected respectively, wherein the method comprises the steps of executing segmentation operation on the chromosome based on copy numbers corresponding to at least two SNP loci to be detected on each chromosome to obtain one or more chromosome sections, wherein the copy number difference value corresponding to any two SNP loci to be detected in each chromosome section is smaller than a preset difference value threshold, taking the SNP loci to be detected with the minor allele frequencies greater than the abundance threshold in the chromosome sections as target SNP loci, and calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies corresponding to at least two target SNP loci.
Among these, the types of chromosomes include autosomes and/or sex chromosomes. Of the 23 pairs of chromosomes in humans, 22 pairs are autosomes and 1 pair is sex chromosome. The chromosomes in this example include all chromosomes that may have heterozygous variations.
In particular, the copy number is used to characterize the number of occurrences of a gene or a gene sequence in the genome, and can also be said to be the number of times a gene fragment is repeated in the genome. Normally, the copy number is 2. When copy number variation occurs, the genome rearranges, and the copy number of the gene changes from 2 to n, where n is a natural number that is not equal to 2.
Specifically, calculating the coverage depth of the SNP locus to be detected by GATK software, and calculating the copy number of the SNP locus to be detected based on the coverage depth. The coverage depth refers to the ratio of the total number of bases (bp) obtained by sequencing to the Genome size (Genome). In this example, the coverage depth of the SNP loci to be detected is higher than 200X.
Therein, by way of example, a CBS (circular binary segmentation) algorithm is employed to segment a chromosome into one or more chromosome segments based on copy number. The CBS algorithm may be implemented by R software package PSCBS, among other things. In this embodiment, the copy number difference corresponding to any two SNP loci to be detected in each chromosome segment is smaller than a preset difference threshold, where the preset difference threshold may be 2. For example, if the copy numbers of the 4 SNP sites to be detected on the chromosome are 2, 3, 5 and 6, respectively, the chromosome segment a contains 2 SNP sites to be detected with copy numbers of 2 and 3, respectively, and the chromosome segment B contains 2 SNP sites to be detected with copy numbers of 5 and 6, respectively.
As an alternative implementation mode, the method for calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies corresponding to at least two target SNP loci comprises the steps of sorting all target SNP loci based on locus positions of all target SNP loci in chromosomes, taking a difference value between the minor allele frequency of a current target SNP locus and the minor allele frequency of a next target SNP locus as an abundance difference value based on a sorting result, and carrying out average processing on at least one abundance difference value to obtain the abundance fluctuation level of the sample to be detected.
In the exemplary embodiment, it is assumed that 4 target SNP sites on the chromosome are a target SNP site a, a target SNP site B, a target SNP site C, and a target SNP site D in this order, and the corresponding copy numbers are 2, 6, 5, and 3, respectively. The target SNP site and the sequence thereof contained in the chromosome segment a are the target SNP site a and the target SNP site D, and the target SNP site and the sequence thereof contained in the chromosome segment B are the target SNP site B and the target SNP site C.
For example, assuming that the chromosome segment a contains 100 target SNP sites, the number of abundance differences calculated is 99, and the abundance differences are averaged to obtain the abundance fluctuation level corresponding to the chromosome segment a.
S330, taking an abundance threshold to be measured corresponding to the abundance fluctuation level which accords with the inflection point feature in the at least 3 abundance fluctuation levels as a first abundance threshold.
Specifically, an abundance fluctuation level curve is constructed by taking an abundance threshold to be measured as an abscissa and an abundance fluctuation level as an ordinate. As the threshold value of the abundance to be measured increases, the level of the fluctuation of the abundance increases and then tends to be stable.
S340, determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.
S350, constructing a Gaussian mixture model based on the minor allele frequency corresponding to the SNP locus to be detected meeting the genotype condition.
And S360, performing optimization solving operation on the Gaussian mixture model to obtain pollution state data corresponding to the sample to be detected.
The third embodiment of the present invention also provides result data obtained by measuring simulation data from 20 groups of human cancer tissue samples by using the sample contamination detection method described in the present embodiment.
Specifically, mixing 20 groups of human cancer tissue samples according to a preset pollution ratio, and polluting sample data to obtain a sample to be detected. Table 2 below is a sample contamination specification list provided in example three of the present invention.
Table 3 below shows pollution ratio data obtained by the sample pollution detection method according to the third embodiment of the present invention.
| Pollution level |
Pollution detection rate |
Predicting pollution ratio |
| 0.5% |
100% |
0.47% |
| 1% |
100% |
0.81% |
| 5% |
100% |
5.2% |
| 10% |
100% |
9.6% |
| 20% |
100% |
18% |
As can be seen from Table 3, the examples of the present invention are applicable to single sample and multiple sample source pollution conditions, can be stably detected at a pollution level of 0.5%, and have accurate pollution level prediction.
Table 4 below shows the pollution detection rate obtained by measuring different pollution levels by using different detection methods according to the third embodiment of the present invention.
Table 5 below shows pollution ratio data obtained by a conventional conPair (v 0.2) method according to example III of the present invention.
From tables 4 and 5, it is possible to obtain a contaminated sample in which the existing ART-DeCo (v 1.1) method cannot recognize a low contamination level. The existing conPair (v 0.2) method, while able to detect low contamination levels of contaminated samples, yields a significantly higher contamination ratio than the actual contamination ratio for contaminated samples containing a large number of copies of the sample.
Therefore, the sample pollution detection method provided by the embodiment of the invention has higher detection sensitivity and higher accuracy of pollution level evaluation than similar software.
According to the technical scheme of the embodiment, the abundance fluctuation level of a sample to be detected is determined based on the sub-allele frequency corresponding to each SNP locus to be detected and the obtained abundance threshold to be detected, the abundance threshold to be detected corresponding to the abundance fluctuation level conforming to the inflection point characteristic in at least 3 abundance fluctuation levels is used as a first abundance threshold, the first abundance threshold is determined through the constructed dynamic abundance threshold model, genotypes corresponding to each SNP locus to be detected are determined according to the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected, the problem that the conventional genotype detection method depends on paired sequencing data is solved, the accuracy of genotype detection results is improved, and the accuracy of detection results of sample pollution is further improved.
Example IV
Fig. 4 is a schematic structural diagram of a genotype detecting device according to a fourth embodiment of the present invention. As shown in FIG. 4, the device comprises a SNP locus acquisition module 410 to be detected, an abundance fluctuation level determination module 420, a first abundance threshold determination module 430 and a genotype determination module 440.
The SNP site to be detected acquisition module 410 is configured to acquire at least 10 SNP sites to be detected in a sample to be detected and acquire at least 3 abundance thresholds to be detected;
the abundance fluctuation level determining module 420 is configured to calculate, for each abundance threshold to be detected, an abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequencies corresponding to the SNP loci to be detected, respectively;
A first abundance threshold determining module 430, configured to take, as a first abundance threshold, an abundance threshold to be measured corresponding to an abundance fluctuation level that meets the inflection point feature in at least 3 abundance fluctuation levels;
The genotype determining module 440 is configured to determine genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundances corresponding to the SNP loci to be detected respectively.
According to the technical scheme of the embodiment, the abundance fluctuation level of a sample to be detected is determined based on the sub-allele frequency corresponding to each SNP locus to be detected and the obtained abundance threshold to be detected, the abundance threshold to be detected corresponding to the abundance fluctuation level conforming to the inflection point characteristic in at least 3 abundance fluctuation levels is used as a first abundance threshold, the first abundance threshold is determined through the constructed dynamic abundance threshold model, genotypes corresponding to each SNP locus to be detected are determined according to the first abundance threshold and mutation abundance corresponding to each SNP locus to be detected, the problem that the conventional genotype detection method depends on paired sequencing data is solved, and the accuracy of genotype detection results is improved.
Based on the above embodiments, optionally, the abundance fluctuation level determination module 420 includes:
The chromosome segment determining unit is used for executing segmentation operation on each chromosome based on the copy numbers corresponding to at least two SNP loci to be detected on the chromosome to obtain one or more chromosome segments, wherein the copy number difference value corresponding to any two SNP loci to be detected in each chromosome segment is smaller than a preset difference value threshold;
A target SNP locus determining unit, which is used for taking the SNP locus to be detected, of which the minor allele frequency is greater than the abundance threshold to be detected, in the chromosome section as a target SNP locus;
the abundance fluctuation level unit is used for calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies respectively corresponding to the at least two target SNP loci.
On the basis of the above embodiment, optionally, the abundance fluctuation level unit is specifically configured to:
sequencing each target SNP locus based on the locus position of each target SNP locus in the chromosome;
Based on the sequencing result, taking the difference value between the frequency of the minor allele of the current target SNP locus and the frequency of the minor allele of the next target SNP locus as an abundance difference value;
And carrying out average value processing on at least one abundance difference value to obtain the abundance fluctuation level of the sample to be detected.
Based on the above embodiments, the genotype determining module 440 is optionally specifically configured to:
calculating a second abundance threshold based on the first abundance threshold, wherein the sum of the first abundance threshold and the second abundance threshold is one;
aiming at each SNP locus to be detected, if the mutation abundance corresponding to the SNP locus to be detected is smaller than or equal to a first abundance threshold, the genotype of the SNP locus to be detected is wild type;
If the mutation abundance corresponding to the SNP locus to be detected is greater than or equal to a second abundance threshold, the genotype of the SNP locus to be detected is homozygous;
And if the mutation abundance corresponding to the SNP locus to be detected is larger than the first abundance threshold and smaller than the second abundance threshold, the genotype of the SNP locus to be detected is heterozygous.
Based on the above embodiments, optionally, the SNP site obtaining module 410 to be tested is specifically configured to:
And respectively taking at least 10 SNP loci with the crowd frequency meeting a preset crowd frequency range in a sample to be detected as SNP loci to be detected, wherein the preset crowd frequency range comprises 0.4-0.6.
The genotype detection device provided by the embodiment of the invention can execute the genotype detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example five
Fig. 5 is a schematic structural diagram of a sample contamination detection apparatus according to a fifth embodiment of the present invention. As shown in FIG. 5, the apparatus includes a genotype determination module 510, a Gaussian mixture model construction module 520, and a pollution status data determination module 530.
The genotype determining module 510 is configured to obtain at least 10 SNP sites to be detected in a sample to be detected, and determine genotypes corresponding to the SNP sites to be detected respectively;
the Gaussian mixture model construction module 520 is used for constructing a Gaussian mixture model based on the subordinated gene frequency corresponding to the SNP locus to be detected which meets the genotype condition, wherein the genotype condition comprises that the genotype is wild type and/or the genotype is homozygous;
The pollution state data determining module 530 is configured to perform an optimization solving operation on the gaussian mixture model to obtain pollution state data corresponding to the sample to be detected, where the pollution state data includes at least one of a pollution proportion, a secondary allele frequency variance, and a number of pollution sources.
According to the technical scheme, at least 10 SNP loci to be detected in a sample to be detected are obtained, genotypes corresponding to the SNP loci to be detected are determined, a Gaussian mixture model is constructed based on the hypo-allele frequencies corresponding to the SNP loci to be detected which are not heterozygous, and the Gaussian mixture model is subjected to optimization solving operation to obtain pollution state data corresponding to the sample to be detected, so that the problems that an existing sample pollution measuring method is greatly influenced by copy number variation/heterozygosity loss and is difficult to detect low-level pollution are solved, and the accuracy of a measuring result of sample pollution is improved.
On the basis of the above embodiment, optionally, the gaussian mixture model satisfies the formula:
Wherein, the
Wherein, the
Where maf represents minor allele frequency, δ 2 represents variance of minor allele frequency, N represents number of contamination sources, α represents contamination ratio, pbinom represents bernoulli probability distribution, P (c=i) represents probability distribution corresponding to i contamination sources, and N represents probability distribution of gaussian mixture model.
Based on the above embodiment, optionally, the genotype determining module 510 includes:
The first abundance threshold acquisition unit is used for acquiring a first abundance threshold and determining mutation abundance corresponding to each SNP locus to be detected;
The genotype determining unit is used for determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.
On the basis of the above embodiments, optionally, the genotype determining unit is specifically configured to:
calculating a second abundance threshold based on the first abundance threshold, wherein the sum of the first abundance threshold and the second abundance threshold is one;
aiming at each SNP locus to be detected, if the mutation abundance corresponding to the SNP locus to be detected is smaller than or equal to a first abundance threshold, the genotype of the SNP locus to be detected is wild type;
If the mutation abundance corresponding to the SNP locus to be detected is greater than or equal to a second abundance threshold, the genotype of the SNP locus to be detected is homozygous;
And if the mutation abundance corresponding to the SNP locus to be detected is larger than the first abundance threshold and smaller than the second abundance threshold, the genotype of the SNP locus to be detected is heterozygous.
On the basis of the above embodiment, optionally, the first abundance threshold acquiring unit includes:
the abundance threshold value obtaining subunit to be detected is used for obtaining at least three abundance threshold values to be detected;
the abundance fluctuation level determining subunit is used for calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected respectively aiming at each abundance threshold to be detected;
A first abundance threshold determining subunit, configured to use, as a first abundance threshold, an abundance threshold to be measured corresponding to an abundance fluctuation level that meets an inflection point feature in at least 3 abundance fluctuation levels
Based on the above embodiments, optionally, the abundance fluctuation level determining subunit is specifically configured to:
For each chromosome where each SNP locus to be detected is located, based on the copy numbers corresponding to at least two SNP loci to be detected on the chromosome, performing segmentation operation on the chromosome to obtain one or more chromosome sections, wherein the copy number difference value corresponding to any two SNP loci to be detected in each chromosome section is smaller than a preset difference value threshold;
aiming at each chromosome segment, taking a SNP locus to be detected, of which the minor allele frequency is greater than the abundance threshold to be detected, in the chromosome segment as a target SNP locus;
and calculating the abundance fluctuation level of the sample to be detected based on the minor allele frequencies respectively corresponding to the at least two target SNP loci.
Based on the above embodiments, optionally, the abundance fluctuation level determining subunit is specifically configured to:
sequencing each target SNP locus based on the locus position of each target SNP locus in the chromosome;
Based on the sequencing result, taking the difference value between the frequency of the minor allele of the current target SNP locus and the frequency of the minor allele of the next target SNP locus as an abundance difference value;
And carrying out average value processing on at least one abundance difference value to obtain the abundance fluctuation level of the sample to be detected.
Based on the above embodiment, optionally, the genotype determining module 510 includes:
The SNP locus determination unit is used for respectively taking at least 10 SNP loci with the crowd frequency meeting the preset crowd frequency range in the sample to be detected as SNP loci to be detected, wherein the preset crowd frequency range comprises 0.4-0.6.
The sample pollution detection device provided by the embodiment of the invention can execute the sample pollution detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example six
Fig. 6 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including an input unit 16, such as a keyboard, mouse, etc., an output unit 17, such as various types of displays, speakers, etc., a storage unit 18, such as a magnetic disk, optical disk, etc., and a communication unit 19, such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, such as a genotype detection method or a sample contamination detection method.
In some embodiments, the genotype detection method or sample contamination detection method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the genotype detection method or sample contamination detection method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the genotype detection method or the sample contamination detection method in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
The computer program for implementing the genotyping method or the sample contamination detection method of the invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
Example seven
The seventh embodiment of the present invention also provides a computer-readable storage medium storing computer instructions for causing a processor to execute a genotype detection method, the method comprising:
acquiring at least 10 SNP loci to be detected in a sample to be detected and acquiring at least 3 abundance thresholds to be detected;
aiming at each abundance threshold to be detected, calculating the abundance fluctuation level of the sample to be detected based on the abundance threshold to be detected and the minor allele frequency corresponding to each SNP locus to be detected;
Taking an abundance threshold to be measured corresponding to the abundance fluctuation level which accords with the inflection point feature in the at least 3 abundance fluctuation levels as a first abundance threshold;
And determining genotypes corresponding to the SNP loci to be detected respectively based on the first abundance threshold and mutation abundance corresponding to the SNP loci to be detected respectively.
Or computer instructions for causing a processor to perform a method of sample contamination detection, the method comprising:
obtaining at least 10 SNP loci to be detected in a sample to be detected, and determining genotypes corresponding to the SNP loci to be detected respectively;
Constructing a Gaussian mixture model based on the minor allele frequency corresponding to the SNP locus to be detected meeting the genotype condition, wherein the genotype condition comprises that the genotype is wild type and/or the genotype is homozygous;
And carrying out optimization solving operation on the Gaussian mixture model to obtain pollution state data corresponding to the sample to be detected, wherein the pollution state data comprises at least one of pollution proportion, suboptimal gene frequency variance and the number of pollution sources.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), a blockchain network, and the Internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.