CN118748040B - A hybrid DNA probabilistic typing method for MH sequencing data - Google Patents
A hybrid DNA probabilistic typing method for MH sequencing data Download PDFInfo
- Publication number
- CN118748040B CN118748040B CN202410881784.7A CN202410881784A CN118748040B CN 118748040 B CN118748040 B CN 118748040B CN 202410881784 A CN202410881784 A CN 202410881784A CN 118748040 B CN118748040 B CN 118748040B
- Authority
- CN
- China
- Prior art keywords
- allele
- sequence
- hypothesis
- locus
- genotype
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Biotechnology (AREA)
- Evolutionary Computation (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Pure & Applied Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Chemical & Material Sciences (AREA)
- Bioethics (AREA)
- Algebra (AREA)
- Probability & Statistics with Applications (AREA)
- Operations Research (AREA)
- Analytical Chemistry (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a mixed DNA probability typing method for MH sequencing data. The method comprises the following steps of 1, obtaining priori parameters according to MH sequencing data of single individual source DNA, 2, obtaining characteristic information of the MH sequencing data of a mixed DNA sample to be detected, 3, filtering noise sequences according to the sequencing information, 4, respectively constructing likelihood functions under a control hypothesis and a dialect hypothesis according to the priori parameters and the characteristic information, calculating likelihood ratios of the control hypothesis and the dialect hypothesis, and 5, sequencing possible genotype combinations according to the likelihood function values.
Description
Technical Field
The invention relates to the technical field of forensic DNA analysis, in particular to a mixed DNA probability typing method for MH sequencing data.
Background
The mixed DNA is a DNA sample formed by mixing the DNAs of a plurality of individuals, and is a sample type which is common in forensic DNA analysis work and has the highest analysis difficulty. Each individual constituting the mixed DNA is also referred to as a contributor. Genotyping results of the genetic markers of the mixed DNA appear as mixed genotypes, i.e., alleles of multiple individuals are mixed together. The main purpose of the mixed DNA analysis is 1) to split the possible individual genotypes of each contributor from the mixed genotyping results, a process which may be referred to as "deconvolution". 2) The mixed genotyping results were evaluated using a biometric method as forensic scientific evidence and demonstrated the proof of case facts, i.e. "evidence intensity assessment".
Microsloid (Microhaplotype, MH) is a novel genetic marker based on DNA sequence polymorphisms that is of great interest in the forensic field, and is usually detected and genotype interpretation performed on a large-scale parallel sequencing (MPS) platform. MH has the characteristics of good polymorphism, high sensitivity, low mutation rate, short fragment length, no shadow peak product and the like. These features make MH suitable for forensic analysis of complex biological samples, especially genotyping of mixed DNA, and provide a new solution for forensic DNA analysis of multiple body-mixed biological samples.
In recent years, more and more research has focused on genotyping forensic hybrid DNA samples using MH genetic markers. These studies not only demonstrate the potential of MH for use in forensic applications, but also provide a theoretical basis for its use in practical cases. But due to the complexity of the second generation sequencing technology flow, and the complexity of MH genetic marker multi-site composite targeted amplification and sequencing data structures. There is currently a lack of analytical methods for MH second generation sequencing data for mixed DNA samples.
The existing analysis method for the MH mainly analyzes the MH sequencing result through the observation and subjective judgment of an analyst, so that the analysis result for the mixed MH sequencing data mainly takes the qualitative property as the main part, and has poor reliability and accuracy.
Disclosure of Invention
The invention provides a mixed DNA probability typing method for MH sequencing data, aiming at the problems in the prior art.
The technical scheme adopted by the invention is that the mixed DNA probability typing method for MH sequencing data comprises the following steps:
step 1, constructing a training set according to MH sequencing data of DNA of a single individual source, and estimating to obtain prior parameters according to the training set;
Step 2, acquiring characteristic information of MH sequencing data of a mixed DNA sample to be detected, wherein the characteristic information comprises MH typing and quantitative information of the mixed DNA, target individual typing information, reference individual typing information and related parameters;
step 3, filtering a noise sequence according to the sequencing information;
Step 4, respectively constructing likelihood functions under the control hypothesis and the dialect hypothesis according to the prior parameters obtained in the step 1 and the characteristic information obtained in the step 2, and calculating likelihood ratios of the control hypothesis and the dialect hypothesis according to the likelihood functions;
the control is assumed that the target individual and n unknown individuals are contributors to the mixed DNA sample, or that the target individual, the reference individual, and n-1 unknown individuals are contributors to the mixed DNA sample;
Dialect is assumed that n+1 unknown individuals are contributors to the mixed DNA sample, or that the reference individual and n unknown individuals are contributors to the mixed sample;
The subject is an individual suspected of being one of the contributors to the mixed DNA sample;
a reference individual, i.e., an individual that has been determined to be one of the contributors to the mixed DNA sample;
and 5, calculating the combined likelihood function value of all genotypes on each locus to obtain a locus likelihood function value size sorting result.
Further, the prior parameters in the step 1 comprise locus specific detection efficacy parameters of MH genetic markers, analysis threshold values, allele loss rates, allele population frequencies and noise distribution parameters.
Further, the MH sequencing data of the mixed DNA in the step 2 comprises MH loci, total MH locus sequencing readings, MH sequence base information and MH sequence sequencing readings;
The target individual typing information comprises MH allele base information of the individual at each MH locus;
Reference typing information includes MH allele base information for the individual at each MH locus;
the relevant parameters include a priori parameters and the number of contributors.
Further, the noise sequence filtering process in the step 3 is as follows:
Calculating the signal ratio r of each base sequence according to MH sequencing data of the mixed DNA sample to be detected:
filtering the MH sequencing data of the mixed DNA sample to be detected according to the analysis threshold value, and filtering out the MH sequencing data not in the threshold value range.
Further, the likelihood function construction method in the step 4 is as follows:
s1, obtaining all possible genotype combinations at each locus;
s2, calculating genotype frequencies of each contributor according to allele population frequency information;
S3, obtaining corresponding weights according to genotype combinations and MH sequence sequencing readings;
s4, calculating likelihood functions Lik l at one locus l:
wherein i is the genotype combination sequence number, G i is the ith genotype combination, P (G i |H) is the corresponding frequency of G i in the population, H is the control hypothesis or dialect hypothesis, and D Gil is the weight of the ith genotype combination G i;
S5, calculating likelihood functions Lik of all loci of the combined sample to be detected:
Wherein L is the number of loci and L is the number of all loci.
Further, the process of calculating the genotype frequencies of each contributor in step S2 is as follows:
The genotype frequencies of each contributor were calculated from the allele population frequency information and the hadi-winberg equilibrium law.
Further, the process of obtaining the corresponding weights according to the genotype combination and the MH sequencing reads in the step S3 is as follows:
Obtaining sequencing read information O c and signal ratio O r for a set O of all sequences at a specified locus l;
Under the specified genotype combination G i, calculating the allele significant number A of each sequence in the set O, wherein the allele significant number of the sequence O is A o;
Wherein B Ox is the number of sequences O in a sequence set O contained in the X-th contributor genotype, M x is the mixing ratio parameter of the X-th contributor, X is the number of contributors, and X is the number of contributors;
Classifying all sequences according to the allele effective figures a o, the allele sequences if a o is not 0, the noise sequences if a o is 0, the sequences contained in the genotype combinations but not contained in all sequence sets O are missing allele sequences;
And respectively calculating weights of the allele sequence, the noise sequence and the missing allele sequence, and multiplying the weights to obtain weights corresponding to genotype combinations.
Further, the process of calculating the likelihood ratio of the control hypothesis and the dialect hypothesis according to the likelihood function in the step 4 is as follows:
Respectively carrying out parameter estimation on likelihood functions under control assumption and dialect assumption;
Calculating the size of a likelihood function under a control hypothesis and a dialect hypothesis according to the estimated parameter values, and taking the maximum value of the likelihood function under the control hypothesis and the dialect hypothesis respectively;
the likelihood ratios of the control hypothesis and the dialect hypothesis are obtained.
Further, the estimated parameters include a normal distribution homogeneity parameter, a normal distribution variability parameter, and a mixing ratio parameter for each contributor.
Further, the weight D o of the missing allele sequence is:
Do=(P(Dropout))z
Wherein P (Dropout) is the allele loss rate, and z is the number of times of occurrence of allele loss in the genotype combination;
The weight D no of the noise sequence is:
Wherein K is the number of noise sequences, r k is the signal ratio of the kth noise sequence on the locus, τ 1 and τ 2 are noise distribution parameters, and beta represents the probability density function of beta distribution;
the weights D a for the allele sequences are:
sequencing reads y n of the nth allele at the locus obey normal distribution
yn:norm(An*αlμ,CV*An*αlμ);
Wherein alpha l is the specific detection efficiency parameter of the locus l, mu is the normal distribution mean parameter, CV is the normal distribution variability parameter, N is the sequence number of the alleles, N is the number of the alleles on the locus, A n is the effective number of the alleles of the nth allele, and norm represents the probability density function of the normal distribution.
The beneficial effects of the invention are as follows:
(1) The analysis method provided by the invention can analyze the results of the MH sequencing data of the mixed DNA more objectively and accurately, effectively reduce uncertainty and subjective errors in the analysis process, and obviously improve the accuracy and scientificity of MH analysis of the mixed DNA;
(2) The invention improves the analysis efficiency, can realize the automation of the analysis, makes the complex data processing more efficient, and does not need to analyze manually;
(3) The invention can analyze more sample information in the mixed and complex mixed MH sequencing data analysis, and can deeply mine and explain more information hidden in the sequencing data;
(4) The analysis method of the invention promotes the practical application of MH genetic markers in the field of forensic DNA analysis, and lays a foundation for the future development of the MH genetic markers.
Drawings
FIG. 1 is a schematic flow chart of the method of the invention.
Detailed Description
The invention will be further described with reference to the drawings and specific examples.
As shown in fig. 1, a hybrid DNA probabilistic typing method for MH sequencing data includes the steps of:
step 1, constructing a training set according to MH sequencing data of DNA of a single individual source, and estimating to obtain prior parameters according to the training set;
The training set data is MH sequencing data of a single individual source DNA detected under the same conditions, wherein the same conditions comprise MH detection reagent, MPS library preparation flow, MPS sequencing program, sequencing data analysis flow and the like. The training set data is used to estimate a priori parameters including locus specific detection efficacy parameters of MH genetic markers, analysis thresholds, allele loss rates, allele population frequencies, sequencing noise distribution parameters.
The estimated prior parameter value can be used for analyzing all unknown samples detected under the same condition, and the prior parameter is not required to be estimated again for each analysis on the premise of not significantly changing the detection condition.
Step 2, acquiring characteristic information of MH sequencing data of a sample to be detected, wherein the characteristic information comprises MH typing and quantitative information of mixed DNA, target individual typing information, reference individual typing information and related parameters;
The subject is an individual suspected of being one of the contributors to the mixed DNA sample;
the reference individual is an individual that has been determined to be one of the contributors to the mixed DNA sample.
The MH sequencing data of the mixed DNA comprises MH locus, MH locus total sequencing reading, MH sequence base information and MH sequence sequencing reading;
The target individual typing information comprises MH allele base information of the individual at each MH locus;
Reference typing information includes MH allele base information for the individual at each MH locus;
the relevant parameters include a priori parameters and the number of contributors.
Step 3, filtering a noise sequence according to the sequencing information;
The noise sequence filtering process is as follows:
Calculating the signal ratio r of each base sequence according to MH sequencing data of the mixed DNA sample to be detected;
Wherein D is the sequencing reading and D is the total sequencing reading of the locus.
And filtering the signal ratio of each base sequence in the MH sequencing data of the mixed DNA sample to be detected according to the analysis threshold, and only keeping the base sequence with r larger than the analysis threshold without filtering in the threshold range.
And 4, respectively constructing likelihood functions under the control hypothesis and the dialect hypothesis according to the prior parameters obtained in the step 1 and the characteristic information obtained in the step 2, and calculating likelihood ratios of the control hypothesis and the dialect hypothesis according to the likelihood functions, wherein the likelihood ratios are used as the evidence intensity of the mixed DNA sample to be detected.
The control is assumed that the target individual and n unknown individuals are contributors to the mixed DNA sample, or that the target individual, the reference individual, and n-1 unknown individuals are contributors to the mixed DNA sample;
Dialect is assumed that n+1 unknown individuals are contributors to the mixed DNA sample, or that the reference individual and n unknown individuals are contributors to the mixed sample;
The likelihood function construction process is as follows:
s1, obtaining all possible genotype combinations at each locus;
All possible genotype combinations G are listed separately for each locus under each hypothesis based on the control and dialect hypothesis, the mixed DNA sample to be tested MH sequencing data, and the individual of interest, reference individual typing information.
First, all unfiltered base sequences O at a given locus l are extracted, and a set of candidate alleles a= { O, Q } is formed with one allele-loss placeholder Q, and for each unknown contributor, two candidate alleles are extracted with a back from set a to form their genotypes. All possible genotypes of the resulting unknown contributors are combined in a put-back cartesian product. If the target or reference individual is present under this assumption, the genotype combination of the already constructed unknown individual is combined with the known target or reference individual genotype at the locus to construct all possible genotype combinations G at the locus.
S2, calculating genotype frequencies of each contributor according to allele population frequency information;
The genotype frequency of each contributor was calculated based on allele population frequency information and the Hardy-Winberg equilibrium law, such as AB, when the contributor was heterozygous, the genotype frequency was 2p ApB, where p A and p B are the frequencies of the alleles A, B, respectively, in the population. When the contributor is homozygous, such as AA, its genotype frequency is 2p A 2. The corresponding frequency P (G i |h) of the ith genotype combination Gi in the population at that locus is then designated as the product of the genotype frequencies of all contributors at that locus.
S3, obtaining corresponding weights according to genotype combinations and MH sequence sequencing readings;
The magnitude of the corresponding probability or density, called weight, is calculated from all possible genotype combinations at each locus and sequencing reads of the MH allele sequence.
Sequencing reads O c and signal ratios O r were first obtained for all sequences O at designated locus l.
Under the specified genotype combination G i, the allele-valid number A of each sequence O is calculated, wherein the allele-valid number of the sequence O is A o
Wherein B Ox is the number of sequences O in a sequence set O contained in the genotype of the xth contributor, M x is the mixing ratio parameter of the xth contributor, X is the number of contributors, and X is the number of contributors, comparing each sequence O in the set O with the genotypes of each contributor in the genotype combination in turn to obtain the contribution of each contributor to O, and adding the contributions of all contributors to obtain the allele effective number of the sequence O.
Classifying all sequences O according to the allele-effective number a o, the allele sequence if a o is not 0, the noise sequence if a o is 0, the sequences contained in the genotype combination but not contained in all sequence sets O are missing allele sequences;
And respectively calculating weights of the allele sequence, the noise sequence and the missing allele sequence, and multiplying the weights to obtain weights corresponding to genotype combinations.
The weight D o of the missing allele sequence is:
Do=(P(Dropout))z
Wherein P (Dropout) is the allele loss rate, z is the number of times of occurrence of allele loss in the genotype combination, loss of one allele of the heterozygote is recorded as 1 time, and loss of the homozygote is recorded as 2 times.
The weight D no of the noise sequence is the product of the cover density functions of each noise sequence:
wherein K is the number of noise sequences, r k is the signal ratio of the kth noise sequence on the locus, r k:beta(τ1,τ2),τ1 and τ 2 are noise distribution parameters, and beta represents the probability density function of beta distribution;
The weight of an allele sequence, D a, is the product of the probability density function for each allele sequence:
sequencing reads y n of the nth allele at the locus obey normal distribution
yn:norm(An*αlμ,CV*An*αlμ);
Wherein alpha l is the specific detection efficiency parameter of the locus l, mu is the normal distribution mean parameter, CV is the normal distribution variability parameter, N is the sequence number of the alleles, N is the number of the alleles on the locus, A n is the effective number of the alleles of the nth allele, and norm represents the probability density function of the normal distribution.
Weight calculation of the i-th genotype combination G i at designated locus lThe method comprises the following steps:
s4, calculating likelihood functions Lik l at one locus l:
Wherein i is the genotype combination sequence number, G i is the ith genotype combination, P (G i |H) is the corresponding frequency of Gi in the population, H is the control hypothesis or dialect hypothesis, Weights for the i-th genotype combination G i;
s5, combining likelihood functions Lik of all loci of the sample to be detected under the control hypothesis and the dialect hypothesis:
Wherein L is the number of loci and L is the number of all loci.
The procedure for calculating the likelihood ratios of the control hypothesis and the dialect hypothesis from the likelihood functions is as follows:
And respectively carrying out parameter estimation on likelihood functions under the assumption of a control party and the assumption of a dialect, wherein the estimated parameters comprise a normal distribution homogeneity parameter mu, a normal distribution variability parameter CV and a mixing ratio parameter M x of each contributor, and the number of the mixing ratio parameters is the number of the assumed contributors minus 1. And estimating the value of the parameter by adopting a maximum likelihood estimation method, and repeatedly estimating each sample three times.
Calculating the size of a likelihood function under a control hypothesis and a dialect hypothesis according to the estimated parameter values (three groups of parameter values), and taking the maximum value of the likelihood function under the control hypothesis and the dialect hypothesis respectively;
the likelihood ratios of the control hypothesis and the dialect hypothesis are obtained.
And 5, calculating the combined likelihood function value of all genotypes on each locus to obtain a locus likelihood function value size sorting result. The result provides a basis for deconvolution of the mixed DNA to be detected, thereby achieving the aim of analysis and explanation of forensic mixed DNA.
The invention is further illustrated by specific data.
Step 1, taking MH sequencing data of 100 single source DNA samples with the sample size of 1ng as a training set to estimate the value of the prior parameter. The training set was from 100 different individuals and genotyped under the same MH detection reagent, MPS library preparation procedure, MPS sequencing procedure, sequencing data analysis procedure.
The a priori parameters obtained are as follows:
Allele loss rate (dropout rate) of 0.05, noise distribution parameter τ 1=0.32、τ2 = 530.19, locus specific detection efficacy parameter values α l as shown in table 1, analysis threshold of 0.02, MH allele population frequency as shown in table 2.
TABLE 1 Gene locus specific assay efficacy parameters
TABLE 2 frequency of MH allele population
Step 2, obtaining characteristic information of MH sequencing data of a mixed DNA sample to be detected, wherein the characteristic information comprises MH typing and quantitative information of the mixed DNA, target individual typing information, reference individual typing information and related parameters;
MH typing and quantification information for mixed DNA includes MH loci, total sequencing reads for MH loci, MH sequence base information, and MH sequence sequencing reads.
The target individual typing information comprises the information of the MH allele base of the individual at each MH locus.
Reference typing information includes MH allele base information for the individual at each MH locus;
the relevant parameters, including the values of the foregoing a priori parameters, the number of contributors is set to 2.
Step3, filtering noise sequence according to sequencing information
Calculating the signal ratio r of each base sequence according to MH sequencing data of the mixed DNA;
the signal ratio r of each base sequence in MH sequencing data of the mixed DNA was filtered according to an analysis threshold value of 0.01, and only base sequences with r greater than 0.01 were retained.
Step 4, respectively constructing likelihood functions under the control hypothesis and the dialect hypothesis according to the prior parameters obtained in the step 1 and the characteristic information obtained in the step 2, and calculating likelihood ratios of the control hypothesis and the dialect hypothesis according to the likelihood functions;
the control hypothesis is that the subject and n unknown individuals are contributors to the mixed DNA sample;
Dialect assumption is that n+1 unknown individuals are contributors to the mixed DNA sample;
s1, obtaining all possible genotype combinations at each locus;
All possible genotype combinations G are listed separately at each locus on each hypothesis based on the control and dialect hypotheses, the mixed DNA sample MH sequencing data to be tested, and the individual typing information of interest.
First, all unfiltered base sequences O at a given locus l are extracted, and a set of candidate alleles a= { O, Q } is formed with one allele-loss placeholder Q, and for each unknown contributor, two candidate alleles are extracted with a back from set a to form their genotypes. All possible genotypes of the resulting unknown contributors are combined in a put-back cartesian product. If the target or reference individual is present under this assumption, the genotype combination of the already constructed unknown individual is combined with the known target or reference individual genotype at the locus to construct all possible genotype combinations G at the locus.
S2, calculating genotype frequencies of each contributor according to allele population frequency information;
The genotype frequency of each contributor was calculated based on allele population frequency information and the Hardy-Winberg equilibrium law, such as AB, when the contributor was heterozygous, the genotype frequency was 2p ApB, where p A and p B are the frequencies of the alleles A, B, respectively, in the population. When the contributor is homozygous, such as AA, its genotype frequency is 2p A 2. The corresponding frequency P (G i |h) of the ith genotype combination G i in the population at that locus is specified as the product of the genotype frequencies of all contributors at that locus.
S3, obtaining corresponding weights according to genotype combinations and MH sequence sequencing readings;
The magnitude of the corresponding probability or density, called weight, is calculated from all possible genotype combinations at each locus and sequencing reads of the MH allele sequence.
Sequencing reads O c and signal ratios O r were first obtained for all sequences O at designated locus l.
Under the specified genotype combination G i, the allele-valid number a for each sequence in set O is calculated, where the allele-valid number for sequence O is a o:
Wherein: for the number of sequences O in sequence set O contained in the X-th contributor genotype, M x is the mixing ratio parameter of the X-th contributor, X is the contributor number, x=2;
classifying all sequences O according to the allele-effective number a o, the allele sequence if a o is not 0, the noise sequence if a n is 0, the sequences contained in the genotype combination but not contained in all sequence sets O are missing allele sequences;
And respectively calculating weights of the allele sequence, the noise sequence and the missing allele sequence, and multiplying the weights to obtain weights corresponding to genotype combinations.
The weight D o of the missing allele sequence is:
Do=(P(Dropout))z
wherein P (Dropout) is the allele loss rate of 0.05, z is the number of times of occurrence of allele loss in the genotype combination, loss of one allele of the heterozygote is recorded as 1 time, and loss of the homozygote is recorded as 2 times.
The weight D no of the noise sequence is the product of the cover density functions of each noise sequence:
Where K is the number of noise sequences, r k is the signal ratio of the kth noise sequence on the locus, r k:beta(τ1,τ2),τ1 and τ 2 are both noise distribution parameters, τ 1=0.32、τ2 =530.19, and beta represents the probability density function of beta distribution;
The weight of an allele sequence, D a, is the product of the probability density function for each allele sequence:
sequencing reads y n of the nth allele at the locus obey normal distribution
yn:norm(An*αlμ,CV*An*αlμ);
Wherein α l is a specific detection efficacy parameter of the locus l, μ is a normal distribution mean parameter, CV is a normal distribution variability parameter, N is an allele sequence number, N is the number of allele sequences on the locus, A n is an allele effective number of the nth allele, and norm represents a probability density function of normal distribution, as shown in Table 1.
Weight calculation of the i-th genotype combination G i at designated locus lThe method comprises the following steps:
s4, calculating likelihood functions Lik l at one locus l:
Wherein i is the genotype combination sequence number, G i is the ith genotype combination, P (G i |H) is the corresponding frequency of G i in the population, H is the control hypothesis or dialect hypothesis, Weights for the i-th genotype combination G i;
S5, calculating likelihood functions Lik of all loci of the combined sample to be detected under the control hypothesis and the dialect hypothesis:
Wherein L is the number of loci and L is the number of all loci.
The procedure for calculating the likelihood ratios of the control hypothesis and the dialect hypothesis from the likelihood functions is as follows:
And respectively carrying out parameter estimation on likelihood functions under the assumption of a control party and the assumption of a dialect, wherein the estimated parameters comprise a normal distribution homogeneity parameter mu, a normal distribution variability parameter CV and a mixing ratio parameter M x of each contributor, and the number of the mixing ratio parameters is the number of the assumed contributors minus 1. And estimating the value of the parameter by adopting a maximum likelihood estimation method, and repeatedly estimating each sample three times.
Calculating the size of a likelihood function under a control hypothesis and a dialect hypothesis according to the estimated parameter values (three groups of parameter values), and taking the maximum value of the likelihood function under the control hypothesis and the dialect hypothesis respectively;
The likelihood ratios of the control hypothesis and the dialect hypothesis are obtained, resulting in log 10 lr= 30.4948. This result supports the assumption that the individual of interest is the contributor to the mixed sample and quantitatively reflects the magnitude of the evidence intensity.
And 5, calculating the combined likelihood function values of all genotypes at each locus to obtain a locus likelihood function value size sorting result, as shown in table 3. Combinations of contributor genotypes with a greater likelihood at each locus can be obtained as a result of mixed DNA deconvolution.
TABLE 3 results of deconvolution of loci
The method can accurately and objectively analyze the result of the MH sequencing data of the mixed DNA sample, reduce uncertainty and subjective error in the analysis process, remarkably enlarge the range of the analyzable sample and improve the analysis accuracy of the MH sequencing data of the mixed DNA. Because the method does not need manual analysis, the accuracy and the efficiency of the method are obviously improved, the analysis efficiency of the legal medical expert on the MH sequencing data is improved, the improvement leads the complex data processing to become more efficient, and the analysis time is saved. And more information hidden in sequencing data can be deeply mined and interpreted in mixed and complex mixed MH data analysis.
The invention provides a scientific analysis and interpretation method of MH sequencing data, which not only promotes the practical application of MH genetic markers in the forensic field, but also lays a foundation for the future development of the MH genetic markers.
Claims (7)
1. A method for mixed DNA probabilistic typing of MH sequencing data comprising the steps of:
step 1, constructing a training set according to MH sequencing data of DNA of a single individual source, and estimating to obtain prior parameters according to the training set;
Step 2, obtaining characteristic information of MH sequencing data of a mixed DNA sample to be detected, wherein the characteristic information comprises MH typing and quantitative information of the mixed DNA, target individual typing information, reference individual typing information and related parameters;
The target individual typing information comprises MH allele base information of the target individual at each MH locus;
the reference individual typing information comprises MH allele base information of the reference individual at each MH locus;
The related parameters include a priori parameters and the number of contributors
Step 3, filtering a noise sequence according to the sequencing information;
Step 4, respectively constructing likelihood functions under the control hypothesis and the dialect hypothesis according to the prior parameters obtained in the step 1 and the characteristic information obtained in the step 2, and calculating likelihood ratios of the control hypothesis and the dialect hypothesis according to the likelihood functions;
the control is assumed that the target individual and n unknown individuals are contributors to the mixed DNA sample, or that the target individual, the reference individual, and n-1 unknown individuals are contributors to the mixed DNA sample;
Dialect is assumed that n+1 unknown individuals are contributors to the mixed DNA sample, or that the reference individual and n unknown individuals are contributors to the mixed sample;
The subject is an individual suspected of being one of the contributors to the mixed DNA sample;
a reference individual, i.e., an individual that has been determined to be one of the contributors to the mixed DNA sample;
The likelihood function construction method comprises the following steps:
s1, obtaining all possible genotype combinations at each locus;
s2, calculating genotype frequencies of each contributor according to allele population frequency information;
S3, obtaining corresponding weights according to genotype combinations and MH sequence sequencing readings;
the procedure for obtaining the corresponding weights from the genotype combinations and MH sequence sequencing reads is as follows:
Obtaining sequencing read information O c and signal ratio O r for a set O of all sequences at a specified locus l;
Under the specified genotype combination G i, the allele-valid number A of each sequence in set O is calculated, wherein the allele-valid number of sequence O is A o
Wherein B ox is the number of sequences O in a sequence set O contained in the X-th contributor genotype, M x is the mixing ratio parameter of the X-th contributor, X is the number of contributors, and X is the number of contributors;
Classifying all sequences according to the allele effective figures a o, the allele sequences if a o is not 0, the noise sequences if a o is 0, the sequences contained in the genotype combinations but not contained in all sequence sets O are missing allele sequences;
Respectively calculating weights of an allele sequence, a noise sequence and a missing allele sequence, and multiplying the weights to obtain weights corresponding to genotype combinations;
s4, calculating likelihood functions Lik l at one locus l:
Wherein i is the genotype combination sequence number, G i is the ith genotype combination, P (G i |H) is the corresponding frequency of G i in the population, H is the control hypothesis or dialect hypothesis, Weights for the i-th genotype combination G i;
S5, calculating likelihood functions Lik of all loci of the combined sample to be detected:
wherein L is the number of loci and L is the number of all loci;
and 5, calculating likelihood function values of all possible genotype combinations on each locus to obtain a locus likelihood function value size sorting result.
2. A mixed DNA probability typing method for MH sequencing data according to claim 1, wherein the prior parameters in step 1 include a locus specific detection efficacy parameter of MH genetic markers, an analysis threshold, an allele loss rate, an allele population frequency, a noise distribution parameter.
3. The method for mixed DNA probabilistic typing of MH sequencing data according to claim 1, wherein the noise sequence filtering in step3 is as follows:
Calculating the signal ratio r of each base sequence according to MH sequencing data of the mixed DNA sample to be detected;
filtering the MH sequencing data of the mixed DNA sample to be detected according to the analysis threshold value, and filtering out the MH sequencing data not in the threshold value range.
4. A mixed DNA probabilistic typing method for MH sequencing data according to claim 1, wherein the process of calculating the genotype frequencies of each contributor in S2 is as follows:
The genotype frequencies of each contributor were calculated from the allele population frequency information and the hadi-winberg equilibrium law.
5. The method according to claim 1, wherein the step 4 of calculating likelihood ratios of the control hypothesis and the dialect hypothesis based on likelihood functions is as follows:
Respectively carrying out parameter estimation on likelihood functions under control assumption and dialect assumption;
Calculating the size of a likelihood function under a control hypothesis and a dialect hypothesis according to the estimated parameter values, and taking the maximum value of the likelihood function under the control hypothesis and the dialect hypothesis respectively;
the likelihood ratios of the control hypothesis and the dialect hypothesis are obtained.
6. A mixed DNA probabilistic typing method for MH sequencing data as in claim 5, wherein the estimated parameters include a normal distribution mean parameter, a normal distribution variability parameter and a mixing ratio parameter for each contributor.
7. A mixed DNA probabilistic typing method for MH sequencing data according to claim 1, the method is characterized in that the weight D o of the lost allele sequence is as follows:
Do=(P(Dropout))z
Wherein P (Dropout) is the allele loss rate, and z is the number of times of occurrence of allele loss in the genotype combination;
The weight D no of the noise sequence is:
wherein K is the number of noise sequences, r k is the signal ratio of the kth noise sequence on the locus, AndAre noise distribution parameters, and beta represents probability density functions of beta distribution;
the weights D a for the allele sequences are:
sequencing reads y n of the nth allele at the locus obey normal distribution
yn~norm(An*αlμ,CV*An*αlμ)
Wherein alpha l is the specific detection efficiency parameter of the locus l, mu is the normal distribution mean parameter, CV is the normal distribution variability parameter, N is the sequence number of the alleles, N is the number of the alleles on the locus, A n is the effective number of the alleles of the nth allele, and norm represents the probability density function of the normal distribution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410881784.7A CN118748040B (en) | 2024-07-03 | 2024-07-03 | A hybrid DNA probabilistic typing method for MH sequencing data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410881784.7A CN118748040B (en) | 2024-07-03 | 2024-07-03 | A hybrid DNA probabilistic typing method for MH sequencing data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118748040A CN118748040A (en) | 2024-10-08 |
CN118748040B true CN118748040B (en) | 2024-12-31 |
Family
ID=92922904
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410881784.7A Active CN118748040B (en) | 2024-07-03 | 2024-07-03 | A hybrid DNA probabilistic typing method for MH sequencing data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118748040B (en) |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030143554A1 (en) * | 2001-03-31 | 2003-07-31 | Berres Mark E. | Method of genotyping by determination of allele copy number |
SG11201911530RA (en) * | 2017-06-20 | 2020-01-30 | Illumina Inc | Methods for accurate computational decomposition of dna mixtures from contributors of unknown genotypes |
-
2024
- 2024-07-03 CN CN202410881784.7A patent/CN118748040B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN118748040A (en) | 2024-10-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Systematic benchmarking of imaging spatial transcriptomics platforms in FFPE tissues | |
Duntsch et al. | Polygenic basis for adaptive morphological variation in a threatened Aotearoa| New Zealand bird, the hihi (Notiomystis cincta) | |
CN109346130A (en) | A method for obtaining microhaplotypes and their typing directly from whole-genome resequencing data | |
CN108913776B (en) | Screening method and kit for DNA molecular markers related to radiotherapy and chemotherapy injury | |
Dubreuil et al. | Evaluation of a DNA pooled-sampling strategy for estimating the RFLP diversity of maize populations | |
CN112634987A (en) | Method and device for detecting copy number variation of single-sample tumor DNA | |
KR101936933B1 (en) | Methods for detecting nucleic acid sequence variations and a device for detecting nucleic acid sequence variations using the same | |
CN116189763A (en) | Single sample copy number variation detection method based on second generation sequencing | |
CN118072823A (en) | Rice phenotype prediction method and system based on whole genome selection | |
CN110444253B (en) | Method and system suitable for mixed pool gene positioning | |
CN118748040B (en) | A hybrid DNA probabilistic typing method for MH sequencing data | |
EP2419846A1 (en) | Methods for nucleic acid quantification | |
CN114038502A (en) | A method for the association of expression quantitative traits with CNVs based on gene interaction network clustering and group sparse learning | |
CN117542418B (en) | Method for evaluating seed conservation effect of seed conservation group based on low-depth whole genome resequencing technology | |
US20070172833A1 (en) | Gene expression profile retrieving apparatus, gene expression profile retrieving method, and program | |
CN119028440B (en) | Mixed DNA evidence analysis method based on STR genotyping map | |
CN115948521A (en) | A method for detecting missing chromosome information in aneuploidy | |
Moritz et al. | A technical note for using microsatellite DNA analyses in haploid male DNA pools of social Hymenoptera | |
CN118969072B (en) | Sample contamination detection method, device, equipment, storage medium and program product | |
CN117976050B (en) | Animal multiple embryo gene rapid diagnosis system based on biotechnology | |
Stępniak et al. | Microarray Inspector: tissue cross contamination detection tool for microarray data. | |
US20040157229A1 (en) | Methods of profiling gene expression, protein or metabolite levels | |
Hou et al. | FastCCC: A permutation-free framework for scalable, robust, and reference-based cell-cell communication analysis in single cell transcriptomics studies | |
CN107988390A (en) | A kind of method based on SNP site discriminating Jiang Shumin, duroc | |
Boon et al. | Understanding the effect of pre-processing methods on fragmentomics analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |