CN118748040B

CN118748040B - A hybrid DNA probabilistic typing method for MH sequencing data

Info

Publication number: CN118748040B
Application number: CN202410881784.7A
Authority: CN
Inventors: 张; 王雨婷; 胡渝涵; 朱强; 王玉芳
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2024-07-03
Filing date: 2024-07-03
Publication date: 2024-12-31
Anticipated expiration: 2044-07-03
Also published as: CN118748040A

Abstract

The invention discloses a mixed DNA probability typing method for MH sequencing data. The method comprises the following steps of 1, obtaining priori parameters according to MH sequencing data of single individual source DNA, 2, obtaining characteristic information of the MH sequencing data of a mixed DNA sample to be detected, 3, filtering noise sequences according to the sequencing information, 4, respectively constructing likelihood functions under a control hypothesis and a dialect hypothesis according to the priori parameters and the characteristic information, calculating likelihood ratios of the control hypothesis and the dialect hypothesis, and 5, sequencing possible genotype combinations according to the likelihood function values.

Description

Mixed DNA probability typing method for MH sequencing data

Technical Field

The invention relates to the technical field of forensic DNA analysis, in particular to a mixed DNA probability typing method for MH sequencing data.

Background

The mixed DNA is a DNA sample formed by mixing the DNAs of a plurality of individuals, and is a sample type which is common in forensic DNA analysis work and has the highest analysis difficulty. Each individual constituting the mixed DNA is also referred to as a contributor. Genotyping results of the genetic markers of the mixed DNA appear as mixed genotypes, i.e., alleles of multiple individuals are mixed together. The main purpose of the mixed DNA analysis is 1) to split the possible individual genotypes of each contributor from the mixed genotyping results, a process which may be referred to as "deconvolution". 2) The mixed genotyping results were evaluated using a biometric method as forensic scientific evidence and demonstrated the proof of case facts, i.e. "evidence intensity assessment".

Microsloid (Microhaplotype, MH) is a novel genetic marker based on DNA sequence polymorphisms that is of great interest in the forensic field, and is usually detected and genotype interpretation performed on a large-scale parallel sequencing (MPS) platform. MH has the characteristics of good polymorphism, high sensitivity, low mutation rate, short fragment length, no shadow peak product and the like. These features make MH suitable for forensic analysis of complex biological samples, especially genotyping of mixed DNA, and provide a new solution for forensic DNA analysis of multiple body-mixed biological samples.

In recent years, more and more research has focused on genotyping forensic hybrid DNA samples using MH genetic markers. These studies not only demonstrate the potential of MH for use in forensic applications, but also provide a theoretical basis for its use in practical cases. But due to the complexity of the second generation sequencing technology flow, and the complexity of MH genetic marker multi-site composite targeted amplification and sequencing data structures. There is currently a lack of analytical methods for MH second generation sequencing data for mixed DNA samples.

The existing analysis method for the MH mainly analyzes the MH sequencing result through the observation and subjective judgment of an analyst, so that the analysis result for the mixed MH sequencing data mainly takes the qualitative property as the main part, and has poor reliability and accuracy.

Disclosure of Invention

The invention provides a mixed DNA probability typing method for MH sequencing data, aiming at the problems in the prior art.

The technical scheme adopted by the invention is that the mixed DNA probability typing method for MH sequencing data comprises the following steps:

step 1, constructing a training set according to MH sequencing data of DNA of a single individual source, and estimating to obtain prior parameters according to the training set;

Step 2, acquiring characteristic information of MH sequencing data of a mixed DNA sample to be detected, wherein the characteristic information comprises MH typing and quantitative information of the mixed DNA, target individual typing information, reference individual typing information and related parameters;

step 3, filtering a noise sequence according to the sequencing information;

Step 4, respectively constructing likelihood functions under the control hypothesis and the dialect hypothesis according to the prior parameters obtained in the step 1 and the characteristic information obtained in the step 2, and calculating likelihood ratios of the control hypothesis and the dialect hypothesis according to the likelihood functions;

the control is assumed that the target individual and n unknown individuals are contributors to the mixed DNA sample, or that the target individual, the reference individual, and n-1 unknown individuals are contributors to the mixed DNA sample;

Dialect is assumed that n+1 unknown individuals are contributors to the mixed DNA sample, or that the reference individual and n unknown individuals are contributors to the mixed sample;

The subject is an individual suspected of being one of the contributors to the mixed DNA sample;

a reference individual, i.e., an individual that has been determined to be one of the contributors to the mixed DNA sample;

and 5, calculating the combined likelihood function value of all genotypes on each locus to obtain a locus likelihood function value size sorting result.

Further, the prior parameters in the step 1 comprise locus specific detection efficacy parameters of MH genetic markers, analysis threshold values, allele loss rates, allele population frequencies and noise distribution parameters.

Further, the MH sequencing data of the mixed DNA in the step 2 comprises MH loci, total MH locus sequencing readings, MH sequence base information and MH sequence sequencing readings;

The target individual typing information comprises MH allele base information of the individual at each MH locus;

Reference typing information includes MH allele base information for the individual at each MH locus;

the relevant parameters include a priori parameters and the number of contributors.

Further, the noise sequence filtering process in the step 3 is as follows:

Calculating the signal ratio r of each base sequence according to MH sequencing data of the mixed DNA sample to be detected:

filtering the MH sequencing data of the mixed DNA sample to be detected according to the analysis threshold value, and filtering out the MH sequencing data not in the threshold value range.

Further, the likelihood function construction method in the step 4 is as follows:

s1, obtaining all possible genotype combinations at each locus;

s2, calculating genotype frequencies of each contributor according to allele population frequency information;

S3, obtaining corresponding weights according to genotype combinations and MH sequence sequencing readings;

s4, calculating likelihood functions Lik _l at one locus l:

wherein i is the genotype combination sequence number, G _i is the ith genotype combination, P (G _i |H) is the corresponding frequency of G _i in the population, H is the control hypothesis or dialect hypothesis, and D _Gil is the weight of the ith genotype combination G _i;

S5, calculating likelihood functions Lik of all loci of the combined sample to be detected:

Wherein L is the number of loci and L is the number of all loci.

Further, the process of calculating the genotype frequencies of each contributor in step S2 is as follows:

The genotype frequencies of each contributor were calculated from the allele population frequency information and the hadi-winberg equilibrium law.

Further, the process of obtaining the corresponding weights according to the genotype combination and the MH sequencing reads in the step S3 is as follows:

Obtaining sequencing read information O _c and signal ratio O _r for a set O of all sequences at a specified locus l;

Under the specified genotype combination G _i, calculating the allele significant number A of each sequence in the set O, wherein the allele significant number of the sequence O is A _o;

Wherein B _Ox is the number of sequences O in a sequence set O contained in the X-th contributor genotype, M _x is the mixing ratio parameter of the X-th contributor, X is the number of contributors, and X is the number of contributors;

Classifying all sequences according to the allele effective figures a _o, the allele sequences if a _o is not 0, the noise sequences if a _o is 0, the sequences contained in the genotype combinations but not contained in all sequence sets O are missing allele sequences;

And respectively calculating weights of the allele sequence, the noise sequence and the missing allele sequence, and multiplying the weights to obtain weights corresponding to genotype combinations.

Further, the process of calculating the likelihood ratio of the control hypothesis and the dialect hypothesis according to the likelihood function in the step 4 is as follows:

Respectively carrying out parameter estimation on likelihood functions under control assumption and dialect assumption;

Calculating the size of a likelihood function under a control hypothesis and a dialect hypothesis according to the estimated parameter values, and taking the maximum value of the likelihood function under the control hypothesis and the dialect hypothesis respectively;

the likelihood ratios of the control hypothesis and the dialect hypothesis are obtained.

Further, the estimated parameters include a normal distribution homogeneity parameter, a normal distribution variability parameter, and a mixing ratio parameter for each contributor.

Further, the weight D _o of the missing allele sequence is:

D_o＝(P(Dropout))^z

Wherein P (Dropout) is the allele loss rate, and z is the number of times of occurrence of allele loss in the genotype combination;

The weight D _no of the noise sequence is:

Wherein K is the number of noise sequences, r _k is the signal ratio of the kth noise sequence on the locus, τ ₁ and τ ₂ are noise distribution parameters, and beta represents the probability density function of beta distribution;

the weights D _a for the allele sequences are:

sequencing reads y _n of the nth allele at the locus obey normal distribution

y_n:norm(A_n*α_lμ,CV*A_n*α_lμ);

Wherein alpha _l is the specific detection efficiency parameter of the locus l, mu is the normal distribution mean parameter, CV is the normal distribution variability parameter, N is the sequence number of the alleles, N is the number of the alleles on the locus, A _n is the effective number of the alleles of the nth allele, and norm represents the probability density function of the normal distribution.

The beneficial effects of the invention are as follows:

(1) The analysis method provided by the invention can analyze the results of the MH sequencing data of the mixed DNA more objectively and accurately, effectively reduce uncertainty and subjective errors in the analysis process, and obviously improve the accuracy and scientificity of MH analysis of the mixed DNA;

(2) The invention improves the analysis efficiency, can realize the automation of the analysis, makes the complex data processing more efficient, and does not need to analyze manually;

(3) The invention can analyze more sample information in the mixed and complex mixed MH sequencing data analysis, and can deeply mine and explain more information hidden in the sequencing data;

(4) The analysis method of the invention promotes the practical application of MH genetic markers in the field of forensic DNA analysis, and lays a foundation for the future development of the MH genetic markers.

Drawings

FIG. 1 is a schematic flow chart of the method of the invention.

Detailed Description

The invention will be further described with reference to the drawings and specific examples.

As shown in fig. 1, a hybrid DNA probabilistic typing method for MH sequencing data includes the steps of:

The training set data is MH sequencing data of a single individual source DNA detected under the same conditions, wherein the same conditions comprise MH detection reagent, MPS library preparation flow, MPS sequencing program, sequencing data analysis flow and the like. The training set data is used to estimate a priori parameters including locus specific detection efficacy parameters of MH genetic markers, analysis thresholds, allele loss rates, allele population frequencies, sequencing noise distribution parameters.

The estimated prior parameter value can be used for analyzing all unknown samples detected under the same condition, and the prior parameter is not required to be estimated again for each analysis on the premise of not significantly changing the detection condition.

Step 2, acquiring characteristic information of MH sequencing data of a sample to be detected, wherein the characteristic information comprises MH typing and quantitative information of mixed DNA, target individual typing information, reference individual typing information and related parameters;

the reference individual is an individual that has been determined to be one of the contributors to the mixed DNA sample.

The MH sequencing data of the mixed DNA comprises MH locus, MH locus total sequencing reading, MH sequence base information and MH sequence sequencing reading;

Step 3, filtering a noise sequence according to the sequencing information;

The noise sequence filtering process is as follows:

Calculating the signal ratio r of each base sequence according to MH sequencing data of the mixed DNA sample to be detected;

Wherein D is the sequencing reading and D is the total sequencing reading of the locus.

And filtering the signal ratio of each base sequence in the MH sequencing data of the mixed DNA sample to be detected according to the analysis threshold, and only keeping the base sequence with r larger than the analysis threshold without filtering in the threshold range.

And 4, respectively constructing likelihood functions under the control hypothesis and the dialect hypothesis according to the prior parameters obtained in the step 1 and the characteristic information obtained in the step 2, and calculating likelihood ratios of the control hypothesis and the dialect hypothesis according to the likelihood functions, wherein the likelihood ratios are used as the evidence intensity of the mixed DNA sample to be detected.

The likelihood function construction process is as follows:

s1, obtaining all possible genotype combinations at each locus;

All possible genotype combinations G are listed separately for each locus under each hypothesis based on the control and dialect hypothesis, the mixed DNA sample to be tested MH sequencing data, and the individual of interest, reference individual typing information.

First, all unfiltered base sequences O at a given locus l are extracted, and a set of candidate alleles a= { O, Q } is formed with one allele-loss placeholder Q, and for each unknown contributor, two candidate alleles are extracted with a back from set a to form their genotypes. All possible genotypes of the resulting unknown contributors are combined in a put-back cartesian product. If the target or reference individual is present under this assumption, the genotype combination of the already constructed unknown individual is combined with the known target or reference individual genotype at the locus to construct all possible genotype combinations G at the locus.

The genotype frequency of each contributor was calculated based on allele population frequency information and the Hardy-Winberg equilibrium law, such as AB, when the contributor was heterozygous, the genotype frequency was 2p _Ap_B, where p _A and p _B are the frequencies of the alleles A, B, respectively, in the population. When the contributor is homozygous, such as AA, its genotype frequency is 2p _A ². The corresponding frequency P (G _i |h) of the ith genotype combination Gi in the population at that locus is then designated as the product of the genotype frequencies of all contributors at that locus.

The magnitude of the corresponding probability or density, called weight, is calculated from all possible genotype combinations at each locus and sequencing reads of the MH allele sequence.

Sequencing reads O _c and signal ratios O _r were first obtained for all sequences O at designated locus l.

Under the specified genotype combination G _i, the allele-valid number A of each sequence O is calculated, wherein the allele-valid number of the sequence O is A _o

Wherein B _Ox is the number of sequences O in a sequence set O contained in the genotype of the xth contributor, M _x is the mixing ratio parameter of the xth contributor, X is the number of contributors, and X is the number of contributors, comparing each sequence O in the set O with the genotypes of each contributor in the genotype combination in turn to obtain the contribution of each contributor to O, and adding the contributions of all contributors to obtain the allele effective number of the sequence O.

Classifying all sequences O according to the allele-effective number a _o, the allele sequence if a _o is not 0, the noise sequence if a _o is 0, the sequences contained in the genotype combination but not contained in all sequence sets O are missing allele sequences;

The weight D _o of the missing allele sequence is:

D_o＝(P(Dropout))^z

Wherein P (Dropout) is the allele loss rate, z is the number of times of occurrence of allele loss in the genotype combination, loss of one allele of the heterozygote is recorded as 1 time, and loss of the homozygote is recorded as 2 times.

The weight D _no of the noise sequence is the product of the cover density functions of each noise sequence:

wherein K is the number of noise sequences, r _k is the signal ratio of the kth noise sequence on the locus, r _k:beta(τ₁,τ₂),τ₁ and τ ₂ are noise distribution parameters, and beta represents the probability density function of beta distribution;

The weight of an allele sequence, D _a, is the product of the probability density function for each allele sequence:

sequencing reads y _n of the nth allele at the locus obey normal distribution

y_n:norm(A_n*α_lμ,CV*A_n*α_lμ);

Weight calculation of the i-th genotype combination G _i at designated locus lThe method comprises the following steps:

s4, calculating likelihood functions Lik _l at one locus l:

Wherein i is the genotype combination sequence number, G _i is the ith genotype combination, P (G _i |H) is the corresponding frequency of Gi in the population, H is the control hypothesis or dialect hypothesis, Weights for the i-th genotype combination G _i;

s5, combining likelihood functions Lik of all loci of the sample to be detected under the control hypothesis and the dialect hypothesis:

Wherein L is the number of loci and L is the number of all loci.

The procedure for calculating the likelihood ratios of the control hypothesis and the dialect hypothesis from the likelihood functions is as follows:

And respectively carrying out parameter estimation on likelihood functions under the assumption of a control party and the assumption of a dialect, wherein the estimated parameters comprise a normal distribution homogeneity parameter mu, a normal distribution variability parameter CV and a mixing ratio parameter M _x of each contributor, and the number of the mixing ratio parameters is the number of the assumed contributors minus 1. And estimating the value of the parameter by adopting a maximum likelihood estimation method, and repeatedly estimating each sample three times.

Calculating the size of a likelihood function under a control hypothesis and a dialect hypothesis according to the estimated parameter values (three groups of parameter values), and taking the maximum value of the likelihood function under the control hypothesis and the dialect hypothesis respectively;

And 5, calculating the combined likelihood function value of all genotypes on each locus to obtain a locus likelihood function value size sorting result. The result provides a basis for deconvolution of the mixed DNA to be detected, thereby achieving the aim of analysis and explanation of forensic mixed DNA.

The invention is further illustrated by specific data.

Step 1, taking MH sequencing data of 100 single source DNA samples with the sample size of 1ng as a training set to estimate the value of the prior parameter. The training set was from 100 different individuals and genotyped under the same MH detection reagent, MPS library preparation procedure, MPS sequencing procedure, sequencing data analysis procedure.

The a priori parameters obtained are as follows:

Allele loss rate (dropout rate) of 0.05, noise distribution parameter τ ₁＝0.32、τ₂ = 530.19, locus specific detection efficacy parameter values α _l as shown in table 1, analysis threshold of 0.02, MH allele population frequency as shown in table 2.

TABLE 1 Gene locus specific assay efficacy parameters

TABLE 2 frequency of MH allele population

Step 2, obtaining characteristic information of MH sequencing data of a mixed DNA sample to be detected, wherein the characteristic information comprises MH typing and quantitative information of the mixed DNA, target individual typing information, reference individual typing information and related parameters;

MH typing and quantification information for mixed DNA includes MH loci, total sequencing reads for MH loci, MH sequence base information, and MH sequence sequencing reads.

The target individual typing information comprises the information of the MH allele base of the individual at each MH locus.

the relevant parameters, including the values of the foregoing a priori parameters, the number of contributors is set to 2.

Step3, filtering noise sequence according to sequencing information

Calculating the signal ratio r of each base sequence according to MH sequencing data of the mixed DNA;

the signal ratio r of each base sequence in MH sequencing data of the mixed DNA was filtered according to an analysis threshold value of 0.01, and only base sequences with r greater than 0.01 were retained.

the control hypothesis is that the subject and n unknown individuals are contributors to the mixed DNA sample;

Dialect assumption is that n+1 unknown individuals are contributors to the mixed DNA sample;

s1, obtaining all possible genotype combinations at each locus;

All possible genotype combinations G are listed separately at each locus on each hypothesis based on the control and dialect hypotheses, the mixed DNA sample MH sequencing data to be tested, and the individual typing information of interest.

The genotype frequency of each contributor was calculated based on allele population frequency information and the Hardy-Winberg equilibrium law, such as AB, when the contributor was heterozygous, the genotype frequency was 2p _Ap_B, where p _A and p _B are the frequencies of the alleles A, B, respectively, in the population. When the contributor is homozygous, such as AA, its genotype frequency is 2p _A ². The corresponding frequency P (G _i |h) of the ith genotype combination G _i in the population at that locus is specified as the product of the genotype frequencies of all contributors at that locus.

Under the specified genotype combination G _i, the allele-valid number a for each sequence in set O is calculated, where the allele-valid number for sequence O is a _o:

Wherein: for the number of sequences O in sequence set O contained in the X-th contributor genotype, M _x is the mixing ratio parameter of the X-th contributor, X is the contributor number, x=2;

classifying all sequences O according to the allele-effective number a _o, the allele sequence if a _o is not 0, the noise sequence if a _n is 0, the sequences contained in the genotype combination but not contained in all sequence sets O are missing allele sequences;

The weight D _o of the missing allele sequence is:

D_o＝(P(Dropout))^z

wherein P (Dropout) is the allele loss rate of 0.05, z is the number of times of occurrence of allele loss in the genotype combination, loss of one allele of the heterozygote is recorded as 1 time, and loss of the homozygote is recorded as 2 times.

Where K is the number of noise sequences, r _k is the signal ratio of the kth noise sequence on the locus, r _k:beta(τ₁,τ₂),τ₁ and τ ₂ are both noise distribution parameters, τ ₁＝0.32、τ₂ =530.19, and beta represents the probability density function of beta distribution;

sequencing reads y _n of the nth allele at the locus obey normal distribution

y_n:norm(A_n*α_lμ,CV*A_n*α_lμ);

Wherein α _l is a specific detection efficacy parameter of the locus l, μ is a normal distribution mean parameter, CV is a normal distribution variability parameter, N is an allele sequence number, N is the number of allele sequences on the locus, A _n is an allele effective number of the nth allele, and norm represents a probability density function of normal distribution, as shown in Table 1.

s4, calculating likelihood functions Lik _l at one locus l:

Wherein i is the genotype combination sequence number, G _i is the ith genotype combination, P (G _i |H) is the corresponding frequency of G _i in the population, H is the control hypothesis or dialect hypothesis, Weights for the i-th genotype combination G _i;

S5, calculating likelihood functions Lik of all loci of the combined sample to be detected under the control hypothesis and the dialect hypothesis:

Wherein L is the number of loci and L is the number of all loci.

The likelihood ratios of the control hypothesis and the dialect hypothesis are obtained, resulting in log ₁₀ lr= 30.4948. This result supports the assumption that the individual of interest is the contributor to the mixed sample and quantitatively reflects the magnitude of the evidence intensity.

And 5, calculating the combined likelihood function values of all genotypes at each locus to obtain a locus likelihood function value size sorting result, as shown in table 3. Combinations of contributor genotypes with a greater likelihood at each locus can be obtained as a result of mixed DNA deconvolution.

TABLE 3 results of deconvolution of loci

The method can accurately and objectively analyze the result of the MH sequencing data of the mixed DNA sample, reduce uncertainty and subjective error in the analysis process, remarkably enlarge the range of the analyzable sample and improve the analysis accuracy of the MH sequencing data of the mixed DNA. Because the method does not need manual analysis, the accuracy and the efficiency of the method are obviously improved, the analysis efficiency of the legal medical expert on the MH sequencing data is improved, the improvement leads the complex data processing to become more efficient, and the analysis time is saved. And more information hidden in sequencing data can be deeply mined and interpreted in mixed and complex mixed MH data analysis.

The invention provides a scientific analysis and interpretation method of MH sequencing data, which not only promotes the practical application of MH genetic markers in the forensic field, but also lays a foundation for the future development of the MH genetic markers.

Claims

1. A method for mixed DNA probabilistic typing of MH sequencing data comprising the steps of:

The target individual typing information comprises MH allele base information of the target individual at each MH locus;

the reference individual typing information comprises MH allele base information of the reference individual at each MH locus;

The related parameters include a priori parameters and the number of contributors

Step 3, filtering a noise sequence according to the sequencing information;

The likelihood function construction method comprises the following steps:

s1, obtaining all possible genotype combinations at each locus;

the procedure for obtaining the corresponding weights from the genotype combinations and MH sequence sequencing reads is as follows:

Under the specified genotype combination G _i, the allele-valid number A of each sequence in set O is calculated, wherein the allele-valid number of sequence O is A _o

Respectively calculating weights of an allele sequence, a noise sequence and a missing allele sequence, and multiplying the weights to obtain weights corresponding to genotype combinations;

s4, calculating likelihood functions Lik _l at one locus l:

wherein L is the number of loci and L is the number of all loci;

and 5, calculating likelihood function values of all possible genotype combinations on each locus to obtain a locus likelihood function value size sorting result.

2. A mixed DNA probability typing method for MH sequencing data according to claim 1, wherein the prior parameters in step 1 include a locus specific detection efficacy parameter of MH genetic markers, an analysis threshold, an allele loss rate, an allele population frequency, a noise distribution parameter.

3. The method for mixed DNA probabilistic typing of MH sequencing data according to claim 1, wherein the noise sequence filtering in step3 is as follows:

4. A mixed DNA probabilistic typing method for MH sequencing data according to claim 1, wherein the process of calculating the genotype frequencies of each contributor in S2 is as follows:

5. The method according to claim 1, wherein the step 4 of calculating likelihood ratios of the control hypothesis and the dialect hypothesis based on likelihood functions is as follows:

6. A mixed DNA probabilistic typing method for MH sequencing data as in claim 5, wherein the estimated parameters include a normal distribution mean parameter, a normal distribution variability parameter and a mixing ratio parameter for each contributor.

7. A mixed DNA probabilistic typing method for MH sequencing data according to claim 1, the method is characterized in that the weight D _o of the lost allele sequence is as follows:

D_o＝(P(Dropout))^z

The weight D _no of the noise sequence is:

wherein K is the number of noise sequences, r _k is the signal ratio of the kth noise sequence on the locus, AndAre noise distribution parameters, and beta represents probability density functions of beta distribution;

the weights D _a for the allele sequences are:

sequencing reads y _n of the nth allele at the locus obey normal distribution

y_n～norm(A_n*α_lμ,CV*A_n*α_lμ)