CN106778078B - DNA Sequence Similarity Alignment Method Based on Kendall's Correlation Coefficient - Google Patents
DNA Sequence Similarity Alignment Method Based on Kendall's Correlation Coefficient Download PDFInfo
- Publication number
- CN106778078B CN106778078B CN201611186639.9A CN201611186639A CN106778078B CN 106778078 B CN106778078 B CN 106778078B CN 201611186639 A CN201611186639 A CN 201611186639A CN 106778078 B CN106778078 B CN 106778078B
- Authority
- CN
- China
- Prior art keywords
- dna sequence
- dna
- sequence dna
- word
- kendall
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The present invention discloses the DNA sequence dna similitude comparison method based on kendall related coefficient comprising following steps: 1) obtaining N item DNA sequence dna to be compared;2) length k is chosen, the corresponding k word of each pair of combination DNA sequence dna is obtained in the way of sliding window, and it is combined into corresponding vector 3) with k word acquired in step 2), it calculates the number that each k word occurs in DNA sequence dna and calculates the frequency vector that k word occurs in DNA sequence dna, be denoted as xi, all k word frequency rates of DNA sequence dna are denoted as X={ xi};4) combination of two is carried out to N DNA sequence dna k term vector to get arrivingCombination, each combination k word frequency vector are denoted as x, y;5) k word frequency vector, that is, x, y of every kind of combination, calculates its corresponding kendall related coefficient;6) the N*N rank similarity factor matrix of N DNA sequence dna is established, to obtain the similitude and evolutionary relationship figure of DNA sequence dna.The present invention improves the effect that DNA sequence dna similitude compares, and simplifies computational complexity and shortens operation time.
Description
Technical field
The present invention relates to computers and bioinformatics process field, more particularly to the DNA based on kendall related coefficient
Sequence similarity comparison method.
Background technique
The central task of bioinformatics is to extract conceptual knowledge from vast as the open sea DNA sequence data.Biological information
The task that scholar is faced is not only to solve efficient data storage means, and needs to develop effective data analysis tool.
Because only that DNA sequence dna information could be converted into Biological Knowledge, and understand fully using new, effective data analysis tool
The structure and function information that they are contained, and then thoroughly understand the biological significance representated by them.
The theoretical basis that DNA sequence dna compares is Evolution Theory, if having enough similitudes between two DNA sequence dnas,
There may be common evolution ancestors with regard to both speculating, by lacking for the replacement of residue in DNA sequence dna, residue or DNA sequencing fragment
It loses and the hereditary variations processes such as DNA sequence dna recombination develops respectively.DNA sequence dna phase Sihe DNA sequence dna is homologous to be different
Concept, the similarity degree between DNA sequence dna is the parameter that can quantify, and DNA sequence dna it is whether homologous need evolve it is true
Verifying.It is actually to use certain specific mathematical model or algorithm that DNA sequence dna, which compares, finds out two or more DNA sequence dnas
Between maximum matching base number.
The frequency and location information that Huang Yujuan, Wang Tianming et al. are occurred using the k word in DNA sequence dna construct a probability
Distribution, this distribution indicate the distance between two vectors, it is closer to be worth smaller species.Vinga and Almeida, which is proposed, to be based on
The DNA sequence dna comparative approach of word frequency rate: the number that the word that all length is k by way of sliding window occurs obtains k word
One DNA sequence dna, is mapped as a vector in higher-dimension theorem in Euclid space in this way by several or frequency vector, thus by DNA sequence dna it
Between similarity system design be converted to the comparison between vector.
It is exactly that two DNA sequence dnas are compared with specific algorithm that double DNA sequence dnas, which compare, so as to find out this two DNA
The matching of maximum similitude between sequence.Kendall related coefficient is widely used in time DNA sequence dna, the hydrology, water quality DNA
The dependency prediction of sequence etc., but it be not used for the matching of DNA sequence dna similitude.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide the DNA sequence dna phases based on kendall related coefficient
Like property comparison method, building one is about N DNA sequence dnaRank similarity factor matrix, the evolution for obtaining N DNA sequence dna are closed
System, while improving the efficiency of DNA sequence dna similitude comparison and improving operation efficiency.
The technical solution adopted by the present invention is that:
DNA sequence dna similitude comparison method based on kendall related coefficient comprising following steps:
1) N item DNA sequence dna to be compared is obtained;
2) length k is chosen, the corresponding k word of each pair of combination DNA sequence dna is obtained in the way of sliding window, and is combined into phase
The vector answered
3) with k word acquired in step 2), the number that each k word occurs in DNA sequence dna is calculated, i.e. calculating k word is in DNA
The frequency vector occurred in sequence, is denoted as xi;
4) combination of two is carried out to N DNA sequence dna k term vector to get arrivingCombination, each mix vector are denoted as X=
{xi, Y={ yi}。
5) k word frequency vector, that is, x of every kind of combinationi, yi, calculate its corresponding kendall related coefficient;
6) establish N × N rank kendall correlation matrix of N DNA sequence dna, with obtain the analog information of DNA sequence dna with
And evolutionary relationship figure.
Further, in the step 2), the word frequency vector the length is k is taken to DNA sequence dna.
Further, in the step 5), the kendall related coefficient of the k word of DNA sequence dna can be obtained as follows;
A) by following formula, the k word of DNA sequence dna A to be compared is obtained, wherein DNA sequence dna A length is set as n:
B) by following formula, the frequency that k word occurs: x is calculatedi={ i-th of k wordRepeat in DNA sequence dna A
Number;
C) to combined X, Y-direction amount calculates kendall related coefficient by following formulaT in formulaxIt is { xi},
{yiIn possess consistency logarithm, tyIt is { xi,yiPossessing inconsistency logarithm, T is { xi,yiPossess not identical k word total number.
D) t in step c)x, tyIt can be obtained by following formula, tx=(xi-yi)*(xi-yi) it is jack per line, then it is known as { xi,
yiIn consistency logarithm, tyIt can be obtained by following formula, ty=(xi-yi)*(xi-yi) it is contrary sign, then it is known as { xi,yiIn it is different
Cause property logarithm
Kendall related coefficient obtained is expressed as τ, is the number that a value is [- 1,1], when the value of τ is closer to 1
Then indicate that degree of correlation is stronger between two DNA sequence dnas, when being negative sense between the value of τ two DNA sequence dnas of closer -1 expression
Correlation, when the value of τ indicates that correlation is not present in two DNA sequence dnas close to 0.
The kendall correlation matrix of N*N rank is constructed, this matrix is symmetrical matrix, and the value on diagonal line is 1, can be with
The affinity information two-by-two of N DNA sequence dna is obtained, the relationship of the evolution of N DNA sequence dna is thus constructed.
The present invention is based on the DNA sequence dna similitude comparison methods of kendall related coefficient, are sought using sliding window mode
The k word frequency vector of DNA sequence dna to be analyzed carries out combination of two to the k term vector of N DNA sequence dna, utilizes kendall correlation
Coefficient seeks its related coefficient to the k word frequency vector of corresponding DNA sequence dna, makes it possible to carry out similitude inspection to a plurality of DNA sequence dna
It surveys, testing result is effectively reflected the evolutionary relationship between DNA sequence dna.This method is more succinct, need to only construct one symmetrically
Matrix, the value on the diagonal line of matrix left to bottom right are 1, simplify computational complexity, improve operation efficiency, kendall
Coefficient can be used as the characteristic value of description DNA sequence dna similitude prediction, can obtain good accuracy.
Detailed description of the invention
The present invention is described in further details below in conjunction with the drawings and specific embodiments;
Fig. 1 is that the present invention is based on the flow diagrams of the DNA sequence dna similitude comparison method of kendall related coefficient;
Fig. 2 is that the present invention is based on the evolution of the DNA sequence dna of the DNA sequence dna similitude comparison method of kendall related coefficient
Relational graph.
Specific embodiment
As shown in Figure 1 or 2, analysis object is used as using the DNA encoding DNA sequence dna of 20 species to method of the invention
For be further elaborated, comprising the following steps: as shown in Figure 1, the present embodiment based on kendall related coefficient
DNA sequence dna similitude comparison method includes the following steps:
1) select the DNA encoding DNA sequence dna of 20 species as initial DNA sequence dna, the DNA sequence dna title of 20 species and
Length is shown in Table 1;
Species name | DNA sequence dna length |
baboon | 16522 |
bluewhale | 16403 |
cat | 17010 |
common_chimpanzee | 16564 |
cow | 16339 |
fin_whale | 16399 |
gibbon | 16473 |
gorilla | 16365 |
grayseal | 16798 |
harborseal | 16827 |
horse | 16661 |
human | 16570 |
mouse | 16296 |
opossum | 17085 |
orangutan | 16390 |
pigmy_chimpanzee | 16555 |
platypus | 17020 |
rat | 16301 |
wallaroo | 16897 |
whiterhinoceros | 16833 |
Table 1: species DNA sequence dna information
2) its k word is obtained to the initial DNA sequence dna of step 1, and combines these k words, obtain the k word frequency of initial DNA sequence dna
Rate vector is (referring to Vinga, S.Almeida, J.S.Alignment-free sequence comparison area review
[J].Bioinformatics.513-523.2003).The characteristics of the method is to the short dna for seeking length k by sliding window mode
Sequence appears in frequency in DNA sequence dna to be measured, and to 4 bases { A, T, G, C } of DNA, taking k length is 2, then corresponding to k word has 42
=16 kinds, k word 4 is corresponded to if k=33=64 kinds;Such as DNA sequence dna A=ATAACTA, the k word W of DNA sequencing fragment to be measured2=
{ AT, TA, AA, TT, AG, GA, AC, CA, CT ... }, frequency vectorValue for 1,
2,1,0,0,0,1,0,1,0…};DNA sequencing fragment B=ACAACTTA to be measured, k word frequency vector be 0,1,1,1,0,0,
2,1,1,0…};
3) corresponding N DNA sequence dna, can find out N number of k word frequency vector and obtain its combination of twoCombination, each
Combination frequency vector is denoted as X, Y
4) it is calculate by the following formulaKendall related coefficient is obtained, wherein txIt is { xi,yiAnd other k word frequency
Possess consistency logarithm, t between rateyIt is { xi,yiAnd other k word frequency rates between possess inconsistency logarithm, T is { xi,yiGather around
There is not identical k word total number, the k word total number of DNA sequence dna A, B segment is T=7 in step 2);
5) t in step 4)x, tyIt can be obtained by following formula, tx=(xi-yi)×(xi-yi) it is jack per line, then it is known as { xi,yi}
Middle consistency logarithm, tyIt can be obtained by following formula, ty=(xi-yi)×(xi-yi) it is contrary sign, then it is known as { xi,yiIn inconsistency
Logarithm;
6) building matrix be N*N rank kendall correlation matrix, this matrix be symmetrical matrix, diagonal line value 1,
Upper triangular matrix can be usually classified as.Since similitude and distance are negatively correlated relationship, so, building evolutionary relationship figure it
Before, similarity figure is taken opposite number to be converted to distance by we, and constructs evolutionary relationship figure with this, please refers to Fig. 2.
Interpretation of result: pass through the Pearson correlation coefficients between calculating and editing distance, it has been found that count using kendall
The related coefficient of the DNA sequence dna similitude and editing distance that calculate is -0.94, illustrate that the method for the present invention is applied to calculate
DNA sequence dna similitude has the characteristics that with high accuracy, and can be a kind of the non-of substitution editing distance by being quickly calculated
Normal effective method.
The above description is only an embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (4)
1. the DNA sequence dna similitude comparison method based on kendall related coefficient, it is characterised in that: it includes the following steps:
1) N item DNA sequence dna to be compared is obtained;
2) length k is chosen, the corresponding k word of each pair of combination DNA sequence dna is obtained in the way of sliding window, and is combined into corresponding
Vector
3) with k word acquired in step 2), the number that each k word occurs in DNA sequence dna is calculated, i.e. calculating k word is in DNA sequence dna
The frequency vector of middle appearance, is denoted as xi;
4) combination of two is carried out to N DNA sequence dna k term vector to get arrivingCombination, each mix vector are denoted as X={ xi},Y
={ yi};
5) k word frequency vector, that is, x of every kind of combinationi, yi, calculate its corresponding kendall related coefficient;
In step 5), the kendall related coefficient of the k word of DNA sequence dna is obtained as follows:
A) by following formula, the k word of DNA sequence dna A to be compared is obtained, wherein DNA sequence dna A length is set as n:
B) by following formula, the frequency that k word occurs: x is calculatedi={ i-th of k wordTime repeated in DNA sequence dna A
Number };
C) to combined X, Y-direction amount calculates kendall related coefficient by following formulaT in formulaxIt is { xi},{yiIn
Possess consistency logarithm, tyIt is { xi,yiPossessing inconsistency logarithm, T is { xi,yiPossess not identical k word total number;
D) t in step c)x, tyIt can be obtained by following formula, tx=(xi-yi)*(xi-yi) it is jack per line, then it is known as { xi,yiIn
Consistency logarithm, tyIt can be obtained by following formula, ty=(xi-yi)*(xi-yi) it is contrary sign, then it is known as { xi,yiIn inconsistency
Logarithm;
6) establish N × N rank kendall correlation matrix of N DNA sequence dna, with obtain DNA sequence dna analog information and into
Change relational graph.
2. the DNA sequence dna similitude comparison method based on kendall related coefficient according to claim 1, it is characterised in that:
In the step 2), the word frequency vector the length is k is taken to DNA sequence dna.
3. the DNA sequence dna similitude comparison method based on kendall related coefficient according to claim 1, it is characterised in that:
Kendall related coefficient obtained is expressed as τ, and τ is the number that a value is [- 1,1], when the value of τ indicates two closer to 1
Degree of correlation is stronger between DNA sequence dna, when being negative sense correlation between the value of τ two DNA sequence dnas of closer -1 expression, works as τ
Value indicate that correlation is not present in two DNA sequence dnas close to 0.
4. the DNA sequence dna similitude comparison method based on kendall related coefficient according to claim 1, it is characterised in that:
The kendall correlation matrix of building N*N rank in step 6), this matrix are symmetrical matrix, and the value on diagonal line is 1, can be with
The affinity information two-by-two of N DNA sequence dna is obtained, the relationship of the evolution of N DNA sequence dna is thus constructed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611186639.9A CN106778078B (en) | 2016-12-20 | 2016-12-20 | DNA Sequence Similarity Alignment Method Based on Kendall's Correlation Coefficient |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611186639.9A CN106778078B (en) | 2016-12-20 | 2016-12-20 | DNA Sequence Similarity Alignment Method Based on Kendall's Correlation Coefficient |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106778078A CN106778078A (en) | 2017-05-31 |
CN106778078B true CN106778078B (en) | 2019-04-09 |
Family
ID=58896076
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611186639.9A Expired - Fee Related CN106778078B (en) | 2016-12-20 | 2016-12-20 | DNA Sequence Similarity Alignment Method Based on Kendall's Correlation Coefficient |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106778078B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108846262A (en) * | 2018-05-31 | 2018-11-20 | 广西大学 | The method that RNA secondary structure distance based on DFT calculates phylogenetic tree construction |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102732609A (en) * | 2011-04-08 | 2012-10-17 | 博奥生物有限公司 | Method for detecting similarity of oligonucleotide and target genome |
WO2014019164A1 (en) * | 2012-08-01 | 2014-02-06 | 深圳华大基因研究院 | Method and device for analyzing microbial community composition |
CN104395900A (en) * | 2013-03-15 | 2015-03-04 | 北京未名博思生物智能科技开发有限公司 | Spatial arithmetic method of sequence alignment |
CN104657628A (en) * | 2015-01-08 | 2015-05-27 | 深圳华大基因科技服务有限公司 | Proton-based transcriptome sequencing data comparison and analysis method and system |
WO2016058089A1 (en) * | 2014-10-17 | 2016-04-21 | The Hospital For Sick Children | Dna methylation markers for overgrowth syndromes |
EP3081257A1 (en) * | 2015-04-17 | 2016-10-19 | Sorin CRM SAS | Active implantable medical device for cardiac stimulation comprising means for detecting a remodelling or reverse remodelling phenomenon of the patient |
CN106203471A (en) * | 2016-06-22 | 2016-12-07 | 南京航空航天大学 | A kind of based on the Spectral Clustering merging Kendall Tau distance metric |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040101846A1 (en) * | 2002-11-22 | 2004-05-27 | Collins Patrick J. | Methods for identifying suitable nucleic acid probe sequences for use in nucleic acid arrays |
-
2016
- 2016-12-20 CN CN201611186639.9A patent/CN106778078B/en not_active Expired - Fee Related
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102732609A (en) * | 2011-04-08 | 2012-10-17 | 博奥生物有限公司 | Method for detecting similarity of oligonucleotide and target genome |
WO2014019164A1 (en) * | 2012-08-01 | 2014-02-06 | 深圳华大基因研究院 | Method and device for analyzing microbial community composition |
CN104395900A (en) * | 2013-03-15 | 2015-03-04 | 北京未名博思生物智能科技开发有限公司 | Spatial arithmetic method of sequence alignment |
WO2016058089A1 (en) * | 2014-10-17 | 2016-04-21 | The Hospital For Sick Children | Dna methylation markers for overgrowth syndromes |
CN104657628A (en) * | 2015-01-08 | 2015-05-27 | 深圳华大基因科技服务有限公司 | Proton-based transcriptome sequencing data comparison and analysis method and system |
EP3081257A1 (en) * | 2015-04-17 | 2016-10-19 | Sorin CRM SAS | Active implantable medical device for cardiac stimulation comprising means for detecting a remodelling or reverse remodelling phenomenon of the patient |
CN106203471A (en) * | 2016-06-22 | 2016-12-07 | 南京航空航天大学 | A kind of based on the Spectral Clustering merging Kendall Tau distance metric |
Non-Patent Citations (1)
Title |
---|
基于k词的DNA序列分析的模型研究及应用;黄玉娟;《中国博士学位论文全文数据库(基础科学辑)》;20120915(第09期);第A006-9页 |
Also Published As
Publication number | Publication date |
---|---|
CN106778078A (en) | 2017-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | A Dirichlet-tree multinomial regression model for associating dietary nutrients with gut microorganisms | |
CN107545151B (en) | Drug relocation method based on low-rank matrix filling | |
Ghafouri-Kesbi et al. | Predictive ability of random forests, boosting, support vector machines and genomic best linear unbiased prediction in different scenarios of genomic evaluation | |
CN106997553B (en) | Multi-objective optimization-based commodity combination mode mining method | |
CN102326160A (en) | Method and system for clustering data generated from a database | |
Martin et al. | Marginal zero-inflated regression models for count data | |
CN105205344A (en) | Genetic locus excavation method based on multi-target ant colony optimization algorithm | |
Wang et al. | A new method to infer causal phenotype networks using QTL and phenotypic information | |
CN116401561B (en) | Time-associated clustering method for equipment-level running state sequence | |
Paradis | The distribution of branch lengths in phylogenetic trees | |
CN106778078B (en) | DNA Sequence Similarity Alignment Method Based on Kendall's Correlation Coefficient | |
CN109977030B (en) | Method and device for testing deep random forest program | |
CN115440392A (en) | An important hyperedge recognition method based on the deleted Laplacian matrix | |
Teng et al. | Two-way truncated linear regression models with extremely thresholding penalization | |
Cheng et al. | Use of biclustering for missing value imputation in gene expression data. | |
Boggis et al. | equips: eqtl analysis using informed partitioning of snps–a fully Bayesian approach | |
CN110060735A (en) | A kind of biological sequence clustering method based on the segmentation of k-mer group | |
Alvarado-Serrano et al. | Detecting spatial dynamics of range expansions with geo-referenced genomewide SNP data and the geographic spectrum of shared alleles | |
Lehmann et al. | High trait variability in optimal polygenic prediction strategy within multiple-ancestry cohorts | |
CN116978464A (en) | Data processing method, device, equipment and medium | |
CN109326327A (en) | A Sequence Clustering Method Based on SeqRank Graph Algorithm | |
Ferebee et al. | Exploring the utility of regulatory network-based machine learning for gene expression prediction in maize | |
Gustafsson et al. | Large-scale reverse engineering by the lasso | |
Narayanan et al. | A newtonian framework for community detection in undirected biological networks | |
Pungpapong et al. | Selecting massive variables using an iterated conditional modes/medians algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190409 |