CN106778078B

CN106778078B - DNA Sequence Similarity Alignment Method Based on Kendall's Correlation Coefficient

Info

Publication number: CN106778078B
Application number: CN201611186639.9A
Authority: CN
Inventors: 林劼; 林丽玉; 江育娥
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2019-04-09
Anticipated expiration: 2036-12-20
Also published as: CN106778078A

Abstract

The present invention discloses the DNA sequence dna similitude comparison method based on kendall related coefficient comprising following steps: 1) obtaining N item DNA sequence dna to be compared；2) length k is chosen, the corresponding k word of each pair of combination DNA sequence dna is obtained in the way of sliding window, and it is combined into corresponding vector 3) with k word acquired in step 2), it calculates the number that each k word occurs in DNA sequence dna and calculates the frequency vector that k word occurs in DNA sequence dna, be denoted as x_i, all k word frequency rates of DNA sequence dna are denoted as X={ x_i}；4) combination of two is carried out to N DNA sequence dna k term vector to get arrivingCombination, each combination k word frequency vector are denoted as x, y；5) k word frequency vector, that is, x, y of every kind of combination, calculates its corresponding kendall related coefficient；6) the N*N rank similarity factor matrix of N DNA sequence dna is established, to obtain the similitude and evolutionary relationship figure of DNA sequence dna.The present invention improves the effect that DNA sequence dna similitude compares, and simplifies computational complexity and shortens operation time.

Description

DNA sequence dna similitude comparison method based on kendall related coefficient

Technical field

The present invention relates to computers and bioinformatics process field, more particularly to the DNA based on kendall related coefficient Sequence similarity comparison method.

Background technique

The central task of bioinformatics is to extract conceptual knowledge from vast as the open sea DNA sequence data.Biological information The task that scholar is faced is not only to solve efficient data storage means, and needs to develop effective data analysis tool. Because only that DNA sequence dna information could be converted into Biological Knowledge, and understand fully using new, effective data analysis tool The structure and function information that they are contained, and then thoroughly understand the biological significance representated by them.

The theoretical basis that DNA sequence dna compares is Evolution Theory, if having enough similitudes between two DNA sequence dnas, There may be common evolution ancestors with regard to both speculating, by lacking for the replacement of residue in DNA sequence dna, residue or DNA sequencing fragment It loses and the hereditary variations processes such as DNA sequence dna recombination develops respectively.DNA sequence dna phase Sihe DNA sequence dna is homologous to be different Concept, the similarity degree between DNA sequence dna is the parameter that can quantify, and DNA sequence dna it is whether homologous need evolve it is true Verifying.It is actually to use certain specific mathematical model or algorithm that DNA sequence dna, which compares, finds out two or more DNA sequence dnas Between maximum matching base number.

The frequency and location information that Huang Yujuan, Wang Tianming et al. are occurred using the k word in DNA sequence dna construct a probability Distribution, this distribution indicate the distance between two vectors, it is closer to be worth smaller species.Vinga and Almeida, which is proposed, to be based on The DNA sequence dna comparative approach of word frequency rate: the number that the word that all length is k by way of sliding window occurs obtains k word One DNA sequence dna, is mapped as a vector in higher-dimension theorem in Euclid space in this way by several or frequency vector, thus by DNA sequence dna it Between similarity system design be converted to the comparison between vector.

It is exactly that two DNA sequence dnas are compared with specific algorithm that double DNA sequence dnas, which compare, so as to find out this two DNA The matching of maximum similitude between sequence.Kendall related coefficient is widely used in time DNA sequence dna, the hydrology, water quality DNA The dependency prediction of sequence etc., but it be not used for the matching of DNA sequence dna similitude.

Summary of the invention

It is an object of the invention to overcome the deficiencies of the prior art and provide the DNA sequence dna phases based on kendall related coefficient Like property comparison method, building one is about N DNA sequence dnaRank similarity factor matrix, the evolution for obtaining N DNA sequence dna are closed System, while improving the efficiency of DNA sequence dna similitude comparison and improving operation efficiency.

The technical solution adopted by the present invention is that:

DNA sequence dna similitude comparison method based on kendall related coefficient comprising following steps:

1) N item DNA sequence dna to be compared is obtained；

2) length k is chosen, the corresponding k word of each pair of combination DNA sequence dna is obtained in the way of sliding window, and is combined into phase The vector answered

3) with k word acquired in step 2), the number that each k word occurs in DNA sequence dna is calculated, i.e. calculating k word is in DNA The frequency vector occurred in sequence, is denoted as x_i；

4) combination of two is carried out to N DNA sequence dna k term vector to get arrivingCombination, each mix vector are denoted as X= {x_i, Y={ y_i}。

5) k word frequency vector, that is, x of every kind of combination_i, y_i, calculate its corresponding kendall related coefficient；

6) establish N × N rank kendall correlation matrix of N DNA sequence dna, with obtain the analog information of DNA sequence dna with And evolutionary relationship figure.

Further, in the step 2), the word frequency vector the length is k is taken to DNA sequence dna.

Further, in the step 5), the kendall related coefficient of the k word of DNA sequence dna can be obtained as follows；

A) by following formula, the k word of DNA sequence dna A to be compared is obtained, wherein DNA sequence dna A length is set as n:

B) by following formula, the frequency that k word occurs: x is calculated_i={ i-th of k wordRepeat in DNA sequence dna A Number；

C) to combined X, Y-direction amount calculates kendall related coefficient by following formulaT in formula_xIt is { x_i}, {y_iIn possess consistency logarithm, t_yIt is { x_i,y_iPossessing inconsistency logarithm, T is { x_i,y_iPossess not identical k word total number.

D) t in step c)_x, t_yIt can be obtained by following formula, t_x=(x_i-y_i)*(x_i-y_i) it is jack per line, then it is known as { x_i, y_iIn consistency logarithm, t_yIt can be obtained by following formula, t_y=(x_i-y_i)*(x_i-y_i) it is contrary sign, then it is known as { x_i,y_iIn it is different Cause property logarithm

Kendall related coefficient obtained is expressed as τ, is the number that a value is [- 1,1], when the value of τ is closer to 1 Then indicate that degree of correlation is stronger between two DNA sequence dnas, when being negative sense between the value of τ two DNA sequence dnas of closer -1 expression Correlation, when the value of τ indicates that correlation is not present in two DNA sequence dnas close to 0.

The kendall correlation matrix of N*N rank is constructed, this matrix is symmetrical matrix, and the value on diagonal line is 1, can be with The affinity information two-by-two of N DNA sequence dna is obtained, the relationship of the evolution of N DNA sequence dna is thus constructed.

The present invention is based on the DNA sequence dna similitude comparison methods of kendall related coefficient, are sought using sliding window mode The k word frequency vector of DNA sequence dna to be analyzed carries out combination of two to the k term vector of N DNA sequence dna, utilizes kendall correlation Coefficient seeks its related coefficient to the k word frequency vector of corresponding DNA sequence dna, makes it possible to carry out similitude inspection to a plurality of DNA sequence dna It surveys, testing result is effectively reflected the evolutionary relationship between DNA sequence dna.This method is more succinct, need to only construct one symmetrically Matrix, the value on the diagonal line of matrix left to bottom right are 1, simplify computational complexity, improve operation efficiency, kendall Coefficient can be used as the characteristic value of description DNA sequence dna similitude prediction, can obtain good accuracy.

Detailed description of the invention

The present invention is described in further details below in conjunction with the drawings and specific embodiments；

Fig. 1 is that the present invention is based on the flow diagrams of the DNA sequence dna similitude comparison method of kendall related coefficient；

Fig. 2 is that the present invention is based on the evolution of the DNA sequence dna of the DNA sequence dna similitude comparison method of kendall related coefficient Relational graph.

Specific embodiment

As shown in Figure 1 or 2, analysis object is used as using the DNA encoding DNA sequence dna of 20 species to method of the invention For be further elaborated, comprising the following steps: as shown in Figure 1, the present embodiment based on kendall related coefficient DNA sequence dna similitude comparison method includes the following steps:

1) select the DNA encoding DNA sequence dna of 20 species as initial DNA sequence dna, the DNA sequence dna title of 20 species and Length is shown in Table 1；

Species name	DNA sequence dna length
		baboon	16522
bluewhale	16403
		cat	17010
common_chimpanzee	16564
		cow	16339
fin_whale	16399
		gibbon	16473
gorilla	16365
		grayseal	16798
harborseal	16827
		horse	16661
human	16570
		mouse	16296
opossum	17085
		orangutan	16390
pigmy_chimpanzee	16555
		platypus	17020
rat	16301
		wallaroo	16897
whiterhinoceros	16833

Table 1: species DNA sequence dna information

2) its k word is obtained to the initial DNA sequence dna of step 1, and combines these k words, obtain the k word frequency of initial DNA sequence dna Rate vector is (referring to Vinga, S.Almeida, J.S.Alignment-free sequence comparison area review [J].Bioinformatics.513-523.2003).The characteristics of the method is to the short dna for seeking length k by sliding window mode Sequence appears in frequency in DNA sequence dna to be measured, and to 4 bases { A, T, G, C } of DNA, taking k length is 2, then corresponding to k word has 4² =16 kinds, k word 4 is corresponded to if k=3³=64 kinds；Such as DNA sequence dna A=ATAACTA, the k word W of DNA sequencing fragment to be measured₂= { AT, TA, AA, TT, AG, GA, AC, CA, CT ... }, frequency vectorValue for 1, 2,1,0,0,0,1,0,1,0…}；DNA sequencing fragment B=ACAACTTA to be measured, k word frequency vector be 0,1,1,1,0,0, 2,1,1,0…}；

3) corresponding N DNA sequence dna, can find out N number of k word frequency vector and obtain its combination of twoCombination, each Combination frequency vector is denoted as X, Y

4) it is calculate by the following formulaKendall related coefficient is obtained, wherein t_xIt is { x_i,y_iAnd other k word frequency Possess consistency logarithm, t between rate_yIt is { x_i,y_iAnd other k word frequency rates between possess inconsistency logarithm, T is { x_i,y_iGather around There is not identical k word total number, the k word total number of DNA sequence dna A, B segment is T=7 in step 2)；

5) t in step 4)_x, t_yIt can be obtained by following formula, t_x=(x_i-y_i)×(x_i-y_i) it is jack per line, then it is known as { x_i,y_i} Middle consistency logarithm, t_yIt can be obtained by following formula, t_y=(x_i-y_i)×(x_i-y_i) it is contrary sign, then it is known as { x_i,y_iIn inconsistency Logarithm；

6) building matrix be N*N rank kendall correlation matrix, this matrix be symmetrical matrix, diagonal line value 1, Upper triangular matrix can be usually classified as.Since similitude and distance are negatively correlated relationship, so, building evolutionary relationship figure it Before, similarity figure is taken opposite number to be converted to distance by we, and constructs evolutionary relationship figure with this, please refers to Fig. 2.

Interpretation of result: pass through the Pearson correlation coefficients between calculating and editing distance, it has been found that count using kendall The related coefficient of the DNA sequence dna similitude and editing distance that calculate is -0.94, illustrate that the method for the present invention is applied to calculate DNA sequence dna similitude has the characteristics that with high accuracy, and can be a kind of the non-of substitution editing distance by being quickly calculated Normal effective method.

The above description is only an embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. the DNA sequence dna similitude comparison method based on kendall related coefficient, it is characterised in that: it includes the following steps:

1) N item DNA sequence dna to be compared is obtained；

2) length k is chosen, the corresponding k word of each pair of combination DNA sequence dna is obtained in the way of sliding window, and is combined into corresponding Vector

3) with k word acquired in step 2), the number that each k word occurs in DNA sequence dna is calculated, i.e. calculating k word is in DNA sequence dna The frequency vector of middle appearance, is denoted as x_i；

4) combination of two is carried out to N DNA sequence dna k term vector to get arrivingCombination, each mix vector are denoted as X={ x_i},Y ={ y_i}；

In step 5), the kendall related coefficient of the k word of DNA sequence dna is obtained as follows:

B) by following formula, the frequency that k word occurs: x is calculated_i={ i-th of k wordTime repeated in DNA sequence dna A Number }；

C) to combined X, Y-direction amount calculates kendall related coefficient by following formulaT in formula_xIt is { x_i},{y_iIn Possess consistency logarithm, t_yIt is { x_i,y_iPossessing inconsistency logarithm, T is { x_i,y_iPossess not identical k word total number；

D) t in step c)_x, t_yIt can be obtained by following formula, t_x=(x_i-y_i)*(x_i-y_i) it is jack per line, then it is known as { x_i,y_iIn Consistency logarithm, t_yIt can be obtained by following formula, t_y=(x_i-y_i)*(x_i-y_i) it is contrary sign, then it is known as { x_i,y_iIn inconsistency Logarithm；

6) establish N × N rank kendall correlation matrix of N DNA sequence dna, with obtain DNA sequence dna analog information and into Change relational graph.

2. the DNA sequence dna similitude comparison method based on kendall related coefficient according to claim 1, it is characterised in that: In the step 2), the word frequency vector the length is k is taken to DNA sequence dna.

3. the DNA sequence dna similitude comparison method based on kendall related coefficient according to claim 1, it is characterised in that: Kendall related coefficient obtained is expressed as τ, and τ is the number that a value is [- 1,1], when the value of τ indicates two closer to 1 Degree of correlation is stronger between DNA sequence dna, when being negative sense correlation between the value of τ two DNA sequence dnas of closer -1 expression, works as τ Value indicate that correlation is not present in two DNA sequence dnas close to 0.

4. the DNA sequence dna similitude comparison method based on kendall related coefficient according to claim 1, it is characterised in that: The kendall correlation matrix of building N*N rank in step 6), this matrix are symmetrical matrix, and the value on diagonal line is 1, can be with The affinity information two-by-two of N DNA sequence dna is obtained, the relationship of the evolution of N DNA sequence dna is thus constructed.