CN106778078A

CN106778078A - DNA sequence dna similitude comparison method based on kendall coefficient correlations

Info

Publication number: CN106778078A
Application number: CN201611186639.9A
Authority: CN
Inventors: 林劼; 林丽玉; 江育娥
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2017-05-31
Anticipated expiration: 2036-12-20
Also published as: CN106778078B

Abstract

The invention discloses a DNA sequence similarity comparison method based on kendall correlation coefficient, which comprises the following steps: 1) obtaining N DNA sequences to be compared; 2) selecting a length k, and obtaining each pair of combined DNA sequences in a sliding window manner The corresponding k words, and be combined into corresponding vectors 3) With the k words obtained in step 2), calculate the number of times that each k word occurs in the DNA sequence, that is, calculate the frequency vector of the occurrence of the k words in the DNA sequence, and use It is denoted as _xi , and the frequency of all k words in the DNA sequence is denoted as X={ _xi }; 4) Combining the k word vectors of N DNA sequences in pairs, we can get Combination, each combination k word frequency vector is recorded as x, y; 5) The k word frequency vector of each combination is x, y, and its corresponding kendall correlation coefficient is calculated; 6) N*N order of N DNA sequences is established Similarity coefficient matrix to obtain DNA sequence similarity and evolutionary relationship graph. The invention improves the effect of DNA sequence similarity comparison, simplifies calculation complexity and shortens calculation time.

Description

DNA Sequence Similarity Alignment Method Based on Kendall Correlation Coefficient

技术领域technical field

本发明涉及计算机与生物信息学处理领域，尤其涉及基于kendall相关系数的DNA序列相似性比对方法。The invention relates to the field of computer and bioinformatics processing, in particular to a DNA sequence similarity comparison method based on kendall correlation coefficient.

背景技术Background technique

生物信息学的中心任务,是从浩如烟海的DNA序列数据中提取理性知识。生物信息学家所面临的任务,不仅是解决高效的数据储存手段,而且需要开发有效的数据分析工具。因为只有利用新的、有效的数据分析工具,才能将DNA序列信息转换成生物学知识,并弄清它们所蕴含的结构和功能信息，进而彻底了解它们所代表的生物学意义。The central task of bioinformatics is to extract rational knowledge from the vast sea of DNA sequence data. The task faced by bioinformaticians is not only to solve efficient data storage methods, but also to develop effective data analysis tools. Because only by using new and effective data analysis tools, can DNA sequence information be converted into biological knowledge, and the structural and functional information contained in them can be clarified, so as to fully understand the biological significance they represent.

DNA序列比对的理论基础是进化理论，如果两个DNA序列之间具有足够的相似性，就推测二者可能有共同的进化祖先，经过DNA序列内残基的替换、残基或DNA序列片段的缺失以及DNA序列重组等遗传变异过程分别演化而来。DNA序列相似和DNA序列同源是不同的概念，DNA序列之间的相似程度是可以量化的参数，而DNA序列是否同源需要有进化事实的验证。DNA序列比对实际上就是运用某种特定的数学模型或算法，找出两个或多个DNA序列之间的最大匹配碱基数。The theoretical basis of DNA sequence comparison is the theory of evolution. If there is enough similarity between two DNA sequences, it is speculated that the two may have a common evolutionary ancestor, through the replacement of residues in the DNA sequence, residues or DNA sequence fragments Deletion and DNA sequence recombination and other genetic variation processes evolved respectively. DNA sequence similarity and DNA sequence homology are different concepts. The degree of similarity between DNA sequences is a quantifiable parameter, while whether DNA sequences are homologous needs to be verified by evolutionary facts. DNA sequence comparison is actually using a specific mathematical model or algorithm to find the maximum number of matching bases between two or more DNA sequences.

黄玉娟、王天明等人采用DNA序列中的k词出现的频率及位置信息构建了一个概率分布，这个分布表示两个向量之间的距离，值越小物种越接近。Vinga和Almeida提出了基于词频率的DNA序列比较方法：通过滑动窗口的方式所有长度为k的词出现的次数，得到k词次数或频率向量，这样把一条DNA序列映射为高维欧式空间上的一个向量，从而将DNA序列之间的相似性比较转换为向量之间的比较。Huang Yujuan, Wang Tianming and others constructed a probability distribution using the frequency and location information of k words in the DNA sequence. This distribution represents the distance between two vectors, and the smaller the value, the closer the species. Vinga and Almeida proposed a DNA sequence comparison method based on word frequency: the number of occurrences of all words with a length of k is obtained by sliding the window to obtain the number of times or frequency vectors of k words, so that a DNA sequence is mapped to a high-dimensional Euclidean space. A vector, thus converting the similarity comparison between DNA sequences into a comparison between vectors.

双DNA序列比对就是用特定的算法对两条DNA序列进行比对，从而求出这两条DNA序列之间最大的相似性的匹配。Kendall相关系数被广泛用于时间DNA序列、水文、水质DNA序列等的相关性预测，但未曾被用于DNA序列相似性匹配。Double DNA sequence comparison is to use a specific algorithm to compare two DNA sequences, so as to find the maximum similarity between the two DNA sequences. The Kendall correlation coefficient is widely used in the correlation prediction of time DNA sequence, hydrology, water quality DNA sequence, etc., but it has not been used in DNA sequence similarity matching.

发明内容Contents of the invention

本发明的目的在于克服现有技术的不足，提供基于kendall相关系数的DNA序列相似性比对方法，构建一个关于N条DNA序列的阶相似系数矩阵，获得N条DNA序列的进化关系，同时提高DNA序列相似性比对的效率及提高运算效率。The purpose of the present invention is to overcome the deficiencies in the prior art, provide a DNA sequence similarity comparison method based on the Kendall correlation coefficient, and construct a list of N DNA sequences The first-order similarity coefficient matrix is used to obtain the evolutionary relationship of N DNA sequences, and at the same time, the efficiency of DNA sequence similarity comparison is improved and the calculation efficiency is improved.

本发明采用的技术方案是：The technical scheme adopted in the present invention is:

基于kendall相关系数的DNA序列相似性比对方法，其包括如下步骤：A DNA sequence similarity comparison method based on kendall correlation coefficient, which comprises the steps of:

1)获取N条待比对的DNA序列；1) Obtain N DNA sequences to be compared;

2)选取长度k，按滑动窗口的方式获取每对组合DNA序列的相应的k词，并组合成相应的向量2) Select the length k, obtain the corresponding k words of each pair of combined DNA sequences in a sliding window manner, and combine them into corresponding vectors

3)以步骤2)所获取的k词，计算每个k词在DNA序列中出现的次数，即计算k词在DNA序列中出现的频率向量，将其记为x_i；3) With the k words obtained in step 2), calculate the number of times that each k word occurs in the DNA sequence, that is, calculate the frequency vector that the k word occurs in the DNA sequence, and record it as _xi ;

4)对N条DNA序列k词向量进行两两组合，即得到组合，每个组合向量记为X＝{x_i},Y＝{y_i}。4) Combining the k word vectors of N DNA sequences in pairs to obtain Combination, each combination vector is recorded as X={ _xi }, Y={y _i }.

5)每种组合的k词频率向量即x_i，y_i，计算其对应的kendall相关系数；5) The k word frequency vectors of each combination are x _i , y _i , and their corresponding kendall correlation coefficients are calculated;

6)建立N条DNA序列的N×N阶相关系数矩阵，以获取DNA序列的相似信息以及进化关系图。6) Establish an N×N order correlation coefficient matrix of N DNA sequences to obtain similarity information and evolutionary relationship diagrams of DNA sequences.

进一步，所述步骤2)中，对DNA序列取其长度为k的词频向量。Further, in the step 2), the word frequency vector whose length is k is taken for the DNA sequence.

进一步，所述步骤5)中，可通过如下步骤获得DNA序列的k词的kendall相关系数；Further, in the step 5), the kendall correlation coefficient of the k words of the DNA sequence can be obtained through the following steps;

a)通过下式，获取待比对DNA序列A的k词，其中DNA序列A长度设为n：a) Obtain the k words of the DNA sequence A to be compared by the following formula, wherein the length of the DNA sequence A is set to n:

b)通过下式，计算k词出现的频率：x_i＝{第i个k词在DNA序列A中重复出现的次数}；b) Calculate the frequency of occurrence of k words by the following formula: x _i = {i-th k words The number of repetitions in the DNA sequence A};

c)对组合的X,Y向量，通过下式，计算kendall相关系数其特征在于：t_x是{x_i},{y_i}中拥有一致性对数，t_y是{x_i,y_i}拥有不一致性对数，T是{x_i,y_i}拥有不相同k词总个数。c) For the combined X and Y vectors, calculate the kendall correlation coefficient through the following formula It is characterized in that: t _x is { _xi }, {y _i } has consistent logarithms, t _y is { _xi , y _i } has inconsistent logarithms, T is { _xi , y _i } has inconsistent The total number of identical k words.

d)步骤c)中的t_x，t_y可以由下式获取，t_x＝(x_i-y_i)*(x_i-y_i)为同号，则称为是{x_i,y_i}中一致性对数,t_y可以由下式获取，t_y＝(x_i-y_i)*(x_i-y_i)为异号，则称为是{x_i,y_i}中不一致性对数d) t _x and t _y in step c) can be obtained by the following formula, t _x = ( _xi -y _i )*( _xi -y _i ) is the same sign, then it is called { _xi , y _i }, t _y can be obtained by the following formula, if t _y = ( _xi -y _i )*( _xi -y _i ) is a different sign, it is called inconsistency in { _xi , y _i } Sexual logarithm

所获得的kendall相关系数τ是一个值为[-1,1]的数，当τ的值越接近于1则表示两条DNA序列之间相关程度越强，当τ的值越接近-1则表示两条DNA序列之间是负向相关，当τ的值接近于0则表示两条DNA序列不存在相关性。The obtained kendall correlation coefficient τ is a number with a value of [-1,1]. When the value of τ is closer to 1, it means that the correlation between the two DNA sequences is stronger. When the value of τ is closer to -1, the It means that there is a negative correlation between the two DNA sequences, and when the value of τ is close to 0, it means that there is no correlation between the two DNA sequences.

构建N*N阶的kendall相关系数矩阵，此矩阵为对称矩阵，对角线上的值为1，可以得到N条DNA序列的两两相似性信息，由此构建出N条DNA序列的进化的关系。Construct a kendall correlation coefficient matrix of N*N order, which is a symmetrical matrix, and the value on the diagonal is 1, and the pairwise similarity information of N DNA sequences can be obtained, thereby constructing the evolutionary model of N DNA sequences relation.

本发明基于kendall相关系数的DNA序列相似性比对方法，采用滑动窗口方式求取待分析DNA序列的k词频率向量，对N条DNA序列的k词向量进行两两组合，利用kendall相关系数对相应DNA序列的k词频率向量求其相关系数，使得能够对多条DNA序列进行相似性检测，检测结果有效地反映出DNA序列之间的进化关系。本方法较为简洁，只需构建一个对称矩阵，矩阵左上到右下的对角线上的值为1，简化了计算复杂性，提高了运算效率，kendall系数可以作为描述DNA序列相似性预测的特征值，可以获得良好的准确度。The present invention is based on the DNA sequence similarity comparison method of the kendall correlation coefficient, adopts the sliding window method to obtain the k-word frequency vector of the DNA sequence to be analyzed, and combines the k-word vectors of the N DNA sequences in pairs, and uses the kendall correlation coefficient to compare The correlation coefficient of the k-word frequency vector of the corresponding DNA sequence is calculated, so that the similarity detection of multiple DNA sequences can be carried out, and the detection results can effectively reflect the evolutionary relationship between the DNA sequences. This method is relatively simple. It only needs to construct a symmetrical matrix. The value on the diagonal line from the upper left to the lower right of the matrix is 1, which simplifies the computational complexity and improves the operational efficiency. The kendall coefficient can be used as a feature to describe the prediction of DNA sequence similarity value, good accuracy can be obtained.

附图说明Description of drawings

以下结合附图和具体实施方式对本发明做进一步详细说明；The present invention will be described in further detail below in conjunction with accompanying drawing and specific embodiment;

图1为本发明基于kendall相关系数的DNA序列相似性比对方法的流程示意图；Fig. 1 is the schematic flow chart of the DNA sequence similarity comparison method based on kendall correlation coefficient of the present invention;

图2为本发明基于kendall相关系数的DNA序列相似性比对方法的DNA序列的进化关系图。Fig. 2 is a diagram of the evolution relationship of DNA sequences in the method of comparing DNA sequence similarity based on Kendall correlation coefficient in the present invention.

具体实施方式detailed description

如图1或图2所示，对本发明的方法采用20个物种的DNA编码DNA序列作为分析对象为例作进一步详细阐述，包括以下步骤：如图1所示，本实施例的基于kendall相关系数的DNA序列相似性比对方法包括如下步骤：As shown in Figure 1 or Figure 2, the method of the present invention adopts the DNA coding DNA sequences of 20 species as an example to be further described in detail, including the following steps: As shown in Figure 1, the present embodiment based on the kendall correlation coefficient The DNA sequence similarity comparison method comprises the following steps:

1)选择20个物种的DNA编码DNA序列作为初始DNA序列，20个物种的DNA序列名称及长度见表1；1) Select the DNA coding DNA sequences of 20 species as the initial DNA sequences, and the names and lengths of the DNA sequences of the 20 species are shown in Table 1;

物种名称species name DNA序列长度DNA sequence length baboonbaboon 1652216522 bluewhaleblue hale 1640316403 catcat 1701017010 common_chimpanzeecommon_chimpanzee 1656416564 cowcow 1633916339 fin_whalefin_whale 1639916399 gibbongibbon 1647316473 gorillagorilla 1636516365 graysealgrayseal 1679816798 harborsealharbor seal 1682716827 horsehorse 1666116661 humanthe human 1657016570 mousethe mouse 1629616296 opossumopossum 1708517085 orangutanorangutan 1639016390 pigmy_chimpanzeepigmy_chimpanzee 1655516555 platypusplatypus 1702017020 ratrat 1630116301 wallaroowallaroo 1689716897 whiterhinoceroswhiter hinoceros 1683316833

表1：物种DNA序列信息Table 1: Species DNA sequence information

2)对步骤1的初始DNA序列获取其k词，并组合这些k词，得到初始DNA序列的k词频率向量(参见Vinga,S.Almeida,J.S.Alignment-free sequence comparison area review[J].Bioinformatics.513-523.2003)。此方法的特点是对按滑动窗口方式求长度k的短DNA序列出现在待测DNA序列中频率，对DNA的4个碱基{A,T,G,C}，取k长度为2，则对应k词有4²＝16种，若k＝3则对应k词4³＝64种；如待测DNA序列片段的DNA序列A＝ATAACTA，其k词W₂＝{AT,TA,AA,TT,AG,GA,AC,CA,CT….}，其频率向量值为{1,2,1,0,0,0,1,0,1,0…}；待测DNA序列片段B＝ACAACTTA，其k词频率向量为{0,1,1,1,0,0,2,1,1,0…}；2) Obtain k words from the initial DNA sequence in step 1, and combine these k words to obtain the k word frequency vector of the initial DNA sequence (see Vinga, S.Almeida, JSAligment-free sequence comparison area review[J].Bioinformatics. 513-523.2003). The feature of this method is to calculate the frequency of short DNA sequences of length k appearing in the DNA sequence to be tested according to the sliding window method. For the 4 bases {A, T, G, C} of DNA, take the length of k as 2, then There are 4 ² =16 kinds of corresponding k words, and if k=3 then correspond to k words 4 ³ =64 kinds; as the DNA sequence A=ATAACTA of the DNA sequence fragment to be tested, its k words W ₂ ={AT, TA, AA, TT,AG,GA,AC,CA,CT….}, its frequency vector The value is {1,2,1,0,0,0,1,0,1,0...}; the DNA sequence fragment to be tested B=ACAACTTA, and its k word frequency vector is {0,1,1,1,0 ,0,2,1,1,0...};

3)对应N条DNA序列，可以求出N个k词频率向量，将其两两组合，得到组合，每个组合频率向量记为X,Y3) Corresponding to N DNA sequences, N k-word frequency vectors can be obtained, and combined in pairs to obtain Combination, each combination frequency vector is recorded as X, Y

4)通过下式计算获取kendall相关系数，其中t_x是{x_i,y_i}与其他k词频率之间拥有一致性对数，t_y是{x_i,y_i}与其他k词频率之间拥有不一致性对数，T是{x_i,y_i}拥有不相同k词总个数，步骤2)中DNA序列A，B片段的k词总个数为T＝7；4) Calculated by the following formula Obtain the kendall correlation coefficient, where t _x is the logarithm of consistency between { _xi , y _i } and other k word frequencies, and t _y is the inconsistency pair between { _xi , y _i } and other k word frequencies Number, T is that { _xi , y _i } has different total number of k words, step 2) in DNA sequence A, the total number of k words of segment B is T=7;

5)步骤4)中的t_x，t_y可以由下式获取，t_x＝(x_i-y_i)×(x_i-y_i)为同号，则称为{x_i,y_i}中一致性对数,t_y可以由下式获取，t_y＝(x_i-y_i)×(x_i-y_i)为异号，则称为{x_i,y_i}中不一致性对数；5) t _x and t _y in step 4) can be obtained by the following formula, t _x = ( _xi -y _i ) × ( _xi -y _i ) is the same sign, then it is called { _xi , y _i } Consistency logarithm, t _y can be obtained by the following formula, t _y = ( _xi -y _i )×( _xi -y _i ) is a different sign, it is called the inconsistency pair in { _xi ,y _i } number;

6)构建矩阵为N*N阶的kendall相关系数矩阵，此矩阵为对称矩阵，对角线值为1，通常可以列为上三角矩阵。由于相似性与距离成负相关关系，所以，在构建进化关系图之前，我们将相似性数值取相反数转换为距离，并以此构建进化关系图，请参看图2。6) The construction matrix is a kendall correlation coefficient matrix of order N*N, which is a symmetrical matrix with a diagonal value of 1, and can usually be listed as an upper triangular matrix. Since similarity is negatively correlated with distance, before constructing the evolutionary relationship diagram, we invert the similarity value and convert it into distance, and then construct the evolutionary relationship diagram, please refer to Figure 2.

结果分析：通过计算与编辑距离之间的皮尔森相关系数，我们发现应用kendall计算出来的DNA序列相似性与编辑距离的相关系数为-0.94，说明应用本发明方法计算出来的DNA序列相似性具有精度高的特点，并且能够通过快速计算得到，是一种替代编辑距离的非常有效的方法。Result analysis: by calculating the Pearson correlation coefficient between the edit distance, we find that the correlation coefficient between the DNA sequence similarity calculated by kendall and the edit distance is -0.94, indicating that the DNA sequence similarity calculated by the method of the present invention has It is characterized by high precision and can be obtained by fast calculation, which is a very effective method to replace the edit distance.

以上所述仅为本发明的实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above is only an embodiment of the present invention, and does not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made by using the description of the present invention and the contents of the accompanying drawings, or directly or indirectly used in other related technologies fields, all of which are equally included in the scope of patent protection of the present invention.

Claims

1. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on, it is characterised in that：It comprises the following steps：

1) N bars DNA sequence dna to be compared is obtained；

2) length k is chosen, each pair is obtained in the way of sliding window and is combined the corresponding k words of DNA sequence dna, and be combined into corresponding Vector；

3) with step 2) acquired in k words, calculate the number of times that each k word occurs in DNA sequence dna, that is, calculate k words in DNA sequence dna The frequency vector of middle appearance, is designated as x_i；

4) combination of two is carried out to N bar DNA sequence dna k term vectors, that is, is obtainedCombination, each mix vector is designated as X={ x_i},Y ={ y_i}；

5) the k word frequency vectors of every kind of combination are x_i, y_i, calculate its corresponding kendall coefficient correlation；

6) N × N rank correlation matrixs of N bar DNA sequence dnas are set up, to obtain the analog information and evolutionary relationship of DNA sequence dna Figure.

2. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on according to claim 1, it is characterised in that： The step 2) in, the word frequency vector that its length is k is taken to DNA sequence dna.

3. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on according to claim 1, it is characterised in that： The step 5) in, the kendall coefficient correlations of the k words of DNA sequence dna are obtained as follows：

A) by following formula, the k words of DNA sequence dna A to be compared are obtained, wherein DNA sequence dna A length is set to n：

F_{K}^{A} = (f (W_{k, 1}^{A}), f (W_{k, 2}^{A}), ... f (W_{k, n}^{A}))

B) by following formula, the frequency that k words occur is calculated：x_i={ i-th k wordThat repeats in DNA sequence dna A is secondary Number }；

C) to the X for combining, Y-direction amount, by following formula, calculates kendall coefficient correlationsIt is characterized in that：t_xIt is {x_i},{y_iIn possess uniformity logarithm, t_yIt is { x_i,y_iPossessing inconsistency logarithm, T is { x_i,y_iPossess that to differ k words total Number；

D) t in step c)_x, t_yCan be obtained by following formula, t_x=(x_i-y_i)*(x_i-y_i) be jack per line, then it is known as { x_i,y_iIn Uniformity logarithm, t_yCan be obtained by following formula, t_y=(x_i-y_i)*(x_i-y_i) be contrary sign, then it is known as { x_i,y_iIn inconsistency Logarithm.

4. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on according to claim 1, it is characterised in that： The kendall coefficient correlations τ for being obtained is the number that a value is [- 1,1], when the value of τ is closer to 1 expression, two DNA sequences Degree of correlation is stronger between row, when the value of τ is related negative sense between two DNA sequence dnas of -1 expression, when the value of τ is approached Represent that two DNA sequence dnas do not exist correlation in 0.

5. the DNA sequence dna similitude comparison method of kendall coefficient correlations is based on according to claim 1, it is characterised in that： The kendall correlation matrixs of N*N ranks are built in step 6, this matrix is symmetrical matrix, and the value on diagonal is 1, can be obtained To the affinity information two-by-two of N bar DNA sequence dnas, the relation of the evolution of N bar DNA sequence dnas is thus constructed.