[go: up one dir, main page]

Academia.eduAcademia.edu
World Academy of Science, Engineering and Technology Vol:5 2011-10-25 A New Edit Distance Method for Finding Similarity in Dna Sequence International Science Index Vol:5, No:10, 2011 waset.org/publications/7178 Patsaraporn Somboonsak, Mud-Armeen Munlin Abstract—The P-Bigram method is a string comparison methods base on an internal two characters-based similarity measure. The edit distance between two strings is the minimal number of elementary editing operations required to transform one string into the other. The elementary editing operations include deletion, insertion, substitution two characters. In this paper, we address the P-Bigram method to sole the similarity problem in DNA sequence. This method provided an efficient algorithm that locates all minimum operation in a string. We have been implemented algorithm and found that our program calculated that smaller distance than one string. We develop PBigram edit distance and show that edit distance or the similarity and implementation using dynamic programming. The performance of the proposed approach is evaluated using number edit and percentage similarity measures. Keywords—Edit distance, String Matching, String Similarity I. INTRODUCTION T edit distance is a common similarity measure between two strings. It is defined as the minimum number of insertions, deletions or substitutions of single terminal needed to transform other of the strings into the other one. This distance is a key importance in several fields, such as Bioinformatics, Text processing consequently computational problems. Given a string S1 = [a1 a2 a3…an] and S2 = [b1 b2 b3…bm] as the minimal cost of transforming S1 into S2 using the three operations insert, delete, and substitution, where only unit cost operations are considered in edit distance. The cost of elementary editing operations is given by some scoring function which induces a metrical on strings. The similarity of two strings is the minimum number of edit distance. DNA sequence can be seen as a pair of reverse complementary repeats in a string that are separated by a number of Nucleotide. The complementary relation on nucleotides (A T C G) means that A is complementary to T and C is complementary to G.In this paper, we develop two-letter edit distance. We compute the edit distance and the cost of operation with dynamic programming. The new algorithm is to find all minimal distance in a string. HE II. RELATED WORK The early works on finding similarity in strings deal with edit distance. The edit distance models were studied in two contexts, for string matching and for sequence similarity. A lot of works have been on string matching (see [6-11]). P. S. Author is with the Faculty of Information Science and Technology Mahanakorn University of Technology, Nongchock, 10530 Bangkok Thailand (e-mail: psomboonsak@yahoo.com). M. M. Author, is with the Faculty of Information Science and Technology, Mahanakorn University of Technology, Nongchock, 10530 Thailand (e-mail: mmunlin@gmail.com). We will focus on edit distance techniques as our main goal here is to concentrate on similarity. We decided to use the most common measure: the Levenshtein edit distance[4]. In the text is typically assumed to be random, i.e., each character is chosen uniformly and independently from the alphabet. A sequence alignment is a way of arranging the sequence of DNA, RNA or protein to identify regions of similarity that may be a consequence of functional, structural or evolutionary relationships between the sequences[2]. The first application of the edit distance algorithm for protein sequence was studied by Needleman-Wunsch[1].It is commonly used in bioinformatics to align protein or nucleotide sequences. To find the alignment with the highest score, a two-dimensional array. The algorithm is an example of dynamic programming matrix. There is one column for each character in sequence A, and one row for each character in sequence B. This algorithm progresses, the Fi,j will be assigned to be the optimal score for alignment of the first i = 0,…n characters in A and the first j = 0,…m characters in B. This algorithm works in the same way regardless of length or complexity of sequences. The simplest and most common scoring function is the Levenshtein distance[4] which assigns a uniform score of string for every operation. Determining the edit distance between a pair of string is a fundamental problem in computer science in general, and in combinatorial pattern matching in particular, with applications ranging from database indexing and word processing, to bioinformatics[3][5]. The Levenshtein distance[4] between two strings is the minimum number of editing steps that convert one string into another. Given two string A[1…m] and B[1…n], one can calculate their edit distance by dynamic programming. We refer to matrix D[0..m, 0..n] as the edit distance table of string A and B. Initialy, D[i,0] = i for 0 ≤ i ≤ m and D[0,j] = j for 1 ≤ i ≤ n. Then the cell D[i,j], where i,j > 0, stores the edit distance of string A[1…i] and B[1…j]. We also say that the cells D[i,j], where j - i = d are on diagonal d of D[15]. We introduce some important properties of the edit distance table that are constantly used in later discussion. The Longest common subsequence(LCS)[14] problem is to find the maximum possible length of a common subsequence of two strings, ‘a’ of length |a| and b of length |b|. The sequence similarity analysis is the Longest Common Subsequence(LCS) problem, where we eliminate the operation of insertions deletions and substitutions[14]. Given strings S and T of lengths n and m, respectively, over an alphabet ∑, determine the lengths of the longest subsequence that is common to both s and t. Here, s subsequence of S = s1s2,…sn is a string of the form si1 si2,… sik, where each ij is between 1 and n and 1 ≤ i1 < i2 ,…< ik ≤ n. For example, if ∑ = {A, T, C , G}, S = AGCGA and T = CAGATAGAG, then CGA is a subsequence of length 3 of both S and T. However, it is not the longest common 744 International Science Index Vol:5, No:10, 2011 waset.org/publications/7178 World Academy of Science, Engineering and Technology Vol:5 2011-10-25 This P-Bigram edit distance is an importance of the edit distance for two characters which is also used for measuring the similarity. subsequence of S and T, since the string AGGA is also a common subsequence of length 4 of both S and T. Since these two strings do not have a common subsequence of length of the longest common subsequence of S and T is 4[16].Bioinformatics have developed several means to characterize the similarity between genetic sequences. One intuitively appealing measure is the edit distance. The edit distance was originally proposed by Levnshtein[4]. One of the major problems is that of edit distance. To compare two strings, a common technique is so called string edit distance, as a measure of their dissimilarity. Edit distance is equal to the minimum number of editing operations required to transform one sequence into the other. The three basic editing operations are insertion, deletion, and substitution[11]. The operations to transform one string into anther string using the basic character wise operation delete, insert and substitution. If each operation has cost of 1, then edit distance is the number of operations. The minimal number of these operations is called edit distance or Levenshtein[4] distance. Each operation has an associated cost, which is a function of characters involved in operation. The cost of transformation is the sum of cost the individual operations. Dynamic programming (DP) is a method used for optimizing a multistage process and that is particularity applicable to problems requiring a sequence of interrelated decisions. Each decision transforms the current situation into a new situation. A sequence of decisions, which in turn yields a sequence of situations, is sought that minimizes some measure of value. The value of a sequence of decisions is generally equal to the sum of the values of the individual decisions and situations in the sequence. It is a “solution seeking” concept which replaces a problem of n decision variable. Such an approach allows analysts to make decisions stage by stage, until the final result is obtained. We use the following formulation[13].Let S = [s1,s2…sm] be the source sequence, T = [t1,t2…tn] the target sequence, and di,j the distance between the subsequence [s1,s2,…si] and [t1,t2…tj]; Then for 1 ≤ i ≤ m, 1 ≤ j ≤ n, d0,0 = 0 di,0 = di-1,0 + c(Si,Ø), di-1,j-1 = dj-1,0 + c(Ø,tj), Definition 2. For a string S and value of Є, let P = Є |S| and string T and value of Є, let P = Є |T|. For example, the twoletter alphabet {s, t}, if S = ATTCCGGTCAAG and T = ATTGGTTCCAAGGA. The first scans and match two-letter of sequence from left to right then a pair of string S is{ AT, TC, CG, GT, CA, AG } and a pair of- string T is { AT, TG, GT, TC, CA, AG, GA }. The P-Bigram Edit Distance between two characters with two strings. Given two strings S, T a standard technique for computing P-Bigram (S, T) is the dynamic programming method, where we compute the DP matrix of size (|S| + 1 ) for which DP [i,j] = P-Bigram (S [1..i]), T[1..j]) for 1 ≤ i ≤ |S| and 1 ≤ j ≤ | T|. and The compute matrix is the following: di-1,j + c(Si,Ø), di-1,0 + c(Ø,tj), di,j = min dj-1,0 + c(Si,tj), where c(Si,Ø) is the cost of deleting Si, c(Ø,tj) is the cost of inserting tj, and c(Si,tj) is the cost of substituting tj for si. The edit distance between S and T is simply dm,n. III. MATERIALS AND METHODS A. P-Bigram Edit Distance We define a P-Bigram Edit Distance as a character that is present in two nucleotide sequences. So, we estimate the minimum length for which a maximal match is significant, according to the length of the two compares sequences. Let S = {s1,s2,s3,…sn} be a set of n characters and T = {t1,t2,t3,…tm} in a text of m characters which are strings of nucleotide sequence characters of length i and j over the nucleotide alphabet S = {A, C, G, T}. Definition 1. A P-Bigram Edit distance solution for computing the edit distance between a pair of string S = s1,s2…sN and T = t1,t2…tM involves filling in an (n+1) X (m+1) table P, with P[i,j] sorting the edit distance between s1,s2…si and t1,t2…tj. In addition let |S| and |T| denote the length of string S and T. We consider the Levenshten edit distance. The computation is done according to the base-case rules given by P[0,0] = 0, P[i,0] = T[i-1,0] + cost of deleting si, and P[0,j] = P[0,j-1] +1 cost of inserting tj, and according to the following dynamic programming step: P[i-1,j] + the cost (1) of delete si P[i,j] = min P[i,j-1] + the cost (1) of insert tj. P[i-1,j-1] + the cost (1) of substitute si with tj 0 if i = 0 or j = 0, DP[i,j] = max DP [i-1, j], DP[i, j-1]) if i,j >0 and S [i] ≠ T [j] DP [i-1, j-1] +1 if i,j >0 and S [i] = T [j] Therefore, to compute P-Bigram = DP S and T of length |S| and |T| with a time complexity of O(| S| X |T|) and space complexity of O(min(| S| X |T)). In this section we describe the algorithm for finding PBigram similarity new edit distance. We first describe PBigram Distance, following which we describe reduced operation to achieve a time and space efficient algorithm. The reduce use similar idea to those in Levenshtein[4] distance. 745 World Academy of Science, Engineering and Technology Vol:5 2011-10-25 International Science Index Vol:5, No:10, 2011 waset.org/publications/7178 ALGORITHM I P-BIGRAM EDIT DISTANCE 1 EditDistance S(0…s-1) T(0…t-1) 2 int m[i,j] = 0 3 for i ä 0 to n 4 do m[i,o] = i 5 i=i+1 6 for j ä 0 to m 7 do m[o,j] = j 8 j=j+1 9 for i ä 0 to n; i ++ 10 for j ä 0 to m; j ++ 11 if s[i,i-1] = t[j,j-1] then 12 cost ä 0 13 m[i,j] ä m[i,j] +1 14 opt ä delete(i,n) // opt operation of i and j 15 else 16 if s[i,i-1] ≠ t[j,j-1] then 17 cost ä 1 18 m[i,j] ä m[i,j] +1 19 opt ä insert(n,i,j) // opt operation of i and j 20 else 21 if s[i,i-1] ≠ t[j,j-1] then 22 cost ä 1 23 m[i,j] ä m[i,j] +1 24 opt ä substitution(n,i,j) // opt operation of i and j 25 for i ä 1 to |S| 26 do for j ä 1 to |T| 27 do m[i,j] = min{m[i-1,j-1] 28 m[i-1,j] + 1, 29 m[i,j-1] + 1} 30 return m[ n,m] P-Bigram algorithm compute the edit distance. Suppose we wish to calculate the edit distance between the strings S = s1,s2…sn and T = t1,t2…tm. 1. We begin by forming an (n + 1) X (m + 1) matrix P initially containing all zeros, that is P(i,j) = 0 for i = 0,1,2,…,n and for j = 0,1,2,…,m. 2. Assign values P0j = j, j = 0,1,2,…,m and P0i = i, i = 0,1,2,…,n. 3. Starting from the second top row and going from left to right, we fill in the values Pi,j according to the following step: Insertion = Pi-1, j-1 + cost Deletion = Pi-1, j + 1 Substitution = Pi, j-1 + 1 Pi,j = min(substitution, deletion , insertion). 4. 5. Where cost = 0 if si = tj and cost = 0 if si ≠ tj. After computing a row, move to the row below, until the bottom row is reached. The value Pnm is the edit distance between the string S and T. Example Here is an example of computing the P-Bigram edit distance of two characters in two strings. Si AT TC CG GT Tj 0 1 2 3 4 AT 1 0 1 2 3 TG 2 1 1 2 3 GT 3 2 2 2 2 TC 4 3 2 3 3 CA 5 4 3 4 4 AG 6 5 4 5 5 GA 7 6 5 6 6 TC CA AG 5 6 7 4 5 6 4 5 6 3 4 5 2 3 4 3 2 3 4 3 2 5 4 3 Fig. 1 P-Bigram edit distance of Si = ATTCCGGTCAAG and Tj = ATTGGTTCCAAGGA. The values of the above table have been obtained with the following unitary costs(Fig. 1, Fig. 2) Sub(s, t) = 1 if s ≠ t and Sub(s,t) = 0, Insert(t) = Delete(t) for s,t Є P. A score (instead of a cost) is associated with each elementary edit operation. For s,t Є P: - Sub(s,t) denotes the score of substituting the character t for the character s, - Del(t) denotes the score of deleting the character t, - Ins(t) denotes the score of inserting the character t. S T Fig. 2 The DP matrix for P-Bigram(S,T), where S = ATTCCGGTCAAG and T = ATTGGTTCCAAGGA The main idea behind the P-Bigram edit distance described above is that each entry Pi,j corresponds to the minimal number of editing operations required to transform the substring Si = s1,s2…si into the substring Tj = t1,t2… tj. Initially, an empty string to transformed into a string of k characters by using exactly k additions (Step2). Explanation of Step3: - If we can transform Si into Tj-1 in Pi,j -1 operations, then we can transform Si into Tj in insertion = Pi-1,j-1 + 1 operations by simply adding the characters si to Tj-1. - If we can transform Si-1 into Tj in Pi-1, j operations, then we can transform Si into Tj in deletion = Pi-1,j + 1 operations by simply deleting the characters si from to Si. - If we can transform Si-1 into Tj-1 in Pi-1,j-1 operations, then we can transform Si into Tj in substitution = P i,j-1 + 1 cost operations by simply substitution the characters si with Tj if they are different (cost = 1). 746 World Academy of Science, Engineering and Technology Vol:5 2011-10-25 - The minimal number of operations required to transform Ti into Sj is the minimum of the three quantities: Pij = min(substitution, deletion , insertion). concentrates on trying to match two characters nucleotide. We summarize the results using edit and similarity in Table1, Table II, Table III, Figure 4, Figure 5. TABLE II NUMBER EDIT FOR PAIR OF DNA SEQUENCE IV. EVALUATION AND RESULTS TABLE I NOTATION sim sim edit (x,y) editDist(x,y) Similarity between S and T is related to their commonality Similarity of edit Minimum number of character (insertion, deletion, substitution) Several metrics to evaluate effectiveness of character identification techniques have been proposed, combining such criteria. The problem of DNA similar measure is a different measures how many match is identified in relation to the total number of edit. The Results given in Table1 are the percent similarity and number of edit[12][17]. sim edit (x,y) = 1 1 + editDist(x,y) Source Destination ATTCCG GTCAAG GAATTC AGTTA GCATCG GTAATT GCCCTA GCG TGATCG ATC ATTGGTT CCAAGGA ATTGGTTC CCAAGGA ATCTCG GACG GCGCAA TG CTGATCG ATC LD Edit LCS Edit ND Edit PB Edit 6 10 7 3 5 6 8 4 7 6 6 5 4 5 8 3 1 9 7 1 By using the edit distance we find the possible edit of sequence. Now to edit distance in many algorithms we used to compare edit. A better approach is to edit in such a way which minimum edit. TABLE III SIMILARITY FOR PAIR OF DNA CHARACTERS IN MANY ALGORITHM Source Destination LD Similarit y LCS Similarit y ND Similarit y PB Similarit y ATTCCG GTCAAG GAATTC AGTTA GCATCG GTAATT GCCCTA GCG TGATCG ATC ATTGGTT CCAAGGA ATTGGTTC CCAAGGA ATCTCG GACG GCGCAA TG CTGATCG ATC 14.29% 9.09% 12.50% 25.00% 16.67% 14.29% 11.11% 20.00% 12.50% 14.29% 14.28% 16.67% 20.00% 16.67% 11.11% 25.00% 50.00% 10.00% 12.50% 50.00% Where editDist(x,y) is the minimum number of data(DNA) insertion deletion and substitution operations needs to transform one string to the other. 12 Levenshtein 10 Number Edit International Science Index Vol:5, No:10, 2011 waset.org/publications/7178 In this section, we want to evaluate a similarity measure, so our evaluation will focus on edit distance a set of source nucleotide to a set of target nucleotide. We suppose that the size of the target set is a part of nucleotide to contain possible other similar nucleotides. The P-Bigram Edit Distance task consist of comparing two characters with two strings that contain nucleotide virus in order to decide whether two strings refer to the same entity. A data set used for testing the nucleotide virus dynamic programming techniques is usually represented as two sets of strings and subset of their Cartesian product that defines valid matches. We measured percentage similarity and edit of the matched characters in the string. It follows from definitions that the following (Table 1). Example Here is an example of computing the P-Bigram similarity of two characters in two strings(ATTCCGGTCAAG, ATTGGTTCCAAGGA). sim edit (s,t) = Similarity 6 Needleman 4 P-Bigram 2 1 1 + 3 (s,t) 0 1 2 3 4 5 Dataset = 0.25 = 25.00% Our training set is an export of the NCBI(National Center for Biotechnology Information) data. To simplify the evaluation, we had to set a threshold to decide if a nucleotide is a small edit operation. Using our trained similarity measure we computed the similarity between two characters in two strings and look for minimum edit distance (similarity) that was able to reduce edit operations. Our evaluation Longest Common Subsequence 8 Fig. 4 Number Edit Fig. 4 Show the comparisons of different algorithms with the sim edit metrics. The current metrics gives good performance in reducing the number of editions compared with other popular methods. 747 World Academy of Science, Engineering and Technology Vol:5 2011-10-25 [5] Percentage Similarity 60.00% 50.00% 40.00% 30.00% Levenshtein [6] Longest Common Subsequence [7] Needleman 20.00% P-Bigram 10.00% [8] 0.00% 1 2 3 4 5 International Science Index Vol:5, No:10, 2011 waset.org/publications/7178 Dataset [9] [10] Fig. 5 Comparison of different algorithms with percentage similarity [11] Fig. 5 The cross line shows the P-Bigram algorithm where as longest common subsequence algorithm and Needleman algorithm and Levenshtein shown by square, triangle, diamond. The P-Bigram Edit distance is a cheap distance measure which always returns a distance rather smaller than the Unigram edit distance, there are shown for typical results in Fig. 4 and Fig. 5. For given string S and T this distance can be calculated in O(n2). The P-Bigram Edit distance is then divided to this transformation the P-Bigram distance similarity measure will always be smaller than the edit distance similarity measure. [12] [13] [14] [15] [16] [17] V. CONCLUSION Mud-Armeen Munlin holds a Ph.D. degree in Computer Science from The University of Leeds, UK in 1995. Assistant Professor worked as associate Dean for Academics Faculty of Information Science and Technology Mahanakorn University of Technology Thailand. His specialized in solid modeling, virtual reality, medical application, internet and mobile applications. We proposed an edit distance of the P-Bigram method integrating the dynamic programming concept to compared similarity. That edit distance implements a heuristic of giving grater importance in the combined measure to the pairs of strings whose similarity is higher in comparison with the similarity of other pairs. The proposed method was tested on 5 dna-matching datasets with representative two charactersbased string measures: Levenshtein distance. The result showed that the performance of the P-Bigram method can be improved. The result of such process can then compute similarities between two characters two nucleotide with quit high accuracy. However there are several improvements we will try to address in the future. Patsaraporn Somboonsak recrived the Master degree in Management Information Technology from Walailak University Thailand in 2006. Currently she is a Ph.D. student at Faculty of Information Science and Technology Mahanakorn University of Technology Thailand. Her interesting research are bioinformatics and computational for health medicine. ACKNOWLEDGMENT This works was financially supported by Faculty of Information Science and Technology Mahanakorn University of Technology. REFERENCES [1] [2] [3] [4] Gusfield. “Algorithms on String Trees and Sequences”, Computer science and Computational Biology Cambridge University Press, 1997. Hall, P.A.V, Dowling, G.R. “Approximate string matching”, ACM Computing Surveys, Vol. 4, 1980, pp. 381-402. Christen, P. “A Comparison of Personal Name Matching Techniques and Practical Issues”, Technical Report TR-CS-06-02, Joint Computer Science Technical Report Series, Department of Computer Science, 2006. Cohen, W. Ravikumar, P. Fienberg, S.”A comparison of string distance metrics for name-matching tasks”, IJCAI Workshop on Information Integration on the Web, Acapulco, Mexico, 2003, pp. 73-78. AbdulJaleel, N. Larkey, L.S. “Statistical transliteration for EnglishArabic cross language information retrieval”, CIKM, 2003, PP. 139-146. Linden, K, “Multilingual Modeling of Cross-Lingual Spelling Variants spelling variants Information Retrieval”, Vol. 3, 2006, pp. 295-310. Ristad, E.S. Yianilos, P.N. “Learning string-edit distance”, IEEE Transactions or Pattern Analysis and Machine Intelligence, 1998. Carlo Batini, Monica Scannapieco, “Data Quality Concepts, Methodologies and Techniqes”, Springer, DCSA, 1998, pp. 117-127. Heikki Hyyrö, Ayumi Shinohara, “New Bit-Parallel-Distance Algorithm”, Nikoletseas, LNCS 3503, 2005, pp. 380-390. Adrian. Horia Dediu, et al., “A fast Longest Common Subsequence Algorithm for Similar Strings”, Language and Automata Theory and Applications, International Conference, LATA, 2010, pp. 82-93. Heikki Hyyrö, “Restricted Transposition Invariant Approximate String Matching Under Edit Distance”, SPIRE, LNCS 3772, 2005, pp. 256266. M.H. Alsuwaiyel, “Algorithms design techniques and analysis”, World Scientific Connecting Great Minds, Vol. 7, 1999, pp. 203-208. Dekang Lin(1998), “An Information Theoretic Definition of similarity”, Proceedings of the 15th international conference on statistic”, Citeseer. Saul B. Needleman, Christlan D. Wunsch, “A General Applicable to the Search for Similarities in the Amino Acid Sequence of two Proteins”, J Mol. Biol, Vol. 48, 1970, pp. 443-453. David Sankoff, “Simultaneous solution of the RNA folding alignment and protosequence problems”, Siam J. Appl Math, Vol. 45, 1985, pp. 810-825. A drien Coyette, et al., “Trainable Sketch Recognizer of Graphical User Interface Design”, International Federation for Information Processing, Vol. 1, 2007, pp. 124-135. Levenshtein V.I. “Binary codes capable of correcting deletions, insertions and reversals”, Soviet Physics Doklady, Vol. 8, 1966, pp. 705710. 748