Let P and T be a pattern and a text strings, respectively. The one-dimensional discretely scaled ... more Let P and T be a pattern and a text strings, respectively. The one-dimensional discretely scaled pattern matching problem is to ask for all valid positions in T that some discrete scales of P occur in these positions. Amir et al. first showed that this problem can be solved in O(n) time by adapting Eilam-Tzoreff and Vishkin’s algorithm. Recently, Wang et al. showed that when the size of the alphabet in T is finite, it can also be answered in O(|P |+Ud) time with a preprocessing in O(n log n) time and O(n log n) space, where Ud denotes the number of reported positions. For integer alphabets and unbounded alphabets, Wang’s preprocessing can also be implemented with O(n log n) time and O(n log n) space, achieving O(|P |+Ud+logn) time to report all valid positions. In this paper, we propose the best known preprocessings for the one-dimensional discretely scaled pattern matching problem. For constant-sized alphabets, we propose an optimal preprocessing, which requires O(n) time and repor...
The longest common subsequence (LCS) problem was widely discussed and regarded as the measurement... more The longest common subsequence (LCS) problem was widely discussed and regarded as the measurement for the relationship among sequences. Let A ′ and B ′ are the subsequences of A and B, respectively. The merged-sequence E(A, B) is composed of A ′ and B ′. In this paper, we consider the merged-LCS problem, denoted as LCS(T, E(A, B)), for measuring the relationship among three sequences T, A and B. We first propose an algorithm for solving the merged-LCS problem, whose time complexity is O(n 3), where n is the sequence length. We further discuss the variant version of the merged-LCS problem with block constraint, that is, the block information of A and B is given in advance. For the blocked merged-LCS problem, we propose an algorithm with time complexity O(n 2 m), where m is the number of blocks. An improved O(n 2 + nm 2) algorithm is proposed for the same blocked merged-LCS problem by using the concept of preprocessing. Key words: longest common subsequence, dynamic programming, seque...
Abstract—Essential proteins affect the cellular life deeply, but it is hard to identify them. Pro... more Abstract—Essential proteins affect the cellular life deeply, but it is hard to identify them. Protein-protein interaction is one of the ways to disclose whether a protein is essential or not. We notice that many researchers use the feature set composed of topology properties from protein-protein interaction to predict the essential proteins. However, the functionality of a protein is also a clue to determine its essentiality. The goal of this paper is to build SVM models for predicting the essential proteins. In our experiments, we download Scere20070107, which contains 4873 proteins and 17166 interactions, from DIP database. The ratio of essential proteins to nonessential proteins is nearly 1:4, so it is imbalanced. In the imbalanced dataset, the best values of F-measure, MCC, AIC and BIC of our models are 0.5197, 0.4671, 0.2428 and 0.2543, respectively. We build another balanced dataset with ratio 1:1. For balanced dataset, the best values of F-measure, MCC, AIC and BIC of our mod...
In this paper, we propose a generalization of quicksort to solve the problem of sorting the first... more In this paper, we propose a generalization of quicksort to solve the problem of sorting the first k largest elements in a set of n elements. k denote the average number of comparisons required for solving the problem. We obtain A ; harmonic number. Besides, we get Key words: complexity analysis, quicksort, divide-and-conquer, generalization.
The multiple sequence alignment (MSA) is a fundamental technique of molecular biology. Biological... more The multiple sequence alignment (MSA) is a fundamental technique of molecular biology. Biological sequences are aligned with each other vertically in order to show the similarities and differences among them. In this paper, we first propose an efficient group alignment method to perform the alignment between two groups of sequences. Its time complexity is O mnL 1 L 2 # , where m and n are the number of sequences in the two groups, L 1 and L 2 are the length of the sequences in the two groups. Then we propose a clustering method to build the tree topology for merging, which is a top-down heuristics. The clustering method is based on the concept that the two sequences having the longest distance should be split into two clusters. The time complexity of our MSA algorithm is O # , where n is the number of sequences and L is the maximum length of all sequences. By our experiments, both the alignment quality and required time of our algorithm are better than Clustal W algorithm (using the...
The biochemical functions of proteins are determined by their structures. Thus one of the most im... more The biochemical functions of proteins are determined by their structures. Thus one of the most important issues in the life science is to predict the three-dimensional structures with protein sequences, and then to deduce their biochemical functions. In order to simplify the problems, scientists use the lattice model to approximate the real protein structures, but they two cannot be compared in fact. So we present the curve fitting concept, such as B-splines, to convert the lattice model and a real structure to the curves to see the difference among them in a fair position. Besides, the curve alignment can also be used as another measurement to evaluate the similarity between two real protein structures. We then propose an algorithm to develop a protein structure prediction methodology based on a structure-known protein, where the two protein sequences are extremely similar. By the experimental results, our protein structure prediction method performs well when we get two protein se...
Given a graph, how do we represent the paths in the graph with the least information ? This induc... more Given a graph, how do we represent the paths in the graph with the least information ? This induces the path compress problem on graphs in which we are asked to represent a path by using fewer edges or vertices so that any two dierent paths are distinctive. There are two versions of compression problem: edge compression and vertex compression. For the edge compress problem, we show that it can be solved in linear time on general graphs. For the vertex compression, we prove that it is NP-hard. Besides, we propose a heuristic algorithm with polynomial time to solve it. We also do some experiments and obtain some experiment results which illustrate the eciency of our heuristic algorithm.
International Journal of Foundations of Computer Science, 2010
Given two sequences S1, S2, and a constrained sequence C, a longest common subsequence of S1, S2 ... more Given two sequences S1, S2, and a constrained sequence C, a longest common subsequence of S1, S2 with restriction to C is called a constrained longest common subsequence of S1 and S2 with C. At the same time, an optimal alignment of S1, S2 with restriction to C is called a constrained pairwise sequence alignment of S1 and S2 with C. Previous algorithms have shown that the constrained longest common subsequence problem is a special case of the constrained pairwise sequence alignment problem, and that both of them can be solved in O(rnm) time, where r, n, and m represent the lengths of C, S1, and S2, respectively. In this paper, we extend the definition of constrained pairwise sequence alignment to a more flexible version, called weighted constrained pairwise sequence alignment, in which some constraints might be ignored. We first give an O(rnm)-time algorithm for solving the weighted constrained pairwise sequence alignment problem, then show that our extension can be adopted to solve...
International Journal of Foundations of Computer Science, 1997
The capability of fault tolerance is one of the advantages of multiprocessor systems. In this pap... more The capability of fault tolerance is one of the advantages of multiprocessor systems. In this paper, we prove that the fault tolerance of an n-star graph is 2n-5 with restriction to the forbidden faulty set. And we propose an algorithm for examining the connectivity of an n-star graph when there exist at most 2n - 4 faults. The algorithm requires O(n2 log n) time. Besides, we improve the fault-tolerant routing algorithm proposed by Bagherzadeh et al. by calculating the cycle structure of a permutation and the avoidance of routing message to a node without any nonfaulty neighbor. This calculation needs only constant time. And then, we propose an efficient fault-tolerant broadcasting algorithm. When there is no fault, our broadcasting algorithm remains optimal. The penalty is O(n) if there exists only one fault, and the penalty is O(n2) if there exist at most n - 2 faults.
Abstract—For solving the problem of cysteine state classification, we propose a 2-stage predictio... more Abstract—For solving the problem of cysteine state classification, we propose a 2-stage prediction method. In the first stage, we invoke the SVM to get the initial prediction. The features involved in SVM classification include the local profile PSSM, order of cysteines with the normalized protein length, physiochemical properties and structure probabilities. Then, in the second stage, we propose a tuning method for refining the predicted result obtained by SVM. We validate it with a dataset derived from PDB, which contains 969 non-homologous proteins and 4136 cysteines. We adopt a 20-fold crossvalidation test and achieve 90.7 % accuracy and 0.79 Matthews correlation coefficient. With our tuning method, we can improve the performance from the initial prediction by about 20 % in the protein-based accuracy and 5 % in the cysteine-based accuracy. The prediction accuracies are better than the previous works. Index Terms—bioinformatics, SVM, feature selection, protein, cysteine, disulfid...
In the time series classification (TSC) problem, the calculation of the distance of two time seri... more In the time series classification (TSC) problem, the calculation of the distance of two time series is the kernel issue. One of the famous methods for the distance calculation is the dynamic time warping (DTW) with \(O(n^2)\) time complexity, based on the dynamic programming. It takes very long time when the data size is large. In order to overcome the time consuming problem, the dynamic time warping with window (DTWW) combines the warping window into DTW calculation. This method reduces the computation time by restricting the number of possible solutions, so the answer of DTWW may not be the optimal solution. In this paper, we propose the minimum-first DTW method (MDTW) that expands the possible solutions in the minimum first order. Our method not only reduces the required computation time, but also gets the optimal answer.
It is almost believed that the function of one protein is determined by its structure. The more s... more It is almost believed that the function of one protein is determined by its structure. The more similar two protein structures are, the more similar their functions are. The distance RMSD (Root Mean Square Deviation) is a popular method used by most researchers to measure the distance (or similarity) between two protein structures, usually one is the predicted structure and the other is its real structure. In this paper, we propose a new algorithm to compare two protein structures, which is the combination of sequence alignment and the B-spline curve fitting in the space. To test and verify our method, we randomly choose some families in the CATH database and try to identify them. Experimental results show that our method outperforms the distance RMSD method. Furthermore, we apply the SVM (Support Vector Machine) tool to help us obtain the better classifications.
The protein side chain packing problem (PSCPP) is an essential issue for predicting structure in ... more The protein side chain packing problem (PSCPP) is an essential issue for predicting structure in proteomics. PSCPP has been proved to be NP-hard. In this paper, we propose a method for solving PSCPP by transforming it to the graph clique problem, and then applying the ant colony optimization (ACO) algorithm to solve it. We build the coordinate rotamer library based on the pair of dihedral angles of backbones to reduce the required time. To evaluate the goodness of a solution of the ACO algorithm, we use a simple score function with four factors: disulfide bonds, intermolecular hydrogen bonds, charge-charge interactions and van der Waals interactions. The experimental results show that our score function is biologically sensible. We compare our computational results with the results of SCWRL 3.0 and the residue-rotamer-reduction (R3) algorithm. The accuracy of our method outperforms both of them.
A huge number of genomic information, including protein and DNA sequences, is generated by the hu... more A huge number of genomic information, including protein and DNA sequences, is generated by the human genome project. Deciphering these sequences and detecting local residue patterns of multiple sequences are very difficult. One of the ways to decipher these biological sequences is to detect local residue patterns from them. However, detecting unknown patterns from multiple sequences is still very difficult. In this paper, we propose an algorithm, based on the Gibbs sampler method, for identifying local consensus patterns (motifs) in monomolecular sequences. We first designed an ACO (ant colony optimization) algorithm to find a good initial solution and a set of better candidate positions for revising the motif. Then the Gibbs sampler method is applied with these better candidate positions as the input. The required time for finding motifs using our algorithm is reduced drastically. It takes only 20% of time of the Gibbs sampler method and it maintains the comparable quality.
The longest common subsequence (LCS) problem with gap constraints (or the gapped LCS), which has ... more The longest common subsequence (LCS) problem with gap constraints (or the gapped LCS), which has applications to genetics and molecular biology, is an interesting and useful variant to the LCS problem. In previous work, this problem can be solved in O(nm) time when the gap constraints are fixed to a single integer, where n and m denote the lengths of the two input sequences, respectively. In this paper, we generalize the problem from fixed gaps to variable gap constraints, offering a new flexible approach for sequence analysis. By using an efficient technique for incremental suffix maximum queries, we show that this generalized problem can be solved in O(nm) time, which improves the previous result.
In order to get more efficiency of those computers, using distributed systems is a good choice. T... more In order to get more efficiency of those computers, using distributed systems is a good choice. Thus, the mutual exclusion problem in distributed systems is an important issue. One of the ways to solve the mutual exclusion problem is the coterie protocol, which was proposed by Garcia-Molina and Barbara [2]. A coterie under U (U is the collection set of all the nodes in the distributed system) consists of a set of quorums in which each quorum is a subset of U , and the intersection of any pair of quorums is nonempty. It is called the intersection property. The other property of quorums is minimality that no quorum contains another quorum. With these two properties, a coterie can be used to solve the mutual exclusion problem in a distributed system. Any node which wants to enter the critical section must have the permissions of all nodes in a quorum, and release the permissions when the node leaves the critical section. The permission can be given to at most one node in the distribute...
Let P and T be a pattern and a text strings, respectively. The one-dimensional discretely scaled ... more Let P and T be a pattern and a text strings, respectively. The one-dimensional discretely scaled pattern matching problem is to ask for all valid positions in T that some discrete scales of P occur in these positions. Amir et al. first showed that this problem can be solved in O(n) time by adapting Eilam-Tzoreff and Vishkin’s algorithm. Recently, Wang et al. showed that when the size of the alphabet in T is finite, it can also be answered in O(|P |+Ud) time with a preprocessing in O(n log n) time and O(n log n) space, where Ud denotes the number of reported positions. For integer alphabets and unbounded alphabets, Wang’s preprocessing can also be implemented with O(n log n) time and O(n log n) space, achieving O(|P |+Ud+logn) time to report all valid positions. In this paper, we propose the best known preprocessings for the one-dimensional discretely scaled pattern matching problem. For constant-sized alphabets, we propose an optimal preprocessing, which requires O(n) time and repor...
The longest common subsequence (LCS) problem was widely discussed and regarded as the measurement... more The longest common subsequence (LCS) problem was widely discussed and regarded as the measurement for the relationship among sequences. Let A ′ and B ′ are the subsequences of A and B, respectively. The merged-sequence E(A, B) is composed of A ′ and B ′. In this paper, we consider the merged-LCS problem, denoted as LCS(T, E(A, B)), for measuring the relationship among three sequences T, A and B. We first propose an algorithm for solving the merged-LCS problem, whose time complexity is O(n 3), where n is the sequence length. We further discuss the variant version of the merged-LCS problem with block constraint, that is, the block information of A and B is given in advance. For the blocked merged-LCS problem, we propose an algorithm with time complexity O(n 2 m), where m is the number of blocks. An improved O(n 2 + nm 2) algorithm is proposed for the same blocked merged-LCS problem by using the concept of preprocessing. Key words: longest common subsequence, dynamic programming, seque...
Abstract—Essential proteins affect the cellular life deeply, but it is hard to identify them. Pro... more Abstract—Essential proteins affect the cellular life deeply, but it is hard to identify them. Protein-protein interaction is one of the ways to disclose whether a protein is essential or not. We notice that many researchers use the feature set composed of topology properties from protein-protein interaction to predict the essential proteins. However, the functionality of a protein is also a clue to determine its essentiality. The goal of this paper is to build SVM models for predicting the essential proteins. In our experiments, we download Scere20070107, which contains 4873 proteins and 17166 interactions, from DIP database. The ratio of essential proteins to nonessential proteins is nearly 1:4, so it is imbalanced. In the imbalanced dataset, the best values of F-measure, MCC, AIC and BIC of our models are 0.5197, 0.4671, 0.2428 and 0.2543, respectively. We build another balanced dataset with ratio 1:1. For balanced dataset, the best values of F-measure, MCC, AIC and BIC of our mod...
In this paper, we propose a generalization of quicksort to solve the problem of sorting the first... more In this paper, we propose a generalization of quicksort to solve the problem of sorting the first k largest elements in a set of n elements. k denote the average number of comparisons required for solving the problem. We obtain A ; harmonic number. Besides, we get Key words: complexity analysis, quicksort, divide-and-conquer, generalization.
The multiple sequence alignment (MSA) is a fundamental technique of molecular biology. Biological... more The multiple sequence alignment (MSA) is a fundamental technique of molecular biology. Biological sequences are aligned with each other vertically in order to show the similarities and differences among them. In this paper, we first propose an efficient group alignment method to perform the alignment between two groups of sequences. Its time complexity is O mnL 1 L 2 # , where m and n are the number of sequences in the two groups, L 1 and L 2 are the length of the sequences in the two groups. Then we propose a clustering method to build the tree topology for merging, which is a top-down heuristics. The clustering method is based on the concept that the two sequences having the longest distance should be split into two clusters. The time complexity of our MSA algorithm is O # , where n is the number of sequences and L is the maximum length of all sequences. By our experiments, both the alignment quality and required time of our algorithm are better than Clustal W algorithm (using the...
The biochemical functions of proteins are determined by their structures. Thus one of the most im... more The biochemical functions of proteins are determined by their structures. Thus one of the most important issues in the life science is to predict the three-dimensional structures with protein sequences, and then to deduce their biochemical functions. In order to simplify the problems, scientists use the lattice model to approximate the real protein structures, but they two cannot be compared in fact. So we present the curve fitting concept, such as B-splines, to convert the lattice model and a real structure to the curves to see the difference among them in a fair position. Besides, the curve alignment can also be used as another measurement to evaluate the similarity between two real protein structures. We then propose an algorithm to develop a protein structure prediction methodology based on a structure-known protein, where the two protein sequences are extremely similar. By the experimental results, our protein structure prediction method performs well when we get two protein se...
Given a graph, how do we represent the paths in the graph with the least information ? This induc... more Given a graph, how do we represent the paths in the graph with the least information ? This induces the path compress problem on graphs in which we are asked to represent a path by using fewer edges or vertices so that any two dierent paths are distinctive. There are two versions of compression problem: edge compression and vertex compression. For the edge compress problem, we show that it can be solved in linear time on general graphs. For the vertex compression, we prove that it is NP-hard. Besides, we propose a heuristic algorithm with polynomial time to solve it. We also do some experiments and obtain some experiment results which illustrate the eciency of our heuristic algorithm.
International Journal of Foundations of Computer Science, 2010
Given two sequences S1, S2, and a constrained sequence C, a longest common subsequence of S1, S2 ... more Given two sequences S1, S2, and a constrained sequence C, a longest common subsequence of S1, S2 with restriction to C is called a constrained longest common subsequence of S1 and S2 with C. At the same time, an optimal alignment of S1, S2 with restriction to C is called a constrained pairwise sequence alignment of S1 and S2 with C. Previous algorithms have shown that the constrained longest common subsequence problem is a special case of the constrained pairwise sequence alignment problem, and that both of them can be solved in O(rnm) time, where r, n, and m represent the lengths of C, S1, and S2, respectively. In this paper, we extend the definition of constrained pairwise sequence alignment to a more flexible version, called weighted constrained pairwise sequence alignment, in which some constraints might be ignored. We first give an O(rnm)-time algorithm for solving the weighted constrained pairwise sequence alignment problem, then show that our extension can be adopted to solve...
International Journal of Foundations of Computer Science, 1997
The capability of fault tolerance is one of the advantages of multiprocessor systems. In this pap... more The capability of fault tolerance is one of the advantages of multiprocessor systems. In this paper, we prove that the fault tolerance of an n-star graph is 2n-5 with restriction to the forbidden faulty set. And we propose an algorithm for examining the connectivity of an n-star graph when there exist at most 2n - 4 faults. The algorithm requires O(n2 log n) time. Besides, we improve the fault-tolerant routing algorithm proposed by Bagherzadeh et al. by calculating the cycle structure of a permutation and the avoidance of routing message to a node without any nonfaulty neighbor. This calculation needs only constant time. And then, we propose an efficient fault-tolerant broadcasting algorithm. When there is no fault, our broadcasting algorithm remains optimal. The penalty is O(n) if there exists only one fault, and the penalty is O(n2) if there exist at most n - 2 faults.
Abstract—For solving the problem of cysteine state classification, we propose a 2-stage predictio... more Abstract—For solving the problem of cysteine state classification, we propose a 2-stage prediction method. In the first stage, we invoke the SVM to get the initial prediction. The features involved in SVM classification include the local profile PSSM, order of cysteines with the normalized protein length, physiochemical properties and structure probabilities. Then, in the second stage, we propose a tuning method for refining the predicted result obtained by SVM. We validate it with a dataset derived from PDB, which contains 969 non-homologous proteins and 4136 cysteines. We adopt a 20-fold crossvalidation test and achieve 90.7 % accuracy and 0.79 Matthews correlation coefficient. With our tuning method, we can improve the performance from the initial prediction by about 20 % in the protein-based accuracy and 5 % in the cysteine-based accuracy. The prediction accuracies are better than the previous works. Index Terms—bioinformatics, SVM, feature selection, protein, cysteine, disulfid...
In the time series classification (TSC) problem, the calculation of the distance of two time seri... more In the time series classification (TSC) problem, the calculation of the distance of two time series is the kernel issue. One of the famous methods for the distance calculation is the dynamic time warping (DTW) with \(O(n^2)\) time complexity, based on the dynamic programming. It takes very long time when the data size is large. In order to overcome the time consuming problem, the dynamic time warping with window (DTWW) combines the warping window into DTW calculation. This method reduces the computation time by restricting the number of possible solutions, so the answer of DTWW may not be the optimal solution. In this paper, we propose the minimum-first DTW method (MDTW) that expands the possible solutions in the minimum first order. Our method not only reduces the required computation time, but also gets the optimal answer.
It is almost believed that the function of one protein is determined by its structure. The more s... more It is almost believed that the function of one protein is determined by its structure. The more similar two protein structures are, the more similar their functions are. The distance RMSD (Root Mean Square Deviation) is a popular method used by most researchers to measure the distance (or similarity) between two protein structures, usually one is the predicted structure and the other is its real structure. In this paper, we propose a new algorithm to compare two protein structures, which is the combination of sequence alignment and the B-spline curve fitting in the space. To test and verify our method, we randomly choose some families in the CATH database and try to identify them. Experimental results show that our method outperforms the distance RMSD method. Furthermore, we apply the SVM (Support Vector Machine) tool to help us obtain the better classifications.
The protein side chain packing problem (PSCPP) is an essential issue for predicting structure in ... more The protein side chain packing problem (PSCPP) is an essential issue for predicting structure in proteomics. PSCPP has been proved to be NP-hard. In this paper, we propose a method for solving PSCPP by transforming it to the graph clique problem, and then applying the ant colony optimization (ACO) algorithm to solve it. We build the coordinate rotamer library based on the pair of dihedral angles of backbones to reduce the required time. To evaluate the goodness of a solution of the ACO algorithm, we use a simple score function with four factors: disulfide bonds, intermolecular hydrogen bonds, charge-charge interactions and van der Waals interactions. The experimental results show that our score function is biologically sensible. We compare our computational results with the results of SCWRL 3.0 and the residue-rotamer-reduction (R3) algorithm. The accuracy of our method outperforms both of them.
A huge number of genomic information, including protein and DNA sequences, is generated by the hu... more A huge number of genomic information, including protein and DNA sequences, is generated by the human genome project. Deciphering these sequences and detecting local residue patterns of multiple sequences are very difficult. One of the ways to decipher these biological sequences is to detect local residue patterns from them. However, detecting unknown patterns from multiple sequences is still very difficult. In this paper, we propose an algorithm, based on the Gibbs sampler method, for identifying local consensus patterns (motifs) in monomolecular sequences. We first designed an ACO (ant colony optimization) algorithm to find a good initial solution and a set of better candidate positions for revising the motif. Then the Gibbs sampler method is applied with these better candidate positions as the input. The required time for finding motifs using our algorithm is reduced drastically. It takes only 20% of time of the Gibbs sampler method and it maintains the comparable quality.
The longest common subsequence (LCS) problem with gap constraints (or the gapped LCS), which has ... more The longest common subsequence (LCS) problem with gap constraints (or the gapped LCS), which has applications to genetics and molecular biology, is an interesting and useful variant to the LCS problem. In previous work, this problem can be solved in O(nm) time when the gap constraints are fixed to a single integer, where n and m denote the lengths of the two input sequences, respectively. In this paper, we generalize the problem from fixed gaps to variable gap constraints, offering a new flexible approach for sequence analysis. By using an efficient technique for incremental suffix maximum queries, we show that this generalized problem can be solved in O(nm) time, which improves the previous result.
In order to get more efficiency of those computers, using distributed systems is a good choice. T... more In order to get more efficiency of those computers, using distributed systems is a good choice. Thus, the mutual exclusion problem in distributed systems is an important issue. One of the ways to solve the mutual exclusion problem is the coterie protocol, which was proposed by Garcia-Molina and Barbara [2]. A coterie under U (U is the collection set of all the nodes in the distributed system) consists of a set of quorums in which each quorum is a subset of U , and the intersection of any pair of quorums is nonempty. It is called the intersection property. The other property of quorums is minimality that no quorum contains another quorum. With these two properties, a coterie can be used to solve the mutual exclusion problem in a distributed system. Any node which wants to enter the critical section must have the permissions of all nodes in a quorum, and release the permissions when the node leaves the critical section. The permission can be given to at most one node in the distribute...
Uploads