Skip to main content
Eloi Araujo

    Eloi Araujo

    Ancestral reconstruction is a classic task in comparative genomics. Here, we study the genome median problem, a related computational problem which, given a set of three or more genomes, asks to find a new genome that minimizes the sum of... more
    Ancestral reconstruction is a classic task in comparative genomics. Here, we study the genome median problem, a related computational problem which, given a set of three or more genomes, asks to find a new genome that minimizes the sum of pairwise distances between it and the given genomes. The distance stands for the amount of evolution observed at the genome level, for which we determine the minimum number of rearrangement operations necessary to transform one genome into the other. For almost all rearrangement operations the median problem is NP-hard, with the exception of the breakpoint median that can be constructed efficiently for multichromosomal circular and mixed genomes. In this work, we study the median problem under a restricted rearrangement measure called c4-distance, which is closely related to the breakpoint and the DCJ distance. We identify tight bounds and decomposers of the c4-median and develop algorithms for its construction, one exact ILP-based and three combin...
    In computational biology, mapping a sequencesonto a sequence graphGis a significant challenge. One possible approach to addressing this problem is to identify a walkpinGthat spells a sequence which is most similar tos. This problem is... more
    In computational biology, mapping a sequencesonto a sequence graphGis a significant challenge. One possible approach to addressing this problem is to identify a walkpinGthat spells a sequence which is most similar tos. This problem is known as the Graph Sequence Mapping Problem (GSMP). In this paper, we study an alternative problem formulation, namely the De Bruijn Graph Sequence Mapping Problem (BSMP), which can be stated as follows: given a sequencesand a De Bruijn graphGk(wherek≥ 2), find a walkpinGkthat spells a sequence which is most similar tosaccording to a distance metric. We present both exact algorithms and approximate distance heuristics for solving this problem, using edit distance as a criterion for measuring similarity.
    In the median problem we are given a set of three or more genomes and want to find a new genome minimizing the sum of pairwise distances between it and the given genomes. For almost all rearrangement operations the median problem is... more
    In the median problem we are given a set of three or more genomes and want to find a new genome minimizing the sum of pairwise distances between it and the given genomes. For almost all rearrangement operations the median problem is NP-hard. We study the median problem under a restricted rearrangement measure called c4-distance, which is closely related to breakpoint and DCJ distances. We propose two algorithms for its construction, one exact ILP-based and a combinatorial heuristic, and perform experiments on simulated data.
    Boolean networks are discrete-time dynamic systems that have been used as a model for a wide range of applications in different areas, especially in Systems Biology. The analysis of Boolean networks includes the search for attractors,... more
    Boolean networks are discrete-time dynamic systems that have been used as a model for a wide range of applications in different areas, especially in Systems Biology. The analysis of Boolean networks includes the search for attractors, which may represent important biological conditions such as gene expression patterns in models of gene regulatory networks, among others. Attractors can be found through exploring the network paths by achieving the solution to the SAT problem, which is known to be NP-complete. In this paper, we propose an approach to find all attractors by first transforming the corresponding instance of the SAT problem to a Hitting Set instance in linear time through a new direct linear reduction. Finally, the instance of the Hitting Set problem is solved by applying a fast parallel algorithm implemented in GPU. As a proof of principle, we tested the method for Boolean networks with 3 and 4 variables, returning the result in about 3 seconds and 9 hours respectively. However, for larger networks the execution time grows substantially due to the algorithm used in the Hitting Set problem solver. But the result achieved for networks with 3 and 4 variables encourages improvements in the method for dealing with large-scale Boolean networks, specially by incorporating some parameter restrictions based on prior information about the state diagram transition graphs structure and optimizing the method by means of dynamic programming and parallelism.
    An important problem in Computational Biology is to determine genetic markers, substrings of a set of sequences that do not occur on sequences of other sets. Applications for this problem include finding small specific regions for primer... more
    An important problem in Computational Biology is to determine genetic markers, substrings of a set of sequences that do not occur on sequences of other sets. Applications for this problem include finding small specific regions for primer design and to find specific organisms or sequences in metagenomes. Genetic markers can be addressed by the Specific Substring Problem - SSP which consists of finding all minimal substrings in a given set of sequences with at least k differences among all the substrings in another sequence set. Since this problem spend quadratic time when Hamming distance is considered and we have, in general, a large volume of data to be processed, this solution becomes impractical. With this in mind, the main focus of this work is to propose and investigate the use of heuristic and parallel approaches for the SSP whose effectiveness were verified with artificial and real data experiments.
    One of the most important concepts in biological network analysis is that of network motifs, which are patterns of interconnections that occur in a given network at a frequency higher than expected in a random network. In this work we are... more
    One of the most important concepts in biological network analysis is that of network motifs, which are patterns of interconnections that occur in a given network at a frequency higher than expected in a random network. In this work we are interested in searching and inferring network motifs in a class of biological networks that can be represented by vertex-colored graphs. We show the computational complexity for many problems related to colorful topological motifs and present efficient algorithms for special cases. We also present a probabilistic strategy to detect highly frequent motifs in vertex-colored graphs. Experiments on real data sets show that our algorithms are very competitive both in efficiency and in quality of the solutions.
    Sequence alignment supports numerous tasks in bioinformatics, natural language processing, pattern recognition, social sciences, and others fields. While the alignment of two sequences may be performed swiftly in many applications, the... more
    Sequence alignment supports numerous tasks in bioinformatics, natural language processing, pattern recognition, social sciences, and others fields. While the alignment of two sequences may be performed swiftly in many applications, the simultaneous alignment of multiple sequences proved to be naturally more intricate. Although most multiple sequence alignment (MSA) formulations are NP-hard, several approaches have been developed, as they can outperform pairwise alignment methods or are necessary for some applications. Taking into account not only similarities but also the lengths of the compared sequences (i.e. normalization) can provide better alignment results than both unnormalized or post-normalized approaches. While some normalized methods have been developed for pairwise sequence alignment, none have been proposed for MSA. This work is a first effort towards the development of normalized methods for MSA. We discuss multiple aspects of normalized multiple sequence alignment (NM...
    Given two sets of sequences A and B, the Substring Specific problem is to find all minimum substrings in A having distance at least k for each subsequence in B. This work addresses three new implementations for the Maaß algorithm when the... more
    Given two sets of sequences A and B, the Substring Specific problem is to find all minimum substrings in A having distance at least k for each subsequence in B. This work addresses three new implementations for the Maaß algorithm when the Hamming distance is considered: a naive cubic-time algorithm and two quadratic-time algorithms. We run tests to compare the running time of these implementations and another recently described algorithm implementation that uses the edit distance. In addition, we conducted preliminary testing on a large Tara Ocean database, looking for efficient and effective strategies for finding unique sequences in a set of sequences comparing with the other
    Scoring matrices are widely used in sequence comparisons. A scoring matrix γ is indexed by symbols of an alphabet. The entry in γ in row a and column b measures the cost of the edit operation of replacing symbol a by symbol b. For a given... more
    Scoring matrices are widely used in sequence comparisons. A scoring matrix γ is indexed by symbols of an alphabet. The entry in γ in row a and column b measures the cost of the edit operation of replacing symbol a by symbol b. For a given scoring matrix and sequences s and t, we consider two kinds of induced scoring functions. The first function, known as weighted edit distance, is defined as the sum of costs of the edit operations required to transform s into t. The second, known as normalized edit distance, is defined as the minimum quotient between the sum of costs of edit operations to transform s into t and the number of the corresponding edit operations. In this work we characterize the class of scoring matrices for which the induced weighted edit distance is actually a metric. We do the same for the normalized edit distance.
    Dissertação (Mestrado)--Instituto de Matemática e Estatística da Universidade de São Paulo, 14/08/98.
    ABSTRACT
    ABSTRACT
    The mechanism of alternative splicing in the transcriptome may increase the proteome diversity in eukaryotes. In proteomics, several studies aim to use protein sequence repositories to annotate MS experiments or to detect differentially... more
    The mechanism of alternative splicing in the transcriptome may increase the proteome diversity in eukaryotes. In proteomics, several studies aim to use protein sequence repositories to annotate MS experiments or to detect differentially expressed proteins. However, the available protein sequence repositories are not designed to fully detect protein isoforms derived from mRNA splice variants. To foster knowledge for the field, here we introduce SpliceProt, a new protein sequence repository of transcriptome experimental data used to investigate for putative splice variants in human proteomes. Current version of SpliceProt contains 159 719 non-redundant putative polypeptide sequences. The assessment of the potential of SpliceProt in detecting new protein isoforms resulting from alternative splicing was performed by using publicly available proteomics data. We detected 173 peptides hypothetically derived from splice variants, which 54 of them are not present in UniprotKB/TrEMBL sequence...
    Scoring matrices are widely used in sequence comparisons. A scoring matrix γ is indexed by symbols of an alphabet. The entry in γ in row a and column b measures the cost of the edit operation of replacing symbol a by symbol b. For a given... more
    Scoring matrices are widely used in sequence comparisons. A scoring matrix γ is indexed by symbols of an alphabet. The entry in γ in row a and column b measures the cost of the edit operation of replacing symbol a by symbol b. For a given scoring matrix and sequences s and t, we consider two kinds of induced scoring functions. The first function, known as weighted edit distance, is defined as the sum of costs of the edit operations required to transform s into t. The second, known as normalized edit distance, is defined as the minimum quotient between the sum of costs of edit operations to transform s into t and the number of the corresponding edit operations. In this work we characterize the class of scoring matrices for which the induced weighted edit distance is actually a metric. We do the same for the normalized edit distance.