Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and ... more Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator d N of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of 'match positions' and 'don't care positions'. Our software is available online and as downloadable source code at:...
We present a WWW server for AUGUSTUS, a software for gene prediction in eukaryotic genomic sequen... more We present a WWW server for AUGUSTUS, a software for gene prediction in eukaryotic genomic sequences that is based on a generalized hidden Markov model, a probabilistic model of a sequence and its gene structure. The web server allows the user to impose constraints on the predicted gene structure. A constraint can specify the position of a splice site, a translation initiation site or a stop codon. Furthermore, it is possible to specify the position of known exons and intervals that are known to be exonic or intronic sequence. The number of constraints is arbitrary and constraints can be combined in order to pin down larger parts of the predicted gene structure. The result then is the most likely gene structure that complies with all given user constraints, if such a gene structure exists. The specification of constraints is useful when part of the gene structure is known, e.g. by expressed sequence tag or protein sequence alignments, or if the user wants to change the default predi...
Overview: In a world of worthy candidates, there are several compelling reasons to sequence the g... more Overview: In a world of worthy candidates, there are several compelling reasons to sequence the genome of the red flour beetle Tribolium castaneum. First and foremost, Tribolium is one of the most sophisticated genetic model organisms among all higher eukaryotes. Among arthropods, only Drosophila offers greater power and flexibility of genetic manipulation. Second, the Tribolium genome sequence will provide an informative link when direct comparisons between human and fruit fly sequences are unproductive. Third, as a member of the most primitive order of holometabolous insects, the Coleoptera, it is in a key phylogenetic position to inform us about the genetic innovations that accompanied the evolution of higher forms with more complex development. Fourth, Coleoptera is the largest and most species diverse of all eukaryotic orders and Tribolium offers the only genetic model for this profusion of medically and economically important species. Analysis of the Tribolium genome will faci...
AUGUSTUS is a software tool for gene prediction in eukaryotes based on a Generalized Hidden Marko... more AUGUSTUS is a software tool for gene prediction in eukaryotes based on a Generalized Hidden Markov Model, a probabilistic model of a sequence and its gene structure. Like most existing gene finders, the first version of AUGUSTUS returned one transcript per predicted gene and ignored the phenomenon of alternative splicing. Herein, we present a WWW server for an extended version of AUGUSTUS that is able to predict multiple splice variants. To our knowledge, this is the first ab initio gene finder that can predict multiple transcripts. In addition, we offer a motif searching facility, where user-defined regular expressions can be searched against putative proteins encoded by the predicted genes. The AUGUSTUS web interface and the downloadable open-source stand-alone program are freely available from
Most multi-alignment methods are fully automated, i.e. they are based on a fixed set of mathemati... more Most multi-alignment methods are fully automated, i.e. they are based on a fixed set of mathematical rules. For various reasons, such methods may fail to produce biologically meaningful alignments. Herein, we describe a semi-automatic approach to multiple sequence alignment where biological expert knowledge can be used to influence the alignment procedure. The user can specify parts of the sequences that are biologically related to each other; our software program uses these sites as anchor points and creates a multiple alignment respecting these user-defined constraints. By using known functionally, structurally or evolutionarily related positions of the input sequences as anchor points, our method can produce alignments that reflect the true biological relationships among the input sequences more accurately than fully automated procedures can do. Availability: Our software is available online at GÖttingen
Genetic screens are powerful tools to identify the genes required for a given biological process.... more Genetic screens are powerful tools to identify the genes required for a given biological process. However, for technical reasons, comprehensive screens have been restricted to very few model organisms. Therefore, although deep sequencing is revealing the genes of ever more insect species, the functional studies predominantly focus on candidate genes previously identified in Drosophila, which is biasing research towards conserved gene functions. RNAi screens in other organisms promise to reduce this bias. Here we present the results of the iBeetle screen, a large-scale, unbiased RNAi screen in the red flour beetle, Tribolium castaneum, which identifies gene functions in embryonic and postembryonic development, physiology and cell biology. The utility of Tribolium as a screening platform is demonstrated by the identification of genes involved in insect epithelial adhesion. This work transcends the restrictions of the candidate gene approach and opens fields of research not accessible in Drosophila.
In September 2005, six Community Grid projects and the Integration Project (DGI) started to desig... more In September 2005, six Community Grid projects and the Integration Project (DGI) started to design and develop a sustainable Grid infrastructure for the e-Sciences Community in Germany. The following D-Grid projects are funded by BMBF, the federal ministry of education and research: DGI D-Grid Integration project, AstroGrid-D in astronomy, C3-Grid for climate research, HEP-Grid for high energy physics, InGrid for engineering research, MediGrid for medical research, and TextGrid for humanities.
Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and ... more Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator d N of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of 'match positions' and 'don't care positions'. Our software is available online and as downloadable source code at:...
Alignment-based methods for sequence analysis have various limitations if large datasets are to b... more Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays. To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially...
In order to improve gene prediction, extrinsic evidence on the gene structure can be collected fr... more In order to improve gene prediction, extrinsic evidence on the gene structure can be collected from various sources of information such as genome-genome comparisons and EST and protein alignments. However, such evidence is often incomplete and usually uncertain. The extrinsic evidence is usually not sufficient to recover the complete gene structure of all genes completely and the available evidence is often unreliable. Therefore extrinsic evidence is most valuable when it is balanced with sequence-intrinsic evidence. We present a fairly general method for integration of external information. Our method is based on the evaluation of hints to potentially protein-coding regions by means of a Generalized Hidden Markov Model (GHMM) that takes both intrinsic and extrinsic information into account. We used this method to extend the ab initio gene prediction program AUGUSTUS to a versatile tool that we call AUGUSTUS+. In this study, we focus on hints derived from matches to an EST or protei...
Proceedings / IEEE Computer Society Bioinformatics Conference. IEEE Computer Society Bioinformatics Conference, 2002
Comparative analysis of syntenic genome sequences can be used to identify functional sites such a... more Comparative analysis of syntenic genome sequences can be used to identify functional sites such as exons and regulatory elements. Here, the first step is to align two or several evolutionary related sequences and, in recent years, a number of computer programs have been developed for alignment of large genomic sequences. Some of these programs are extremely fast but often time-efficiency is achieved at the expense of sensitivity. One way of combining speed and sensitivity is to use an anchored-alignment approach. In a first step, a fast heuristic identifies a chain of strong sequence similarities that serve as anchor points. In a second step, regions between these anchor points are aligned using a slower but more sensitive method. We present CHAOS, a novel algorithm for rapid identification of chains of local sequence similarities among large genomic sequences. Similarities identified by CHAOS are used as anchor points to improve the running time of the DIALIGN alignment program. Sy...
Comparative sequence analysis is a powerful approach to identify functional elements in genomic s... more Comparative sequence analysis is a powerful approach to identify functional elements in genomic sequences. Herein, we describe AGenDA (Alignment-based GENe Detection Algorithm), a novel method for gene prediction that is based on long-range alignment of syntenic regions in eukaryotic genome sequences. Local sequence homologies identified by the DIALIGN program are searched for conserved splice signals to define potential protein-coding exons; these candidate exons are then used to assemble complete gene structures. The performance of our method was tested on a set of 105 human-mouse sequence pairs. These test runs showed that sensitivity and specificity of AGenDA are comparable with the best gene- prediction program that is currently available. However, since our method is based on a completely different type of input information, it can detect genes that are not detectable by standard methods and vice versa. Thus, our approach seems to be a useful addition to existing gene-predicti...
Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, 1998
In this paper, we discuss a novel scoring scheme for sequence alignments. The score of an alignme... more In this paper, we discuss a novel scoring scheme for sequence alignments. The score of an alignment is defined as the sum of so-called weights of aligned segment pairs. A simple modification of the weight function used by the original version of the DIALIGN alignment program turns out to have a crucial advantage: it can be applied to both, global and local alignment problems without the need to specify a threshold parameter.
Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and ... more Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator d N of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of 'match positions' and 'don't care positions'. Our software is available online and as downloadable source code at:...
We present a WWW server for AUGUSTUS, a software for gene prediction in eukaryotic genomic sequen... more We present a WWW server for AUGUSTUS, a software for gene prediction in eukaryotic genomic sequences that is based on a generalized hidden Markov model, a probabilistic model of a sequence and its gene structure. The web server allows the user to impose constraints on the predicted gene structure. A constraint can specify the position of a splice site, a translation initiation site or a stop codon. Furthermore, it is possible to specify the position of known exons and intervals that are known to be exonic or intronic sequence. The number of constraints is arbitrary and constraints can be combined in order to pin down larger parts of the predicted gene structure. The result then is the most likely gene structure that complies with all given user constraints, if such a gene structure exists. The specification of constraints is useful when part of the gene structure is known, e.g. by expressed sequence tag or protein sequence alignments, or if the user wants to change the default predi...
Overview: In a world of worthy candidates, there are several compelling reasons to sequence the g... more Overview: In a world of worthy candidates, there are several compelling reasons to sequence the genome of the red flour beetle Tribolium castaneum. First and foremost, Tribolium is one of the most sophisticated genetic model organisms among all higher eukaryotes. Among arthropods, only Drosophila offers greater power and flexibility of genetic manipulation. Second, the Tribolium genome sequence will provide an informative link when direct comparisons between human and fruit fly sequences are unproductive. Third, as a member of the most primitive order of holometabolous insects, the Coleoptera, it is in a key phylogenetic position to inform us about the genetic innovations that accompanied the evolution of higher forms with more complex development. Fourth, Coleoptera is the largest and most species diverse of all eukaryotic orders and Tribolium offers the only genetic model for this profusion of medically and economically important species. Analysis of the Tribolium genome will faci...
AUGUSTUS is a software tool for gene prediction in eukaryotes based on a Generalized Hidden Marko... more AUGUSTUS is a software tool for gene prediction in eukaryotes based on a Generalized Hidden Markov Model, a probabilistic model of a sequence and its gene structure. Like most existing gene finders, the first version of AUGUSTUS returned one transcript per predicted gene and ignored the phenomenon of alternative splicing. Herein, we present a WWW server for an extended version of AUGUSTUS that is able to predict multiple splice variants. To our knowledge, this is the first ab initio gene finder that can predict multiple transcripts. In addition, we offer a motif searching facility, where user-defined regular expressions can be searched against putative proteins encoded by the predicted genes. The AUGUSTUS web interface and the downloadable open-source stand-alone program are freely available from
Most multi-alignment methods are fully automated, i.e. they are based on a fixed set of mathemati... more Most multi-alignment methods are fully automated, i.e. they are based on a fixed set of mathematical rules. For various reasons, such methods may fail to produce biologically meaningful alignments. Herein, we describe a semi-automatic approach to multiple sequence alignment where biological expert knowledge can be used to influence the alignment procedure. The user can specify parts of the sequences that are biologically related to each other; our software program uses these sites as anchor points and creates a multiple alignment respecting these user-defined constraints. By using known functionally, structurally or evolutionarily related positions of the input sequences as anchor points, our method can produce alignments that reflect the true biological relationships among the input sequences more accurately than fully automated procedures can do. Availability: Our software is available online at GÖttingen
Genetic screens are powerful tools to identify the genes required for a given biological process.... more Genetic screens are powerful tools to identify the genes required for a given biological process. However, for technical reasons, comprehensive screens have been restricted to very few model organisms. Therefore, although deep sequencing is revealing the genes of ever more insect species, the functional studies predominantly focus on candidate genes previously identified in Drosophila, which is biasing research towards conserved gene functions. RNAi screens in other organisms promise to reduce this bias. Here we present the results of the iBeetle screen, a large-scale, unbiased RNAi screen in the red flour beetle, Tribolium castaneum, which identifies gene functions in embryonic and postembryonic development, physiology and cell biology. The utility of Tribolium as a screening platform is demonstrated by the identification of genes involved in insect epithelial adhesion. This work transcends the restrictions of the candidate gene approach and opens fields of research not accessible in Drosophila.
In September 2005, six Community Grid projects and the Integration Project (DGI) started to desig... more In September 2005, six Community Grid projects and the Integration Project (DGI) started to design and develop a sustainable Grid infrastructure for the e-Sciences Community in Germany. The following D-Grid projects are funded by BMBF, the federal ministry of education and research: DGI D-Grid Integration project, AstroGrid-D in astronomy, C3-Grid for climate research, HEP-Grid for high energy physics, InGrid for engineering research, MediGrid for medical research, and TextGrid for humanities.
Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and ... more Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator d N of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of 'match positions' and 'don't care positions'. Our software is available online and as downloadable source code at:...
Alignment-based methods for sequence analysis have various limitations if large datasets are to b... more Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein, we generalize this approach by considering longest common substrings with k mismatches. We present a greedy heuristic to approximate the length of such k-mismatch substrings, and we describe kmacs, an efficient implementation of this idea based on generalized enhanced suffix arrays. To evaluate the performance of our approach, we applied it to phylogeny reconstruction using a large number of DNA and protein sequence sets. In most cases, phylogenetic trees calculated with kmacs were more accurate than trees produced with established alignment-free methods that are based on exact word matches. Especially...
In order to improve gene prediction, extrinsic evidence on the gene structure can be collected fr... more In order to improve gene prediction, extrinsic evidence on the gene structure can be collected from various sources of information such as genome-genome comparisons and EST and protein alignments. However, such evidence is often incomplete and usually uncertain. The extrinsic evidence is usually not sufficient to recover the complete gene structure of all genes completely and the available evidence is often unreliable. Therefore extrinsic evidence is most valuable when it is balanced with sequence-intrinsic evidence. We present a fairly general method for integration of external information. Our method is based on the evaluation of hints to potentially protein-coding regions by means of a Generalized Hidden Markov Model (GHMM) that takes both intrinsic and extrinsic information into account. We used this method to extend the ab initio gene prediction program AUGUSTUS to a versatile tool that we call AUGUSTUS+. In this study, we focus on hints derived from matches to an EST or protei...
Proceedings / IEEE Computer Society Bioinformatics Conference. IEEE Computer Society Bioinformatics Conference, 2002
Comparative analysis of syntenic genome sequences can be used to identify functional sites such a... more Comparative analysis of syntenic genome sequences can be used to identify functional sites such as exons and regulatory elements. Here, the first step is to align two or several evolutionary related sequences and, in recent years, a number of computer programs have been developed for alignment of large genomic sequences. Some of these programs are extremely fast but often time-efficiency is achieved at the expense of sensitivity. One way of combining speed and sensitivity is to use an anchored-alignment approach. In a first step, a fast heuristic identifies a chain of strong sequence similarities that serve as anchor points. In a second step, regions between these anchor points are aligned using a slower but more sensitive method. We present CHAOS, a novel algorithm for rapid identification of chains of local sequence similarities among large genomic sequences. Similarities identified by CHAOS are used as anchor points to improve the running time of the DIALIGN alignment program. Sy...
Comparative sequence analysis is a powerful approach to identify functional elements in genomic s... more Comparative sequence analysis is a powerful approach to identify functional elements in genomic sequences. Herein, we describe AGenDA (Alignment-based GENe Detection Algorithm), a novel method for gene prediction that is based on long-range alignment of syntenic regions in eukaryotic genome sequences. Local sequence homologies identified by the DIALIGN program are searched for conserved splice signals to define potential protein-coding exons; these candidate exons are then used to assemble complete gene structures. The performance of our method was tested on a set of 105 human-mouse sequence pairs. These test runs showed that sensitivity and specificity of AGenDA are comparable with the best gene- prediction program that is currently available. However, since our method is based on a completely different type of input information, it can detect genes that are not detectable by standard methods and vice versa. Thus, our approach seems to be a useful addition to existing gene-predicti...
Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, 1998
In this paper, we discuss a novel scoring scheme for sequence alignments. The score of an alignme... more In this paper, we discuss a novel scoring scheme for sequence alignments. The score of an alignment is defined as the sum of so-called weights of aligned segment pairs. A simple modification of the weight function used by the original version of the DIALIGN alignment program turns out to have a crucial advantage: it can be applied to both, global and local alignment problems without the need to specify a threshold parameter.
Uploads