Skip to main content
Hilmar Lapp
  • Durham, North Carolina, United States

Hilmar Lapp

Phenotypes resulting from mutations in genetic model organisms can help reveal candidate genes for evolutionarily important phenotypic changes in related taxa. Although testing candidate gene hypotheses experimentally in non-model... more
Phenotypes resulting from mutations in genetic model organisms can help reveal candidate genes for evolutionarily important phenotypic changes in related taxa. Although testing candidate gene hypotheses experimentally in non-model organisms is typically difficult, ontology-driven information systems can help generate testable hypotheses about developmental processes in experimentally tractable organisms. Here, we tested candidate gene hypotheses suggested by expert use of the Phenoscape Knowledgebase, specifically looking for genes that are candidates responsible for evolutionarily interesting phenotypes in the ostariophysan fishes that bear resemblance to mutant phenotypes in zebrafish. For this, we searched ZFIN for genetic perturbations that result in either loss of basihyal element or loss of scales phenotypes, because these are the ancestral phenotypes observed in catfishes (Siluriformes). We tested the identified candidate genes by examining their endogenous expression patterns in the channel catfish, Ictalurus punctatus. The experimental results were consistent with the hypotheses that these features evolved via disruption in developmental pathways at, or upstream of, brpf1 and eda/edar for the ancestral losses of basihyal element and scales, respectively. These results demonstrate that ontological annotations of the phenotypic effects of genetic alterations in model organisms, when aggregated within a knowledgebase, can be used effectively to generate testable, and useful, hypotheses about evolutionary changes in morphology.
The reality of larger and larger molecular databases and the need to integrate data scalably have presented a major challenge for the use of phenotypic data. Morphology is currently primarily described in discrete publications, entrenched... more
The reality of larger and larger molecular databases and the need to integrate data scalably have presented a major challenge for the use of phenotypic data. Morphology is currently primarily described in discrete publications, entrenched in non-computer readable text, and requires enormous investments of time and resources to integrate across large numbers of taxa and studies. Here we present a new methodology, using ontology-based reasoning systems working with the Phenoscape Knowledgebase (KB; kb.phenoscape.org), to automatically integrate large amounts of evolutionary character state descriptions into a synthetic character matrix of neomorphic (presence/absence) data. Using the KB, which includes more than 55 studies of sarcopterygian taxa, we generated a synthetic supermatrix of 639 variable characters scored for 1051 taxa, resulting in over 145,000 populated cells. Of these characters, over 76% were made variable through the addition of inferred presence/absence states derived...
The abundance of phenotypic diversity among species can enrich our knowledge of development and genetics beyond the limits of variation that can be observed in model organisms. The Phenoscape Knowledgebase (KB) is designed to enable... more
The abundance of phenotypic diversity among species can enrich our knowledge of development and genetics beyond the limits of variation that can be observed in model organisms. The Phenoscape Knowledgebase (KB) is designed to enable exploration and discovery of phenotypic variation among species. Because phenotypes in the KB are annotated using standard ontologies, evolutionary phenotypes can be compared with phenotypes from genetic perturbations in model organisms. To illustrate the power of this approach, we review the use of the KB to find taxa showing evolutionary variation similar to that of a query gene. Matches are made between the full set of phenotypes described for a gene and an evolutionary profile, the latter of which is defined as the set of phenotypes that are variable among the daughters of any node on the taxonomic tree. Phenoscape's semantic similarity interface allows the user to assess the statistical significance of each match and flags matches that may only result from differences in annotation coverage between genetic and evolutionary studies. Tools such as this will help meet the challenge of relating the growing volume of genetic knowledge in model organisms to the diversity of phenotypes in nature. The Phenoscape KB is available at http://kb.phenoscape.org. genesis 53:561-571, 2015. © 2015 Wiley Periodicals, Inc.
Classification of human tumors according to their primary anatomical site of origin is fundamental for the optimal treatment of patients with cancer. Here we describe the use of large-scale RNA profiling and supervised machine learning... more
Classification of human tumors according to their primary anatomical site of origin is fundamental for the optimal treatment of patients with cancer. Here we describe the use of large-scale RNA profiling and supervised machine learning algorithms to construct a first-generation molecular classification scheme for carcinomas of the prostate, breast, lung, ovary, colorectum, kidney, liver, pancreas, bladder/ureter, and gastroesophagus, which collectively account for approximately 70% of all cancer-related deaths in the United States. The classification scheme was based on identifying gene subsets whose expression typifies each cancer class, and we quantified the extent to which these genes are characteristic of a specific tumor type by accurately and confidently predicting the anatomical site of tumor origin for 90% of 175 carcinomas, including 9 of 12 metastatic lesions. The predictor gene subsets include those whose expression is typical of specific types of normal epithelial differ...
In this paper we address the problem of reliably fitting parametric and semi-parametric models to spots in high density spot array images obtained in gene expression experiments. The goal is to measure the amount of label bound to an... more
In this paper we address the problem of reliably fitting parametric and semi-parametric models to spots in high density spot array images obtained in gene expression experiments. The goal is to measure the amount of label bound to an array element. A lot of spots can be modelled accurately by a Gaussian shape. In order to deal with highly overlapping spots we use robust M-estimators. When the parametric method fails (which can be detected automatically) we use a novel, robust semi-parametric method which can handle spots of different shapes accurately. The introduced techniques are evaluated experimentally.
Nearly invariably, phenotypes are reported in the scientific literature in meticulous detail, utilizing the full expressivity of natural language. Often it is particularly these detailed observations (facts) that are of interest, and thus... more
Nearly invariably, phenotypes are reported in the scientific literature in meticulous detail, utilizing the full expressivity of natural language. Often it is particularly these detailed observations (facts) that are of interest, and thus specific to the research questions that motivated observing and reporting them. However, research aiming to synthesize or integrate phenotype data across many studies or even fields is often faced with the need to abstract from detailed observations so as to construct phenotypic concepts that are common across many datasets rather than specific to a few. Yet, observations or facts that would fall under such abstracted concepts are typically not directly asserted by the original authors, usually because they are "obvious" according to common domain knowledge, and thus asserting them would be deemed redundant by anyone with sufficient domain knowledge. For example, a phenotype describing the length of a manual digit for an organism implicit...
The Teleost Taxonomy Ontology.
Despite complete sequencing of the human and mouse genomes, functional annotation of novel gene function still remains a major challenge in mammalian biology. Emerging strategies to help elucidate unknown gene function include the... more
Despite complete sequencing of the human and mouse genomes, functional annotation of novel gene function still remains a major challenge in mammalian biology. Emerging strategies to help elucidate unknown gene function include the analysis of tissue-specific patterns of mRNA expression. A recent study investigated the steady-state mRNA expression profiling of the vast majority of protein-encoding human and mouse genes across a panel of 79 human and 61 mouse nonredundant tissues. The microarray data from this study constitutes the Genomics Institute of Novartis Foundation (GNF) Human and Mouse Gene Atlases and is publicly available for exploration through the SymAtlas web-application (http://symatlas.gnf.org/). We have recently reported the use of these data and hierarchical clustering algorithms to generate a global overview of the distribution of Rabs, SNAREs, and coat machinery components, as well as their respective adaptors, effectors, and regulators. This systems biology approach led us to propose Rab-centric protein activity hubs as a framework for an integrated coding system, the membrome network, which orchestrates the dynamics of specialized membrane architecture of differentiated cells. Here, we describe the use of the SymAtlas web-application and the Membrome datasets to help explore trafficking GTPase function. The human and mouse membrome datasets are available through the Membrome homepage (http://www.membrome.org/) and correspond to subsets of the SymAtlas content restricted to known membrane trafficking components. Considering the fragmentary nature of the current reductionist approaches in elucidating trafficking component functions, the membrome datasets provide a more focused systems biology perspective that not only complements our current understanding of transport in complex tissues but also provides an integrated perspective of Rab activity in controlling membrane architecture.
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for... more
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.
The tissue-specific pattern of mRNA expression can indicate important clues about gene function. High-density oligonucleotide arrays offer the opportunity to examine patterns of gene expression on a genome scale. Toward this end, we have... more
The tissue-specific pattern of mRNA expression can indicate important clues about gene function. High-density oligonucleotide arrays offer the opportunity to examine patterns of gene expression on a genome scale. Toward this end, we have designed custom arrays that interrogate the expression of the vast majority of protein-encoding human and mouse genes and have used them to profile a panel of 79 human and 61 mouse tissues. The resulting data set provides the expression patterns for thousands of predicted genes, as well as known and poorly characterized genes, from mice and humans. We have explored this data set for global trends in gene expression, evaluated commonly used lines of evidence in gene prediction methodologies, and investigated patterns indicative of chromosomal organization of transcription. We describe hundreds of regions of correlated transcription and show that some are subject to both tissue and parental allele-specific expression, suggesting a link between spatial expression and imprinting.
Rab GTPases and SNARE fusion proteins direct cargo trafficking through the exocytic and endocytic pathways of eukaryotic cells. We have used steady state mRNA expression profiling and computational hierarchical clustering methods to... more
Rab GTPases and SNARE fusion proteins direct cargo trafficking through the exocytic and endocytic pathways of eukaryotic cells. We have used steady state mRNA expression profiling and computational hierarchical clustering methods to generate a global overview of the distribution of Rabs, SNAREs, and coat machinery components, as well as their respective adaptors, effectors, and regulators in 79 human and 61 mouse nonredundant tissues. We now show that this systems biology approach can be used to define building blocks for membrane trafficking based on Rab-centric protein activity hubs. These Rab-regulated hubs provide a framework for an integrated coding system, the membrome network, which regulates the dynamics of the specialized membrane architecture of differentiated cells. The distribution of Rab-regulated hubs illustrates a number of facets that guides the overall organization of subcellular compartments of cells and tissues through the activity of dynamic protein interaction networks. An interactive website for exploring datasets comprising components of the Rab-regulated hubs that define the membrome of different cell and organ systems in both human and mouse is available at http://www.membrome.org/.
The Bioperl project is an international open-source collaboration of biologists, bioinformaticians, and computer scientists that has evolved over the past 7 yr into the most comprehensive library of Perl modules available for managing and... more
The Bioperl project is an international open-source collaboration of biologists, bioinformaticians, and computer scientists that has evolved over the past 7 yr into the most comprehensive library of Perl modules available for managing and manipulating life-science information. Bioperl provides an easy-to-use, stable, and consistent programming interface for bioinformatics application programmers. The Bioperl modules have been successfully and repeatedly used to reduce otherwise complex tasks to only a few lines of code. The Bioperl object model has been proven to be flexible enough to support enterprise-level applications such as EnsEMBL, while maintaining an easy learning curve for novice Perl programmers. Bioperl is capable of executing analyses and processing results from programs such as BLAST, ClustalW, or the EMBOSS suite. Interoperation with modules written in Python and Java is supported through the evolving BioCORBA bridge. Bioperl provides access to data stores such as GenBank and SwissProt via a flexible series of sequence input/output modules, and to the emerging common sequence data storage format of the Open Bioinformatics Database Access project. This study describes the overall architecture of the toolkit, the problem domains that it addresses, and gives specific examples of how the toolkit can be used to solve common life-sciences problems. We conclude with a discussion of how the open-source nature of the project has contributed to the development effort.
In December, 2006, a group of 26 software developers from some of the most widely used life science programming toolkits and phylogenetic software projects converged on Durham, North Carolina, for a Phyloinformatics Hackathon, an intense... more
In December, 2006, a group of 26 software developers from some of the most widely used life science programming toolkits and phylogenetic software projects converged on Durham, North Carolina, for a Phyloinformatics Hackathon, an intense five-day collaborative software ...
Despite a large and multifaceted effort to understand the vast landscape of phenotypic data, their current form inhibits productive data analysis. The lack of a community-wide, consensus-based, human- and machine-interpretable language... more
Despite a large and multifaceted effort to understand the vast landscape of phenotypic data, their current form inhibits productive data analysis. The lack of a community-wide, consensus-based, human- and machine-interpretable language for describing phenotypes and their genomic and environmental contexts is perhaps the most pressing scientific bottleneck to integration across many key fields in biology, including genomics, systems biology, development, medicine, evolution, ecology, and systematics. Here we survey the current phenomics landscape, including data resources and handling, and the progress that has been made to accurately capture relevant data descriptions for phenotypes. We present an example of the kind of integration across domains that computable phenotypes would enable, and we call upon the broader biology community, publishers, and relevant funding agencies to support efforts to surmount today's data barriers and facilitate analytical reproducibility.
Research Interests:
1. The importance of data archiving, data sharing and public access to data has received considerable attention. Awareness is growing among scientists that collaborative databases can facilitate these activities.2. We provide a detailed... more
1. The importance of data archiving, data sharing and public access to data has received considerable attention. Awareness is growing among scientists that collaborative databases can facilitate these activities.2. We provide a detailed description of the collaborative life history database developed by our Working Group at the National Evolutionary Synthesis Center to address questions about life history patterns and the evolution of mortality and demographic variability in wild primates.3. Examples from each of the seven primate species included in our database illustrate the range of data incorporated and the challenges, decision-making processes, and criteria applied to standardize data across diverse field studies. In addition to the descriptive and structural metadata associated with our database, we also describe the process metadata (how the database was designed and delivered) and the technical specifications of the database.4. Our database provides a useful model for other researchers interested in developing similar types of databases for other organisms, while our process metadata may be helpful to other groups of researchers interested in developing databases for other types of collaborative analyses.
Synthetic science promises an unparalleled ability to find new meaning in old data, extant results, or previously unconnected methods and concepts, but pursuing synthesis can be a difficult and risky endeavor. Our experience as... more
Synthetic science promises an unparalleled ability to find new meaning in old data, extant results, or previously unconnected methods and concepts, but pursuing synthesis can be a difficult and risky endeavor. Our experience as biologists, informaticians, and educators at the National Evolutionary Synthesis Center has affirmed that synthesis can yield major insights, but also revealed that technological hurdles, prevailing academic culture, and general confusion about the nature of synthesis can hamper its progress. By presenting our view of what synthesis is, why it will continue to drive progress in evolutionary biology, and how to remove barriers to its progress, we provide a map to a future in which all scientists can engage productively in synthetic research.