👋 Hi, I’m @lhallee
My name is Logan Hallee, a PhD Candidate in Bioinformatics Data Science at the University of Delaware (Gleghorn Lab) specializing in curating large-dimensional feature spaces for biological data. My primary focus is on protein design and annotation using transformer neural networks. Techniques I developed led to the creation of SYNTERACT, the first large language model approach to protein-protein interaction prediction, ranking in the top 3% of research outputs by Altmetric.
At the Wolfram Winter School, I collaborated with Stephen Wolfram and other mentors to create "Tetris For Proteins," a shape-based metric for protein-protein interactions that emulates lock-and-key enzyme-substrate dynamics, generating hypotheses about protein aggregation likelihood.
I created the Annotation Vocabulary, a unique set of integers mapped to popular protein and gene ontologies, enabling state-of-the-art protein annotation and generation models when used with its own token embedding.
My work also supports the paradigm of codon usage bias as a key biological phenomenon for phylogenetic analysis. Our models, published in Nature Scientific Reports, highlight codon usage as a unique phylogenetic predictor. Our lab recently produced cdsBERT, showcasing cost-effective techniques to enhance the biological relevance of protein language models using a codon vocabulary.
In natural language processing, I invented Mixture of Experts extension for scalable transformer networks adept at sentence similarity tasks. We believe future networks with N experts will perform like N independently trained networks, offering significant time and computational savings for vector retrieval systems and search relying on semantic vector representations.
I also manage lab projects in computer vision, utilizing deep learning to reconstruct anatomically accurate 3D organs from 2D Z-stacks, informing morphometric and pharmacokinetic studies.
Some other stuff I've worked on over the years:
- featureranker, a Python package for feature ranking
- My textbook section about Protein Language Models
- Machine learning to identify cardioprotective molecules in minority groups
- Writing about the relationships of Hsp90 and Gamma secretase in cardiac diseases
Norway, ME ➔ Newark, DE