Skip to main content

Minjie Fan

In this report, we use the GEE and GLMM models to fit the skin cancer data. In both cases, we begin with a saturated model and reduce the model stepwise by Likelihood Ratio Test. The results of the GEE and GLMM models are similar, but... more
In this report, we use the GEE and GLMM models to fit the skin cancer data. In both cases, we begin with a saturated model and reduce the model stepwise by Likelihood Ratio Test. The results of the GEE and GLMM models are similar, but somewhat different. After eliminating the outlying individual #37, the GEE model selects Y ear, Gender and Exposure as covariates. The GLMM model selects these three covariates as well. The signs of the estimates are the same for both models. In this case, similar interpretations can be applied to these two models. Note that the terms involved with Trt are all removed from these two models. This implies that there is no significant effect of beta carotene on reducing the number of new skin cancers. Model diagnostics are also conducted for the GEE and GLMM models, implying that there is no obvious lack of fit for them. Between these two models, we prefer the GLMM model to the GEE model for two reasons: the GLMM model includes subject-specific random eff...
Symbolic regression has been shown to be quite useful in many domains from discovering scientific laws to industrial empirical modeling. Existing methods focus on numerically fitting the given data. However, in many domains, symbolically... more
Symbolic regression has been shown to be quite useful in many domains from discovering scientific laws to industrial empirical modeling. Existing methods focus on numerically fitting the given data. However, in many domains, symbolically derivable properties of the desired expressions are known. We illustrate these "semantic priors" with leading powers (the polynomial behavior as the input approaches 0 and $\infty$). We introduce an expression generating neural network that significantly favors the generation of expressions with desired leading powers, even generalizing to powers not in the training set. We then describe our Neural-Guided Monte Carlo Tree Search (NG-MCTS) algorithm for symbolic regression. We extensively evaluate our method on thousands of symbolic regression tasks and desired expressions to show that it significantly outperforms baseline algorithms and exhibits discovery of novel expressions outside of the training set.
Compared with the traditional spherical harmonics, the spherical needlets are a new generation of spherical wavelets that possess several attractive properties. Their double localization in both spatial and frequency domains empowers them... more
Compared with the traditional spherical harmonics, the spherical needlets are a new generation of spherical wavelets that possess several attractive properties. Their double localization in both spatial and frequency domains empowers them to easily and sparsely represent functions with small spatial scale features. This paper is divided into two parts. First, it reviews the spherical harmonics and discusses their limitations in representing functions with small spatial scale features. To overcome the limitations, it introduces the spherical needlets and their attractive properties. In the second part of the paper, a Matlab package for the spherical needlets is presented. The properties of the spherical needlets are demonstrated by several examples using the package.
Drug discovery for Parkinson’s disease (PD) is impeded by the lack of screenable phenotypes in scalable cell models. Here we present a novel unbiased phenotypic profiling platform that combines automation, Cell Painting, and deep... more
Drug discovery for Parkinson’s disease (PD) is impeded by the lack of screenable phenotypes in scalable cell models. Here we present a novel unbiased phenotypic profiling platform that combines automation, Cell Painting, and deep learning. We applied this platform to primary fibroblasts from 91 PD patients and carefully matched healthy controls, generating the largest publicly available Cell Painting dataset to date. Using fixed weights from a convolutional deep neural network trained on ImageNet, we generated unbiased deep embeddings from each image, and applied these to train machine learning models to detect morphological disease phenotypes. Interestingly, our models captured individual variation by identifying specific cell lines within the cohort with high fidelity, even across different batches and plate layouts, demonstrating platform robustness and sensitivity. Importantly, our models were able to confidently separate LRRK2 and sporadic PD lines from healthy controls (ROC AU...
Symbolic regression is a type of discrete optimization problem that involves searching expressions that fit given data points. In many cases, other mathematical constraints about the unknown expression not only provide more information... more
Symbolic regression is a type of discrete optimization problem that involves searching expressions that fit given data points. In many cases, other mathematical constraints about the unknown expression not only provide more information beyond just values at some inputs, but also effectively constrain the search space. We identify the asymptotic constraints of leading polynomial powers as the function approaches zero and infinity as useful constraints and create a system to use them for symbolic regression. The first part of the system is a conditional production rule generating neural network which preferentially generates production rules to construct expressions with the desired leading powers, producing novel expressions outside the training domain. The second part, which we call Neural-Guided Monte Carlo Tree Search, uses the network during a search to find an expression that conforms to a set of data points and desired leading powers. Lastly, we provide an extensive experimenta...
Drug discovery for diseases such as Parkinson’s disease (PD) are impeded by the lack of screenable cellular phenotypes. We present a novel, unbiased phenotypic profiling platform that combines automated cell culture, high-content imaging,... more
Drug discovery for diseases such as Parkinson’s disease (PD) are impeded by the lack of screenable cellular phenotypes. We present a novel, unbiased phenotypic profiling platform that combines automated cell culture, high-content imaging, Cell Painting, and deep learning. We applied this platform to primary fibroblasts from 91 PD patients and matched healthy controls, creating the largest publicly available Cell Painting image dataset to date at 48 terabytes. Using fixed weights from a convolutional deep neural network trained on ImageNet, we generated deep embeddings from each image and trained machine learning models to detect morphological disease phenotypes. Our platform’s robustness and sensitivity allowed the detection of individual-specific variation with high fidelity, across batches and plate layouts. Lastly, our models confidently separated LRRK2 and sporadic PD lines from healthy controls (ROC AUC 0.79 (0.08 standard deviation (SD))) supporting the capacity of this platfo...
Drug resistance threatens the effective prevention and treatment of an ever-increasing range of human infections. This highlights an urgent need for new and improved drugs with novel mechanisms of action to avoid cross-resistance. Current... more
Drug resistance threatens the effective prevention and treatment of an ever-increasing range of human infections. This highlights an urgent need for new and improved drugs with novel mechanisms of action to avoid cross-resistance. Current cell-based drug screens are, however, restricted to binary live/dead readouts with no provision for mechanism of action prediction. Machine learning methods are increasingly being used to improve information extraction from imaging data. These methods, however, work poorly with heterogeneous cellular phenotypes and generally require time-consuming human-led training. We have developed a semi-supervised machine learning approach, combining human- and machine-labeled training data from mixed human malaria parasite cultures. Designed for high-throughput and high-resolution screening, our semi-supervised approach is robust to natural parasite morphological heterogeneity and correctly orders parasite developmental stages. Our approach also reproducibly ...
Profiling cellular phenotypes from microscopic imaging can provide meaningful biological information resulting from various factors affecting the cells. One motivating application is drug development: morphological cell features can be... more
Profiling cellular phenotypes from microscopic imaging can provide meaningful biological information resulting from various factors affecting the cells. One motivating application is drug development: morphological cell features can be captured from images, from which similarities between different drug compounds applied at different doses can be quantified. The general approach is to find a function mapping the images to an embedding space of manageable dimensionality whose geometry captures relevant features of the input images. An important known issue for such methods is separating relevant biological signal from nuisance variation. For example, the embedding vectors tend to be more correlated for cells that were cultured and imaged during the same week than for those from different weeks, despite having identical drug compounds applied in both cases. In this case, the particular batch in which a set of experiments were conducted constitutes the domain of the data; an ideal set ...
We present IDEA (the Induction Dynamics gene Expression Atlas), a dataset constructed by independently inducing hundreds of transcription factors (TFs) and measuring timecourses of the resulting gene expression responses in budding yeast.... more
We present IDEA (the Induction Dynamics gene Expression Atlas), a dataset constructed by independently inducing hundreds of transcription factors (TFs) and measuring timecourses of the resulting gene expression responses in budding yeast. Each experiment captures a regulatory cascade connecting a single induced regulator to the genes it causally regulates. We discuss the regulatory cascade of a single TF, Aft1, in detail; however, IDEA contains > 200 TF induction experiments with 20 million individual observations and 100,000 signal‐containing dynamic responses. As an application of IDEA, we integrate all timecourses into a whole‐cell transcriptional model, which is used to predict and validate multiple new and underappreciated transcriptional regulators. We also find that the magnitudes of coefficients in this model are predictive of genetic interaction profile similarities. In addition to being a resource for exploring regulatory connectivity between TFs and their target genes, our modeling approach shows that combining rapid perturbations of individual genes with genome‐scale time‐series measurements is an effective strategy for elucidating gene regulatory networks.
Drug resistance threatens the effective prevention and treatment of an ever-increasing range of human infections. This highlights an urgent need for new and improved drugs with novel mechanisms of action to avoid cross-resistance. Current... more
Drug resistance threatens the effective prevention and treatment of an ever-increasing range of human infections. This highlights an urgent need for new and improved drugs with novel mechanisms of action to avoid cross-resistance. Current cell-based drug screens are, however, restricted to binary live/dead readouts with no provision for mechanism of action prediction. Machine learning methods are increasingly being used to improve information extraction from imaging data. Such methods, however, work poorly with heterogeneous cellular phenotypes and generally require time-consuming human-led training. We have developed a semi-supervised machine learning approach, combining human- and machine-labelled training data from mixed human malaria parasite cultures. Designed for high-throughput and high-resolution screening, our semi-supervised approach is robust to natural parasite morphological heterogeneity and correctly orders parasite developmental stages. Our approach also reproducibly ...
The etiological underpinnings of many CNS disorders are not well understood. This is likely due to the fact that individual diseases aggregate numerous pathological subtypes, each associated with a complex landscape of genetic risk... more
The etiological underpinnings of many CNS disorders are not well understood. This is likely due to the fact that individual diseases aggregate numerous pathological subtypes, each associated with a complex landscape of genetic risk factors. To overcome these challenges, researchers are integrating novel data types from numerous patients, including imaging studies capturing broadly applicable features from patient-derived materials. These datasets, when combined with machine learning, potentially hold the power to elucidate the subtle patterns that stratify patients by shared pathology. In this study, we interrogated whether high-content imaging of primary skin fibroblasts, using the Cell Painting method, could reveal disease-relevant information among patients. First, we showed that technical features such as batch/plate type, plate, and location within a plate lead to detectable nuisance signals, as revealed by a pre-trained deep neural network and analysis with deep image embeddin...
We present an approach for inferring genome-wide regulatory causality and demonstrate its application on a yeast dataset constructed by independently inducing hundreds of transcription factors and measuring timecourses of the resulting... more
We present an approach for inferring genome-wide regulatory causality and demonstrate its application on a yeast dataset constructed by independently inducing hundreds of transcription factors and measuring timecourses of the resulting gene expression responses. We discuss the regulatory cascades in detail for a single transcription factor, Aft1; however, we have 201 TF induction timecourses that include >100,000 signal-containing dynamic responses. From a single TF induction timecourse we can often discriminate the direct from the indirect effects of the induced TF. Across our entire dataset, however, we find that the majority of expression changes are indirectly driven by unknown regulators. By integrating all timecourses into a single whole-cell transcriptional model, potential regulators of each gene can be predicted without incorporating prior information. In doing so, the indirect effects of a TF are understood as a series of direct regulatory predictions that capture how r...
We develop in this paper a novel portfolio selection framework with a feature of double robustness in both return distribution modeling and portfolio optimization. While predicting the future return distributions always represents the... more
We develop in this paper a novel portfolio selection framework with a feature of double robustness in both return distribution modeling and portfolio optimization. While predicting the future return distributions always represents the most compelling challenge in investment, any underlying distribution can be always well approximated by utilizing a mixture distribution, if we are able to ensure that the component list of a mixture distribution includes all possible distributions corresponding to the scenario analysis of potential market modes. Adopting a mixture distribution enables us to (1) reduce the problem of distribution prediction to a parameter estimation problem in which the mixture weights of a mixture distribution are estimated under a Bayesian learning scheme and the corresponding credible regions of the mixture weights are obtained as well and (2) harmonize information from different channels, such as historical data, market implied information and investors׳ subjective views. We further formulate a robust mean-CVaR portfolio selection problem to deal with the inherent uncertainty in predicting the future return distributions. By employing the duality theory, we show that the robust portfolio selection problem via learning with a mixture model can be reformulated as a linear program or a second-order cone program, which can be effectively solved in polynomial time. We present the results of simulation analyses and primary empirical tests to illustrate a significance of the proposed approach and demonstrate its pros and cons.