Ya Ju Fan

Lawrence Livermore National Lab, Center for Applied Scientific Computing, Computational Scientist

Followers

Following

Public Views

Interests

Uploads

Papers

Evaluating Point-Prediction Uncertainties in Neural Networks for Drug Discovery

arXiv (Cornell University), Oct 30, 2022

Download

Compressing Unstructured Mesh Data Using Spline Fits, Compressed Sensing, and Regression Methods

2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2018

Download

A cross-study analysis of drug response prediction in cancer cell lines

Briefings in Bioinformatics, 2021

To enable personalized cancer treatment, machine learning models have been developed to predict d... more To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross-validation within a single study to assess model accuracy. While an essential first step, cross-validation within a biological data set typically provides an overly optimistic estimate of the prediction performance on independent test sets. To provide a more rigorous assessment of model generalizability between different studies, we use machine learning to analyze five publicly available cell line-based data sets: National Cancer Institute 60, ancer Therapeutics Response Portal (CTRP), Genomics of Drug Sensitivity in Cancer, Cancer Cell Line Encyclopedia and Genentech Cell Line Screening Initiative (gCSI). Based on observed experimental variability across studies, we explore estimates of prediction upper bounds. We report performance results of a variety of machine lear...

Download

Detecting ramp events in wind energy generation using affinity evaluation on weather data

Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

Download

Regression with Small Data Sets: A Case Study using Code Surrogates in Additive Manufacturing

Download

A Comparison of Compressed Sensing and Sparse Recovery Algorithms Applied to Simulation Data

Statistics, Optimization & Information Computing, 2016

Download

Towards Detecting Motifs in Time Series Data of Wind Energy

Download

Identifying and Exploiting Diurnal Motifs in Wind Generation Time Series Data

International Journal of Pattern Recognition and Artificial Intelligence, 2015

Wind energy is scheduled on the power grid using 0–6 h ahead forecasts generated from computer si... more Wind energy is scheduled on the power grid using 0–6 h ahead forecasts generated from computer simulations or historical data. When the forecasts are inaccurate, control room operators use their expertise, as well as the actual generation from previous days, to estimate the amount of energy to schedule. However, this is a challenge, and it would be useful for the operators to have additional information they can exploit to make better informed decisions. In this paper, we use techniques from time series analysis to determine if there are motifs, or frequently occurring diurnal patterns in wind generation data. We compare two different representations of the data and four different ways of identifying the number of motifs. Using data from wind farms in Tehachapi Pass and mid-Columbia Basin, we describe our findings and discuss how these motifs can be used to guide scheduling decisions.

Finding Motifs in Wind Generation Time Series Data

2012 11th International Conference on Machine Learning and Applications, 2012

Download

Practical Considerations in Applying Compressed Sensing to Simulation Data

2015 Data Compression Conference, 2015

The move toward exascale computing for scientific simulations is placing new demands on compressi... more The move toward exascale computing for scientific simulations is placing new demands on compression techniques. It is expected that the I/O system will not be able to support the volume of data that will be written out. To enable quantitative analysis and scientific discovery, we need techniques that can compress high-dimensional simulation data with near-perfect reconstruction. In this work, we investigate Compressed Sensing (CS) to reduce the size of the data from a fusion simulation of a tokamak in 3 dimensions (Figure (a)). The computational domain of the simulation is a toroid, composed of 32 poloidal planes (shown in blue). Each plane has nearly 600,000 grid points, arranged irregularly, and distributed across multiple processors of a massively parallel system. Since these data are analyzed to understand the behavior of coherent structures (Figure (b)) over time, it is important that these structures remain unchanged after reconstruction using CS. We conducted several experiments to understand how best to apply CS to our data set. We used several metrics to investigate the effects of preprocessing, including scaling to improve the contrast in the data and thresholding to increase the sparsity. To determine the size of the compressed data that would enable near-perfect reconstruction, we evaluated the quality of reconstruction (shown in Figures (c) and (d) using the R2 metric) as we varied the percentage of compression (m/n) for various levels of sparsity (k/n) in the data. We found that a successful application of CS is bounded by the percentage of sparsity in the data - the data have to be sparse enough for compression using CS, but not so sparse that it is more cost effective to just write out the locations and values of the non-zero data points. Our experiments also indicated that scaling the data is very helpful and thresholding helps both with compression and the coherent structure analysis performed on the data.

Data Mining in Materials Science and Engineering

Informatics for Materials Science and Engineering, 2013

Data mining is the process of uncovering patterns, associations, anomalies, and statistically sig... more Data mining is the process of uncovering patterns, associations, anomalies, and statistically significant structures and events in data. It borrows and builds on ideas from many disciplines, ranging from statistics to machine learning, mathematical optimization, and signal and image processing. Data mining techniques are becoming an integral part of scientific endeavors in many application domains, including astronomy, bioinformatics, chemistry, materials science, climate, fusion, and combustion. In this chapter, we provide a brief introduction to the data mining process and some of the algorithms used in extracting information from scientific data sets.

Incremental SVD for Insight into Wind Generation

2014 13th International Conference on Machine Learning and Applications, 2014

In this paper, we formulate the problem of predicting wind generation as one of streaming data an... more In this paper, we formulate the problem of predicting wind generation as one of streaming data analysis. We want to understand if it is possible to use the weather data in a time window just before the current time to gain insight into how the wind generation might behave in a time interval just after the current time. Specifically, we use a singular value decomposition of the weather data, and how that the number of singular values and the largest singular value can be used to predict the magnitude of the change in the generation in the near future. The analysis uses an incremental algorithm based on a sliding window for reduced computational costs.

Using data mining to enable integration of wind resources on the power grid

Statistical Analysis and Data Mining, 2012

Download

Optimization models and algorithms for sample-preserved classification and clustering

Download

Autoencoder node saliency: Selecting relevant latent representations

Pattern Recognition, 2018

Download

Distinguishing between Normal and Cancer Cells Using Autoencoder Node Saliency

ArXiv, 2019

Gene expression profiles have been widely used to characterize patterns of cellular responses to ... more Gene expression profiles have been widely used to characterize patterns of cellular responses to diseases. As data becomes available, scalable learning toolkits become essential to processing large datasets using deep learning models to model complex biological processes. We present an autoencoder to capture nonlinear relationships recovered from gene expression profiles. The autoencoder is a nonlinear dimension reduction technique using an artificial neural network, which learns hidden representations of unlabeled data. We train the autoencoder on a large collection of tumor samples from the National Cancer Institute Genomic Data Commons, and obtain a generalized and unsupervised latent representation. We leverage a HPC-focused deep learning toolkit, Livermore Big Artificial Neural Network (LBANN) to efficiently parallelize the training algorithm, reducing computation times from several hours to a few minutes. With the trained autoencoder, we generate latent representations of a sm...

Download

Optimizing feature selection to improve medical diagnosis

Annals of Operations Research, 2010

In this paper, we propose a new optimization framework for improving feature selection in medical... more In this paper, we propose a new optimization framework for improving feature selection in medical data classification. We call this framework Support Feature Machine (SFM). The use of SFM in feature selection is to find the optimal group of features that show strong separability between two classes. The separability is measured in terms of inter-class and intra-class distances. The objective of SFM optimization model is to maximize the correctly classified data samples in the training set, whose intra-class distances are smaller than inter-class distances. This concept can be incorporated with the modified nearest neighbor rule for unbalanced data. In addition, a variation of SFM that provides the feature weights (prioritization) is also presented. The proposed SFM framework and its extensions were tested on 5 real medical datasets that are related to the diagnosis of epilepsy, breast cancer, heart disease, diabetes, and liver disorders. The classification performance of SFM is compared with those of support vector machine (SVM) classification and Logical Data Analysis (LAD), which is also an optimization-based feature selection technique. SFM gives very good classification results, yet uses far fewer features to make the decision than SVM and LAD. This result provides a very significant implication in diagnostic practice. The outcome of this study suggests that the SFM framework can be used as a quick decision-making tool in real clinical settings.

Download