To enable personalized cancer treatment, machine learning models have been developed to predict d... more To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross-validation within a single study to assess model accuracy. While an essential first step, cross-validation within a biological data set typically provides an overly optimistic estimate of the prediction performance on independent test sets. To provide a more rigorous assessment of model generalizability between different studies, we use machine learning to analyze five publicly available cell line-based data sets: National Cancer Institute 60, ancer Therapeutics Response Portal (CTRP), Genomics of Drug Sensitivity in Cancer, Cancer Cell Line Encyclopedia and Genentech Cell Line Screening Initiative (gCSI). Based on observed experimental variability across studies, we explore estimates of prediction upper bounds. We report performance results of a variety of machine lear...
International Journal of Pattern Recognition and Artificial Intelligence, 2015
Wind energy is scheduled on the power grid using 0–6 h ahead forecasts generated from computer si... more Wind energy is scheduled on the power grid using 0–6 h ahead forecasts generated from computer simulations or historical data. When the forecasts are inaccurate, control room operators use their expertise, as well as the actual generation from previous days, to estimate the amount of energy to schedule. However, this is a challenge, and it would be useful for the operators to have additional information they can exploit to make better informed decisions. In this paper, we use techniques from time series analysis to determine if there are motifs, or frequently occurring diurnal patterns in wind generation data. We compare two different representations of the data and four different ways of identifying the number of motifs. Using data from wind farms in Tehachapi Pass and mid-Columbia Basin, we describe our findings and discuss how these motifs can be used to guide scheduling decisions.
The move toward exascale computing for scientific simulations is placing new demands on compressi... more The move toward exascale computing for scientific simulations is placing new demands on compression techniques. It is expected that the I/O system will not be able to support the volume of data that will be written out. To enable quantitative analysis and scientific discovery, we need techniques that can compress high-dimensional simulation data with near-perfect reconstruction. In this work, we investigate Compressed Sensing (CS) to reduce the size of the data from a fusion simulation of a tokamak in 3 dimensions (Figure (a)). The computational domain of the simulation is a toroid, composed of 32 poloidal planes (shown in blue). Each plane has nearly 600,000 grid points, arranged irregularly, and distributed across multiple processors of a massively parallel system. Since these data are analyzed to understand the behavior of coherent structures (Figure (b)) over time, it is important that these structures remain unchanged after reconstruction using CS. We conducted several experiments to understand how best to apply CS to our data set. We used several metrics to investigate the effects of preprocessing, including scaling to improve the contrast in the data and thresholding to increase the sparsity. To determine the size of the compressed data that would enable near-perfect reconstruction, we evaluated the quality of reconstruction (shown in Figures (c) and (d) using the R2 metric) as we varied the percentage of compression (m/n) for various levels of sparsity (k/n) in the data. We found that a successful application of CS is bounded by the percentage of sparsity in the data - the data have to be sparse enough for compression using CS, but not so sparse that it is more cost effective to just write out the locations and values of the non-zero data points. Our experiments also indicated that scaling the data is very helpful and thresholding helps both with compression and the coherent structure analysis performed on the data.
Informatics for Materials Science and Engineering, 2013
Data mining is the process of uncovering patterns, associations, anomalies, and statistically sig... more Data mining is the process of uncovering patterns, associations, anomalies, and statistically significant structures and events in data. It borrows and builds on ideas from many disciplines, ranging from statistics to machine learning, mathematical optimization, and signal and image processing. Data mining techniques are becoming an integral part of scientific endeavors in many application domains, including astronomy, bioinformatics, chemistry, materials science, climate, fusion, and combustion. In this chapter, we provide a brief introduction to the data mining process and some of the algorithms used in extracting information from scientific data sets.
2014 13th International Conference on Machine Learning and Applications, 2014
In this paper, we formulate the problem of predicting wind generation as one of streaming data an... more In this paper, we formulate the problem of predicting wind generation as one of streaming data analysis. We want to understand if it is possible to use the weather data in a time window just before the current time to gain insight into how the wind generation might behave in a time interval just after the current time. Specifically, we use a singular value decomposition of the weather data, and how that the number of singular values and the largest singular value can be used to predict the magnitude of the change in the generation in the near future. The analysis uses an incremental algorithm based on a sliding window for reduced computational costs.
Gene expression profiles have been widely used to characterize patterns of cellular responses to ... more Gene expression profiles have been widely used to characterize patterns of cellular responses to diseases. As data becomes available, scalable learning toolkits become essential to processing large datasets using deep learning models to model complex biological processes. We present an autoencoder to capture nonlinear relationships recovered from gene expression profiles. The autoencoder is a nonlinear dimension reduction technique using an artificial neural network, which learns hidden representations of unlabeled data. We train the autoencoder on a large collection of tumor samples from the National Cancer Institute Genomic Data Commons, and obtain a generalized and unsupervised latent representation. We leverage a HPC-focused deep learning toolkit, Livermore Big Artificial Neural Network (LBANN) to efficiently parallelize the training algorithm, reducing computation times from several hours to a few minutes. With the trained autoencoder, we generate latent representations of a sm...
In this paper, we propose a new optimization framework for improving feature selection in medical... more In this paper, we propose a new optimization framework for improving feature selection in medical data classification. We call this framework Support Feature Machine (SFM). The use of SFM in feature selection is to find the optimal group of features that show strong separability between two classes. The separability is measured in terms of inter-class and intra-class distances. The objective of SFM optimization model is to maximize the correctly classified data samples in the training set, whose intra-class distances are smaller than inter-class distances. This concept can be incorporated with the modified nearest neighbor rule for unbalanced data. In addition, a variation of SFM that provides the feature weights (prioritization) is also presented. The proposed SFM framework and its extensions were tested on 5 real medical datasets that are related to the diagnosis of epilepsy, breast cancer, heart disease, diabetes, and liver disorders. The classification performance of SFM is compared with those of support vector machine (SVM) classification and Logical Data Analysis (LAD), which is also an optimization-based feature selection technique. SFM gives very good classification results, yet uses far fewer features to make the decision than SVM and LAD. This result provides a very significant implication in diagnostic practice. The outcome of this study suggests that the SFM framework can be used as a quick decision-making tool in real clinical settings.
To enable personalized cancer treatment, machine learning models have been developed to predict d... more To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross-validation within a single study to assess model accuracy. While an essential first step, cross-validation within a biological data set typically provides an overly optimistic estimate of the prediction performance on independent test sets. To provide a more rigorous assessment of model generalizability between different studies, we use machine learning to analyze five publicly available cell line-based data sets: National Cancer Institute 60, ancer Therapeutics Response Portal (CTRP), Genomics of Drug Sensitivity in Cancer, Cancer Cell Line Encyclopedia and Genentech Cell Line Screening Initiative (gCSI). Based on observed experimental variability across studies, we explore estimates of prediction upper bounds. We report performance results of a variety of machine lear...
International Journal of Pattern Recognition and Artificial Intelligence, 2015
Wind energy is scheduled on the power grid using 0–6 h ahead forecasts generated from computer si... more Wind energy is scheduled on the power grid using 0–6 h ahead forecasts generated from computer simulations or historical data. When the forecasts are inaccurate, control room operators use their expertise, as well as the actual generation from previous days, to estimate the amount of energy to schedule. However, this is a challenge, and it would be useful for the operators to have additional information they can exploit to make better informed decisions. In this paper, we use techniques from time series analysis to determine if there are motifs, or frequently occurring diurnal patterns in wind generation data. We compare two different representations of the data and four different ways of identifying the number of motifs. Using data from wind farms in Tehachapi Pass and mid-Columbia Basin, we describe our findings and discuss how these motifs can be used to guide scheduling decisions.
The move toward exascale computing for scientific simulations is placing new demands on compressi... more The move toward exascale computing for scientific simulations is placing new demands on compression techniques. It is expected that the I/O system will not be able to support the volume of data that will be written out. To enable quantitative analysis and scientific discovery, we need techniques that can compress high-dimensional simulation data with near-perfect reconstruction. In this work, we investigate Compressed Sensing (CS) to reduce the size of the data from a fusion simulation of a tokamak in 3 dimensions (Figure (a)). The computational domain of the simulation is a toroid, composed of 32 poloidal planes (shown in blue). Each plane has nearly 600,000 grid points, arranged irregularly, and distributed across multiple processors of a massively parallel system. Since these data are analyzed to understand the behavior of coherent structures (Figure (b)) over time, it is important that these structures remain unchanged after reconstruction using CS. We conducted several experiments to understand how best to apply CS to our data set. We used several metrics to investigate the effects of preprocessing, including scaling to improve the contrast in the data and thresholding to increase the sparsity. To determine the size of the compressed data that would enable near-perfect reconstruction, we evaluated the quality of reconstruction (shown in Figures (c) and (d) using the R2 metric) as we varied the percentage of compression (m/n) for various levels of sparsity (k/n) in the data. We found that a successful application of CS is bounded by the percentage of sparsity in the data - the data have to be sparse enough for compression using CS, but not so sparse that it is more cost effective to just write out the locations and values of the non-zero data points. Our experiments also indicated that scaling the data is very helpful and thresholding helps both with compression and the coherent structure analysis performed on the data.
Informatics for Materials Science and Engineering, 2013
Data mining is the process of uncovering patterns, associations, anomalies, and statistically sig... more Data mining is the process of uncovering patterns, associations, anomalies, and statistically significant structures and events in data. It borrows and builds on ideas from many disciplines, ranging from statistics to machine learning, mathematical optimization, and signal and image processing. Data mining techniques are becoming an integral part of scientific endeavors in many application domains, including astronomy, bioinformatics, chemistry, materials science, climate, fusion, and combustion. In this chapter, we provide a brief introduction to the data mining process and some of the algorithms used in extracting information from scientific data sets.
2014 13th International Conference on Machine Learning and Applications, 2014
In this paper, we formulate the problem of predicting wind generation as one of streaming data an... more In this paper, we formulate the problem of predicting wind generation as one of streaming data analysis. We want to understand if it is possible to use the weather data in a time window just before the current time to gain insight into how the wind generation might behave in a time interval just after the current time. Specifically, we use a singular value decomposition of the weather data, and how that the number of singular values and the largest singular value can be used to predict the magnitude of the change in the generation in the near future. The analysis uses an incremental algorithm based on a sliding window for reduced computational costs.
Gene expression profiles have been widely used to characterize patterns of cellular responses to ... more Gene expression profiles have been widely used to characterize patterns of cellular responses to diseases. As data becomes available, scalable learning toolkits become essential to processing large datasets using deep learning models to model complex biological processes. We present an autoencoder to capture nonlinear relationships recovered from gene expression profiles. The autoencoder is a nonlinear dimension reduction technique using an artificial neural network, which learns hidden representations of unlabeled data. We train the autoencoder on a large collection of tumor samples from the National Cancer Institute Genomic Data Commons, and obtain a generalized and unsupervised latent representation. We leverage a HPC-focused deep learning toolkit, Livermore Big Artificial Neural Network (LBANN) to efficiently parallelize the training algorithm, reducing computation times from several hours to a few minutes. With the trained autoencoder, we generate latent representations of a sm...
In this paper, we propose a new optimization framework for improving feature selection in medical... more In this paper, we propose a new optimization framework for improving feature selection in medical data classification. We call this framework Support Feature Machine (SFM). The use of SFM in feature selection is to find the optimal group of features that show strong separability between two classes. The separability is measured in terms of inter-class and intra-class distances. The objective of SFM optimization model is to maximize the correctly classified data samples in the training set, whose intra-class distances are smaller than inter-class distances. This concept can be incorporated with the modified nearest neighbor rule for unbalanced data. In addition, a variation of SFM that provides the feature weights (prioritization) is also presented. The proposed SFM framework and its extensions were tested on 5 real medical datasets that are related to the diagnosis of epilepsy, breast cancer, heart disease, diabetes, and liver disorders. The classification performance of SFM is compared with those of support vector machine (SVM) classification and Logical Data Analysis (LAD), which is also an optimization-based feature selection technique. SFM gives very good classification results, yet uses far fewer features to make the decision than SVM and LAD. This result provides a very significant implication in diagnostic practice. The outcome of this study suggests that the SFM framework can be used as a quick decision-making tool in real clinical settings.
Uploads