Using Deep Neural Networks To Automate Large Scale Statistical Analysis For Big Data Applications
Using Deep Neural Networks To Automate Large Scale Statistical Analysis For Big Data Applications
Using Deep Neural Networks To Automate Large Scale Statistical Analysis For Big Data Applications
Abstract
Statistical analysis (SA) is a complex process to deduce population properties from analysis
of data. It usually takes a well-trained analyst to successfully perform SA, and it becomes
extremely challenging to apply SA to big data applications. We propose to use deep neural
networks to automate the SA process. In particular, we propose to construct convolutional
neural networks (CNNs) to perform automatic model selection and parameter estimation,
two most important SA tasks. We refer to the resulting CNNs as the neural model selector
and the neural model estimator, respectively, which can be properly trained using labeled
data systematically generated from candidate models. Simulation study shows that both
the selector and estimator demonstrate excellent performances. The idea and proposed
framework can be further extended to automate the entire SA process and have the potential
to revolutionize how SA is performed in big data analytics.
Keywords: statistical analysis, deep network, model selection, parameter estimation, con-
volutional neural network, big data.
1. Introduction
According to the definitions of Gartner Release (2014) and De Maro et al. Mauro et al.
(2016), big data refer to information assets with characteristics high volume, high velocity,
and/or high variety, and the transformation from big data to value requires specific analyt-
ical methods. Currently, machine learning methods are used as the main tools for big data
∗
Corresponding should be sent to Michael Yu Zhu: yuzhu@purdue.edu
c 2017 R. Zhang, W. Deng & M.Y. Zhu.
Zhang Deng Zhu
312
Using DNN to Automate Large Scale SA
two CNNs can be entirely separate, almost identical, or partially joint, leading to different
performances in training as well as in application. We carry out extensive simulation studies,
and show that the proposed neural model selector and parameter estimator can be properly
trained, and the trained CNNs demonstrate excellent performance in test data and a real
data application. The idea and proposed framework can be further extended to the entire
SA process with the potential to change how SA is done in conventional data analysis and
big data analytics.
2. Related work
There exists an extensive statistical literature on model selection Bozdogan (1987); Burn-
ham and Anderson (2003, 2004). Numerous model selection methods have been proposed.
Some of these methods are not applicable to the setting we consider in this paper, while
others, though applicable, may run into various difficulties; see discussions in Section 3 for
details. To our best knowledge, there is no prior work about redefining the model selec-
tion problem as a machine learning classification problem and training CNN to learn and
perform model selection with labeled simulated data.
There also exist a variety of statistical methods for parameter estimation in the lit-
erature; see Casella and Berger (2002); Huber et al. (1964); Norton et al. (2010). Most
statistical methods rely on full or partial knowledge of the model and are based on statis-
tical principles. After conducting intensive literature search, we only found one paper Xie
et al. (2007), in which the authors proposed to use artificial neural networks and simulated
data to construct estimates for parameters of a stochastic differential equation. However,
the idea of using CNNs and simulated data to automate parameter estimation and model
selection and bring AI to the general SA process appears to be novel to our best knowledge.
3. Proposed approach
As discussed in the Introduction, we first reformulate model building and parameter esti-
mation as a machine learning problem. Suppose M = {Mk : 1 ≤ k ≤ K} be a collection
of K prespecified models/distributions. Let f (y|θk , Mk ) be the density function of model
Mk , where θk ∈ Θk is the scalar parameter of the density function. Assume that we have
a random sample from one of the models, which is {yj }1≤j≤N , but we do not know the
data-generating model and its parameter. The goal of statistical analysis is to identify the
model and further estimate the model parameter.
To achieve the analysis goal stated above, conventionally, the statistician will employ
various model selection methods together with some estimation methods. Here, we will
briefly discuss several representative approaches, which include the Kolmogorov-Smirnov
(KS) distance Chakravarti et al. (1967), Bayesian Information Criterion (BIC) Schwarz
(1978), and Bayes factor Kass and Raftery (1995). The KS distance method calculates
the distance between the population Cumulative Distribution Function (CDF) and the
empirical CDF based on the sample {yj } for each model. The model that achieves the
minimum distance will be selected as the true model. The BIC criterion calculates the BIC
score for each model as follows:
313
Zhang Deng Zhu
where L(·) is the likelihood function, θ̂k is the maximum likelihood estimate, and p is the
number of parameters in the model Mk . Note that for the scenario considered here, p = 1.
The model that achieves the minimum BIC score will be selected as the true model.
The Bayes factor method will impose a prior distribution to the models, π(Mk ), and
further impose a prior distribution to the parameter π(θk ). Then, given the sample, the
posterior distribution for each model can be calculated, which is denoted as π(Mk |{yj }). The
Bayes factor between any two models, Mk1 and Mk2 , can be calculated as BF(Mk1 , Mk2 ) =
π(Mk1 |{yj })/π(Mk2 |{yj }), which can be used to discriminate between the two models. The
model the BF scores support the most will be selected as the true model.
Our criticism for the conventional statistical approaches discussed above is two-fold.
First, for the goal of automating model selection, the model set M usually consists of a
large number of candidate models, and the models are of huge variety. The conventional
statistical methods such as the KS distance and BIC only work for selection between nested
or other well-structured models. Second, for a given sample, all the conventional methods
will have to calculate a score for each of the candidate models, and then compare them to
pick the winner. This can become computationally intensive or even intractable, especially
for the Bayes factor approach. Similar discussion and criticism can be applied to using
conventional statistical methods for automating parameter estimation, which we omit them
due to space limitation.
In this section, we instead propose to use CNNs and machine learning to automate
model selection and parameter estimation. Our main idea is that the procedures for model
selection and parameter estimation can be considered mappings from the sample to a model
and a value of the model parameter, that is,
G1 ({yj })
G : {yj } → ∈M×Θ
G2 ({yj })
where G = (G1 , G2 ) consists of the model selection mapping G1 and the parameter esti-
mation mapping G2 , and Θ is the parameter space. Instead of using statistical principles
to derive G1 and G2 , we propose to use CNNs to approximate them. From here on in the
rest of the paper, we refer to G1 as the neural model selector, and G2 the neural parameter
estimator, as discussed in the Introduction.
314
Using DNN to Automate Large Scale SA
315
Zhang Deng Zhu
minimal sufficient statistics can serve as the set of common features. This assumption
however may not hold in general. Therefore, FSA is expected to work well under one set of
candidate models, but may fail under another set of candidate models.
The most promising architecture is PSA. The intuition underlying PSA is that the early
convolutional layers will produce low-level features that are common for both model selection
and parameter estimation, and information in the true model label and parameter values
can be shared. Because model selection and parameter estimation are two different tasks,
we should not expect they would be relaying on the same set of high-level features. Our
simulation studies reported in later sections support this intuition. In terms of training,
PSA is more demanding than the other two architectures. Furthermore, PSA leads to
another important issue, that is, how many convolutional layers should be shared by G1
and G2 . We will investigate this issue in the next section.
4. Simulation results
We conduct simulation studies to demonstrate the properties and performance of the pro-
posed model selector and parameter estimator in this section, and further compare them
with several conventional statistical methods. Due to space limitation, we will emphasize
on results regarding model selection instead of parameter estimation. The latter can be
found in the Supplementary Document.
316
Using DNN to Automate Large Scale SA
317
Zhang Deng Zhu
between some consecutive convolutional layers. The same CNN architecture is used for
both the neural model selector and parameter estimator except for the output layers.
Under each combination of SA architectures (NSA, FSA, PSA), CNN architectures
(small, medium, large), number of candidate models (K = 5, 20, 50), sample sizes (N =
100, 400, 900), we use the generated labeled data to train, validate, and test the proposed
neural selector and parameter estimator. For PSA, we further vary the number of shared
layers (l) in training. Each training run is replicated six times to assess the stability of the
training procedure and results. All training is performed using the Caffe implementation
Jia et al. (2014) on one GTX-1080 GPU. The running time each training run takes ranges
from five minutes to one hour depending on the values of K and N .
318
Using DNN to Automate Large Scale SA
Table 1: Model selection results under all the combinations of SA architecture, CNN archi-
tecture, number of candidate models K, and sample size N .
the model selector and parameter estimator can expedite their training rates. Similar
patterns could be found in other scenarios as showed in Figure S2, S3, S4 in Supplementary
Document.
How many layers should be shared? Figure 3 shows the impact of the number of shared
layers between the model selector and parameter estimator on their performances. We
consider the scenario with K = 50, N = 100, and the medium and large CNN architectures,
and vary the SA architectures from NSA to FSA. The left panel of Figure 3 presents the
boxplots of accuracy of the model selector under various SA architectures, whereas the right
panel presents the boxplots of the Huber loss of the parameter estimator. In terms of model
selection accuracy, for medium CNN architecture, PSA-1 shows significant improvement
over NSA, and PSA-2 further improves upon PSA-1, though the amount of improvement
from PSA-1 to PSA-2 is small. PSA-3 performs almost the same as PSA-2, and further
increasing the number of shared layers leads to slight decrease in selection accuracy. In
terms of estimation accuracy (i.e. Huber loss), we can observe similar patterns as the
319
Zhang Deng Zhu
(a) (b)
(c) (d )
Figure 2: Comparison between NSA and PSA-l neural model selector and parameter esti-
mator, different colours denote for different sample sizes, upper panel is for medium CNN
architecture and lower panel is for large CNN architecture.
selection accuracy for NSA, PSA-1 and PSA-2. As the number of shared layers further
increases, the estimation accuracy declines fairly fast. The results suggest that the PSA-2
SA architecture is optimal for both of the model selector and parameter estimator for the
medium CNN architecture. For the large CNN architecture, the optimal SA architecture
turns out to be PSA-5 instead.
320
Using DNN to Automate Large Scale SA
(a) (b)
(c) (d )
Figure 3: Information sharing comparison for medium and large CNN architectures, K = 50
and N = 100. The upper panel is for medium CNN architecture and the lower panel is for
large CNN architecture.
conventional statistical methods are not applicable, but the neural parameter estimator can
still work well.
Table 2: Comparison of model selection methods on model set with K = 20.
321
Zhang Deng Zhu
(a) (b)
framework can be extended to handle more sophisticated models. In this section, we extend
the neural selector to a group of commonly used simple regression models.
Let the model set M include the following seven regression models: simple linear re-
gression model, Poisson regressoin model, Logistic regression model, Negative Binomial re-
gression model, Lognormal regression model, Loglinear regression model, and multinormial
regression model. Let {(yj , xj )}1≤j≤N be a sample generated from one of the seven model.
As before, the neural model selector is a CNN-based classifier that maps the sample to its
generating model, and we will use labeled data systematically generated from the seven
models to train this neural model selector.
The labeled data are generated as follows. For each regression model, we place an evenly
spaced grid over its parameter space. For each vector of the parameter values on the grid,
1000 samples with sample size N are randomly drawn from the model. The generated data
are further partitioned into 70% for training, 20% for validation, and 10% for test. We use
the medium CNN architecture, employ the Caffe to train the model selector, and further
test the performance of the trained selector on the test dataset. The results show that the
trained model selector can achieve 87.86% in accuracy when the sample size is 100, and can
achieve 97.86% in accuracy when the sample size is 400.
322
Using DNN to Automate Large Scale SA
References
Christopher M Bishop. Pattern recognition. Machine Learning, 128:1–58, 2006.
Hamparsum Bozdogan. Model selection and akaike’s information criterion (aic): The general
theory and its analytical extensions. Psychometrika, 52(3):345–370, 1987.
Kenneth P Burnham and David R Anderson. Model selection and multimodel inference: a
practical information-theoretic approach. Springer Science & Business Media, 2003.
Kenneth P Burnham and David R Anderson. Multimodel inference understanding aic and
bic in model selection. Sociological methods & research, 33(2):261–304, 2004.
George Casella and Roger L Berger. Statistical inference, volume 2. Duxbury Pacific Grove,
CA, 2002.
I.M. Chakravarti, R.G. Laha, and J. Roy. Handbook of methods of applied statistics. Number
v. 1 in Wiley series in probability and mathematical statistics. Wiley, 1967. URL https:
//books.google.com.hk/books?id=vtI-AAAAIAAJ.
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies
for accurate object detection and semantic segmentation. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 580–587, 2014.
Christian Hennig. smoothmest: Smoothed M-estimators for 1-dimensional location, 2012.
URL https://CRAN.R-project.org/package=smoothmest. R package version 0.1-2.
Peter J Huber et al. Robust estimation of a location parameter. The Annals of Mathematical
Statistics, 35(1):73–101, 1964.
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Gir-
shick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for
fast feature embedding. In Proceedings of the 22nd ACM international conference on
Multimedia, pages 675–678. ACM, 2014.
323
Zhang Deng Zhu
Iain M. Johnstone, Zongming Ma, Patrick O. Perry, and Morteza Shahram. RMTstat:
Distributions, Statistics and Tests derived from Random Matrix Theory, 2014. R package
version 0.3.
Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical
association, 90(430):773–795, 1995.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Andrea De Mauro, Marco Greco, and Michele Grimaldi. A formal definition of big data
based on its essential features. Library Review, 65(3):122–135, 2016. doi: 10.1108/
LR-06-2015-0061. URL http://dx.doi.org/10.1108/LR-06-2015-0061.
Gartner Press Release. Gartner says the internet of things will transform the data center.
Retrieved from http://www.gartner.com/newsroom/id/2684616, 2014.
Brian D Ripley. Pattern recognition and neural networks. Cambridge university press, 2007.
Gideon Schwarz. Estimating the dimension of a model. Ann. Statist, 6(2):461–464, 03 1978.
doi: 10.1214/aos/1176344136. URL http://dx.doi.org/10.1214/aos/1176344136.
Bruce Swihart and Jim Lindsey. rmutil: Utilities for Nonlinear Regression and Repeated
Measurements Models, 2016. URL https://CRAN.R-project.org/package=rmutil. R
package version 1.1.0.
Thomas W. Yee. VGAM: Vector Generalized Linear and Additive Models, 2017. URL
https://CRAN.R-project.org/package=VGAM. R package version 1.0-3.
324
Using DNN to Automate Large Scale SA
Appendix A.
Table 3: Parameter estimation results under all the combinations of SA architecture, CNN
architecture, number of candidate models K, and sample size N . The Huber Loss with
standard deviation in parentheses based on six repeated runs are reported, the better result
between NSA and PSA SA architectures is denoted as bold. For PSA, we report the best
results based on layer analysis, they are PSA-3, PSA-2, and PSA-5 for small, medium, and
large CNN architectures respectively. We can see that PSA performs better than NSA in
most cases.
325
Zhang Deng Zhu
Appendix B.
Figure 5: Confusion matrix based on large CNN and PSA-5 neural model selector on test
dataset with K = 20
326