Exploratory Data Analysis
Module III: Deep Dive
Dr. Mark Williamson
DaCCoTA
University of North Dakota
Introduction
Previously:
• Covered a broad overview
• Looked at more detail
• Ran through examples
This time: look at more advanced
techniques of exploratory data analysis
• Visualizing more dimensions
• Model selection
• Complex plots
Reviewing the Basics
Exploratory Data
Analysis
View Summary Basic Basic
data statistics Graphs Tests
Topics Covered
• Visualizing more dimensions
• Bubble graph
• 3D scatterplot
• Principle Components Analysis (PCA)
• Variable selection
• Model selection
• Complex plots
• Scatterplot Matrix
• Multiple bar plots
• Scatterplot with factors
Visualizing More Dimensions
• Three numerical variables on one graph
• Bubble graph
• Typical X-Y scatterplot of first two variables
• Third variables is scaled by size of the point
(bubble)
• 3D scatterplot
• Scatter plot runs in 3 dimensions
• X, Y, and Z
• Many numerical variables on one graph
• Principle Components Analysis (PCA)
Visualizing More Dimensions
Assessment 1
1) What does this picture tell us As one variable increases (such
about the relationship between as age), the other two tend to
height, weight and age? do so as well.
2) Based on the PCA summary of 6 2 (together, they cover almost
car variables and the plot, how 90%)
many components capture the
majority of variance?
3) Which of the following variables College major
could be colored in a 3D plot:
height, weight, age, college major
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 2.0463 1.0715 0.57737 0.39289 0.3533 0.22799
Proportion of Variance 0.6979 0.1913 0.05556 0.02573 0.0208 0.00866
Cumulative Proportion 0.6979 0.8892 0.94481 0.97054 0.9913 1.00000
Assessment 1
1) What does this picture tell us As one variable increases (such
about the relationship between as age), the other two tend to
height, weight and age? do so as well.
2) Based on the PCA summary of 6 2 (together, they cover almost
car variables and the plot, how 90%)
many components capture the
majority of variance?
3) Which of the following variables College major
could be colored in a 3D plot:
height, weight, age, college major
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 2.0463 1.0715 0.57737 0.39289 0.3533 0.22799
Proportion of Variance 0.6979 0.1913 0.05556 0.02573 0.0208 0.00866
Cumulative Proportion 0.6979 0.8892 0.94481 0.97054 0.9913 1.00000
Variable selection
• Model Selection Resp = Pred1 + Pred2 + Pred3 + Pred4 + Pred5 + Pred6
• Lots of predictor variables M1: Resp=Pred1
• Need to trim down to only M2: Resp =Pred2
ones that are important M3: Resp =Pred3
M4: Resp =Pred4
• Can’t try every combination M5: Resp =Pred5
if there are lots of variables M6: Resp =Pred6
M7: Rese =Pred1 + Pred2
• Forwards, backwards, and M8: Resp =Pred1 + Pred3
stepwise selection M9: Resp =Pred1 + Pred4
M10: Resp =Pred1 + Pred5
M11: Resp =Pred1 + Pred6
M12: Resp =Pred2 + Pred3
….
Variable selection
Assessment 2
1) Based on the graph to the right, which Model with 6
model has the best fit variables (up to
for MSRP? Cylinders)
2) How will the stepwise selection method C) Start with null and
select a model? add/subtract
a) Start with the null model and add until best fit
b) Start with the full model and subtract until best fit
c) Start with the null model and add or subtract until best fit
d) Start with the null model and subtract or add until best fit
3) Should you use the R-squared to Only for single
compared models? variable models,
otherwise it is
biased by adding
more variables
Assessment 2
1) Based on the graph to the right, which Model with 6
model has the best fit variables (up to
for MSRP? Cylinders)
2) How will the stepwise selection method C) Start with null and
select a model? add/subtract
a) Start with the null model and add until best fit
b) Start with the full model and subtract until best fit
c) Start with the null model and add or subtract until best fit
d) Start with the null model and subtract or add until best fit
3) Should you use the R-squared to Only for single
compared models? variable models,
otherwise it is
biased by adding
more variables
Complex Plots
• Scatterplot Matrix
• Compare many numerical
variables at once
• Multiple bar plots
• Compare numerical variable
across multiple categorical
variables
• Scatterplot with factors
• Compare two numerical
variables across categories
Complex Plots
Assessment 3
1) The scatterplot to the right displays the number of It appears that exclamationis
pulses per hour against the outside temperature for crickets tend to haver more
crickets. What does the graph tell out about the two pulses per hour as the
species? temperature increases
2) To the right is a three-way scatterplot matrix of three Not really. There may be
response variables. Does there appear to be a some sort of negative
relationship/correlation between any of the three? If so, relationship between Resp1
why? and Resp2, but it is hard to
tell.
3) Suppose you have a dataset with a response variable (Weight) b) boxplot
and three categorical variables (Gender, Ethnicity, Occupation)
and want to use a graph to visualized possible differences in
Weight across those variables. What R-code would work best?
a) pairs(~Weight + Gender + Ethnicity + Occupation)
b) boxplot(Weight~Gender*Ethnicity*Occupation)
c) plot(Weight, Gender, col=Ethnicity*Occupation)
Assessment 3
1) The scatterplot to the right displays the number of It appears that exclamationis
pulses per hour against the outside temperature for crickets tend to haver more
crickets. What does the graph tell out about the two pulses per hour as the
species? temperature increases
2) To the right is a three-way scatterplot matrix of three Not really. There may be
response variables. Does there appear to be a some sort of negative
relationship/correlation between any of the three? If so, relationship between Resp1
why? and Resp2, but it is hard to
tell.
3) Suppose you have a dataset with a response variable (Weight) b) boxplot
and three categorical variables (Gender, Ethnicity, Occupation)
and want to use a graph to visualized possible differences in
Weight across those variables. What R-code would work best?
a) pairs(~Weight + Gender + Ethnicity + Occupation)
b) boxplot(Weight~Gender*Ethnicity*Occupation)
c) plot(Weight, Gender, col=Ethnicity*Occupation)
Caveats and Concerns
• Variables should be relevant to research questions
• If you look at enough variables, you’re bound to find correlations by chance
(mining for significance)
• Scatterplot matrices can help identify correlated covariates
• Limitations to visualizing complex data
• Tables are appropriate alternatives
• Exploratory data visualization is not analysis
• Need to follow up visualization with appropriate statistical analyses
Real World Examples
Siegel, R. L., Miller, K.D.,
Jemal, A. (2020). "Cancer
statistics, 2020." CA Cancer
Journal for Clinicians 70(1).
Real World Examples
(2020). "The GTEx Consortium
atlas of genetic regulatory
effects across human tissues."
Science 369(6509): 1318.
Summary and Conclusion
• Lots of methods for more advanced data exploration and visualization
• Helps to understand data more
• Increasingly useful in the era of large dataset and complex analyses