Data Analytics
for Civil & Environmental Engineers
Lecture 8
Univariate Statistics
1
Univariate Statistics
Empirical distributions
Histograms
Mean, Median, Quartiles, Variance, Skewness,
Kurtosis
Boxplot
Statistical distributions
Discrete distributions
Continuous distributions
• Gaussian distribution and the central limit theorem
• Chi-squared, F, and Student’s t-distributions
2
Univariate Statistics
Statistical testing
Chi-squared test
F-test
Student’s t-test
Extreme value distributions
Generalized extreme value distribution
• Return period
Extreme threshold distributions
• Weibull distribution
3
Empirical distribution: Histogram
Histogram shows the number of data points in a
given data bin
[n,xout]=hist(data)
Syntax %n: row vector if the number of data in each bin
%xout: bin locations
hist (data)
hist(data, # of bins)
hist(data, vector of data bins)
Updated functions:
hist histogram
[n, edges]=histcounts(data)
center=edges(1:end-1)+diff(edges)/2
4
Empirical distribution: Histogram
x=randn (1000,1);
histogram (x)
hist(x, 22) %gives similar results
histogram (x, 50) %50 bins
y=-2:0.1:2;
hist(x,y) %not pretty
histogram(x,y) %much better
5
Empirical distributions
How do we describe a dataset?
Discrete parameters
min, max, mean
Median, quartile
Standard deviation
Variance
Skewness
kurtosis
6
Mean: Why different definitions?
Arithmetic mean
Geometric mean
Harmonic mean
7
Median: write a median function
function m=mymedian(x)
a=sort(x);
b=length(x);
b2=floor(b/2);
if (b/2 > b2) %if mod(b,2)
m=a(b2+1);
else
m=0.5*(a(b2)+a(b2+1));
end
end 8
Quantiles
Divide ordered data into (approximately) equal-
sized subsets of data.
4-quantiles: quartiles
100-quantiles: percentiles
1st quartile: 25th percentile
2nd quartile: median: 50th percentile
9
Quantiles
x=1:15, what is the 3rd quartile?
1. Use the median to divide the data to 2 subsets (do not
include the median value)
2. The upper quartile is the median of the upper half.
The 3rd quartile is 12.
Matlab uses linear interpolation:
prctile(x,[25 50 75])
10
Dispersion of the data: Central moments
11
Moment statistics
12
Moment statistics
Which variable is most sensitive to the difference between the mean
with median values?
13
Moment statistics
14
Moment statistics
15
Dealing with NaN
x=[1:120, NaN];
mean(x), var(x)
nanmean(x), nanvar(x)
skewness(x)
kurtosis(x)
How do we remove the NaN values?
x(isnan(x))=[]
x=x(~isnan(x))
NaN==NaN always return 0; must use isnan
16
Organic matter data
org=load('organicmatter_one.txt');
%checkout the data
plot(org,'o-'), ylabel('wt %')
%histogram
%sqrt of the number of data is often a good
first guess of intervals to use
hist(org,8)
17
Statistics:
18
Historgram
org=load('organicmatter_one.txt');
[n,xout]=hist(org,8);
%n: raw with the number of data of each bin
%xout: bin locations
bar(xout, n, 'r') %red bar
%3d bar
bar3(xout, n, 'b')
19
Sensitivity to outliers
sodium = load('sodiumcontent.txt');
whos sodium
hist(sodium,11)
%add an outliner
sodium2=sodium;
sodium2(121,1) = 0.1;
%sodim2=[sodium;0.1];
Which variable is most sensitive?
20
Sensitivity to outliers
21
Boxplot
boxplot(org)
Box shows the lower quartile, median, and upper quartile
values.
Whiskers show the most extreme data within 1.5 times
interquatile range (25th-75th percentile) from the ends of
the box (25th, 75th percentile)
Red + signs: outliners
load carsmall
boxplot(MPG,Origin)
%MPG is a vector of numbers, Origin a vector of strings that
define “group”
22
Boxplot
23
Box plot: group assignment {}
sodium = load('sodiumcontent.txt');
sodium2=[sodium;0.1];
data=[sodium; sodium2];
name(1:length(sodium))={'original'};
ed= length(sodium);
name(ed+1:ed+length(sodium2))={'outlier'};
boxplot(data, name)
24
Statistical distribution
25
Discrete distribution: Poisson
26
Continuous PDF: Boltzman
Maxwell–Boltzmann distribution
27
https://en.wikipedia.org/wiki/Maxwell%E2%80%93Boltzmann_distribution#/media/File:MaxwellBoltzmann-en.svg
Gaussian (normal) distributions
Syntax
Y=pdf(name, p1,..)
Y=cdf(name, p1,…)
name: distribution name
pi: parameters for the distribution
Guassian
Y=pdf(‘norm’,data vector, mean,std)
Y=cdf(‘norm’,data vector, mean,std)
Or
Y=normpdf(data vector, mean,std)
Y=normcdf(data vector, mean,std)
28
Gaussian distribution
mu=[0, 2, -2, 0]; sig=[0.2,1,0.5,3];
x=linspace(-5,5,100);
for i=1:4
xpdf(:,i)=pdf('norm',x,mu(i),sig(i));
xcdf(:,i)=cdf('norm',x,mu(i),sig(i));
end
subplot(2,1,1), plot(x,xpdf)
subplot(2,1,2), plot(x,xcdf)
29
Gaussian distribution
30
Central limit theorem
The sum of a large number of independent and
identically distributed random variables, each
with finite mean and variance, is approximately
normally distributed.
-the 2nd fundamental theorem of probability
31
Central limit theorem
To simulate flipping a coin, we use a
random number generator. If the
random number is less than 0.5, it is for i=1:2000
considered as heads (represented x=rand(1000,1)<0.5;
by 1); if the random number is equal heads=sum(x);
to or greater than 0.5, it is tails=1000-heads;
considered as tails (represented by y(i)=heads-tails;
-1). The sum of 1000 random flips in end
2000 game plays can be calculated hist(y)
by generating random numbers
1000 times in 2000 trips and
summing the results (1 or -1).
32
Central limit theorem
33
Flipping a rigged coin
What are the probabilities of losing more than
$50 and $100 if the winning odds is 50%?
ymean=mean(y)
ystd=std(y)
cdf('norm',-50,ymean,ystd)
cdf('norm',-100,ymean,ystd)
34
Central limit theorem
Let X1, X2, X3, ... be a set of n independent and identically
distributed (not necessarily normal) random variables having
finite values of mean μ and variance σ2. As the sample size n
increases, the distribution of the sample average approaches
the normal distribution with a mean μ and variance σ2/n
irrespective of the shape of the original distribution.
The PDF of the sum of two or more independent variables is
the convolution of their densities (if these densities exist). The
convolution of a number of density functions tends to the
normal density as the number of density functions increases
without bound, under the conditions stated above.
35
Gaussian distribution
36
Estimate of the errors
37
Propagation of error (normal distribution)
38
Central limit theorem
The log of a product of random variables that
take only positive values tends to have a normal
distribution, which makes the product itself have
a log normal distribution.
39
Log-normal distribution
If Y is a random variable with a normal
distribution, then X = exp(Y) has a log-normal
distribution.
40
Log-normal distribution
41
Log-normal distribution
mu=[0, 0, 1, 1]; sig=[1/4,1/2,1,2];
x=linspace(0,3,100);
for i=1:4
xpdf(:,i)=pdf('logn',x,mu(i),sig(i));
xcdf(:,i)=cdf('logn',x,mu(i),sig(i));
end
subplot(2,1,1), plot(x,xpdf)
subplot(2,1,2), plot(x,xcdf)
42
Chi-squared distribution
One of the most widely used in statistical significance
tests.
43
Chi-squared distribution
https://en.wikipedia.org/wiki/Chi-squared_distribution
44
F-distribution
U1 and U2 have chi-squared distributions with d1 and
d2 degrees of freedom respectively, and
U1 and U2 are independent
F-distribution is
https://en.wikipedia.org/wiki/F-distribution
The F-distribution arises frequently as the null
distribution in the analysis of variance
45
Student’s t-distribution
The t-distribution is often
used as an alternative to
the normal distribution as
a model for data. It is
frequently the case that
real data have heavier
tails than the normal
distribution allows for. The
classical approach was to
identify outliers and
exclude or downweight
them in some way.
46
https://en.wikipedia.org/wiki/Student%27s_t-distribution
Student’s t-distribution
Arises in the problem of estimating the mean of a
normally distributed population when the sample size is
small (hence the standard deviation is not known well).
Suppose X1, ..., Xn are independent random variables
that are normally distributed with expected value μ and
variance σ2. DOF=n-1 Let
47
Using t-distribution
Scaled distribution with a mean of 0 and
a standard deviation of 1.
If T were a Gaussian distribution,
PDF
CDF
48
t-distribution
A random sampling of screws gives weights 30.02,
29.99, 30.11, 29.97, 30.01, 29.99.
Calculate a 95% confidence interval for the
population's mean weight.
49
t-distribution
The probability that x
is greater than the
confidence interval
zα/2 is α.
50
https://www.studypug.com/statistics-help/confidence-intervals-with-t-distribution
t-distribution
This shows the confidence interval on μ at the 1- α
confidence level. This is a commonly used statistic for
estimation of the population mean. Often the level used
is 95%, or 2σ.
51
t-distribution
Find the data value corresponding to probability
α/2= cdf(‘t’, zα/2, DOF) (forward)
zα/2 = tinv(α/2, DOF) (invert)
For Gaussian distribution
zα/2 = norminv(α/2, mean, std/sqrt(n))
52
t-distribution
x=[30.02, 29.99, 30.11, 29.97, 30.01, 29.99];
xmean=mean(x)
xstd=std(x)
n=length(x)
%t-value at 2.5% (5%/2), DOF=n-1
%t-distribution cdf is symmetrical
tvalue=abs(tinv(0.025,n-1))
%low/high bound
low=xmean-tvalue*xstd/sqrt(n)
high=xmean+tvalue*xstd/sqrt(n)
53
Comparison to normal distribution
%if the sample size is large
xmean=mean(x);
t-distribution
xsig=xstd/sqrt(n); [29.963, 30.067]
tvalue=2.57
tvalue=abs(norminv(0.025,0,1))
Normal dist
xmean+tvalue*xsig [29.975, 30.055]
tvalue=1.96
xmean-tvalue*xsig
…….…or ………….
norminv(0.975,xmean,xsig)
norminv(0.025,xmean,xsig)
54
Student’s t and Guassian distributions
As the number of
sample increases,
the t-distribution
value for 95%
confidence interval
approaches that of
Gaussian.
55
Statistical testing
Null hypothesis is used to test differences in
treatment and control groups, and the assumption
at the outset of the experiment is that no difference
exists between the two groups for the variable being
compared.
"A statistically significant difference" simply means
there is statistical evidence that there is a difference;
it does not mean the difference is necessarily large,
important or significant in the common meaning of
the word.
Confidence level:
56
Statistical testing
The null hypothesis must be stated in mathematical/statistical terms
that make it possible to calculate the probability of possible samples
assuming the hypothesis is correct.
A test must be chosen that will summarize the information in the
sample that is relevant to the hypothesis. In the example given
above, it might be the numerical difference between the two sample
means, m1 − m2.
The distribution of the test statistics is used to calculate the
probability sets of possible values (usually an interval or union of
intervals).
Among all the sets of possible values, we must choose one that we
think represents the most extreme evidence against the hypothesis.
That is called the critical region of the test statistic. The probability
of the test statistic falling in the critical region when the null
hypothesis is correct, is called when the null hypothesis is correct,
is called the p value (“surprise” value) of the test. 57
Probability
Frequency probability (frequentists) is the interpretation of
probability that defines an event's probability as the limit of its
relative frequency in a large number of trials. The problems
and paradoxes of the classical interpretation motivated the
development of the relative frequency concept of probability.
Bayesian probability is an interpretation of the probability
calculus which holds that the concept of probability can be
defined as the degree to which a person (or community)
believes that a proposition is true. A posteriori is a function of
a priori and observations.
The groups have agreed that Bayesian and Frequentist
analyses answer genuinely different questions, but disagreed
about which class of question it is more important to answer
in scientific and engineering contexts.
58
Pearson’s Chi-squared test
The null hypothesis: the relative frequencies of
occurrence of observed events follow a specified
frequency distribution.
Pearson's chi-squared is the original and most
widely-used chi-squared test.
The test compares the difference between each
observed and theoretical frequency for each
possible outcome.
59
Pearson’s Chi-squared test
DOF = n-1
60
Pearson’s Chi-squared test
A chi-squared probability of 0.05 or less (alpha
value, α =0.05) is commonly used for rejecting the
null hypothesis.
Critics of α values point out that the criterion used to
decide "statistical significance" is based on the
somewhat arbitrary choice of level (often set at
0.05).
61
Pearson’s Chi-squared test
A random sample of 100 people has been drawn
from a population in which men and women are
equal in frequency. There were 45 men and 55
women in the sample, what is chi-squared value?
Probability in Chi-squared distribution: cdf(‘chi2’, value, DOF)
Is this Chi-squared value within 95% of the distribution?
cdf('chi2',1,1)
62
Pearson’s Chi-squared test
What should we use?
x=linspace(0,8,100);
v (probability)
v=cdf('chi2',x,ones(1,100));
plot(x,v)
%cdf('chi2',1,1) = 0.68
x (chi-squared value)
63
chi2inv
%chi2inv(probability, DOF)
%the value exceeds 95% of the data with DOF =1
chi2inv(0.95,1)
if the chi2 values < chi2inv(0.95,1), the hypothesis
cannot be rejected.
64
Pearson’s Chi-squared test
65
Pearson’s Chi-squared test
corg = load('organicmatter_one.txt');
% 60 data points, define 8 bins
[n_exp,v] = hist(corg,8);
% generate a synthetic dataset
n_syn = pdf('norm',v,mean(corg),std(corg));
%redistribute;
n_syn = n_syn.* sum(n_exp)/sum(n_syn);
subplot(1,2,1), bar(v,n_syn,'r')
subplot(1,2,2), bar(v,n_exp,'b')
66
Pearson’s Chi-squared test
%test
chi2 = sum((n_exp - n_syn).^2./n_syn)
%dof (# of bins - # of parameters -1)
%For Gaussian, # of parameters is 2, mean & std
dof=8-2-1;
%0.05 value
chi2inv(0.95,dof)
67
F-test
The null hypothesis: the standard deviations of
two normally distributed populations are
equal.
F_crit=finv(0.95, DOF1,DOF2)
68
F-test
load('organicmatter_four.mat');
%compare std
s1 = std(corg1)
s2 = std(corg2)
%DOF
df1 = length(corg1) -1;
df2 = length(corg2) -1;
69
F-test
if s1>s2
Freal=(s1/s2)^2
else
Freal=(s2/s1)^2
end
%find the table value for 5% extreme, inv-f cdf
Ftable = finv(0.95,df1,df2)
70
Student’s t-Test
The null hypothesis: the means of two
distributions are equal.
Assumptions
Normal distribution of data (What test to use?)
Equality of variances (what test to use?)
71
Student’s t-Test
Matlab syntax
[h,s,ci] = ttest2(x,y,p-value)
h: 1 rejects the null hypothesis;
0 cannot reject
s: significance for the difference
of the means between x and y
ci: confidence interval
p-value=0.05
72
Student’s t-Test
load('organicmatter_two.mat');
[n1,x1] = hist(corg1);
[n2,x2] = hist(corg2);
h1 = bar(x1,n1);
hold on
h2 = bar(x2,n2);
set(h1,'FaceColor','none','EdgeColor','r')
set(h2,'FaceColor','none','EdgeColor','b')
hold off
73
Student’s t-Test
%difference of mean
mean(corg1)-mean(corg2)
%h=1 rejects the null hypothesis
%significance: p-value associated with the t-statistic
%ci: 95% confidence interval on the difference of the
means (if there is no statistically significant difference)
[h, significance, ci] = ttest2(corg1,corg2,0.05)
74
There are more!
(e.g., Extremal type distribution)
75
Tutorial
We compare the observed and simulated ozone mixing
ratios (ppbv, parts per billion by volume) in July and August.
Negative values represent no measurements.
Data: Atlanta_O3.txt
1. Apply a statistical test to show if the two datasets are
normal distributions.
2. Assuming that both are normal distributions, apply
statistical tests to show if the two datasets have equivalent
variances and if the two datasets have equivalent means.
76