Statistical Computing II-slide
Statistical Computing II-slide
II
(Stat 2022)
R AND SAS
1. INTRODUCTION TO R
2
Starting R
The beauty of R is that it’s shareware, so it is
free to anyone (no license fee).
To obtain R for windows (or Mac), go to the
Comprehensive R Archive Network (CRAN) at
http://www.r-project.org and you can
immediately install it.
Once you have installed R, there will be an icon
on your desktop. Double click it and R will start
up.
Cont’d
4
versa.
Some names are used by the system, e.g. c, q, t, C, D, F, I, T,
e.g.
>x<-5
Using “=“
e.g.
>x=5
Using the function “assign()”
e.g.
>assign(“x”,5)
Using “->”
e.g.
> 5->X
Variables that contains many values (vectors), e.g. with the concatenate
function:
> y<-c(3,7,9,11)
>y
[1] 3 7 9 11
Calculator
13
>2+3 #addition
[1] 5
>3^2 #square
[1] 9
>2*3 #multiplication
[1] 6
> 4/2 #division
[1] 2
>sqrt(9) #square root
[1] 3
>exp(2) #e squared(2.718282^2)
[1] 7.389056
>pi #R knows about pi
[1] 3.141593
1.4 Objects and simple manipulation
14
To print the sin, cos, tan, asin, acos, atan, log, exp, … of the
values of x:
>sin(x)
>cos(x)
:
>exp(x)
The parallel maximum and minimum functions pmax and pmin
return a vector (of equal to their largest arguiment) that
contains in each element the largest (smallest) element in that
position in any of the input vectors.
>pmax(x,6) # returns the values of x but values that are
#less than 6 will be replaced by 6.
> pmin(x,6) # returns the values of x but values that are
#greater than 6 will be replaced by 6.
Cont’d
22
Character vectors
To set up a character/string vector z
consisting of 4 place names use:
> z <- c(“Canberra”, “Sydney”, ”Newcastle”)
# or
> z <- c(‘Canberra’, ‘Sydney’, ‘Newcastle’)
Character strings are entered using either
matching double(“) or single(‘) quotes, but
are printed using double quotes (or
sometimes without quotes).
Cont’d
23
Logical Vectors
A logical vector is a vector whose elements are TRUE,
FALSE or NA.
Note: TRUE and FALSE are often abbreviated as T and F
respectively, however T and F are just variables which
are set to TRUE and FALSE by default, but are not
reserved words and hence can be overwritten by the
user.
Cont’d
25
o Indexing Vectors
Vectors indices are placed with square
brackets: []
Vectors can be indexed in any of the following
ways:
Vector of positive integers
Vector of negative integers
Vector of named items
Logical vector
Cont’d
28
e.g.
>const=c(3.1416,2.7183,1.4142,1.6180)
>names(const)=c(“pi”,”euler”,”sqrt2”, “golden”)
>const[c(1,3,4)] # printing the 1st , 2nd and the 4th elements of const.
pi sqrt2 golden
3.1416 1.4142 1.618
>const[c(-1,-2)] # printing all elements of const except the 1 st and the
2nd
sqrt2 golden
1.4142 1.618
>const>2
pi euler sqrt2 golden
TRUE TRUE FALSE FALSE
>const[const>2]
pi euler
3.1416 2.7183
Cont’d
29
o Modifying Vectors
To alter the contents of a vector, similar methods can be
used.
e.g. Create a variable x with 5 elements:
10 5 3 6 21
>x=c(10,5,3,6,21)
Now, to modify the first element of x and assign it a value
7 use
>x[1]<-7
>x
[1] 7 5 3 6 21
The following command replaces any NA (missing) values in
the vector w with the value 0:
>w[is.na(w)]<-0
Generating sequences
30
Matrices
A matrix can be regarded as a generalization of a vector.
As with vectors, all the elements of a matrix must be of
the same data type.
A matrix can be generated in two ways.
Method 1: Using the function dim:
Example
>x <- c(1:8)
>dim(x) <- c(2,4)
>x
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
Cont’d
35
Method 2: >X<-matrix(vector, nrow=r, ncol=c,
byrow= FALSE,
dimnames=list(rownames,
colnames))
eg. >x <- matrix(c(1:8),2,4,byrow=F,
rix(c(1:8),2,4,byrow=F, dimnames=list(rrrr=c("A","B"),cccc=c("D","E","F
dimnames=list(rrrr=c(“A”,”B”),
cccc=c(“D”,”E”,”F”,”G”)));x
cccc
rrrr D E F G
A 1 3 5 7
B 2 4 6 8
By default the matrix is filled by column. To fill the matrix
by row specify byrow = T as argument in the matrix
function.
Cont’d
36
Example:
>cbind(c(1,2,3),c(4,5,6))
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
>rbind(c(1,2,3),c(4,5,6))
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Cont’d
37
columns)
as.matrix() used to coerce an argument into a matrix
object
%*% matrix multiplication
t() matrix transpose
det() determinant of a square matrix
solve() matrix inverse; also solves a system of linear
equations
eigen() computes eigen-values and eigenvectors
Cont’d
38
Example:
>name= c(”Eden”,”Solomon”,”Zelalem”,”Kidist”)
>age=c(18,22,25,27)
>sex=c(”F”,”M”,”M”,”F”)
>stud=data.frame(name,age,sex)
>stud
name age sex
1 Eden 18 F
2 Solomon 22 M
3 Zelalem 25 M
4 kidist 27 F
To display the column names:
>names(stud) #OR colnames(stud)
To display the row names:
>rownames(stud)
Cont’d
48
You can use the factor function to create your own value
labels.
# variable v1 is coded 1, 2 or 3
# we want to attach value labels 1=red, 2=blue,3=green
>mydata$v1 <- factor(mydata$v1, levels = c(1,2,3), labels =
c("red", "blue", "green"))
# variable y is coded 1, 3 or 5
# we want to attach value labels 1=Low, 3=Medium,
5=High
>mydata$y <- ordered(mydata$y, levels = c(1,3, 5),
labels = c("Low", "Medium", "High"))
Note: factor and ordered are used the same way, with the
same arguments. The former creates factors and the later
creates ordered factors.
Cont’d
55
MERGING DATASETS
To merge two dataframes (datasets) horizontally, use the
merge function. In most cases, you join two dataframes by one
or more common key variables (i.e., an inner join).
# merge two dataframes by ID
>total <- merge(dataframeA,dataframeB,by="ID")
# merge two dataframes by ID and Country
>total <- merge(dataframeA,dataframeB,by=c("ID","Country"))
ADDING ROWS
To join two dataframes (datasets) vertically, use the rbind
function.
The two dataframes must have the same variables, but they do
not have to be in the same order.
>total <- rbind(dataframeA, dataframeB)
Cont’d
58
2. Writing Functions
Writing Functions
60
To define a function in R, use the folowing syntax:
> name <- function(arg1, arg2, arg3,…,argk) expression1
If a function requires morethan one command, braces can
be used to group them.
Example: to write a function to calculate geometric mean
of a set of numbers:
> geomean<-function(x)10^mean(log10(x))
The function name is geomean, expecting a single
argument x
> x=c(10,10,10); > geomean(x)
[1] 10
Writing Functions
61
Conditional execution: if(), if()else and ifelse()
Syntax:
if(condition) {commands1}
if(condition) {commands1} else{commands2}
ifelse (conditions vector, yes vector, no
vector)
The command if() evaluates 'commands1' if the logical
> x <- 4
> if ( x == 5 ){ x <- x+1 } else{x<- x*2}
> x
[1] 8
> if ( x != 5 & x>3 ){ x <- x+1 ; 17+2 }
commands and
var is a variable which may be used in commands.
Loops: for(), while() and repeat()…
67
Note:
The command for() is the R version of 'for each
element in the set do ...'.
The command while() is the R version of 'as long
as the condition is
TRUE do ...'.
The command repeat() is the R version of 'repeat
>rnorm(100)
# simulate(generate) 100 standard normal RVs
> 2*pt(-2.43, df = 13)
# 2-tailed p-value for t distribution
> qf(0.01, 2, 7, lower.tail = FALSE)
# upper 1% point for an F(2, 7) distribution
Sampling Distributions BDP
82
Birth Day Paradox (BDP), conducted on 23 persons to have 50-50
chance that two or more of them have the same birth day from 365
days.
>qqline(x)
Formally, R provides the Shapiro-Wilk test and the Kolmogorov-
data: x
t = 0.7717, df = 9, p-value = 0.46
alternative hypothesis: true mean is not
equal to 10
99 percent confidence interval:
9.807338 10.312662
sample estimates:
mean of x
10.06
One- and two-sample t-tests…..
91
Example: Consider the following sets of data on the latent heat of the
fusion of ice (cal/gm) from Rice.
Method A: 79.98 80.04 80.02 80.04 80.03 80.00 80.02
80.03 80.04 79.97 80.05 80.03 80.02
Method B: 80.02 79.94 79.98 79.97 79.97 80.03 79.95 79.97
Box plots
provide a simple graphical comparison of the two samples.
>A=c(79.98,80.04,80.02,80.04,80.03,80.03,80.04,
79.97,80.05,80.03,80.02,80.00,80.02)
>B=c(80.02,79.94,79.98,79.97,79.97,80.03,79.95,
79.97)
>boxplot(A,B)
One- and two-sample t-tests…..
92
which indicates that the first group tends to give higher results than the
second.
79.94 79.96 79.98 80.00 80.02 80.04
1 2
One- and two-sample t-tests…..
93
To test for the equality of the means of the two samples, we can use an unpaired
t-test by:
> t.test(A, B)
This will give you the following output:
Welch Two Sample t-test
data: A and B
t = 3.2499, df = 12.027, p-value = 0.006939
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.01385526 0.07018320
sample estimates:
mean of x mean of y
80.02077 79.97875
Which indicate a significant difference, assuming normality. By default the R
function does not assume equality of variances in the two samples.
One- and two-sample t-tests…..
94
We can use the F test to test for equality in the variances, provided
that the two samples are from normal populations.
> var.test(A, B)
F test to compare two variances
data: A and B
F = 0.5837, num df = 12, denom df = 7, p-value = 0.3938
alternative hypothesis: true ratio of variances is not equal
to 1
95 percent confidence interval:
0.1251097 2.1052687
sample estimates:
ratio of variances
0.5837405
which shows no evidence of a significant difference, and so we can
use the classical t-test that assumes equality of the variances.
One- and two-sample t-tests…..
95
>t.test(A, B, var.equal=TRUE)
Two Sample t-test
data: A and B
t = 3.4722, df = 19, p-value = 0.002551
alternative hypothesis: true difference in means is
not equal to 0
95 percent confidence interval:
0.01669058 0.06734788
sample estimates:
mean of x mean of y
80.02077 79.97875
One- and two-sample t-tests…..
96
Excercise: the recovery time (in days) is measured for 10
patients taking a new drug and for 10 different patients
taking a placebo. We wish to test the hypothesis that the
mean recovery time for patients taking the drug is less than
fort those taking a placebo (under an assumption of
normality and equal population variances). The data are:
With drug: 15, 10, 13, 7, 9, 8, 21, 9, 14, 8
Placebo: 15, 14, 12, 8, 14, 7, 16, 10, 15, 12
One- and two-sample t-tests…..
Answer 97
> drug <- c(15, 10, 13, 7, 9, 8, 21, 9, 14, 8)
> plac <- c(15, 14, 12, 8, 14, 7, 16, 10, 15, 12)
> t.test(drug, plac, alternative ="less", var.equal = T)
Two Sample t-test
Car 1 2 3 4 5 6
mpg w/ additive: 24.6 18.9 27.3 25.2 22.0 30.9
mpg w/o additive: 23.8 17.7 26.6 25.1 21.6 29.6
One- and two-sample t-tests…..
99
Since this is a paired design, we can test the claim using the paired t–test (under
an assumption of normality for mpg measurements). This is performed by:
> add <-c(24.6, 18.9, 27.3, 25.2, 22.0, 30.9)
>noadd <-c(23.8, 17.7, 26.6, 25.1, 21.6, 29.6)
>t.test(add, noadd, paired=T, alt = "greater")
Paired t-test
4. Graphical Procedures
Introduction
10
2
Graphs are important tools for exploring data in
statistics and other fields/streams.
There are a number of graphs in Statistics.
Pie chart, histogram, scatter, box, steam and leaf, QQ plot…
> points(x,y)
> lines(x,y)
Different plotting commands
> pie() # Used to plot a Pie Chart
> boxplot() # Plots Box-and-whiskers plot
> hist() # Used to plot histograms
> barplot() # Used to plot a bar graph (Chart)
> legend() # Adds a legend to a figure
> stem() # Plots steam and leaf plot
Graphics Parameters
10
text() – writes inside the plot 5
region, could be used to label data
points.
mtext() – writes on the margins, can be used to add multiline
legends
text() and mtext() functions can print mathematical expressions created
with expression().
mfrow or mfcol options Take 2 dimensional vector as an argument
The first value specifies the number of rows
The second specifies the number of columns
plot types: type="l"
line types: lty= (takes values from 1 to 8)
plotting characters: pch=(takes values square (0); circle (or octagon)
(1); triangle (2); cross (3); X (4); diamond (5) and inverted triangle (6))
plotting colors: col=1,2,3,…
Graphics Parameters
>x <- c(1.2,3.4,1.3,-
10
2.1,5.6,2.3,3.2,2.4,2.1,1.8,1.7,2.2)
6
> y <- c(2.4,5.7,2.0,-
3,13,5,6.2,4.8,4.2,3.5,3.7,5.2)
>plot(x,y)
Graphics Parameters
>x <- c(-2,-0.3,1.4,2.4,4.5)
>y <- c(5,-0.5,8,2,11) 10
7
>plot(x,y,type="l",col="blue",xlab="Advertise
Change", ylab="Revenue Change", main="Financial
Analysis")
Demonstration
Bar Graph: Summary data
10
Consider five clinics treating patients
8 who are smokers, ex-smokers and
non-smokers.
Perform the following operation to creat the data.
> clinics <- matrix(c(30, 55, 60, 20, 45, 70, 50, 10, 20,
55, 70, 120, 27, 34,22, 23, 14, 33), nrow = 6)
> dimnames(clinics) <- list(paste(c("clinic"), 1:6),
c("smoker", "ex-smoker", "non-smoker"))
> clinics
smoker exsmoker non-smoker
clinic 1 30 50 27
clinic 2 55 10 34
clinic 3 60 20 22
clinic 4 20 55 23
clinic 5 45 70 14
clinic 6 70 120 33
Bar Chart:
The following code plots a bar chart for the first clinic (clinic 1) with
names on the x-axis (smoker, exsmoker
10 and nonsmoker)
9
> barplot(clinics[1,],names=dimnames(clinics)[[2]],
main="clinic1”)
Bar Chart:
The following code plots a bar chart for the all clinics who
are only smokers. 11
0
> barplot(clinics[,1],names=dimnames(clinics)[[1]],
main="smoker group")
Bar Chart:
> barplot(t(clinics), names=dimnames(clinics)[[1]],
main='matrix is transposed',
11 sub='each row is one
bar', col=1:10) 1
Bar Chart:
>barplot(clinics,names=dimnames(clinics)
[[2]],sub='each column 11of the matrix is one
bar', col=1:6) 2
Bar Chart:
>barplot(clinics,names=dimnames(clinics)
[[2]],sub='each column of
11 the matrix is one bar',
col=1:6) 3
10
5
0
-2 -1 0 1 2
x
Histogram…
#With shading
> hist(x, density=20) 11
8
Histogram…
#With specific number of bins
> hist(x, density=20, breaks=20)
11
9
Histogram…
# Proportion, instead of frequency, also specifying y-axis
12
0
> hist(x, density=20, breaks=-
3:3,ylim=c(0,.5), prob=TRUE)
Histogram …
Overlay normal curve with x-lab and ylim, colored normal curve
> m<-mean(x) 12
1
> std<-sqrt(var(x))
> hist(x, density=20, breaks=20, prob=TRUE,xlab="x-variable",
ylim=c(0, 0.7), main="normal curve over histogram")
> curve(dnorm(x, mean=m,sd=std),col="darkblue",lwd=2,add=TRUE)
Histogram…
Overlay density curve with x-lab and ylim
> hist(x, density=20, breaks=20,12 prob=TRUE,xlab="x-variable",
2
ylim=c(0, 0.8),main="Density curve over histogram")
Density curve over histogram
> lines(density(x), col = "blue")
0.8
0.6
Density
0.4
0.2
0.0
-2 -1 0 1 2
x-variable
Histogram…
hist(x) is an object
12
names(xh) will show all of its components
3
> xh<-hist(x)
> plot(xh, ylim=c(0, 40), col="lightgray",xlab="",
main="Histogram of x")
> text(xh$mids, xh$counts+2, label=c(xh$counts))
Box Plot
Syntax
boxplot(Cont. Variable
12 )
4
boxplot(Cont. Variable , xlab="write",
boxwex=.4, col="darkblue")
boxplot(Cont. Variable ~categorical_variable)
Example:
> attach(Orange)
> boxplot(age)
Box Plot
> boxplot(age, ylab="age of a tree",col=4)
> f <- fivenum(age) 12
5
> text(rep(1.35,5),f,labels=c("min","Q1","median","Q3",
"max"),cex=0.8)
Rep(1.35,5) is
location for
labels
Box Plot…
> Tlab<-as.vector(c("Tree1","Tree2", "Tree3","Tree4","Tree5"))
> Treef<-factor(Tree, label=Tlab)
12
> boxplot(age ~ Treef, 6
xlab=“Age by Tree ")
Box Plot…
If a horizontal boxplot is needed, run following R code
> boxplot(age ~ Treef,horizontal=TRUE
12 xlab=“Age by Tree ")
7
Stem and Leaf plots
For a univariate set of data, it is possible to examine its
distribution in a number of12
8 ways of which steam and leaf
plot is one.
Example:
> x<-c(42, 23, 43, 34, 49, 56, 31, 47, 61, 54, 46, 34, 26)
> Stem(x)
The decimal point is 1 digit(s) to the right of the |
2 | 36
3 | 144
4 | 23679
5 | 46
6 | 1
Notice that there are 5 categories for these 13 numbers, with stems for the 10s
digit and leaves for the 1s digit.
Stem and Leaf plots
The stem and leaf plot can be scaled to have more stems
by changing the scale option:
12
9
> stem(x, scale = 2)
The decimal point is 1 digit(s) to the right of the |
2 | 3
2 | 6
3 | 144
3 |
4 | 23
4 | 679
5 | 4
5 | 6
6 | 1
Quantile plots
qqnorm(x)
13
plots the numeric vector x against the expected
0
qqplot(x, y)
plots the quantiles of x against those of y to
compare their respective distributions.
Quantile plots…
> set.seed(123456)
> y <- rt(200, df = 5) 13
1
> x <-rt(300, df = 5)
> par(mfrow=c(1,2))
> qqnorm(y)
> qqline(y, col = 2)
> qqplot(y, x)
Quantile plots…
13
2
> mydata = c(2.4, 3.7, 2.1, 3, 1.6, 2.5, 2.9)
> myquant=qqnorm(mydata)
> qqline(mydata)
myquant; contains
the theoretical
quantiles and the
original data
If the observations in mydata come from a normal distribution, then the above
plot of mydata versus their population quantiles should give a straight line.
13
3
5. Statistical Models In R
Regression
13
4
Regression Analysis is a statistical tool for the investigation of
relationship between variables.
The investigator wants to ascertain causal effect of one variable
on an other
The investigator also typically assesses the “Statistical
significance” of the estimated relationship (the degree of
confidence that the true relationship is close to the estimated
relationship).
Regression may be simple or multiple, linear or non-linear.
The template for a statistical model is a linear regression model
with independent, homoscedastic errors.
y x
Regression…
13
General form: 5
Response ~expression
y ~ x
y ~ 1 + x
Both imply the same simple linear regression model of y on x. The first
has an implicit intercept term, and the second an explicit one.
y ~ 0 + x
y ~ -1 + x
y ~ x - 1
Simple linear regression of y on x through the origin (that is, without an
intercept term)
The operator ~ is used to define a model formula in R.
Regression…
13
6
Some useful extractors an out put from R code.
coef(object)
Extract the regression coefficient (matrix).
Long form: coefficients(object).
deviance(object)
Residual sum of squares, weighted if appropriate.
formula(object)
Extract the model formula.
plot(object)
Produce four plots, showing residuals, fitted values and some diagnostics.
predict(object, newdata=data.frame)
The data frame supplied must have variables specified with the same labels as
vcov(object)
Returns the variance-covariance matrix of the main parameters of a fitted
model object.
Linear Model…
13
8
confint(model_Name, level=0.95)
Confidence interval formation for the model parameters
fitted(model_Name)
Used to display predicted values depending on the fitted model
influence(model_Name)
used for regression diagnostics
anova(model_Name)
It displays te analysis of variance table
cor(Data.frame)
A quick way to look for relationships between variables in a data frame
pairs(Data.frame)
To visualize these relationships
Linear Model…
13
weights(model) 9
simple/multiple models.
fitted.model=lm(formula, data=, subset=)
Example: Consider the Effect of Vitamin C on Tooth
Growth in Guinea Pigs data, called ToothGrowth in R
dataset package.
First explore the data structure as follows:
> str(ToothGrowth)
'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
Linear Model Using R
14
> summary(ToothGrowth) 1
> class(model1)
[1] "lm"
> model1$coef # or
> model1$coefficients #or
> coef(model1)
> resid(model1)
-4.95285714 2.34714286 … -2.22714286
-4.17285714
Linear Model Using R
14
7
As mentioned above, the summary function is a generic
function—what it does and what it returns is dependent on the
class of its first argument. Here is a list of what's available from
the summary function for this model.
> names(summary(model1))
[1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled"
The following code displays the indexed object
> summary(model1)[[8]]
[1] 0.7295544
The 8th object isfrom the above list
(names(summary(model1))) is the r.squared.
Linear Model Using R
Another function worth mentioning is anova.
This function will calculate an analysis of variance table, which can be used
to evaluate the significance of the terms in single models or to compare two
nested models.
> anova(model1)
Analysis of Variance Table
Response: len
Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.35 205.35 12.3170 0.0008936 ***
dose 1 2224.30 2224.30 133.4151 < 2.2e-16 ***
supp:dose 1 88.92 88.92 5.3335 0.0246314 *
Residuals 56 933.63 16.67
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
148
Linear Model Using R
Once we have fit a model in R, we can generate predicted
values using the predict function.
> predict(model1)
1 2 … 59 60
9.152857 9.152857 … 27.172857 27.172857
The predict function can also give confidence and prediction
intervals.
> predict(model1,int="conf")
fit lwr upr
1 9.152857 6.966789 11.33893
2 9.152857 6.966789 11.33893
⁞ ⁞ ⁞ ⁞
59 27.172857 24.680356 29.66536
60 27.172857 24.680356 29.66536 149
Linear Model Using R
A quick way to look for relationships between variables in a
data frame is with the cor function.
> cor(dataset)
However this works when all the variables in the data set are
numeric.
Example: Formaldehyde data set in R (two numeric vars)
> cor(Formaldehyde)
carb optden
carb 1.0000000 0.9995232
optden 0.9995232 1.0000000
If not it will display the following message on R console
Error: could not find function "as.numeric"
It is possible to supply two variables (numeric) to get pair of correlation.
150
Linear Model Using R
To visualize these relationships, we can use pairs.
> pairs(data set name)
Example:
> pairs(Formaldehyde)
151
Updating Models
15
2
We could also use the update function for model 2—
this is especially handy for dealing with large model
formulas.
For example let us drop the supp term in the fitting of
length on dose and supp on ToothGrowth data set in R
(model<-lm(len~supp+dose, data=ToothGrowth))
It can be done as follows from model by update
function
> model2<-update(model, ~. –supp,data=ToothGrowth)
The same as the following model
> model2<-lm(len~dose,data=ToothGrowth)
ANOVA Models
15
3
To demonstrate ANOVA in R, let‘s start with a simple data
insecticides.
> str(InsectSprays)
'data.frame': 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1
1 ...
ANOVA Models
15
4
There are two options for specifying an ANOVA: lm and
aov.
Really, aov is just a wrapper for calling up the lm
function.
The main difference between aov and lm is in the format
of the output, although a traditional ANOVA table can be
produced by applying the anova function to an lm model.
Since the measured variable is a count (number of insects),
it is not normally distributed.
To make these data approximate a normal distribution, we
can use a square root transformation (Zar 1999)
ANOVA Models
15
> 5
InsectSprays$sr.count<-sqrt(InsectSprays$count + 3/8)
> mod.1<-aov(sr.count ~ spray, data = InsectSprays)
> mod.1
Call:
aov(formula = sr.count ~ spray, data =
InsectSprays)
Terms:
spray Residuals
Sum of Squares 80.52844 22.80262
Deg. of Freedom 5 66
Residual standard error: 0.5877876
Estimated effects may be unbalanced
ANOVA Models
15
6
To get more detailed output, we need to use the summary
function.
> summary(mod.1)
Df Sum Sq Mean Sq F value Pr(>F)
spray 5 80.528 16.106 46.616 < 2.2e-16 ***
Residuals 66 22.803 0.345
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
VC
mean of ToothGrowth$len
OJ
20
15
10
0.5 1 2
ToothGrowth$dose
Generalized Linear Model
16
2
Generalized linear models (GLMs) are a very flexible class
of statistical models.
In R, GLM models can be specified using the glm function.
There are eight different error distributions available in
glm, including binomial and poisson, each with a default
link function.
Arguments and default argument values can be found in the
help file for glm:
Where
et and et-1 residuals at time t and t-1
α1 is the first order moving average coefficient
Time Series Models
16
8
The autoregressive model includes lagged terms on the
time series itself,
The moving average model includes lagged terms on the
noise or residuals.
Including both types of lagged terms, autoregressive-
moving-average, or ARMA, models can be found.
The order of the ARMA model is included in parentheses
as ARMA(p,q),
Where p is the autoregressive order and q the moving
average order. The simplest ARMA model is first-order
autoregressive and first-order moving average, or ARMA(1,1).
y y
t c e
1 t 1 e
1 t 1 t
Time Series Models
16
9
The following steps should be used in modeling
Identification: (AR, MA, ARMA) using acf and pacf
Estimation: coefficients, by least square or iterative
Diagnostic: randomnes of error, etimated coefficients
significantly different from zero
> kingsdeath
Time Series:
Start = 1
End = 42
Frequency = 1
[1] 60 43 67 50 56 42 50 65 68 43 65 34
47 34 49 41 13 35 53 56 16 43 69 59 48
[26] 59 86 55 68 51 33 49 67 77 81 67 71
81 68 70 77 56
Time Series Models Using R
Plotting Time Series 17
2
> plot.ts(kingsdeath)
Time Series Models Using R
17
The following generates from a normal
3 distribution 300 elements and classifies
into 100 rows and 3 columns and converts to timeseries data.
> z <-ts(matrix(rnorm(300), 100, 3), start=c(1961, 1),
z
frequency=12)
> plot(z)
2
Series 1
1
0
-1
-2
-3
2
Series 2
1
0
-1
-2
2
Series 3
1
0
-1
-2
Time
Time Series Models Using R
17
4
> plot(z, plot.type="single", lty=1:3)
Time Series Models Using R
17
5
Plot time series against lagged versions of themselves.
Helps visualizing ‘auto-dependence’ even when auto-correlations
vanish.
The following command plots lags against the observed.
> lag.plot(x, lags = 1, layout = NULL, set.lags = 1:lags, main
= NULL, asp = 1, diag = TRUE, diag.col = "gray", type = "p",
oma = NULL, ask = NULL, do.lines = (n <= 150), labels =
do.lines, ...)
Example: plot
> lag.plot(z, 1, diag.col = "forest green")
Time Series Models Using R
17
6
Time Series Models Using R
17
Fit an autoregressive time series
7 model to the data, by default
selecting the complexity by AIC.
Syntax
> ar(x, aic = TRUE, order.max = NULL, method=c("yule-
walker", "burg", "ols", "mle", "yw“), na.action, series,
...)
Example: on England kings age at death data
> kingsdeath.ar <-
ar(kingsdeath,aic=TRUE,method="ols",order.max=2)
Call:
ar(x = kingsdeath, aic = TRUE, order.max = 2, method = "ols")
Coefficients:
1
0.4006
Intercept: -0.108 (2.368)
Order selected 1 sigma^2 estimated as 229.9
Time Series Models Using R
17
The following used to predict time 8
series model.
> predict(kingsdeath.ar, n.ahead=10)
$pred
Time Series:
Start = 43
End = 52
Frequency = 1
[1] 55.46385 55.24907 55.16303 55.12856 55.11476 55.10923
55.10701 55.10612
[9] 55.10577 55.10562
$se
Time Series:
Start = 43
End = 52
Frequency = 1
[1] 15.16372 16.33518 16.51544 16.54418 16.54879 16.54953
16.54965 16.54967
[9] 16.54967 16.54967
Time Series Models Using R
17
9
Fit an ARIMA model to a univariate time series.
Syntax
> arima(x, order = c(0, 0, 0), seasonal = list(order = c(0, 0,
0), period = NA), xreg = NULL, include.mean = TRUE,
transform.pars = TRUE, fixed = NULL, init = NULL, method =
c("CSS-ML", "ML", "CSS"), n.cond, optim.control = list(),
kappa = 1e6)
> kingsdeath.arima <- arima(kingsdeath, c(3, 0, 0))
> tsdiag(kingsdeath.arima ) #Diagnostic fitted model
> predict(kingsdeath.arima,5) #forcasting for 5 years
18
0
6. Introduction to SAS
Introduction
18
1
SAS is a comprehensive statistical software system which
Explorer
Window
Editor Window
Results
Window
(not shown)
184
SAS Output Windows…
185
What is SAS?
PROC steps:
Produce reports, summarize data, sort
data,
create formats, generate graphs, etc. etc.
etc.
Global statements:
Remain in effect until changed by
another global statement or SAS
session ends
Examples:
Titles, Footnotes, Options, Macros,
Libname
Comment statements:
Ignored by SAS when code is run
*comment statement;
or
/* comment statement */
Programming in SAS
multiple lines
Code and variable names not case sensitive
Even accepts some misspellings
Technical papers
SAS User Groups
Websites
Books
Classes
http://www.sascommunity.org/wiki/
Learning_SAS_-_Resources_You_Can’t_Live_Without
Popular Websites
www.google.com
Popular Websites
www.ats.ucla.edu/stat/s
as Popular Websites
Explains
both the
stats and
the SAS
code
www.ats.ucla.edu/stat/s
as Popular Websites
www.lexjansen.com
Popular Websites
SAS Help and Documentation
Popular Websites
SAS Help
20
4
SAS Language
20
5
Composed of SAS Statements
All SAS Statements end with a Semi-colon (;) except data lines
comment)
Free-format – can use upper or lower case, i.e. SAS is case inse.
SAS Result
Raw of Report
Data data
set Analysi
s PROC
step
SAS
data
set
SAS Data Libraries
20
9
SAS files are stored in SAS data libraries.
21
1
The SAS data set will have the name provided by you.
name.sas7bdat
data name;
input var1 var2 var3;
datalines;
…
…
;
run;
Note: to tell SAS that a variable is character, add a dollar sign ($) after the
Remark that there are NO semicolons at the end of the lines, but
DATA student;
INPUT id sex $ CGPA Age;
DATALINES;
1 M 3.5 21
2 M 2.3 18
3 F 3.2 19
;
RUN ;
SAS Procedures
21
4
General structre of a PROC step is:
proc name; #the name of the procedure wanted
…;
…;
run;
Remark: the procedure will be executed on the most recently
created SAS data set in the current SAS session. To avoid this, add
DATA=data_name to the PROC statement.
Two Basic steps in SAS programs:
Data Steps
Proc Steps
SAS Procedures
21
5
DATA Steps
PROC Steps
from the input data set into the output data set.
Manipulating and Modifying a Data Set
21
7
Some possible actions are
DATA Student;
INPUT id $ Semester gender $ height SGPA;
CARDS;
RNS/001/05/ 1 M 1.72 2.50
RNS/001/05/ 2 M 1.72 3.10
RNS/001/05/ 3 M 1.72 2.85
RNS/002/05/ 1 F 1.68 3.20
RNS/004/05/ 2 M 1.69 2.42
RNS/004/05/ 1 M 1.69 2.35
RNS/003/05/ 1 F 1.74 2.56
;
PROC PRINT ;
RUN;
Manipulating and Modifying SAS Data Sets
22
DATA student2; 5
SET student;
WHERE gender='F';
PROC PRINT ;
RUN;
DATA student3;
SET student;
DROP height;
PROC PRINT;
RUN;
DATA student4;
SET student;
RENAME id=Identification;
PROC PRINT;
RUN;
Combining SAS Data Sets
Data
Set 1
Data Data Data
Set 1 Set 2 Set 2
SAS Merge allows the programmer to combine data from
multiple datasets.
Each observation from dataset one is combined with a
corresponding observation in dataset two (and dataset three,
etc.) Which observations and which data fields from the source
datasets will be included in the resulting dataset is determined
by the detailed “instructions” provided by the programmer.
Combining SAS Data Sets
227
One-to-one
One-to-many
Many-to-many
Combining SAS Data Sets
228
The SAS statements for all three types of match merge are
merging variable(s).
One-to-One Merging
229
Dataset1 Dataset 2
ID Credit ID Course
DATA Dataset3;
Dataset3
MERGE Dataset1 Dataset2;
ID Credit Course
BY ID; NS/01/06 3 Stat2022
NS/02/06 2 Stat3133
RUN;
NS/03/06 4 Stat3111
One-to-Many Merging
230
Dataset 4 Dataset 5
ID Credit ID Course
ID Credit Course
DATA Dataset6;
NS/01/06 3 stat2022
MERGE Dataset4 Dataset5; NS/01/06 3 stat2013
BY ID; NS/02/06 2 stat3133
Dataset9
DATA Dataset9;
ID Y Z
MERGE Dataset7 Dataset8; NS/01/06 A1 AA1
NS/01/06 A2 AA2
BY ID;
NS/01/06 A2 AA3
RUN; NS/02/06 B1 BB1
NS/02/06 B2 BB2
SAS Expressions
23
2
A SAS operator is a symbol that represents:
prefix operators
infix operators
The word NOT and its equivalent symbols are also prefix operators.
When used to perform arithmetic operations, the plus and minus signs
23
5
Arithmetic operators
Symbol Definition
+ addition
- subtraction
* multiplication
/ division
** exponentiation
SAS Expressions
23
Comparison operators 6
Mnemonic
Symbol Equivalent Definition
= eq equal to
^= ne not equal to
form.
The basic syntax is:
For your knowledg run the following code (without any option) and see
Analysis variables:
are numeric variables that are used to compute statistics that are
reported in the body of the table.
The PROC TABULATE Procedure
24
3
The CLASS statement is used to specify any categorical
The VAR statement is used to list any variables that will be used
for computing the statistics that are to appear in the body of the
table.
Frequency counts can be computed without an analysis
variable.
However, under most conditions, the VAR statement is required.
The PROC TABULATE Procedure
24
4
The TABLE statement is used to define both the arrangement of the
rows and columns of the table, as well as the requests for any summary
statistics.
In a TABLE statement, the comma is a very important symbol, because
statistics should be produced, and pertaining to which analysis variables. Each statistic is
identified by a keyword.
N = the number of observations, the frequency count
MIN = the smallest value
MAX = the largest value
MEAN = the arithmetic mean, or the average value
STD = the standard deviation
VAR = the variance
MEDIAN = the middle (50th percentile) value
SKEWNESS = a measure of the asymmetry of the distribution of values SUM = the sum of the
values
PCTN = the percentage that one frequency is of another frequency.
The PROC TABULATE Procedure Example
24
8
The following SAS code displays four different tables.
RUN;
Run this code from SAS and have a look at the four different tables displayed.
The PROC TABULATE Procedure Surprise
24
DATA Ztable; 9
DO row = 0.0 TO 3.4 BY 0.1;
DO column = 0.00 TO 0.09 BY 0.01;
z = row + column;
prob = PROBNORM(z); OUTPUT;
END;
END;
RUN;
tables.
It Counts! Answers question how many
“Heart”.
RUN;
RUN;
The PROC MEAN Procedure …
25
6
Statistical options that may be requested are: (default statistics are
underlined.)
N - Number of observations RANGE – Range
NMISS - Number of missing observations VAR - Variance
MEAN - Arithmetic average) USS – Uncorr. sum of squares
STD - Standard Deviation CSS - Corr. sum of squares
MIN - Minimum (smallest) STDERR - Standard Error
MAX - Maximum (largest) T - Student’s t value for testing Ho: md = 0
SUM - Sum of observations SUMWGT - Sum of the WEIGHT variable values
PRT - P-value associated with t-test above
Example PROC MEAN Procedure …
25
7
Consider the Framingham data set in ‘‘sashelp’’ library called
RUN;
The PROC SUMMARY Procedure
25
9
CLASS: Identify the variables that will be sub-grouped for the
analysis.
VAR: Identify the variable that analysis will be done and the order
CLASS sex;
VAR diastolic;
RUN;
Test hypotheses
P. predicted values
R. residuals
L95. lower 95% CI bound for individual prediction
U95. upper 95% CI bound for individual prediction
L95M. lower 95% CI bound for mean of dependent variable
U95M. upper 95% CI bound for mean of dependent variable
GPLOT also creates the plot in the separate Graph window in SAS,
Ty p e i t i n S A S a n d g o t h ro u g h t h e o u t p u t
The PROC CHART Procedure
27
3
It produces vertical and horizontal barcharts, block charts,
sashelp library)
Ty p e i t i n S A S a n d g o t h ro u g h t h e o u t p u t
The PROC CHART Procedure
27
6
Ploting a pie chart
Example:
PROC UNIVARIATE DATA=sashelp.heart;
VAR systolic diastolic ;
HISTOGRAM systolic diastolic ;
RUN;
Ploting boxplot
27
Syntax: 8
Example:
PROC BOXPLOT DATA=sashelp.heart;
PLOT systolic *sex /BOXSTYLE= schematic; *data should be sorted by sex;
INSET min mean max stddev/header='Overall statistics'
pos=tm; *bm, lm, rm;
RUN;
The PROC GCHART Procedure
27
9
Better charts can be drawn using GCHART procedure.
GOOD
LUCK!!!