[go: up one dir, main page]

0% found this document useful (0 votes)
15 views279 pages

Statistical Computing II-slide

wgjkuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views279 pages

Statistical Computing II-slide

wgjkuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 279

STATISTICAL COMPUTING-

II
(Stat 2022)

R AND SAS
1. INTRODUCTION TO R
2

1.1 THE R ENVIRONMENT


R is an integrated suite of software facilities for
data manipulation, calculation and graphical
display.
R is a platform for the object-oriented statistical
programming language S.
The term “environment” is intended to
characterize it as a fully planned and coherent
system, rather than an incremental accretion of
very specific and inflexible tools, as is frequently
the case with other data analysis software.
It is free & easy to learn
1.2 GETTING STARTED WITH R AND THE
ONLINE HELP SYSTEM
3

 Starting R
The beauty of R is that it’s shareware, so it is
free to anyone (no license fee).
To obtain R for windows (or Mac), go to the
Comprehensive R Archive Network (CRAN) at
http://www.r-project.org and you can
immediately install it.
Once you have installed R, there will be an icon
on your desktop. Double click it and R will start
up.
Cont’d
4

o The > is a prompt symbol displayed by R, not typed


by you. This is R’s way of telling you that it’s ready
for you to type a command
o If a command is not complete at the end of a line, R
will give a different prompt, by default
+
on second and subsequent lines and continue to
read input until the command is syntactically
complete.
o To see the list of installed datasets, use the data
method with an empty argument:
> data()
Getting online help in R
5

R has a built-in help facility.


To get more information on any specific named
function, e.g. sqrt(), the command is
> help(sqrt)
Or
>? Sqrt
For a feature specified by special characters, the
argument must be enclosed in single or double
quotes (e.g. ”[[”)
>help(”[[”)
This is also necessary for a few words with syntactic
meaning including if, for, while and function.
Getting online help in R
help(sqrt)
abs(7)
sqrt(25) 6
25^0.5
?? Sqrt
help.start(sqrt(100))
?? Solve
example("log")
? Help
str()
fix()
RRR=read.csv(file.choose(),header=T)
RRR=read.table(file.choose(),header=T)
RRR=read.delim(file.choose(),header=T)
rm()
dim()
head()
tail()
RRR=[5:9,]
RRR=[-(5:9),]
names()
attach()
detach()
class()
levels()
as.factor() class() summary()
as.numeric()
mean(age[gender==“male” & age>15,])
Cont’d
7

 Help is also available in HTML format by running


> help.start()
which will launch a web browser that allows the help pages
to be browsed with hyperlinks.
 The help.search command (alternatively ?? ) allows searching
for help in various ways.
e.g.
>?? Solve #Or
>help.search(”solve”)
Try ?help.search for details and more examples.
 The examples on a help topic can normally be run by
>example(topic) # topic=log
 For further information about online help: use
> ? help
1.3 COMMANDS AND EXECUTION
8

Technically R is an expression language with


a very simple syntax.
Users are expected to type inputs
(commands) into R in the console window.
 Commands:
 Consists of expressions or assignments
 are separated by a semi-colon(;) or by a newline
 Can be grouped together using braces(‘{‘ and ‘}’)
Cont’d
9

Some preliminaries on entering commands


 Expressions and commands in R are case-sensitive
e.g. X and x do not refer to the same variable
 Command lines do not need to be separated by any special
character like a semicolon as in SAS.
 Anything following the hash character (#) R ignores as a comment.
 You can use the arrow keys on the keyboard to scroll back to
previous commands.
 The set of symbols which can be used in R:
 can be created using letters, digits and the . (dot) symbol.

e.g. Weight, Wt.male


 must not start with a digit or a . followed by a digit or vice

versa.
 Some names are used by the system, e.g. c, q, t, C, D, F, I, T,

diff, df, pt –AVOID!


Some Important Commands
10

>sort() :-used for sorting a vector of values in ascending/


descending order.
>rank() :-Returns the ranks of the values in a vector of value.
 By defaults, ties (i.e., equal values) are leading to their
average rank positions
>rep() :- Repeats the same value several times, e.g., rep(pi,12)
>seq() :-Generates regular sequences of values, e.g.,
seq(from=5,to=30,by=5)
>print():- Enables to print an object using a different format
than simply typing its name,
e.g., print(pi,digits=20)
>rm():- Removes (i.e., delete) an object
>ls():- Lists all existing objects
Cont’d
11

>sqrt():- Returns the square root of values of a vector


>log():- Returns the natural logarithm of values of a vector
>log10():-Returns the base-10 logarithm of values of a vector
>exp():- Returns the exponential of values of a vector
>abs():- Returns the absolute values of a vector
>sin():- Returns the sinus of values of a vector (in radians)
>cos():- Returns the cosinus of values of a vector (in radians)
>tan():- Returns the tangent of values of a vector (in radians)
>cor():- Returns the correlation coefficient between two vectors
>summary():- Gives several descriptive statistics of an object
Assigning values to variables
12
 Assignment can be made using any of the following operators:
 Using “<-”

e.g.
>x<-5
 Using “=“
e.g.
>x=5
 Using the function “assign()”
e.g.
>assign(“x”,5)
 Using “->”
e.g.
> 5->X
 Variables that contains many values (vectors), e.g. with the concatenate
function:
> y<-c(3,7,9,11)
>y
[1] 3 7 9 11
Calculator
13

>2+3 #addition
[1] 5
>3^2 #square
[1] 9
>2*3 #multiplication
[1] 6
> 4/2 #division
[1] 2
>sqrt(9) #square root
[1] 3
>exp(2) #e squared(2.718282^2)
[1] 7.389056
>pi #R knows about pi
[1] 3.141593
1.4 Objects and simple manipulation
14

1.4.1 objects; Vectors; Generating sequences


 Objects
 The entities that R creates and manipulates are known
as objects.
 These may be variables, arrays of numbers, character
strings, functions, or more general structures built
from such components.
 R saves any object you create.
 To list the objects you have created in a session use
either of the following commands:
> objects()
> ls()
Cont’d
15

 To remove all the objects in R type:


>rm(list=ls(all=T))
 To remove a specified number of objects use:
>rm(x, y) #only object x and y will be
removed
All objects created during an R sessions can be
stored permanently in a file for use in future R
sessions.
To quite the R program use the close(X) button
in the window or you can use the command:
>q()
Vectors
16

Vectors are the simplest type of object in R.


They can easily be created with c, the
combined function.
There are 3 main types of vectors:
 Numeric vectors
 Character vectors
 Logical vectors
Cont’d
17

 Numeric Vector: is a single entity consisting of


an ordered collection of numbers.
e.g. To set up a numeric vector X consisting of
5 numbers, 10, 6, 3, 6, 22, we use any one
of the following commands:
>x<-c(10, 6, 3, 6, 22) #OR
>x= c(10, 6, 3, 6, 22) #OR
>assign(“x”, c(10, 6, 3, 6, 22)) #OR
>c(10, 6, 3, 6, 22)->x
Cont’d
18

The further assignment


>y<-c(x,0,x)
would create a vector y with 11 entries
consisting of two copies of x with a zero in
the middle place.
To print the contents of x:
>x
[1] 10 6 3 6 22
Note: The [1] in front of the result is the index
of the first element in the vector x.
Cont’d
19

Functions that return a single value


>length(x) # the number of elements in x
>sum(x) # the sum of the values of x
>mean(x) # the mean of the values of x
>var(x) # the variance of the values of x
>sd(x) # the standard deviation of the values of
x
>min(x) # the minimum value from the values of
x
>max(x) # the maximum value from the values of
x
>prod(x) # the product of the values of x
>range(x) # the range of the values of x
Cont’d
20

Functions that return vectors with the same


length
 To print the rank of the values of x:
>order(x)
>sort.list(x)
 To print the values of x in increasing order
>sort(x)
>x[order(x)]
>x[sort.list(x)]
 To print the reciprocals of the values of x
>1/x
Cont’d
21

 To print the sin, cos, tan, asin, acos, atan, log, exp, … of the
values of x:
>sin(x)
>cos(x)
:
>exp(x)
 The parallel maximum and minimum functions pmax and pmin
return a vector (of equal to their largest arguiment) that
contains in each element the largest (smallest) element in that
position in any of the input vectors.
>pmax(x,6) # returns the values of x but values that are
#less than 6 will be replaced by 6.
> pmin(x,6) # returns the values of x but values that are
#greater than 6 will be replaced by 6.
Cont’d
22

 Character vectors
To set up a character/string vector z
consisting of 4 place names use:
> z <- c(“Canberra”, “Sydney”, ”Newcastle”)
# or
> z <- c(‘Canberra’, ‘Sydney’, ‘Newcastle’)
Character strings are entered using either
matching double(“) or single(‘) quotes, but
are printed using double quotes (or
sometimes without quotes).
Cont’d
23

They use C-style escape sequences, using \ as


the escape character, so \\ is entered and
printed as \\, and inside double quotes “is
entered as \”.
Common useful escape sequences are :
“ \n” for new line
“ \t” for tab
“ \b” for backspace
e.g.
>cat(“Abebe”,”\n”,”zelalem”,”\n”)
Cont’d
24

 Character vectors can be connected using c()


>z <- c(“Canberra”, “Sydney”, ”Newcastle”)
> c(z, “Mary”)
[1] “Canberra” “Sydney” “Newcastle” “Mary”

 Logical Vectors
 A logical vector is a vector whose elements are TRUE,
FALSE or NA.
Note: TRUE and FALSE are often abbreviated as T and F
respectively, however T and F are just variables which
are set to TRUE and FALSE by default, but are not
reserved words and hence can be overwritten by the
user.
Cont’d
25

 Logical vectors are generated by conditions.


e.g.
>temp <- x>13
sets temp as a vector of the same length as x
with values FALSE corresponding to elements
of x where the condition is not met and TRUE
where it is met.
The logical operators are <,<=,>,>=,== for
exact equality and != for inequality.
Cont’d
26

 In addition if c1 and c2 are logical expressions, then c1&c2 is their


intersection(“and”), c1 I c2 is their union(“or”), and !c1 is the
negation of c1.
 The function is.na(x) gives a logical vector of the same size as x with
value TRUE if and only if the corresponding element in x is NA
(where NA is value not available or a missing value).
e.g.
> z<-c(1:3,NA)
>ind<-is.na(z)
[1] FALSE FALSE FALSE TRUE
 Note that there is a second kind of “missing” values which are
produced by numerical computation, the so-called Not a Number,
NaN, values.
e.g.
>0/0
Note: is.na(xx) is TRUE both for NA and NaN
Cont’d
27

o Indexing Vectors
Vectors indices are placed with square
brackets: []
Vectors can be indexed in any of the following
ways:
 Vector of positive integers
 Vector of negative integers
 Vector of named items
 Logical vector
Cont’d
28
e.g.
>const=c(3.1416,2.7183,1.4142,1.6180)
>names(const)=c(“pi”,”euler”,”sqrt2”, “golden”)
>const[c(1,3,4)] # printing the 1st , 2nd and the 4th elements of const.
pi sqrt2 golden
3.1416 1.4142 1.618
>const[c(-1,-2)] # printing all elements of const except the 1 st and the
2nd
sqrt2 golden
1.4142 1.618
>const>2
pi euler sqrt2 golden
TRUE TRUE FALSE FALSE
>const[const>2]
pi euler
3.1416 2.7183
Cont’d
29

o Modifying Vectors
 To alter the contents of a vector, similar methods can be
used.
e.g. Create a variable x with 5 elements:
10 5 3 6 21
>x=c(10,5,3,6,21)
Now, to modify the first element of x and assign it a value
7 use
>x[1]<-7
>x
[1] 7 5 3 6 21
 The following command replaces any NA (missing) values in
the vector w with the value 0:
>w[is.na(w)]<-0
Generating sequences
30

R has a number of ways to generate


sequences of numbers.
These include:
 The colon “:”
e.g.
>1:10
[1] 1 2 3 4 5 6 7 8 9 10
>10:1
[1] 10 9 8 7 6 5 4 3 2 1
Cont’d
31

Note: The colon operator has high priority within


an expression.
e.g. 2*1:10 is equivalent to 2*(1:10)
>2*1:10
[1] 2 4 6 8 10 12 14 16 18 20
 The seq() function
e.g.
>seq(1:10)
>seq(from=1, to=10)
>seq(to=10, from=1)
are all equivalent to 1:10
Cont’d
32

Note: The parameters by=value and length=value


specify a step size and length for the sequence
respectively. If neither of these is given, the
default by=1 is assumed.
e.g.
> seq(1,5, by=2)
[1] 1 3 5
> seq(1,10, length=5)
[1] 1.00 3.25 5.50 7.75 10.00
>seq(from=1, by=2.25, length=5)
[1] 1.00 3.25 5.50 7.75 10.00
Cont’d
33

 The function rep() can be used for replicating an


object in various complicated ways .
 the command:
> rep(x, times=5) #or
>rep(x, 5)
will print 5 copies of x end to end.
 While the command :
>rep(x, each=5)
will print each element of x five times before
moving onto the next.
 Further more, the command
>rep(c(1,4), c(2,3))
will print 1 two times and then 4 three times(1 1 4 4 4
).
Matrices, Arrays, Lists and Data Frames
34

Matrices
 A matrix can be regarded as a generalization of a vector.
As with vectors, all the elements of a matrix must be of
the same data type.
 A matrix can be generated in two ways.
Method 1: Using the function dim:
Example
>x <- c(1:8)
>dim(x) <- c(2,4)
>x
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
Cont’d
35
Method 2: >X<-matrix(vector, nrow=r, ncol=c,
byrow= FALSE,
dimnames=list(rownames,
colnames))
eg. >x <- matrix(c(1:8),2,4,byrow=F,
rix(c(1:8),2,4,byrow=F, dimnames=list(rrrr=c("A","B"),cccc=c("D","E","F
dimnames=list(rrrr=c(“A”,”B”),
cccc=c(“D”,”E”,”F”,”G”)));x

cccc
rrrr D E F G
A 1 3 5 7
B 2 4 6 8
 By default the matrix is filled by column. To fill the matrix
by row specify byrow = T as argument in the matrix
function.
Cont’d
36

Example:
>cbind(c(1,2,3),c(4,5,6))
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

>rbind(c(1,2,3),c(4,5,6))
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Cont’d
37

 Matrix operations (multiplication, transpose, etc.) can easily


be performed in R using a few simple functions like:
Name Operation
dim() dimension of the matrix (number of rows and

columns)
as.matrix() used to coerce an argument into a matrix
object
%*% matrix multiplication
t() matrix transpose
det() determinant of a square matrix
solve() matrix inverse; also solves a system of linear
equations
eigen() computes eigen-values and eigenvectors
Cont’d
38

R has a number of matrix specific operations,


for example:
Cont’d
39

#Identify rows, columns or elements using subscripts.


>x[,4] # 4th column of matrix
>x[3,] # 3rd row of matrix
>x[2,3] # entry in row 2 and column 3
>x[2:4,1:3] # rows 2,3,4 of columns 1,2,3

# Other matrix commands are:


> apply(A,1,sum) # apply the sum function to the rows
of A
> apply(A,2,sum) # apply the sum function to the
columns of A
> sum(diag(A)) # trace of A
Cont’d
40

Arrays: can be considered as a multiply subscripted


collection of data entries, for example numeric.
 Arrays are generalizations of vectors and matrices
That means, vectors in the mathematical sense are one-
dimensional arrays where as matrices are two-
dimensional arrays; higher dimensions are also
possible.
 There are two methods of creating arrays in R
Method 1: Using vectors
 A vector can be used by R as an array only if it has a
dimension vector as its dim attribute.
 A dimension vector is a vector of non-negative integers.
If its length is k then the array is k-dimensional.
Cont’d
41

 An array can be created by giving a vector structure, a dim,


which has the form
>z=data_vector
>dimention_vector
 Example: The following is a 3 X 5 X100 (3-dimentional) array
with dimension vector c(3,5,100) and a vector of 1500
elements.
>z=c(1:1500)
>dim(z) <- c(3,5,100)
 Example: If the dimension vector for an array, say A, is
c(3,4,2) then there are 3X4X2 = 24 entries in A and the data
vector holds them in the order A[1,1,1], A[2,1,1], ...,A[2,4,2],
A[3,4,2].
>A=c(5:28)
>dim(A)=c(3,4,2)
Cont’d
42

Method 2: Using the function array()


As well as giving a vector structure a dim
attribute, arrays can be constructed from vectors
by the array function, which has the form
> Z <- array(data_vector, dim_vector)
For example, if the vector h contains 24 or
fewer, numbers then the command
> Z <- array(h, dim=c(3,4,2))
would use h to set up 3 by 4 by 2 array in Z. If
the size of h is exactly 24 the result is the same
as
> Z <- h ; dim(Z) <- c(3,4,2)
Cont’d
43

Lists: are collections of arbitrary objects.


That is, the elements of a list can be objects
of any type and structure. Consequently, a list
can contain another list and therefore it can
be used to construct arbitrary data
structures.
A list could consist of a numeric vector, a
logical value, a matrix, a complex vector, a
character array, a function, and so on.
Cont’d
44

 Lists are created with the list() command:


L<-list(object-1,object-2,…,object-m)
Example:
> L <- list( c(1,5,3), matrix(1:6, nrow=3), c("Hello", "world") )
>L
[[1]]
[1] 1 5 3
[[2]]
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
[[3]]
[1] "Hello“ "world"
Cont’d
45

 The elements of the list are accessed with the [[ ]]-operator.


Examples: Consider the previous example
> L[[1]] # First element of L
[1] 1 5 3
> L[[2]][2,1] # Element [2,1] of the second element of L
[1] 2 # Note that L[[2]] is a matrix which can be
#referenced with []
> L[[c(3,2)]] # Recursively: element 3 of L, hereof the 2nd
# element
[1] "world“
OR
>L[[3]][2]
[1] "world”
Cont’d
46

Data frames: regarded as an extension to


matrices.
Data frames can have columns of different data
types and are the most convenient data
structure for data analysis in R.
Data frames are lists with the constraint that
all elements are vectors of the same length.
The command data.frame() creates a data
frame:
>dat<-data.frame(object-1,object-2,
…,object-m)
Cont’d
47

 Example:
>name= c(”Eden”,”Solomon”,”Zelalem”,”Kidist”)
>age=c(18,22,25,27)
>sex=c(”F”,”M”,”M”,”F”)
>stud=data.frame(name,age,sex)
>stud
name age sex
1 Eden 18 F
2 Solomon 22 M
3 Zelalem 25 M
4 kidist 27 F
 To display the column names:
>names(stud) #OR colnames(stud)
 To display the row names:
>rownames(stud)
Cont’d
48

You can use the “names” function to change the


column names:
>names(stud)<-c(”name”,”age”,”sex”)
>stud
name age sex
1 Eden 18 F
2 Solomon 22 M
3 Zelalem 25 M
4 kidist 27 F
 Similarly, use the “row.names” function to change the row
names:
>row.names(stud)<-c(“Wrt”,”Ato1”,”Ato2”,”Wro”)
>stud
Note: Duplicate row.names are not allowed!!!
Cont’d
49

Creating Spreadsheet to input data from


keyboard
# enter data using editor
>mydata <- data.frame(age=numeric(0), gender=
character(0), weight=numeric(0))
>mydata <- edit(mydata)
# note that without the assignment in the line above,
# the edits are not saved! OR it should be an already
#started dataframe, eg. >stud=edit(stud)
Use attach to make the variables accessible by name:
> attach(stud)
Use names to get a list of variable names:
Cont’d
50

Selecting Parts of a Dataframe: There are


a variety of ways to identify the elements of a dataframe
>stud[1:2] # columns 1 & 2 of stud OR
>stud[,3:5] # columns 1 & 2 of stud
>stud[3,1:2] # row 3 of columns 1 & 2 of stud
>stud[1:2,] # rows 1 & 2 of all columns of stud
>stud[1:2,3] # rows 1 & 2 of column 3 of stud
>stud[c(“name",“age")] # columns name and age
from stud
>stud$age # variable age in the dataframe
stud
READING(importing) DATA
51

Large data objects will usually be read as values


from external files rather than entered during an R
session at the keyboard.
In R you can import text files with the function
read.table.
syntax: read.table(“file name", arguments)
The function may have arguements to specify the
header, the column separator, the number of lines
to skip, the data types of the columns, etc.
The functions read.csv and read.delim are functions
to read “comma separated values” files and tab
delimited files.
Read data from other sources
52
 reading data from database file use the
extension .dat
 setwd("C:\\Users\\DELL\\Desktop\\Data for R")
 teachers<-read.table("teachers.dat",header=T)
 reading data from comma delimited(csv) file use
the extension .csv
 soils<read.csv("SoilChemicalAnalysis.csv",header=
TRUE)
 reading data from spss,stata & sas use library
foreign
 library(foreign)
 reading data from spss file use the extension .sav
with data.frame first
Cont’d
54

You can use the factor function to create your own value
labels.
# variable v1 is coded 1, 2 or 3
# we want to attach value labels 1=red, 2=blue,3=green
>mydata$v1 <- factor(mydata$v1, levels = c(1,2,3), labels =
c("red", "blue", "green"))
# variable y is coded 1, 3 or 5
# we want to attach value labels 1=Low, 3=Medium,
5=High
>mydata$y <- ordered(mydata$y, levels = c(1,3, 5),
labels = c("Low", "Medium", "High"))
Note: factor and ordered are used the same way, with the
same arguments. The former creates factors and the later
creates ordered factors.
Cont’d
55

Use the assignment operator <- to create new


variables. A wide array of operators and functions are
available here.
# Three examples for doing the same computations
>mydata$sum <- mydata$x1 + mydata$x2
>mydata$mean <- (mydata$x1 + mydata$x2)/2
>attach(mydata)
>mydata$sum <- x1 + x2
>mydata$mean <- (x1 + x2)/2
>detach(mydata)
>mydata <- transform( mydata, sum = x1 + x2,mean = (x)
+ x2)/2 )
Cont’d
56

In order to recode data, you will probably use one


or more of R's control structures.
# create 2 age categories
>mydata$agecat <- ifelse(mydata$age > 70,
c("old"),c("young"))
# another example: create 3 age categories
>attach(mydata)
>mydata$agecat[age > 75] <- "Elder“
>mydata$agecat[age > 45 & age <= 75] <- "Middle
Aged“
>mydata$agecat[age <= 45] <- "Young“
>detach(mydata)
Cont’d
57

MERGING DATASETS
 To merge two dataframes (datasets) horizontally, use the
merge function. In most cases, you join two dataframes by one
or more common key variables (i.e., an inner join).
# merge two dataframes by ID
>total <- merge(dataframeA,dataframeB,by="ID")
# merge two dataframes by ID and Country
>total <- merge(dataframeA,dataframeB,by=c("ID","Country"))
ADDING ROWS
 To join two dataframes (datasets) vertically, use the rbind
function.
 The two dataframes must have the same variables, but they do
not have to be in the same order.
>total <- rbind(dataframeA, dataframeB)
Cont’d
58

 Suppose we have a text le data.txt, that contains the following


text:
Author: John Davis
Date: 18-05-2007
Some comments..
Col1, Col2, Col3, Col4
23, 45, A, John
34, 41, B, Jimmy
12, 99, B, Patrick
 The data without the first few lines of text can be imported to an
R data frame using the following R syntax:
>myfile <- "C:\\Temp\\R\\Data.txt"
>mydf <- read.table(myfile, skip=3, sep=",",
header=TRUE)
59

2. Writing Functions
Writing Functions
60
 To define a function in R, use the folowing syntax:
> name <- function(arg1, arg2, arg3,…,argk) expression1
 If a function requires morethan one command, braces can
be used to group them.
 Example: to write a function to calculate geometric mean
of a set of numbers:
> geomean<-function(x)10^mean(log10(x))
 The function name is geomean, expecting a single
argument x
> x=c(10,10,10); > geomean(x)
[1] 10
Writing Functions
61
 Conditional execution: if(), if()else and ifelse()

Syntax:
if(condition) {commands1}
if(condition) {commands1} else{commands2}
ifelse (conditions vector, yes vector, no
vector)
 The command if() evaluates 'commands1' if the logical

expression 'condition' returns TRUE. Here 'commands1' is a


single command or a sequence of commands separated with ';'.
Writing Functions…
62

 The command if() else evaluates 'commands1'

if the logical expression 'condition' returns TRUE,


otherwise it evaluates 'commands2'.
 The command ifelse() returns a vector of the
same length as 'conditions vector' with elements
selected from either 'yes vector' or 'no vector'
depending on whether the element of 'conditions
vector' is TRUE or FALSE.
Writing Functions…
 Example: 63

 > x <- 4
> if ( x == 5 ){ x <- x+1 } else{x<- x*2}
> x
[1] 8
 > if ( x != 5 & x>3 ){ x <- x+1 ; 17+2 }

else { x <- x*2 ; 21+5 }


[1] 19
> x
[1] 9
 > y <- 1:10

> ifelse ( y<6, y^2, y-1 )


[1] 1 4 9 16 25 5 6 7 8 9
Writing Functions…
64
 >z <- 6:-3
> sqrt(z) # Produces a warning
[1] 2.449490 2.236068 2.000000

1.732051 1.414214 1.000000


0.000000 NaN NaN NaN
Warning message:
In sqrt(z) : NaNs produced
 sqrt( ifelse(z>=0,z, NA)) # No warning
[1] 2.449490 2.236068 2.000000
1.732051 1.414214 1.000000 0.000000
NA NA NA
Writing Functions…
65

Example: Adding two vectors in R of different length will cause R to


recycle the shorter vector. The following adds the two vectors by
chopping of the longer vector so that it has the same length as the
shorter.

> x=1:10; y=2:8


> n1 <- length(x)
> n2 <- length(y)
> if(n1 > n2){z <- x[1:n2]+ y} else{z <- x + y[1:n1]}
> z
[1] 3 5 7 9 11 13 15
Loops: for(), while() and repeat()
66
Syntax:
for ( var in set ) {commands}
while ( condition ) {commands}
repeat {commands}
Where
 the object set is a vector,
 commands is a single command or a sequence of

commands and
 var is a variable which may be used in commands.
Loops: for(), while() and repeat()…
67
 Note:
 The command for() is the R version of 'for each
element in the set do ...'.
 The command while() is the R version of 'as long

as the condition is
TRUE do ...'.
 The command repeat() is the R version of 'repeat

until I say break'. The command 'break' stops any loop;


control is then transferred to the first statement outside
the loop. The command 'next' halts the processing of the
current iteration and advances the looping index.
Loops: for()
68
Example:
 >x=c(20,22,27,38,42)
>summ=0
>for(i in 1:length(x)){summ=summ + x[i]}
>summ
[1] 149
 >x <- 0
> for ( i in 1:5 ) { if (i==3) { next } ;
x <- x + i }
> x # i=3 is skipped, so x <- 1+2+4+5
[1] 12
Loops: while() and repeat()…
69
 > y <- 1; j <- 1
> while ( y < 12 & j < 8 ) { y <- y*2 ; j
<- j + 1}
> y; j
[1] 16
[1] 5
 > z <- 3
> repeat { z<- z^2; if ( z>100 )
{ break }}
> z # the loop stopped after 81^2, so z==81^2
[1] 6561
Writing functions….
70
 Probably one of the most powerful aspects of the R
language is the ability of a user to write functions.
 When doing so, many computations can be
incorporated into a single function and (unlike with
scripts) intermediate variables used during the
computation are local to the function and are not saved
to the workspace.
 In addition, functions allow for input values used
during a computation that can be changed when the
function is executed.
Writing functions….
71
 The general format for creating a function is

fname<-function(arg1, arg2, ...) { R code }


Where fname is any allowable object name and arg1,
arg2, ... are function arguments.
 As with any R function, they can be assigned default
values.
 When you write a function, it is saved in your workspace
as a function object.
 Use the command return() for returning a value. If
you need to return several values put them into a list and
return the list.
Simple examples
72
 Consider a function that returns the maximum of two
scalars or the statement that they are equal.
> f1 <- function(a, b) {
if(is.numeric(c(a,b))) {
if(a < b) return(b)
else if(a > b) return(a)
else print("The values are
equal") }
else print("Character inputs not
allowed.") }
Writing functions….
73

Factorial of a value 95% confidence interval of mean


> fac1<-function(x) > ci95<-function(x) {
{ + t.value<-qt(0.975,length(x)-1)
+ f <-1 + standard.error<-sd(x)
+ t <-x + ci<-t.value*standard.error
+ repeat { + cat("95% Confidence Interval = ",
+ if (t<2) break + mean(x) -ci,"to", mean(x) +ci,"\n") }
+ f <-f *t > ci95(c(5,10,13,15,10,9,11,12,16))
+ t <-t-1 } 95% Confidence Interval = 3.59 to 18.85
+ return(f)}
> fac1(5)
[1] 120
Writing functions….
74
 Consider a function that computes the zero(s) of a
quadratic equations.
 g=function(a,b,c){if(b**2-4*a*c==0)
cat("x=",-+b/(2*a),"\n") else cat("x=",(-
b+sqrt(b**2-+4*a*c))/(2*a),(-b-sqrt(b**2 -
4*a*c)) / +(2*a),"\n")}
 Binary operators: are operators of functions with two
arguments.
 The arguments of such operators are found in either side
of the function name.
 R allows us to write our own binary operators of the
form: %anything%.
Writing functions….
75
 Syntax: “%anything%”=function(…){…}
 The matrix multiplication operator, %*%, and the outer
product matrix operator %o% are other examples of binary
operators defined in this way.
 Consider the following binary operator that calculates the
dot product of two vectors.
> "%H%"=function(x,y){
summ=0
for(i in 1:length(x)){summ=summ+x[i]*y[i]}
summ}
> c(1,2,3)%H%c(2,3,4)
[1] 20
76

3. Probability and Sampling Distributions


Probability and Sampling
77
Distributions
R as a set of statistical tables
 R allows for the calculation of:
 Probabilities (including cumulative)
 The evaluation of probability density/mass
functions
 Percentiles, and
 The generation of pseudo-random variables
following a number of common distributions.
 Therefore, R is useful to provide a comprehensive set of
statistical tables.
Sampling Distributions….
78
 The following table gives examples of various function names in R along
with additional arguments.
distribution R name arguments
normal norm mean, sd
chi-squared chisq df, ncp
F f df1, df2, ncp
Student’s t t df, ncp
exponential exp rate
log-normal lnorm meanlog, sdlog
logistic logis location, scale
Poisson pois lambda
multinomial multinom size, prob
binomial binom size, prob
Sampling Distributions….
79
 For each distribution, R provides the following four
commands:
 dxxx: density function of the xxx distribution

 pxxx: distribution function of the xxx distribution ('p' for


probability)
 qxxx: quintile function of the xxx distribution
 rxxx: random number generator for the xxx
distribution
where 'xxx' is the R name of the distribution.
 for example for normal distribution normal
Sampling Distributions….
80
 Example:
>dbinom(3,size=10,prob=.25)
# P(X=3) for X ~ Bin(n=10, p=.25)
> dpois(0:2, lambda=4)
# P(X=0), P(X=1), P(X=2) for X ~ Poisson(4)
> pbinom(3,size=10,prob=.25)
# P(X < 3) in the above distribution
> pnorm(12,mean=10,sd=2)
# P(X < 12) for X~N(mu = 10, sigma =2)
> qnorm(.75,mean=10,sd=2)
# 3rd quartile of N(mu = 10,sigma = 2)
> qchisq(.10,df=8)
# 10th percentile of x2(8)
> qt(.95,df=20) # 95th percentile of t(20)
Sampling Distributions….
81

>rnorm(100)
# simulate(generate) 100 standard normal RVs
> 2*pt(-2.43, df = 13)
# 2-tailed p-value for t distribution
> qf(0.01, 2, 7, lower.tail = FALSE)
# upper 1% point for an F(2, 7) distribution
Sampling Distributions BDP
82
 Birth Day Paradox (BDP), conducted on 23 persons to have 50-50
chance that two or more of them have the same birth day from 365
days.

> qbirthday(prob =0.5,classes=365,coincident=2)


> pbirthday(n, classes = 365, coincident = 2)
 Arguments
 classes: How many distinct categories the people could fall into
 prob: The desired probability of coincidence
 n:The number of people
 coincident: The number of people to fall in the same category
Sampling Distributions….
83

Examining the distribution of a set of data


 Given a (univariate) set of data we can examine its distribution
in a large number of ways. The simplest is to examine the
numbers.
 Two slightly different summaries are given by
 Summary (is a generic function used to produce result summaries of the
results of various model fitting functions, The function invokes particular
methods which depend on the class of the first argument. ) and
> summary(var)
 fivenum (the Tukey Five-Number Summaries, minimum, lower-hinge,
median, upper-hinge, maximum).
> fivenum(var)
Sampling Distributions….
84
 And also we can examine by displaying the data through
the following graphs:
>stem(var) # Steam and leaf of the var.
>hist(var) # Default histogram of var.
>boxplot(var) # a box plot of var
>plot(ecdf(var)) #the empirical cumulative
#distribution function of var.
>x <- rt(250, df = 5) # A random sample of size
# 250 from t distribution
with 5 df
> qqnorm(var) # QQ plot for normality of var.
> qqline(var) #make a line on the above QQ
Sampling Distributions….
85

 We can make a Q-Q plot against the generating distribution by

>qqplot(qt(ppoints(250), df = 5), x, xlab ="Q-Q


plot for t dsn")

>qqline(x)
 Formally, R provides the Shapiro-Wilk test and the Kolmogorov-

Smirnov test to examine whether the given data follows a normal


distribution or not.
>shapiro.test(var) #Shapir-Wilk test
>ks.test(var, "pnorm", mean = mean(var),
sd = sqrt(var(var))) #kolmogorov-smirnov test
86

Simulating the Sample Distribution of the Mean


 Tests on means are built on the assumption that the

sample mean is based on n independent observations


from a population with mean and variance . From
linear combination theory, we have derived that, so long
as the n observations are independent, will have a
mean of and a variance of .
One- and two-sample t-tests
87
 The main function that performs these sorts of tests is t.test(). Its syntax is:

t.test(x, y = NULL, alternative = c("two.sided","less",


"greater"), mu = 0, paired = FALSE, var.equal =FALSE,
conf.level = 0.95).
Arguments:
 x, y: numeric vectors of data values. If y is not given, a one sample test is
performed.
 alternative: a character string specifying the alternative hypothesis, must
be one of `"two.sided"' (default), "greater" or "less". You can specify just
the initial letter.
 mu: a number indicating the true value of the mean (or difference in
means if you are performing a two sample test). Default is 0.
 paired: a logical indicating if you want the paired t-test (default is the
independent samples test if both x and y are given).
One- and two-sample t-tests…..
88
 var.equal: (for the independent samples test) a logical variable
indicating whether to treat the two variances as being
equal. If `TRUE', then the pooled variance is used to
estimate the variance. If ‘FALSE’ (default), then the
Welch suggestion for degrees of freedom is used.
 conf.level: confidence level (default is 95%) of the interval
estimate for the mean appropriate to the specified
alternative hypothesis.
 Note that from the above, t.test() not only performs the
hypothesis test but also calculates a confidence interval. However,
if the alternative is either a “greater than” or “less than”
hypothesis, a lower (in case of a greater than alternative) or upper
(less than) confidence bound is given.
One- and two-sample t-tests…..
89

Example: Test the hypotheses that the average height


content of containers of certain lubricant is 10 liters if the
contents of a random sample of 10 containers are 10.2,
9.7, 10.1, 10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and 9.8 liters.
Use the 0.01 level of significance and assume that the
distribution of contents is normal.
>x=c(10.2,9.7,10.1,10.3,10.1,9.8,9.9,10.4
,10.3,9.8)
>t.test(x, mu = 10, conf.level = 0.99)
One- and two-sample t-tests…..
90
 The output of the above command will be:

One Sample t-test

data: x
t = 0.7717, df = 9, p-value = 0.46
alternative hypothesis: true mean is not
equal to 10
99 percent confidence interval:
9.807338 10.312662
sample estimates:
mean of x
10.06
One- and two-sample t-tests…..
91

Example: Consider the following sets of data on the latent heat of the
fusion of ice (cal/gm) from Rice.
Method A: 79.98 80.04 80.02 80.04 80.03 80.00 80.02
80.03 80.04 79.97 80.05 80.03 80.02
Method B: 80.02 79.94 79.98 79.97 79.97 80.03 79.95 79.97

 Box plots
provide a simple graphical comparison of the two samples.
>A=c(79.98,80.04,80.02,80.04,80.03,80.03,80.04,
79.97,80.05,80.03,80.02,80.00,80.02)
>B=c(80.02,79.94,79.98,79.97,79.97,80.03,79.95,
79.97)
>boxplot(A,B)
One- and two-sample t-tests…..
92

which indicates that the first group tends to give higher results than the
second.
79.94 79.96 79.98 80.00 80.02 80.04

1 2
One- and two-sample t-tests…..
93
 To test for the equality of the means of the two samples, we can use an unpaired
t-test by:
> t.test(A, B)
 This will give you the following output:
Welch Two Sample t-test

data: A and B
t = 3.2499, df = 12.027, p-value = 0.006939
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.01385526 0.07018320
sample estimates:
mean of x mean of y
80.02077 79.97875
Which indicate a significant difference, assuming normality. By default the R
function does not assume equality of variances in the two samples.
One- and two-sample t-tests…..
94
 We can use the F test to test for equality in the variances, provided
that the two samples are from normal populations.
> var.test(A, B)
F test to compare two variances
data: A and B
F = 0.5837, num df = 12, denom df = 7, p-value = 0.3938
alternative hypothesis: true ratio of variances is not equal
to 1
95 percent confidence interval:
0.1251097 2.1052687
sample estimates:
ratio of variances
0.5837405
which shows no evidence of a significant difference, and so we can
use the classical t-test that assumes equality of the variances.
One- and two-sample t-tests…..
95
>t.test(A, B, var.equal=TRUE)
Two Sample t-test
data: A and B
t = 3.4722, df = 19, p-value = 0.002551
alternative hypothesis: true difference in means is
not equal to 0
95 percent confidence interval:
0.01669058 0.06734788
sample estimates:
mean of x mean of y
80.02077 79.97875
One- and two-sample t-tests…..
96
 Excercise: the recovery time (in days) is measured for 10
patients taking a new drug and for 10 different patients
taking a placebo. We wish to test the hypothesis that the
mean recovery time for patients taking the drug is less than
fort those taking a placebo (under an assumption of
normality and equal population variances). The data are:
With drug: 15, 10, 13, 7, 9, 8, 21, 9, 14, 8
Placebo: 15, 14, 12, 8, 14, 7, 16, 10, 15, 12
One- and two-sample t-tests…..
Answer 97
> drug <- c(15, 10, 13, 7, 9, 8, 21, 9, 14, 8)
> plac <- c(15, 14, 12, 8, 14, 7, 16, 10, 15, 12)
> t.test(drug, plac, alternative ="less", var.equal = T)
Two Sample t-test

data: drug and plac


t = -0.5331, df = 18, p-value = 0.3002
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 2.027436
sample estimates:
mean of x mean of y
11.4 12.3
One- and two-sample t-tests…..
98

 Example : an experiment was performed to determine if a


new gasoline additive can increase the gas mileage of cars.
In the experiment, six cars are selected and driven with
and without the additive. The gas mileages (in miles per
gallon, mpg) are given below.

Car 1 2 3 4 5 6
mpg w/ additive: 24.6 18.9 27.3 25.2 22.0 30.9
mpg w/o additive: 23.8 17.7 26.6 25.1 21.6 29.6
One- and two-sample t-tests…..
99
 Since this is a paired design, we can test the claim using the paired t–test (under
an assumption of normality for mpg measurements). This is performed by:
> add <-c(24.6, 18.9, 27.3, 25.2, 22.0, 30.9)
>noadd <-c(23.8, 17.7, 26.6, 25.1, 21.6, 29.6)
>t.test(add, noadd, paired=T, alt = "greater")
Paired t-test

data: add and noadd


t = 3.9994, df = 5, p-value = 0.005165
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.3721225 Inf
sample estimates:
mean of the differences
0.75
Quiz (10%)
10
0
1.By using R software solve the following
questions?(1.5 pts each)
a)Find p(x=5)for X ~ Bin(n=20, p =.5)
b) Find p(x<3)for X ~ Bin(n=8, p =.25)
c)Find P(X<4) for X~N(mu = 10, sigma =2)
d)Find 1st quartile of N(mu = 12,sigma = 3)
e)Find the 90th percentile of t(15)
f) Test the hypotheses that the average marks
of certain students in a class is 75 or not.
if the marks of a random sample of 6
students are 80, 62, 55, 68, 82 and 70 . Use
the 0.05 level of significance and assume
that the distribution of marks is normal.
10
1

4. Graphical Procedures
Introduction
10
2
 Graphs are important tools for exploring data in
statistics and other fields/streams.
 There are a number of graphs in Statistics.
 Pie chart, histogram, scatter, box, steam and leaf, QQ plot…

 R is a powerful tool for building graphics from the


simplest to the most complex ones.
 It may be base or lattice graphics.
Plotting Commands
10
3
 There are a variety of plotting commands in R.
 plot()
 Automatically produces simple plots for vectors, functions
or data frames
 Specifying labels:
 main – provides a title
 xlab – label for the x axis

 ylab – label for the y axis

 Specifying range limits


 ylim=c(ymin, ymax) # 2-element vector gives range for y axis
 xlim=c(xmin, xmax) # 2-element vector gives range for x axis
Plotting Commands…
10
 To add additional data use 4

> points(x,y)
> lines(x,y)
Different plotting commands
> pie() # Used to plot a Pie Chart
> boxplot() # Plots Box-and-whiskers plot
> hist() # Used to plot histograms
> barplot() # Used to plot a bar graph (Chart)
> legend() # Adds a legend to a figure
> stem() # Plots steam and leaf plot
Graphics Parameters
10
 text() – writes inside the plot 5
region, could be used to label data
points.
 mtext() – writes on the margins, can be used to add multiline
legends
 text() and mtext() functions can print mathematical expressions created
with expression().
 mfrow or mfcol options Take 2 dimensional vector as an argument
 The first value specifies the number of rows
 The second specifies the number of columns
 plot types: type="l"
 line types: lty= (takes values from 1 to 8)
 plotting characters: pch=(takes values square (0); circle (or octagon)
(1); triangle (2); cross (3); X (4); diamond (5) and inverted triangle (6))
 plotting colors: col=1,2,3,…
Graphics Parameters
>x <- c(1.2,3.4,1.3,-
10
2.1,5.6,2.3,3.2,2.4,2.1,1.8,1.7,2.2)
6
> y <- c(2.4,5.7,2.0,-
3,13,5,6.2,4.8,4.2,3.5,3.7,5.2)
>plot(x,y)
Graphics Parameters
>x <- c(-2,-0.3,1.4,2.4,4.5)
>y <- c(5,-0.5,8,2,11) 10
7
>plot(x,y,type="l",col="blue",xlab="Advertise
Change", ylab="Revenue Change", main="Financial
Analysis")
Demonstration
Bar Graph: Summary data
10
Consider five clinics treating patients
8 who are smokers, ex-smokers and
non-smokers.
Perform the following operation to creat the data.
> clinics <- matrix(c(30, 55, 60, 20, 45, 70, 50, 10, 20,
55, 70, 120, 27, 34,22, 23, 14, 33), nrow = 6)
> dimnames(clinics) <- list(paste(c("clinic"), 1:6),
c("smoker", "ex-smoker", "non-smoker"))
> clinics
smoker exsmoker non-smoker
clinic 1 30 50 27
clinic 2 55 10 34
clinic 3 60 20 22
clinic 4 20 55 23
clinic 5 45 70 14
clinic 6 70 120 33
Bar Chart:
 The following code plots a bar chart for the first clinic (clinic 1) with
names on the x-axis (smoker, exsmoker
10 and nonsmoker)
9
> barplot(clinics[1,],names=dimnames(clinics)[[2]],
main="clinic1”)
Bar Chart:
 The following code plots a bar chart for the all clinics who
are only smokers. 11
0
> barplot(clinics[,1],names=dimnames(clinics)[[1]],
main="smoker group")
Bar Chart:
> barplot(t(clinics), names=dimnames(clinics)[[1]],
main='matrix is transposed',
11 sub='each row is one
bar', col=1:10) 1
Bar Chart:
>barplot(clinics,names=dimnames(clinics)
[[2]],sub='each column 11of the matrix is one
bar', col=1:6) 2
Bar Chart:
>barplot(clinics,names=dimnames(clinics)
[[2]],sub='each column of
11 the matrix is one bar',
col=1:6) 3

legend (locator(1), dimnames(clinics)


[[1]] ,col=1:6,lty=1,lwd=8)
Bar Chart:
> par(mfrow=c(2,2))
>barplot(clinics[1,],names=dimnames(clinics)
11
4
[[2]],main="clinic 1")
>barplot(clinics[,1],names=dimnames(clinics)
[[1]],main="smoker group")
>barplot(t(clinics),names=dimnames(clinics)
[[1]],main='matrix is transposed',sub='each
row is one bar', col=1:10)
>barplot(clinics,names=dimnames(clinics)
[[2]],sub='each column of the matrix is one
bar', col=1:6)
> legend (locator(1), dimnames(clinics)
[[1]] ,col=1:6,lty=1,lwd=8)
Bar Chart:
11
5
Pie Chart:
> par(mfrow=c(1,1))
> clinic.names <- dimnames(clinics)[[1]]
11
6
> smoker.names <- dimnames(clinics)[[2]]
> pie(clinics[,1],names=clinic.names,col=20:25)
Histogram…
> set.seed(121343) # this to fix the numbers to be generated
> x <- rnorm(100) 11
7
#Default histogram
> hist(x)
Histogram of x
25
20
15
Frequency

10
5
0

-2 -1 0 1 2

x
Histogram…
#With shading
> hist(x, density=20) 11
8
Histogram…
#With specific number of bins
> hist(x, density=20, breaks=20)
11
9
Histogram…
# Proportion, instead of frequency, also specifying y-axis
12
0
> hist(x, density=20, breaks=-
3:3,ylim=c(0,.5), prob=TRUE)
Histogram …
Overlay normal curve with x-lab and ylim, colored normal curve
> m<-mean(x) 12
1
> std<-sqrt(var(x))
> hist(x, density=20, breaks=20, prob=TRUE,xlab="x-variable",
ylim=c(0, 0.7), main="normal curve over histogram")
> curve(dnorm(x, mean=m,sd=std),col="darkblue",lwd=2,add=TRUE)
Histogram…
Overlay density curve with x-lab and ylim
> hist(x, density=20, breaks=20,12 prob=TRUE,xlab="x-variable",
2
ylim=c(0, 0.8),main="Density curve over histogram")
Density curve over histogram
> lines(density(x), col = "blue")

0.8
0.6
Density

0.4
0.2
0.0

-2 -1 0 1 2

x-variable
Histogram…
hist(x) is an object
12
 names(xh) will show all of its components
3
> xh<-hist(x)
> plot(xh, ylim=c(0, 40), col="lightgray",xlab="",
main="Histogram of x")
> text(xh$mids, xh$counts+2, label=c(xh$counts))
Box Plot
Syntax
 boxplot(Cont. Variable
12 )
4
 boxplot(Cont. Variable , xlab="write",
boxwex=.4, col="darkblue")
 boxplot(Cont. Variable ~categorical_variable)

Example:
> attach(Orange)
> boxplot(age)
Box Plot
> boxplot(age, ylab="age of a tree",col=4)
> f <- fivenum(age) 12
5
> text(rep(1.35,5),f,labels=c("min","Q1","median","Q3",
"max"),cex=0.8)

 Rep(1.35,5) is
location for
labels
Box Plot…
> Tlab<-as.vector(c("Tree1","Tree2", "Tree3","Tree4","Tree5"))
> Treef<-factor(Tree, label=Tlab)
12
> boxplot(age ~ Treef, 6
xlab=“Age by Tree ")
Box Plot…
 If a horizontal boxplot is needed, run following R code
> boxplot(age ~ Treef,horizontal=TRUE
12 xlab=“Age by Tree ")
7
Stem and Leaf plots
 For a univariate set of data, it is possible to examine its
distribution in a number of12
8 ways of which steam and leaf
plot is one.
Example:
> x<-c(42, 23, 43, 34, 49, 56, 31, 47, 61, 54, 46, 34, 26)
> Stem(x)
The decimal point is 1 digit(s) to the right of the |

2 | 36
3 | 144
4 | 23679
5 | 46
6 | 1
 Notice that there are 5 categories for these 13 numbers, with stems for the 10s
digit and leaves for the 1s digit.
Stem and Leaf plots
 The stem and leaf plot can be scaled to have more stems
by changing the scale option:
12
9
> stem(x, scale = 2)
The decimal point is 1 digit(s) to the right of the |

2 | 3
2 | 6
3 | 144
3 |
4 | 23
4 | 679
5 | 4
5 | 6
6 | 1
Quantile plots
qqnorm(x)
13
 plots the numeric vector x against the expected
0

Normal order scores (a normal scores plot)


qqline(x)
 adds a straight line to such a plot by drawing a line
through the distribution and data quartiles.
 this line passes through the first and third quartiles

qqplot(x, y)
 plots the quantiles of x against those of y to
compare their respective distributions.
Quantile plots…
> set.seed(123456)
> y <- rt(200, df = 5) 13
1
> x <-rt(300, df = 5)

> par(mfrow=c(1,2))
> qqnorm(y)
> qqline(y, col = 2)
> qqplot(y, x)
Quantile plots…
13
2
> mydata = c(2.4, 3.7, 2.1, 3, 1.6, 2.5, 2.9)
> myquant=qqnorm(mydata)
> qqline(mydata)

 myquant; contains
the theoretical
quantiles and the
original data

 If the observations in mydata come from a normal distribution, then the above
plot of mydata versus their population quantiles should give a straight line.
13
3

5. Statistical Models In R
Regression
13
4
 Regression Analysis is a statistical tool for the investigation of
relationship between variables.
 The investigator wants to ascertain causal effect of one variable
on an other
 The investigator also typically assesses the “Statistical
significance” of the estimated relationship (the degree of
confidence that the true relationship is close to the estimated
relationship).
 Regression may be simple or multiple, linear or non-linear.
 The template for a statistical model is a linear regression model
with independent, homoscedastic errors.
y x  
Regression…
13
 General form: 5

 Response ~expression
 y ~ x

 y ~ 1 + x
 Both imply the same simple linear regression model of y on x. The first
has an implicit intercept term, and the second an explicit one.
 y ~ 0 + x
 y ~ -1 + x
 y ~ x - 1
 Simple linear regression of y on x through the origin (that is, without an
intercept term)
 The operator ~ is used to define a model formula in R.
Regression…
13
6
 Some useful extractors an out put from R code.
 coef(object)
 Extract the regression coefficient (matrix).
 Long form: coefficients(object).
 deviance(object)
 Residual sum of squares, weighted if appropriate.
 formula(object)
 Extract the model formula.
 plot(object)
 Produce four plots, showing residuals, fitted values and some diagnostics.
 predict(object, newdata=data.frame)
 The data frame supplied must have variables specified with the same labels as

the original. The value is a vector or matrix of predicted values corresponding


to the determining variable values in data.frame.
Regression…
13
 print(object) 7

 Print a concise version of the object. Most often used implicitly.


 residuals(object)
 Extract the (matrix of) residuals, weighted as appropriate.
 Short form: resid(object).
 step(object)
 Select a suitable model by adding or dropping terms and preserving
hierarchies.
 The model with the smallest value of AIC (Akaike’s An Information
Criterion) discovered in the stepwise search is returned.
 summary(object)
 Print a comprehensive summary of the results of the regression analysis.

 vcov(object)
 Returns the variance-covariance matrix of the main parameters of a fitted
model object.
Linear Model…
13
8
 confint(model_Name, level=0.95)
 Confidence interval formation for the model parameters
 fitted(model_Name)
 Used to display predicted values depending on the fitted model
 influence(model_Name)
 used for regression diagnostics
 anova(model_Name)
 It displays te analysis of variance table
 cor(Data.frame)
 A quick way to look for relationships between variables in a data frame
 pairs(Data.frame)
 To visualize these relationships
Linear Model…
13
 weights(model) 9

 when weights are used to display what it is


 rank(model)
 the numeric rank of the fitted linear model
 call(model)
 the matched call
 terms(model)
 the terms object used
 contrasts
 only when relevant the contrasts used
Linear Model Using R
14
 lm() is the basic function for fitting ordinary
0

simple/multiple models.
 fitted.model=lm(formula, data=, subset=)
Example: Consider the Effect of Vitamin C on Tooth
Growth in Guinea Pigs data, called ToothGrowth in R
dataset package.
 First explore the data structure as follows:
> str(ToothGrowth)
'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
Linear Model Using R
14
> summary(ToothGrowth) 1

len supp dose


Min. : 4.20 OJ:30 Min. :0.500
1st Qu.:13.07 VC:30 1st Qu.:0.500
Median :19.25 Median :1.000
Mean :18.81 Mean :1.167
3rd Qu.:25.27 3rd Qu.:2.000
Max. :33.90 Max. :2.000
 Our (alternative) hypothesis here is that suppliment and dose are a

good predictors of tooth length of Guinea pigs. Let's start out by


seeing what the data look like.
Linear Model Using R
14
2
> plot(ToothGrowth$dose,ToothGrowth$len)
> plot(len~dose,data=ToothGrowth)
Both codes display the following graph
Linear Model Using R
14
3
 This looks like a pretty clear relationship. To fit a linear
model, we can use the lm function.
> model1<-lm(len~supp+dose+supp*dose,data=ToothGrowth)
> model1
Call:
lm(formula = len ~ supp + dose + supp * dose, data =
ToothGrowth)
Coefficients:
(Intercept) suppVC dose suppVC:dose
11.550 -8.255 7.811 3.904
 A linear model with interaction is fitted and the results
are displayed as seen here.
Linear Model Using R
14
4
 R returns only the call and coefficients by default. You can
get a lot more information using the summary function.
> summary(model1)
Call:
lm(formula = len ~ supp + dose + supp * dose, data = ToothGrowth)
Residuals:
Min 1Q Median 3Q Max
-8.2264 -2.8463 0.0504 2.2893 7.9386
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.550 1.581 7.304 1.09e-09 ***
suppVC -8.255 2.236 -3.691 0.000507 ***
dose 7.811 1.195 6.534 2.03e-08 ***
suppVC:dose 3.904 1.691 2.309 0.024631 *
Linear Model Using R 14
5
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.083 on 56 degrees of freedom


Multiple R-squared: 0.7296, Adjusted R-squared: 0.7151
F-statistic: 50.36 on 3 and 56 DF, p-value: 6.521e-16
 Not surprisingly, there is a highly significant relationship
here. As mentioned above, the output from the lm
function is an object of class lm. These objects are lists
that contain at least the following elements (you can find
this list in the help file for lm):
 coefficients,residuals,fitted.values,
rank,weights,df.residual,call,terms,co
ntrasts
Linear Model Using R
14
 Let us look at some of the options 6

> class(model1)
[1] "lm"
> model1$coef # or
> model1$coefficients #or
> coef(model1)

(Intercept) suppVC dose suppVC:dose


11.550000 -8.255000 7.811429 3.904286
 To display the residuals

> resid(model1)
-4.95285714 2.34714286 … -2.22714286
-4.17285714
Linear Model Using R
14
7
 As mentioned above, the summary function is a generic
function—what it does and what it returns is dependent on the
class of its first argument. Here is a list of what's available from
the summary function for this model.
> names(summary(model1))
[1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled"
 The following code displays the indexed object
> summary(model1)[[8]]
[1] 0.7295544
 The 8th object isfrom the above list
(names(summary(model1))) is the r.squared.
Linear Model Using R
 Another function worth mentioning is anova.
 This function will calculate an analysis of variance table, which can be used
to evaluate the significance of the terms in single models or to compare two
nested models.
> anova(model1)
Analysis of Variance Table
Response: len
Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.35 205.35 12.3170 0.0008936 ***
dose 1 2224.30 2224.30 133.4151 < 2.2e-16 ***
supp:dose 1 88.92 88.92 5.3335 0.0246314 *
Residuals 56 933.63 16.67
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

148
Linear Model Using R
 Once we have fit a model in R, we can generate predicted
values using the predict function.
> predict(model1)
1 2 … 59 60
9.152857 9.152857 … 27.172857 27.172857
 The predict function can also give confidence and prediction
intervals.
> predict(model1,int="conf")
fit lwr upr
1 9.152857 6.966789 11.33893
2 9.152857 6.966789 11.33893
⁞ ⁞ ⁞ ⁞
59 27.172857 24.680356 29.66536
60 27.172857 24.680356 29.66536 149
Linear Model Using R
 A quick way to look for relationships between variables in a
data frame is with the cor function.
> cor(dataset)
 However this works when all the variables in the data set are
numeric.
Example: Formaldehyde data set in R (two numeric vars)
> cor(Formaldehyde)
carb optden
carb 1.0000000 0.9995232
optden 0.9995232 1.0000000
 If not it will display the following message on R console
Error: could not find function "as.numeric"
 It is possible to supply two variables (numeric) to get pair of correlation.
150
Linear Model Using R
To visualize these relationships, we can use pairs.
> pairs(data set name)
Example:
> pairs(Formaldehyde)

151
Updating Models
15
2
 We could also use the update function for model 2—
this is especially handy for dealing with large model
formulas.
 For example let us drop the supp term in the fitting of
length on dose and supp on ToothGrowth data set in R
(model<-lm(len~supp+dose, data=ToothGrowth))
 It can be done as follows from model by update
function
> model2<-update(model, ~. –supp,data=ToothGrowth)
 The same as the following model
> model2<-lm(len~dose,data=ToothGrowth)
ANOVA Models
15
3
 To demonstrate ANOVA in R, let‘s start with a simple data

set with the base packages called InsectSprays.


 The dataset has two variables (count and spray type)

 This dataset shows the effectiveness of six different

insecticides.
> str(InsectSprays)
'data.frame': 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1
1 ...
ANOVA Models
15
4
 There are two options for specifying an ANOVA: lm and
aov.
 Really, aov is just a wrapper for calling up the lm
function.
 The main difference between aov and lm is in the format
of the output, although a traditional ANOVA table can be
produced by applying the anova function to an lm model.
 Since the measured variable is a count (number of insects),
it is not normally distributed.
 To make these data approximate a normal distribution, we
can use a square root transformation (Zar 1999)
ANOVA Models
15
> 5
InsectSprays$sr.count<-sqrt(InsectSprays$count + 3/8)
> mod.1<-aov(sr.count ~ spray, data = InsectSprays)
> mod.1
Call:
aov(formula = sr.count ~ spray, data =
InsectSprays)
Terms:
spray Residuals
Sum of Squares 80.52844 22.80262
Deg. of Freedom 5 66
Residual standard error: 0.5877876
Estimated effects may be unbalanced
ANOVA Models
15
6
 To get more detailed output, we need to use the summary
function.
> summary(mod.1)
Df Sum Sq Mean Sq F value Pr(>F)
spray 5 80.528 16.106 46.616 < 2.2e-16 ***
Residuals 66 22.803 0.345
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 We can specify this same model using the lm function.


> modelAnova<-lm(sr.count ~ spray, data = InsectSprays)
> summary(modelAnova)
ANOVA Models
15
Call: 7
lm(formula = sr.count ~ spray, data = InsectSprays)
Residuals:
Min 1Q Median 3Q Max
-1.21011 -0.38480 -0.02005 0.38054 1.26503
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.8115 0.1697 22.463 < 2e-16 ***
sprayB 0.1143 0.2400 0.476 0.635
sprayC -2.3549 0.2400 -9.814 1.60e-14 ***
sprayD -1.5587 0.2400 -6.496 1.26e-08 ***
sprayE -1.8937 0.2400 -7.892 4.14e-11 ***
sprayF 0.2550 0.2400 1.062 0.292
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5878 on 66 degrees of freedom
Multiple R-squared: 0.7793, Adjusted R-squared: 0.7626
F-statistic: 46.62 on 5 and 66 DF, p-value: < 2.2e-16
ANOVA Models
15
8
 The above output gives you some insight into how R carries
out ANOVA—it uses linear regression with dummy variables.
 You can get more information on the variable coding with the
model.matrix function,
 It returns the X-matrix for the regression.
 From the output, it is clear that spray=A is the reference
level.

> model.matrix(object) # object is the model


Exmple:
> model.matrix(modelAnova)
ANOVA Models
15
9
 R has many multiple range tests available, including Tukey‘s
HSD test in the base package, and many others in the
multcomp package.
 The Tukey test is applied using the TukeyHSD function.
 Note that this function requires aov output (lm will not work).
 Example:
> modelAnova<-aov(formula = sr.count ~ spray, data =
InsectSprays)
> TukeyHSD(modelAnova)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = sr.count ~ spray, data= InsectSprays)
$spray
ANOVA Models
16
diff lwr 0 upr p adj
B-A 0.1143102 -0.59000479 0.8186251 0.9968245
C-A -2.3549121 -3.05922701 -1.6505971 0.0000000
D-A 1.5587119 -2.26302685 -0.8543969 0.0000002
E-A -1.8937416 -2.59805660 -1.1894267 0.0000000
F-A 0.2549576 -0.44935734 0.9592726 0.8943236
C-B -2.4692222 -3.17353717 -1.7649073 0.0000000
D-B -1.6730221 -2.37733701 -0.9687071 0.0000000
E-B -2.0080518 -2.71236676 -1.3037369 0.0000000
F-B 0.1406474 -0.56366751 0.8449624 0.9916328
D-C 0.7962002 0.09188521 1.5005151 0.0177353
E-C 0.4611704 -0.24314454 1.1654854 0.3983576
F-C 2.6098697 1.90555471 3.3141846 0.0000000
E-D -0.3350298 -1.03934471 0.3692852 0.7291427
F-D 1.8136695 1.10935455 2.5179845 0.0000000
F-E 2.1486993 1.44438430 2.8530142 0.0000000
ANOVA Models
 Interaction plots can be done as follows
16
1
> interaction.plot(var1,var2,var3)
 Example ToothGrowth data
>interaction.plot(ToothGrowth$dose,ToothGrowth$supp,Toot
hGrowth$len)
ToothGrowth$supp
25

VC
mean of ToothGrowth$len

OJ
20
15
10

0.5 1 2

ToothGrowth$dose
Generalized Linear Model
16
2
 Generalized linear models (GLMs) are a very flexible class
of statistical models.
 In R, GLM models can be specified using the glm function.
 There are eight different error distributions available in
glm, including binomial and poisson, each with a default
link function.
 Arguments and default argument values can be found in the
help file for glm:

> glm(formula, family = gaussian, data, weights, subset,


na.action, start = NULL, etastart, mustart, offset, control =
glm.control(...), model = TRUE, method = "glm.fit", x =
FALSE, y = TRUE, contrasts = NULL, )
Generalized Linear Model
16
3
 The glm function is very flexible.
 To demonstrate a very different application, let‘s read in the data on
insect numbers in response to insecticide spraying.
 This data set was analyzed above using an ANOVA, but recall that it
required a transformation of the response.
 In this case, we want to carry out an ANOVA, but the GLM lets us
use an appropriate distribution for count data: the Poisson
distribution.
 Note that the default link function for the Poisson distribution is log.
> modelglm<-glm(count~spray,poisson,data=InsectSprays)
> summary(modelglm)
Call:
glm(formula =count ~spray,family ="poisson", data =
InsectSprays)
Generalized Linear Model
16
Deviance Residuals: 4
Min 1Q Median 3Q Max
-2.3852 -0.8876 -0.1482 0.6063 2.6922
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.67415 0.07581 35.274 < 2e-16 ***
sprayB 0.05588 0.10574 0.528 0.597
sprayC -1.94018 0.21389 -9.071 < 2e-16 ***
sprayD -1.08152 0.15065 -7.179 7.03e-13 ***
sprayE -1.42139 0.17192 -8.268 < 2e-16 ***
sprayF 0.13926 0.10367 1.343 0.179
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 409.041 on 71 degrees of freedom
Residual deviance: 98.329 on 66 degrees of freedom
AIC: 376.59
Number of Fisher Scoring iterations: 5
Time Series Models
16
5
 An AR model expresses a time series as a linear function
of its past values. The order of the AR model tells how
many lagged past values are included.
 The simplest AR model is the first-order autoregressive,
or AR(1), modely a y e
t 1 t 1 t

 Where yt mean adjusted series in t, yt-1 is series in the


previous year
 α1 is the lag-1 autoregresive coefficient
 et is random noise/shock,noise/residual.
Autoregressive Model
16
6
 In a multiple regression model, we forecast the variable of

interest using a linear combination of predictors.


 In an autoregression model, we forecast the variable of

interest using a linear combination of past values of the


variable.
 The term autoregression indicates that it is a regression of the

variable against itself.


Time Series Models
16
7
 The moving average (MA) model is a form of ARMA
model in which the time series is regarded as a moving
average (unevenly weighted) of a random shock series et

 The first-order moving average, or MA(1), model is given


by y t c1et  1  et

 Where
 et and et-1 residuals at time t and t-1
 α1 is the first order moving average coefficient
Time Series Models
16
8
 The autoregressive model includes lagged terms on the
time series itself,
 The moving average model includes lagged terms on the
noise or residuals.
 Including both types of lagged terms, autoregressive-
moving-average, or ARMA, models can be found.
 The order of the ARMA model is included in parentheses
as ARMA(p,q),
 Where p is the autoregressive order and q the moving
average order. The simplest ARMA model is first-order
autoregressive and first-order moving average, or ARMA(1,1).
y  y
t c e
1 t 1 e
1 t 1 t
Time Series Models
16
9
 The following steps should be used in modeling
 Identification: (AR, MA, ARMA) using acf and pacf
 Estimation: coefficients, by least square or iterative
 Diagnostic: randomnes of error, etimated coefficients
significantly different from zero

 Attention: this is a provok about your time series

knowledg for more about time series modelling please


have alook at your Course, TIME SERIES ANALYSIS
Time Series Models Using R
17
0
 Let us consider the following data (Age of Death of
Successive Kings of England)
> death<-
c(60,43,67,50,56,42,50,65,68,43,65,34,47,34,49,41,13,35,53,56,
16,43,69,59,48,59,86,55,68,51,33,49,67,77,81,67,71,81,68,70,77
,56)
 Once you have read/stored the time series data into R,
 the next step is to store the data in a time series object in R, so that you
can use R’s many functions for analysing time series data.
 To store the data in a time series object, we use the ts() function in R.
 For example, to store the data in the variable ‘death’ as a time series object
in R
> kingsdeath<-ts(death)
 Note: read data into R using the scan() function, which assumes that your data for
successive time points is in a simple text file with one column.
Time Series Models Using R
17
1

> kingsdeath
Time Series:
Start = 1
End = 42
Frequency = 1
[1] 60 43 67 50 56 42 50 65 68 43 65 34
47 34 49 41 13 35 53 56 16 43 69 59 48
[26] 59 86 55 68 51 33 49 67 77 81 67 71
81 68 70 77 56
Time Series Models Using R
 Plotting Time Series 17
2
> plot.ts(kingsdeath)
Time Series Models Using R
17
 The following generates from a normal
3 distribution 300 elements and classifies
into 100 rows and 3 columns and converts to timeseries data.
> z <-ts(matrix(rnorm(300), 100, 3), start=c(1961, 1),
z
frequency=12)
> plot(z)

2
Series 1
1
0
-1
-2
-3
2
Series 2
1
0
-1
-2
2
Series 3
1
0
-1
-2

1962 1964 1966 1968

Time
Time Series Models Using R
17
4
> plot(z, plot.type="single", lty=1:3)
Time Series Models Using R
17
5
 Plot time series against lagged versions of themselves.
 Helps visualizing ‘auto-dependence’ even when auto-correlations
vanish.
 The following command plots lags against the observed.
> lag.plot(x, lags = 1, layout = NULL, set.lags = 1:lags, main
= NULL, asp = 1, diag = TRUE, diag.col = "gray", type = "p",
oma = NULL, ask = NULL, do.lines = (n <= 150), labels =
do.lines, ...)
 Example: plot
> lag.plot(z, 1, diag.col = "forest green")
Time Series Models Using R
17
6
Time Series Models Using R
17
 Fit an autoregressive time series
7 model to the data, by default
selecting the complexity by AIC.
 Syntax
> ar(x, aic = TRUE, order.max = NULL, method=c("yule-
walker", "burg", "ols", "mle", "yw“), na.action, series,
...)
 Example: on England kings age at death data
> kingsdeath.ar <-
ar(kingsdeath,aic=TRUE,method="ols",order.max=2)
Call:
ar(x = kingsdeath, aic = TRUE, order.max = 2, method = "ols")
Coefficients:
1
0.4006
Intercept: -0.108 (2.368)
Order selected 1 sigma^2 estimated as 229.9
Time Series Models Using R
17
 The following used to predict time 8
series model.
> predict(kingsdeath.ar, n.ahead=10)
$pred
Time Series:
Start = 43
End = 52
Frequency = 1
[1] 55.46385 55.24907 55.16303 55.12856 55.11476 55.10923
55.10701 55.10612
[9] 55.10577 55.10562
$se
Time Series:
Start = 43
End = 52
Frequency = 1
[1] 15.16372 16.33518 16.51544 16.54418 16.54879 16.54953
16.54965 16.54967
[9] 16.54967 16.54967
Time Series Models Using R
17
9
 Fit an ARIMA model to a univariate time series.
 Syntax
> arima(x, order = c(0, 0, 0), seasonal = list(order = c(0, 0,
0), period = NA), xreg = NULL, include.mean = TRUE,
transform.pars = TRUE, fixed = NULL, init = NULL, method =
c("CSS-ML", "ML", "CSS"), n.cond, optim.control = list(),
kappa = 1e6)
> kingsdeath.arima <- arima(kingsdeath, c(3, 0, 0))
> tsdiag(kingsdeath.arima ) #Diagnostic fitted model
> predict(kingsdeath.arima,5) #forcasting for 5 years
18
0

6. Introduction to SAS
Introduction
18
1
 SAS is a comprehensive statistical software system which

integrates utilities for storing, modifying, analyzing, and graphing


data.
 To start SAS :

 via the START menu


 by double clicking on the SAS icon on the desktop
 To exit SAS :

 click on Exit in the File menu


 Alt + F4
SAS User Interface
Run button – click on this button
18
2 to run SAS code
Tool bar similar
to Windows applications

Click here for SAS help

New Window button


Log Window Save button

Explorer
Window

Editor Window
Results
Window
(not shown)

Output Window (not shown)


3
18
SAS Editor Windows

Access and edit existing SAS programs


Write new SAS programs
Submitting SAS programs for execution
Saving SAS programs
SAS Log Window

The Log Window contains a


record
of all commands submitted to
SAS and shows errors in the
commands.
 Information about the
processing of the SAS
program
 Includes any warnings or
error messages
 Accumulated in the order
the data and procedure
steps are submitted

184
SAS Output Windows…

 Will automatically pop in front when there is output.


 Does not need to occupy screen space during program
editing.
 Reports generated by the SAS procedures
 Accumulates output in the order it is generated

185
What is SAS?

Began as a statistical package


Also allows users to:
 Store data
 Manipulate data
 Create reports
 PDF
 Excel
 HTML
 XML
 RTF / Word
 Etc. Etc. Etc.
What is SAS?

Also allows users to:


 Create graphs
 Create maps
 Send e-mails
 Create web applications
 Create iPhone Apps
 Access R
 Schedule regularly run reports
 Etc. Etc. Etc.
 Examples created using SAS version 9.2
Log:
Details about jobs you’ve
run, including error
Explorer messages
Window:
See libraries
and SAS
datasets
Enhanced Editor:
Where you write your
SAS code
Results tab:
List of
previously Output Window:
run results Basic view of results and
output
(Also known as “Listing
Destination”)
Run:
Submits your SAS code
(“The Running Man”)
You can submit the
program as a whole
or
Run only a portion of
highlighted code
Chunks of code will
be processed in the
order you RUN
them, not necessarily
in the order you
wrote them
DATA steps:
Read and write data,
manipulate data, perform
calculations, etc.

Every DATA step creates a new


dataset
Every SAS
program
can have
multiple
DATA and
PROC steps

PROC steps:
Produce reports, summarize data, sort
data,
create formats, generate graphs, etc. etc.
etc.
Global statements:
Remain in effect until changed by
another global statement or SAS
session ends

Examples:
Titles, Footnotes, Options, Macros,
Libname
Comment statements:
Ignored by SAS when code is run

Used to write comments to yourself


or others or
comment out code you don’t want to
run

*comment statement;
or
/* comment statement */
Programming in SAS

 SAS is generally forgiving


 Code can be written on a single line or

multiple lines
 Code and variable names not case sensitive
 Even accepts some misspellings

 Semicolons are super important


 Colors of Enhanced Editor
 Always check your log!!!
 Often many ways to do the same
thing
Where to learn more about SAS

Technical papers
SAS User Groups
Websites
Books
Classes

http://www.sascommunity.org/wiki/
Learning_SAS_-_Resources_You_Can’t_Live_Without
Popular Websites
www.google.com
Popular Websites
www.ats.ucla.edu/stat/s
as Popular Websites
 Explains
both the
stats and
the SAS
code

www.ats.ucla.edu/stat/s
as Popular Websites
www.lexjansen.com
Popular Websites
SAS Help and Documentation

Popular Websites
SAS Help

20
4
SAS Language
20
5
 Composed of SAS Statements

 All SAS Statements end with a Semi-colon (;) except data lines

 Example SAS Statement:

input name $ sex $ age height;

 Comment lines are preceded by /* (single line),

 Comment lines also preceeded by /* and end with */ (multiple line

comment)
 Free-format – can use upper or lower case, i.e. SAS is case inse.

 Usually begin with an identifying keyword.


SAS Language
20
6

 The end of the data or PROC steps are indicated by:

 RUN statement – most steps


 QUIT statement – some steps
 Beginning of another step (DATA or PROC statement)
SAS Data Sets
20
7
 SAS Data Set

 Specifically structured file that contains data values.


 File extension - .sas7bdat
 Rows and Columns format – similar to Excel
 Columns – variables in the table corresponding to fields of data
 Rows – single record or observation
 Two types of variables
 Character – contain any value (letters, numbers, symbols, etc.)
 Numeric – floating point numbers
 Located in SAS Data Libraries
SAS Data Libraries
20
8
 Default library in SAS is "work"

 User defined libraries are aslso possible

 In general data in SAS can be managed as follows

SAS Result
Raw of Report
Data data
set Analysi
s PROC
step

SAS
data
set
SAS Data Libraries
20
9
 SAS files are stored in SAS data libraries.

 Is a collection of sas files that are recognized as a unit by SAS

 Temporary data sets

 exist only during the current session


 are 'stored’ in a special library called WORK
 are automatically erased when you exit SAS
 have one-level names
SAS Data Libraries
21
0

 Permanent data sets

 are saved on the hard disk in a library of your choice


 are available for use later
 have two-level names
SAS Data Step Structures

21
1
 The SAS data set will have the name provided by you.

 The SAS data set will be saved in a temporary folder as

name.sas7bdat
data name;
input var1 var2 var3;
datalines;


;
run;
 Note: to tell SAS that a variable is character, add a dollar sign ($) after the

name of the variable in the input statement.


SAS Data Step Structures
21
2
 INPUT describes the data structure of the new SAS data set

 DATALINES or CARDS indicate that the data follow.

 Remark that there are NO semicolons at the end of the lines, but

there is one semicolon at the end of all the data.


 RUN; tells SAS to execute the program

 Example: Creat a data set call it "student" for 3 students with

their id, sex, CGPA and Age characteristics.


SAS Data Step Structures
21
3
 The sas data set is created using the following SAS expression

DATA student;
INPUT id sex $ CGPA Age;
DATALINES;
1 M 3.5 21
2 M 2.3 18
3 F 3.2 19
;
RUN ;
SAS Procedures
21
4
 General structre of a PROC step is:
proc name; #the name of the procedure wanted
…;
…;
run;
 Remark: the procedure will be executed on the most recently

created SAS data set in the current SAS session. To avoid this, add
DATA=data_name to the PROC statement.
 Two Basic steps in SAS programs:

 Data Steps
 Proc Steps
SAS Procedures
21
5
 DATA Steps

 begin with DATA statements


 read and modify data
 create a SAS data set

 PROC Steps

 begin with PROC statements


 perform specific analysis or functions
 produce results or reports

 There are a number of procedures to perform a specific task in SAS.


Manipulating and Modifying a Data Set
21
6
 Data can be read from one or more SAS data sets and uses them

to build a new sas data set by


 modify the initial data set
 add new variables
 create a subset

 For modifying a sas data set a main statement is:

SET name1 name2 ... ;


 By default the SET statement reads all observations and variables

from the input data set into the output data set.
Manipulating and Modifying a Data Set
21
7
 Some possible actions are

 KEEP=: variables to be included


 DROP=: variables to be dropped
 RENAME=: rename variable
 WHERE=: selecting observations
Manipulating and Modifying a Data Set
21
8
 KEEP=variables to be included

 Lists one or more variable names (only specified)


 When needed few variables to appear in the SAS data set
 If placed the KEEP= option on the SET statement
 It keeps the specified variables when it reads the input data set
 If placed the KEEP= option on the DATA statement,
 SAS keeps the specified variables when it writes to the out put data
set.
Manipulating and Modifying a Data Set
21
9
 DROP=variables to be dropped

 Drops one or more variable names (only specified)


 When needed few variables to appear in the SAS data set
 If placed the DROP= option on the SET statement
 SAS drops the specified variables when it reads the input data set
 If placed the DROP= option on the DATA statement,
 SAS drops the specified variables when it writes to the out put data
set
Manipulating and Modifying a Data Set
22
0
 RENAME= rename variable

 Important when there is a need to change some of the variables name


in a SAS data set
 If the RENAME= option appears in the SET statement, then the new
variable names takes effect when programe data vector is created.
 All programming statements within the DATA statement must refer
to the new variable name.
Manipulating and Modifying a Data Set
22
1
 If the RENAME= option appears in the DATA statement, then the new
variable names takes effect when the data are written to the SAS data
set
 All programming statements within the DATA step must refer to the
old variable name.
Manipulating and Modifying a Data Set
22
2
 WHERE=selecting observations
 It allows one to select only those observations from a SAS data set that
meet a certain condition
 If the WHERE= option is attached to the SET statement,
 SAS elects the observations that meet the condition as it reads the data
 If the WHERE= option is attached to the DATA statement,
 SAS elects the observations as it writes the data from the program data
vector to the output data set.
Manipulating and Modifying SAS Data Sets
22
3
Example: Consider the following data (STUDENT Data Set)

Obs id Semester gender height SGPA

1 RNS/001/05 1 M 1.72 2.50

2 RNS/001/05 2 M 1.72 3.10

3 RNS/001/05 3 M 1.72 2.85

4 RNS/002/05 1 F 1.68 3.20

5 RNS/004/05 2 M 1.69 2.42

6 RNS/004/05 1 M 1.69 2.35

7 RNS/003/05 1 F 1.74 2.56


Manipulating and Modifying SAS Data Sets
22
4
Example: the above data set can be created using SAS

DATA Student;
INPUT id $ Semester gender $ height SGPA;
CARDS;
RNS/001/05/ 1 M 1.72 2.50
RNS/001/05/ 2 M 1.72 3.10
RNS/001/05/ 3 M 1.72 2.85
RNS/002/05/ 1 F 1.68 3.20
RNS/004/05/ 2 M 1.69 2.42
RNS/004/05/ 1 M 1.69 2.35
RNS/003/05/ 1 F 1.74 2.56
;
PROC PRINT ;
RUN;
Manipulating and Modifying SAS Data Sets
22
DATA student2; 5
SET student;
WHERE gender='F';
PROC PRINT ;
RUN;

DATA student3;
SET student;
DROP height;
PROC PRINT;
RUN;

DATA student4;
SET student;
RENAME id=Identification;
PROC PRINT;
RUN;
Combining SAS Data Sets

 Concatenating: appending two226


or more data sets
Data Data
Data Data
Set 1 Set 2
set 1 Set 2

Data
Set 1
Data Data Data
Set 1 Set 2 Set 2
 SAS Merge allows the programmer to combine data from
multiple datasets.
 Each observation from dataset one is combined with a
corresponding observation in dataset two (and dataset three,
etc.) Which observations and which data fields from the source
datasets will be included in the resulting dataset is determined
by the detailed “instructions” provided by the programmer.
Combining SAS Data Sets
227

 The default action taken by SAS when the code requests a

merge between two datasets is to simply combine the


observations one by one in the order that they appear in the
datasets.
 Merging can be:

 One-to-one
 One-to-many
 Many-to-many
Combining SAS Data Sets
228

 The SAS statements for all three types of match merge are

identical in the following form:


DATA new-data-set;
MERGE data-set-1 data-set-2 data-set-3 …;
BY by-variable(s); /* indicates the variable(s) that control which
observations to match */
RUN;

 Note: The datasets to be merged MUST be sorted by the

merging variable(s).
One-to-One Merging

229

Dataset1 Dataset 2
ID Credit ID Course

NS/01/06 3 NS/01/06 Stat2022

NS/02/06 2 NS/02/06 Stat3133

NS/03/06 4 NS/03/06 Stat3111

DATA Dataset3;
Dataset3
MERGE Dataset1 Dataset2;
ID Credit Course
BY ID; NS/01/06 3 Stat2022
NS/02/06 2 Stat3133
RUN;
NS/03/06 4 Stat3111
One-to-Many Merging

230
Dataset 4 Dataset 5
ID Credit ID Course

NS/01/06 3 NS/01/06 Stat2022

NS/02/06 2 NS/01/06 Stat2013

NS/03/06 4 NS/02/06 Stat3133


NS/03/06 Stat3101
NS/03/06 Stat3111
Dataset6

ID Credit Course
DATA Dataset6;
NS/01/06 3 stat2022
MERGE Dataset4 Dataset5; NS/01/06 3 stat2013
BY ID; NS/02/06 2 stat3133

RUN; NS/03/06 4 stat3101


NS/03/06 4 stat3111
Many-to-Many Merging
23
1
Dataset7 Dataset8
ID Y ID Z

NS/01/06 A1 NS/01/06 AA1

NS/01/06 A2 NS/01/06 AA2

NS/02/06 B1 NS/01/06 AA3

NS/02/06 B2 NS/02/06 BB1


NS/02/06 BB2

Dataset9
DATA Dataset9;
ID Y Z
MERGE Dataset7 Dataset8; NS/01/06 A1 AA1
NS/01/06 A2 AA2
BY ID;
NS/01/06 A2 AA3
RUN; NS/02/06 B1 BB1
NS/02/06 B2 BB2
SAS Expressions
23
2
 A SAS operator is a symbol that represents:

 a comparison, arithmetic calculation, or logical operation;


 a SAS function; or grouping parentheses.

 SAS uses two major kinds of operators:

 prefix operators
 infix operators

 A prefix operator is an operator that is applied to the variable,

constant, function, or parenthetic expression that immediately


follows it.
SAS Expressions
23
3
 The plus sign (+) and minus sign (-) can be used as prefix operators.

 The word NOT and its equivalent symbols are also prefix operators.

 The following are examples of prefix operators used with variables,

constants, functions, and parenthetic expressions:


 +y
 -25
 -cos(angle1)
 +(x*y)
SAS Expressions
23
4
 An infix operator applies to the operands on each side of it, for

example, 6<8. Infix operators include the following:


 arithmetic
 comparison
 logical, or Boolean
 minimum
 maximum
 concatenation.

 When used to perform arithmetic operations, the plus and minus signs

are infix operators.


SAS Expressions

23
5
 Arithmetic operators

Symbol Definition
+ addition
- subtraction
* multiplication
/ division
** exponentiation
SAS Expressions
23
 Comparison operators 6

Mnemonic
Symbol Equivalent Definition

= eq equal to

^= ne not equal to

> gt greater then

< lt less than

>= or => ge greater than or equal to

<= or =< le less than or equal to


SAS Expressions
23
7
 Logical operators
Mnemonic
Symbol Equivalent Definition

& and true if both sides are true

! or | or true if either side is true

^ or ~ not true if the quantity following NOT is false

Note: mnemonic operators are comparison operators


written with letters.
SAS Expressions
23
8
You can use both, symbolic or mnemonic, in SAS.
 Comparison operators are used to compare two quantities.

 Logical operators are frequently used in if-then statements.

 The result of both type of operators is either true (1) if the

relationship holds, or false (0) if it does not.


23
9

SAS Procedures for Data


Analysis
The PROC CONTENTS Procedure
24
0
 The CONTENTS procedure shows the contents of a SAS data set and

prints the directory of the SAS library.


 It provides access to descriptive information about datasets in printed

form.
 The basic syntax is:

PROC CONTENTS DATA=dataset_Name <options>;


RUN;

 For your knowledg run the following code (without any option) and see

what will display:


PROC CONTENTS DATA=sashelp.cars;
RUN;
The PROC TABULATE Procedure
24
1
 The simplest possible table in TABULATE has to have three things:

 a PROC TABULATE statement,


 a TABLE statement,
 and a CLASS or VAR statement.
 The syntax for PROC TABULATE is:
PROC TABULATE DATA=datasetname;
CLASS analysis variables ;
VAR classification variables;
TABLE page dimension,
row dimension,
column dimension / <options>;;
RUN;
The PROC TABULATE Procedure
24
2
 Classification variables:

 are used to identify categorical groups on which calculations are


performed.
 They are the variables that make up the rows and columns of the table.
They may be either character or numeric.
 If numeric, generally only a small number of distinct values are permitted.

 Analysis variables:

 are numeric variables that are used to compute statistics that are
reported in the body of the table.
The PROC TABULATE Procedure
24
3
 The CLASS statement is used to specify any categorical

variables that will be used for grouping purposes in the analysis.


 The CLASS statement is required.

 The VAR statement is used to list any variables that will be used

for computing the statistics that are to appear in the body of the
table.
 Frequency counts can be computed without an analysis
variable.
 However, under most conditions, the VAR statement is required.
The PROC TABULATE Procedure
24
4
 The TABLE statement is used to define both the arrangement of the

rows and columns of the table, as well as the requests for any summary
statistics.
 In a TABLE statement, the comma is a very important symbol, because

it separates the dimensions of the table.


 If two commas were specified, then the table would have three dimensions, and the
order would be pages, rows, and columns.
 If only one comma was specified, then the table would have two dimensions, and the
order would be rows, columns.
 No comma would be interpreted to mean that the table’s only dimension would be the
column dimension. The table would only have one row .
The PROC TABULATE Procedure
24
5
 An asterisk (*) can be used to cross the classification variables;

that is, to arrange them in a nested manner, according to the


order listed (top, middle, and lower).
 A blank space is used to concatenate two classification
variables (which will appear in the table: top-to-bottom for row
headings, left-to-right for column headings).
 Parentheses ( ) are used to group the elements of an expression,

and to associate an adjacent operator with each concatenated


element inside the parentheses.
The PROC TABULATE Procedure
24
6
 Here are a few simple examples:

TABLE var1, var2;


 It generates a two-dimensional table in which the row dimension would be the values of var1,
and the column dimension would be the values of var2.

TABLE var1, var2 var3;


 This would result in a two-dimensional table in which the row dimension would be the values of
var1, and the column dimension would be comprised of values resulting from the side-by-side
concatenation of var2 and var3.

TABLE var1, var2*var3;


 This statement would generate a two-dimensional table in which the row dimension would be the
values of var1, and the column dimension would be a hierarchical arrangement of the values of
var2 and var3, with the values of var2 comprising the top columns, and for each of these values
as columns, the values of var3 as the lower columns.
The PROC TABULATE Procedure
24
7
 Besides specifying the dimensions, the TABLE statement also identifies which summary

statistics should be produced, and pertaining to which analysis variables. Each statistic is
identified by a keyword.
 N = the number of observations, the frequency count
 MIN = the smallest value
 MAX = the largest value
 MEAN = the arithmetic mean, or the average value
 STD = the standard deviation
 VAR = the variance
 MEDIAN = the middle (50th percentile) value
 SKEWNESS = a measure of the asymmetry of the distribution of values SUM = the sum of the
values
 PCTN = the percentage that one frequency is of another frequency.
The PROC TABULATE Procedure Example
24
8
 The following SAS code displays four different tables.

PROC TABULATE DATA = sashelp.cars;

CLASS type Origin DriveTrain;

TABLE type, Origin*DriveTrain;


TABLE type*Origin*DriveTrain;
TABLE type, Origin*DriveTrain*PCTN;

TABLE type*PCTN, Origin*DriveTrain;

RUN;

 Run this code from SAS and have a look at the four different tables displayed.
The PROC TABULATE Procedure Surprise
24
DATA Ztable; 9
DO row = 0.0 TO 3.4 BY 0.1;
DO column = 0.00 TO 0.09 BY 0.01;
z = row + column;
prob = PROBNORM(z); OUTPUT;
END;
END;
RUN;

PROC TABULATE DATA = Ztable;


CLASS row column;
VAR prob;
TABLE row, column*prob=''*sum=''*f=5.4;
LABEL row = 'Z' column = 'P(Z<z)';
RUN;
The PROC TABULATE Procedure Surprise
Standard Normal Distribution
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
Z 0.5 0.504 0.508 0.512 0.516 0.5199 0.5239 0.5279 0.5319 0.5359
0
0.1 0.5398 0.5438 0.5478 0.5517 25
0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.591 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293
0
0.6331 0.6368 0.6406 0.6443 0.648 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.67 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.695 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.719 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.758 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.791 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.834 0.8365 0.8389
1 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.877 0.879 0.881 0.883
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.898 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.937 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.975 0.9756 0.9761 0.9767
2 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.983 0.9834 0.9838 0.9842 0.9846 0.985 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.989
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.992 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.994 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.996 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.997 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.998 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.999 0.999
3.1 0.999 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
The PROC FREQ Procedure
25
1

 Proc FREQ is a procedure that is used to give descriptive

statistics about a particular data set


 Produces one-way to n-way frequency and crosstabulation

tables.
 It Counts! Answers question how many

 Display data (error checks), descriptive

 Analyze categorical data, statistical

 It is fast and easy


The PROC FREQ Procedure
25
2
 Basic Syntax for a two-way table:

PROC FREQ DATA=dataset ;


BY (pre sorted variable 1) (pre sorted variable 2);
TABLES variable 1 * variable 2;
WEIGHT;
RUN;

 PROC FREQ is an extremely useful statistical tool in the SAS language.

 PROC FREQ procedure outputs Frequencies, crosstabulation tables,

various measures of variable associations across a data set (such as chi-


square test), and stratified analysis.
Example PROC FREQ Procedure
25
3

 Consider the Framingham data set in ‘‘sashelp’’ library called

“Heart”.

PROC FREQ DATA=sashelp.Heart;


TABLES BP_Status*Weight_Status/chisq agree measures;
RUN;
 Have a look by typing this code in SAS
The PROC MEAN Procedure
25
4
 PROC MEANS produces descriptive statistics
 (means, standard deviation, minimum, maximum, etc.) for numeric
variables in a set of data.

 PROC MEANS can be used for


 Describing continuous data where the average has meaning
 Describing the means across groups
 Searching for possible outliers or incorrectly coded values
 Performing a single sample t-test

 The syntax of the PROC MEANS statement is:


The PROC MEAN Procedure …
25
5
 The syntax of the PROC MEANS statement is:
PROC MEANS DATA=dataset;

VAR var1 var2 var3 … varn;

RUN;

 For cases separately by Factor A:

PROC MEANS DATA=dataset;;

CLASS factor1 factor2;

VAR var1 var2 var3;

RUN;
The PROC MEAN Procedure …
25
6
 Statistical options that may be requested are: (default statistics are

underlined.)
N - Number of observations RANGE – Range
NMISS - Number of missing observations VAR - Variance
MEAN - Arithmetic average) USS – Uncorr. sum of squares
STD - Standard Deviation CSS - Corr. sum of squares
MIN - Minimum (smallest) STDERR - Standard Error
MAX - Maximum (largest) T - Student’s t value for testing Ho: md = 0
SUM - Sum of observations SUMWGT - Sum of the WEIGHT variable values
PRT - P-value associated with t-test above
Example PROC MEAN Procedure …
25
7
 Consider the Framingham data set in ‘‘sashelp’’ library called

“Heart”. We want to perform descriptive statistics about


diastolic and systolic blood pressures by sex.

PROC MEANS DATA=sashelp.Heart sum t var range nmiss ;


CLASS Sex;
VAR Diastolic Systolic;
RUN;

Type the code and have a look on the output.


The PROC SUMMARY Procedure
25
8
 Computes descriptive statistics on numeric variables in a SAS data

set and outputs the results to a new SAS data set.


 General syntax for PROC SUMMARY

PROC SUMMARY DATA = dataset;

CLASS categorical variables;

VAR continous variable;

OUTPUT OUT=NEWDATASET n=name1 mean=name2 std=name3;

RUN;
The PROC SUMMARY Procedure

25
9
 CLASS: Identify the variables that will be sub-grouped for the

analysis.
 VAR: Identify the variable that analysis will be done and the order

of the results in the output dataset.


 OUTPUT: Outputs to a dataset. Also, this is the place where the

type of statistics can be selected and named.


 The default statistics will be produced N MEAN STD MIN MAX.
Example PROC SUMMARY Procedure
26
0
 Consider the Framingham data set in ‘‘sashelp’’ library called

“Heart”. We want to save results as a dataset diastolic blood


pressures by sex.
Example
PROC SUMMARY DATA = sashelp.Heart;

CLASS sex;

VAR diastolic;

OUTPUT OUT=diastolic n=count mean=meandiastolic std=stddevdiastolic;

RUN;

Type the code and have a look on the output.


The PROC CORR procedure
26
1

 Computes correlation coefficient between variables.

 One way to test whether two variables are linearly related is by

finding the correlation between them and testing the hypotheses.


 General syntax for PROC CORR

PROC CORR DATA=dataset;

VAR var1 var2 var3… varn;


RUN;
The PROC CORR procedure
26
2

 Consider the Framingham data set in ‘‘sashelp’’ library called

“Heart”. We want to compute the correlation amongst Height,


Weight, diastolic and systolic blood pressures.

PROC CORR DATA=sashelp.Heart;


VAR Height Weight Diastolic Systolic;
RUN;

 Type the code and have a look on the output.


The PROC REG procedure
26
3
 Examine relationships between variables

 Estimate parameters and their standard errors by OLS

 Calculate predicted values

 Evaluate the fit or lack of fit of a model

 Test hypotheses

 General Syntax for PROC REG


PROC REG DATA=dataset;
MODEL response= predictor1 predictor2 . . . predictor k;
RUN;
The PROC REG procedure
26
 The complex regression model syntax is:
4

PROC REG <options>;


MODEL response = effects </options>;
PLOT yvariable*xvariable = 'symbol';
BY varlist;
OUTPUT <OUT=SAS data set> <output statistic list>;
RUN;
The PROC REG procedure
26
 proc reg statement syntax: 5
 data = SAS data set name input data set
 outest = SAS data set name creates data set with parameter estimates
 simple prints simple statistics

 the model statement


 model response=<effects></options>;
 required
 variables must be numeric
 many options
 can specify more than one model statement

 the plot statement


 plot yvariable*xvariable <=symbol> </options>;
 produces scatter plots - yvariable on the vertical axis and xvariable on the horizontal axis
The PROC REG procedure
26
 some statistics available for plotting: 6

 P. predicted values
 R. residuals
 L95. lower 95% CI bound for individual prediction
 U95. upper 95% CI bound for individual prediction
 L95M. lower 95% CI bound for mean of dependent variable
 U95M. upper 95% CI bound for mean of dependent variable

 the output statement


 output <OUT=SAS data set> keywords=names;
 creates SAS data set
 all original variables included
 keyword=names specifies the statistics to include
 Example:

output out=pvals p=pred r=resid;


Example PROC REG procedure
26
7
 Consider the Framingham data set in ‘‘sashelp’’ library called
“Heart”. We want to fit a multiple linear regression model
Cholesterol level as a response and Weight, diastolic and systolic
blood pressures as a independent variables.

PROC REG DATA=sashelp.Heart simple outest=coeff ;


MODEL cholesterol= weight diastolic systolic/influence r p lackfit;
PLOT residual.*predicted./nostat nomodle;
OUTPUT out=result r=residual p=predicted cookd=cookdistance h=hat
dffits=DFFITS;
RUN;

Type it in SAS and go through the output


SAS Procedures for Graphing
26
8
 Graphical Representations makes it easy to understand and

interpret data at a glance.


 It also helps to do comparisons among many things.

 Moreover it makes data easy to recall.

 SAS provides a wide range of graphical fetures to produce high

quality presentaton raphics.


The PROC PLOT Procedure
26
 The basic syntax for PLOT is: 9

PROC PLOT DATA=dataset;


TITLE 'your graph title here';
PLOT var1*var2;
RUN;
The PROC PLOT Procedure
27
0
DATA one;
INPUT name $ sex $ age height weight;
CARDS;
john m 12 590 195
james m 12 573 130
alfred m 14 590 125
William m 15 565 120
jane f 12 598 145
louise f 12 563 170
barbara f 13 553 180
mary f 15 565 120
Alice f 13 545 140
;
The PROC PLOT Procedure
27
PROC PLOT DATA=ONE ; 1
PLOT height*weight;
TITLE "Scatter plot of weight vs height";
RUN;

PROC PLOT DATA=one ;


PLOT height*weight='*';
TITLE "Scatter plot of weight vs height";
RUN;

PROC PLOT DATA=one ;


BY sex; the data should be sorted by sex;
PLOT age*weight='*';
RUN;

PROC PLOT DATA=one ;


PLOT weight*name='*' ;
PLOT height*name='0';
RUN;
The PROC GPLOT Procedure
27
2

 Used for ploting graphs as PLOT.

 GPLOT creates a better-looking graph .

 GPLOT also creates the plot in the separate Graph window in SAS,

as opposed to PLOT, which creates the plot in the Output window.


PROC GPLOT DATA=one ;
BY sex; *make sure the dataset is sorted by sex;
PLOT age*weight='*';
RUN;

Ty p e i t i n S A S a n d g o t h ro u g h t h e o u t p u t
The PROC CHART Procedure
27
3
 It produces vertical and horizontal barcharts, block charts,

pie charts and star charts.


 The charted variable can be numeric or character.

PROC CHART DATA=datasetname;


BY classvar;
VBAR plotvar/ TYPE=percent SUMVAR =var
RUN;

PROC CHART DATA=datasetname;


BY classvar;
HBAR plotvar/ TYPE=percent SUMVAR =var
RUN;
The PROC CHART Procedure
27
4
 Example (vertical and horizontal bar chart, from car data in

sashelp library)

PROC CHART DATA=sashelp.cars;


BY Origin;
VBAR Make/ TYPE=percent;
RUN;

PROC CHART DATA=sashelp.cars;


BY Origin;
HBAR Make;
RUN;
Ty pe i t i n SA S and go t hrough t he out put
The PROC CHART Procedure
27
 Ploting a block chart 5

PROC CHART DATA=datasetname;


BY var;
BLOCK plotvar/ SUMVAR =var
RUN;

 Example: from car data in sashelp library


PROC CHART DATA=datasetname;
BY Origin; *make sure the dataset is sorted by Origin;
BLOCK Type;
RUN;

Ty p e i t i n S A S a n d g o t h ro u g h t h e o u t p u t
The PROC CHART Procedure
27
6
 Ploting a pie chart

PROC CHART DATA= =datasetname;


PIE response/SUMVAR=count
RUN ;

 Example: from car data in sashelp library


PROC CHART DATA=sashelp.cars;
PIE Type;
RUN;

Note: It is possible to use BY statement.


Ty p e i t i n S A S a n d g o t h ro u g h t h e o u t p u t
Ploting histogram
27
 Syntax: 7

PROC UNIVARIATE DATA=dataset;


VAR var1 var2 ;
HISTOGRAM var1 var2;
RUN;
Note: it is possible to put a number of variables as far
numeric; and summary statistics will be displayed.

 Example:
PROC UNIVARIATE DATA=sashelp.heart;
VAR systolic diastolic ;
HISTOGRAM systolic diastolic ;
RUN;
Ploting boxplot
27
 Syntax: 8

PROC BOXPLOT DATA=dataset;


PLOT var1*var2/BOXSTYLE='style'; var1 should be numeric and var to
character;
INSET min mean max stddev/header='title'
pos='position';
RUN;

 Example:
PROC BOXPLOT DATA=sashelp.heart;
PLOT systolic *sex /BOXSTYLE= schematic; *data should be sorted by sex;
INSET min mean max stddev/header='Overall statistics'
pos=tm; *bm, lm, rm;
RUN;
The PROC GCHART Procedure
27
9
 Better charts can be drawn using GCHART procedure.

PROC GCHART DATA=datasetname;


VBAR charactervar ;
VBAR3D charactervar /TYPE=percent;
PIE charactervar /TYPE=percent;
RUN;
Example: using the Cars data in sashelp liberary.

PROC GCHART DATA=sashelp.Cars;


VBAR Origin;
VBAR3D Origin/TYPE=percent;
PIE Origin/TYPE=percent;
RUN;
28
0

GOOD
LUCK!!!

You might also like