BIOSTAT 607 R MODULE: LECTURE 3
Biostatistics 607: Module 1
ASSIGNMENTS AND DUE DATES
The first homework assignment is due on September 13th before
midnight.
The first quiz is due on September 13th before midnight.
Biostatistics 607: Module 1
TODAY’S MAIN TOPICS
Finish up discussion of functions in R (default arguments).
Comments in R
Vectors
Biostatistics 607: Module 1
FUNCTIONS
Biostatistics 607: Module 1
DEFINING YOUR OWN FUNCTION
There are three key components of a function definition in R.
Function name : the name which will be used to call the function
Arguments : values to pass to a function as input.
Return value : value returned by a function as output.
The general form for writing your own R function is
1 function_name <- function(params){
2 ## function_name is the name of the function
3 ## params name of the input variable within this function
4
5 statement1 ## statements executed when the function is called
6 statement2 ## statements convert params into some value to be
7 ... ## returned
8 return(return_value) ## return the variable return_value
9 }
Biostatistics 607: Module 1
DEFINING YOUR OWN FUNCTION: EXAMPLES
Example. Let’s write a function that takes a number as an input and
returns the square of that number.
1 ## define a new function named square
2 square <- function(x) { ## function name: square, argument : x
3 return(x*x) ## returns x*x
4 }
5
6 square(10) ## example of using the function square
[1] 100
Biostatistics 607: Module 1
DEFINING FUNCTIONS: AN EXAMPLE
Let’s write a function called PositiveEven that takes a number
(assumed to be an integer) as input and outputs another number
according to the following rule:
if the input number is positive, return 2 if the input number is even and
return 1 if input number is odd.
if the input number is not positive, return -2 if the input number is
even and return -1 if the input number is odd.
Biostatistics 607: Module 1
DEFINING PositiveEven
1 PositiveEven <- function(x) {
2 if( x > 0 && x%%2==0 ) {
3 return_value <- 2
4 } else if( x > 0 && x%%2==1 ){
5 return_value <- 1
6 } else if( x <= 0 && x%%2==0) {
7 return_value <- -2
8 } else {
9 return_value <- -1
10 }
11 return( return_value )
12 }
Biostatistics 607: Module 1
DEFINING FUNCTIONS: EXAMPLES
Now, let’s look at a few examples of calling our function PositiveEven:
1 PositiveEven(3)
[1] 1
1 PositiveEven(-6)
[1] -2
1 PositiveEven(0)
[1] -2
1 PositiveEven(4)
[1] 2
Biostatistics 607: Module 1
DEFINING FUNCTIONS: EXAMPLES
We could make our function PositiveEven a bit more user-friendly by
throwing an error whenever the user does not input an integer.
1 PositiveEvenSafe <- function(x) { # Function named PositiveEvenSafe
2 if( x%%1 != 0) { # x%%1 will equal 0 if x is an integer
3 stop("x must be an integer")
4 # The stop function will stop the execution
5 # of the function and will return an error
6
7 }
8 if( x > 0 && x%%2==0 ) {
9 return_value <- 2
10 } else if( x > 0 && x%%2==1 ){
11 return_value <- 1
12 } else if( x <= 0 && x%%2==0) {
13 return_value <- -2
14 } else {
15 return_value <- -1
16 }
17 return( return_value )
18 } Biostatistics 607: Module 1
DEFINING FUNCTIONS: EXAMPLES
1 PositiveEvenSafe(3)
[1] 1
1 PositiveEvenSafe(-6)
[1] -2
1 PositiveEvenSafe(2)
[1] 2
1 PositiveEvenSafe(7.1)
2 Error in PositiveEvenSafe(7.1) : x must be an integer
Biostatistics 607: Module 1
RULES FOR CHOOSING FUNCTION NAMES
All the same rules for variable names apply to rules for choosing function
names.
Examples:
Valid_Function_Names Invalid_Function_Names
i 2things
my_function location@
answer42 _user.name
.name .3rd
Biostatistics 607: Module 1
RESERVED WORDS
You also cannot use reserved words as a function name or a variable
name
You can use built-in function names (for example, print) for your own
functions, but this is NOT RECOMMENDED.
The following are the reserved words in R
if else while function for
in next break TRUE FALSE
NULL Inf NA NA_integer
NA_real NA_complex NA_character
You can find the list of reserved words in R by typing in
1 ?reserved
Biostatistics 607: Module 1
DEFAULT ARGUMENT VALUES
We can provide default values for function parameters/arguments
by adding = default_value after the parameter
If an argument is specified in the function call, the specified one is used
Otherwise; the default argument value is used
In the function definition, it is generally better to put parameters without
default arguments before those with default arguments.
When calling a function, arguments must be specified for every
parameter that does not have a default argument.
Unlike Python, in R you can mix arguments with/without default
arguments in an arbitrary order (though I don’t recommend it).
Biostatistics 607: Module 1
EXAMPLE: DEFAULT ARGUMENTS
As an example, let’s write a function that adds 3 numbers and, as a
default, sets one of these numbers to zero:
1 add3 <- function(x, y, z=0) {
2 return(x + y + z)
3 }
The default value for z here is 0 .
1 add3(1, 2) ## omit z
[1] 3
1 add3(1, 2, 0) ## this should give the same as add3(1,2)
[1] 3
1 add3(1, 2, 3) ## set z to 3 instead of 0
[1] 6
Biostatistics 607: Module 1
SPECIFYING ARGUMENTS WITH KEYWORDS
We can specify how arguments are passed to parameters not only by
their order but by names with keyword arguments.
Keyword arguments have to do with how you call the function - not with
the function definition itself.
For example, we could call our function add3 with keywords in the
following way:
1 add3(2, 2, 1) # Call function using original positions
[1] 5
1 add3(x=2, y=2, z=1) # Call function using keywords
[1] 5
1 add3(y=2, x=2, z=1) # With keywords, position does not matter
[1] 5
Biostatistics 607: Module 1
ANOTHER EXAMPLE OF DEFAULT ARGUMENTS
The function foo below has parameters x, y,, z, w.
The default value of z is 0 , and the default value of w is TRUE.
1 foo <- function (x, y, z=0, w=TRUE) {
2 if(w) {
3 1000*x + 100*y + 10*z ## this is equivalent to return(...)
4 } else {
5 1000*x - 100*y + 10*z
6 }
7 }
8 foo(9,3,5,TRUE) ## specify all arguments
[1] 9350
1 foo(9,3,5) ## omit argument w
[1] 9350
1 foo(9, 3) ## omit both z and w
[1] 9300
Biostatistics 607: Module 1
CALLING foo WITH KEYWORD ARGUMENTS
1 ## foo(9) ## this will cause error because y is unknown
2 foo(x=9, y=5) ## specify x and y as keyword arguments
[1] 9500
1 foo(y=5, x=9) ## when using keywords, argument order doesn't matter
[1] 9500
1 foo(9, y=5) ## specify x as positional, y as keyword argument
[1] 9500
1 foo(9, z=3, y=5) ## y,z are keyword arguments, x is positional
[1] 9530
Biostatistics 607: Module 1
QUESTION
Suppose we define the function quiz as
1 quiz <- function(bool_var1, x=0, bool_var2 = TRUE) {
2 y <- 0
3 if(bool_var1 && bool_var2) {
4 y <- x + 2
5 } else {
6 if(bool_var1) {
7 y <- x - 2
8 }
9 }
10 return(y)
11 }
What value does the following function call return?
1 quiz(FALSE, 1.3)
Biostatistics 607: Module 1
EXERCISE
Write an R function that implements the following mathematical
function in R
⎧0 if x = 0 and y = 0
⎪
1 if x ≠ 0 and y = 0
L(x, y) = ⎨
⎪ |x| if y = 1
2
⎩x if y = 2
The function should have user-provided arguments x and y and should
return NA if y does not equal either 0 , 1 , or 2
Biostatistics 607: Module 1
SOLUTION
1 Lfn <- function(x, y) {
2 if(x==0 & y==0) {
3 ans <- 0
4 } else if(x!=0 & y==0) {
5 ans <- 1
6 } else if(y==1){
7 ans <- abs(x) ## abs computes absolute value
8 } else if(y==2){
9 ans <- x*x
10 } else {
11 ans <- NA
12 }
13 return(ans)
14 }
Biostatistics 607: Module 1
EXERCISE
Write an R function called PropGtZero which returns the proportion of
three entered numbers which are greater than 0 .
The function should have the following function definition
1 PropGtZero <- function(x, y, z, gt=TRUE) {
2
3 }
If gt=TRUE, then PropGtZero should return the proportion of the
numbers x, y, z which are greater than 0 .
If gt=FALSE, then PropGtZero should return the proportion of the
numbers x, y, z which are lesser than or equal to 0 .
If one or more of x, y, z, is NA, the function should return NA.
For example, PropGtZero(3,2,-2) should return 2/3.
Biostatistics 607: Module 1
COMMENTS IN R
Biostatistics 607: Module 1
COMMENTS IN R
The comment symbol in R is the hashmark #.
Comments allow you to write notes in English (or any other human
language) within your R programs.
Comments are basically pieces of text the computer will ignore when
interpreting your code.
You can use comments to help explain what your code is doing.
Writing comments becomes more helpful as your code becomes more
complex.
Writing comments can make code more readable for others.
Biostatistics 607: Module 1
COMMENTS IN R
In R, the hashmark symbol # marks the beginning of a comment.
Everything on a line following the hashmark symbol is ignored.
An example
1 # This is an example of a comment
2
3 x <- 42
4
5 # x <- 64
6
7 x
[1] 42
Biostatistics 607: Module 1
COMMENTS IN R
1 # More
2 # examples
3 # of comments
4
5 x <- 42 ## x <- 24
6
7 # x <- 64
8
9 x
[1] 42
Biostatistics 607: Module 1
VECTORS 1
Biostatistics 607: Module 1
VECTORS IN R
The most basic data type in R is the vector.
As we mentioned previously, if we assign the number 42 to the variable x,
R will treat x as a vector.
1 x <- 42 ## the x value is 42
2 x ## print the value of x
[1] 42
1 x[1] ## What does this do?
[1] 42
Here, x is considered to be a vector with length 1.
Technically, there are two kinds of vectors in R: atomic vectors and lists.
Vectors that are homogenous (all elements have the same type) are more
technically referred to as atomic vectors in R.
We will just refer to any atomic vector as a vector.
Biostatistics 607: Module 1
R ALWAYS STORES DATA AS A “COLLECTION”
Dimension Homogeneous Heterogeneous
1-Dimension Atomic Vector List
2-Dimensions Matrix Data Frame
>2-Dimensions Multi-dimensional array
There is no “0-dimensional data” in R.
Even a single-valued object is considered to be a “vector” with length 1.
Source: http://adv-r.had.co.nz/Data-structures.html
Biostatistics 607: Module 1
CREATING VECTORS IN R WITH c()
The most straightforward way to create vectors in R is to use the
concatenate function c()
This links together a group of values into a single vector.
You can also create a single vector from multiple vectors using c.
Examples:
1 x <- c(1,2,3) # a vector with elements 1, 2, and 3
2 x
[1] 1 2 3
1 y <- c(x, 4, 5) # a vector with elements 1,2,3,4,5
2 y
[1] 1 2 3 4 5
1 z <- c(x, y) # a vector with elements 1,2,3,1,2,3,4,5
2 z
[1] 1 2 3 1 2 3 4 5
Biostatistics 607: Module 1
CREATING VECTORS IN R WITH c()
You are not limited to using numbers with c().
For example, you can use c() to create a vector of characters or logicals
1 char_vec <- c("cat", "dog", "hamster") # vector of characters
2 char_vec
[1] "cat" "dog" "hamster"
1 log_vec <- c(TRUE, FALSE, TRUE, TRUE) # vector of logicals
2 log_vec
[1] TRUE FALSE TRUE TRUE
Biostatistics 607: Module 1
CREATING VECTORS WITH SPECIFIC PATTERNS - COLON
It is often very useful to be able to create vectors with certain patterns.
The colon operator : can be used to create a sequence of numbers.
The code from:end will create a vector of numbers starting at from and
increasing (or decreasing) by 1 until reaching the end.
Examples:
1 x <- 1:5 # creates the vector (1,2,3,4,5)
2 x
[1] 1 2 3 4 5
1 y <- 22:28
2 y
[1] 22 23 24 25 26 27 28
Biostatistics 607: Module 1
CREATING VECTORS WITH PATTERNS - COLON
1 z <- 0:-5 # use : to created decreasing vector
2 z
[1] 0 -1 -2 -3 -4 -5
You can even have use a number with a decimal point as the starting or
ending number (but this is not done that frequently).
1 w <- 2.3:6.8 # it keeps increasing by 1 until it reaches
2 # largest value less than 6.8
3 w
[1] 2.3 3.3 4.3 5.3 6.3
Biostatistics 607: Module 1
CREATING VECTORS WITH PATTERNS - COLON
Be careful when using something like a:b-1 when creating a vector
1 b <- 6
2 u <- 1:b - 1 # This does not create the vector 1,2,...,b-1
3 u
[1] 0 1 2 3 4 5
1 u <- 1:(b-1) # use this to create vector 1,2,...,b-1
2 u
[1] 1 2 3 4 5
Biostatistics 607: Module 1
CREATING VECTORS WITH PATTERNS - seq()
The function seq is a useful function for creating vectors that have
desired starting and ending values.
seq provides more flexibility than the colon operator :
You can use seq to create a sequence with different increments than 1
1 seq(1, 11, by=2) # sequence that increases by 2
[1] 1 3 5 7 9 11
1 seq(1, 10, by=2) # stops at 9 since 11 is larger than 10
[1] 1 3 5 7 9
1 seq(1, 11, by=2.54) # increment by non-integer amount
[1] 1.00 3.54 6.08 8.62
Biostatistics 607: Module 1
CREATING VECTORS WITH PATTERNS - seq()
Use the length.out argument in seq to create an equally-spaced
vector with a given length.
1 seq(1, 11, length.out=11) # same as 1:11
[1] 1 2 3 4 5 6 7 8 9 10 11
1 seq(1, 11, length.out=6) # vector of length 6, with equal increments
[1] 1 3 5 7 9 11
1 # using length.out is convenient
2 seq(21.5, 48.2, length.out=5) # don't have to work out correct increment
[1] 21.500 28.175 34.850 41.525 48.200
Biostatistics 607: Module 1
CREATING VECTORS WITH rep()
The rep() (replicate) function is very useful for creating vectors that
have any kind of repeated pattern.
The basic form of rep is
1 rep(x, times)
rep produces a vector which repeats the vector x times number of times.
1 rep(7, 3) # just creates the vector 7,7,7
[1] 7 7 7
1 rep(c(2,4,6), 3) # repeats c(2, 4, 6) three times
[1] 2 4 6 2 4 6 2 4 6
Biostatistics 607: Module 1
CREATING VECTORS WITH rep()
Using rep inside of c():
1 c(10:12, rep(c(2,4,6), 3))
[1] 10 11 12 2 4 6 2 4 6 2 4 6
Using rep with the keyword each will repeat each element of x each
times before moving on to the next element of x.
1 rep(c(2,4,6), each=4) # repeat each element 4 times
[1] 2 2 2 2 4 4 4 4 6 6 6 6
Biostatistics 607: Module 1
EXTRACTING VECTOR ELEMENTS
You can extract the k th element of a vector by using
1 vector_name[k]
For example:
1 x <- c(1,3,5,100)
2 x[2] # second element of x
[1] 3
1 x[4] # fourth element of x
[1] 100
Biostatistics 607: Module 1
EXTRACTING VECTOR ELEMENTS
You can also extract a subset of elements with indices stored by the
vector vec_index from a vector by using
1 vector_name[ vec_index ]
For example:
1 x <- c(1,3,5,100, 1250)
2 x[ c(1,3) ] # extract first and third elements of x
[1] 1 5
1 x[ 3:5 ] # extract elements 3 through 5 of x
[1] 5 100 1250
Biostatistics 607: Module 1
QUESTION
Suppose we define the vector x as
1 x <- 1:10
What will be the value of
1 x[ seq(1, 10, by=2)][3]
a. 3
b. 9
c. 5
d. 4
Biostatistics 607: Module 1
UPDATING VECTOR ELEMENTS
You can change the value of the k th element of a vector by using
1 vector_name[k] <- new_value
1 x <- c(1,3,5,100)
2 x[2] <- 6 # you may update a single element
3 print(x)
[1] 1 6 5 100
You can also update multiple elements of a vector by placing a vector of
indices inside brackets []
1 x[1:3] <- rep(10,3) # update first 3 elements of x
2 print(x)
[1] 10 10 10 100
Biostatistics 607: Module 1
SUBSETTING A VECTOR WITH A LOGICAL EXPRESSION
We mentioned before how you can take a subset of a vector by specifying
the vector indeces.
You can also subset a vector using a logical expression
1 x <- c(10, 2, 21, 15)
2 y <- x[x > 8] # returns all elements of x greater than 8
3 z <- x[x > 12] # returns all elements of x greater than 12
4 y
[1] 10 21 15
1 z
[1] 21 15
You can think of the expression x[x > 8] as doing the following:
1 x[c(TRUE, FALSE, TRUE, TRUE)]
[1] 10 21 15
Biostatistics 607: Module 1
SUBSETTING A VECTOR WITH A LOGICAL EXPRESSION
Subsetting vectors with logical expressions is very useful when you want
to compute statistics from a subset of your data.
For example, if we have a vector named agevec which stores a
collection of patient ages
1 agevec <- c(38, 51, 43, 72, 61, 55, 27, 64, 47)
You can count how many patients are older than 50
1 agevec > 50
[1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
1 sum(agevec > 50) ## how many are older than 50?
[1] 5
Biostatistics 607: Module 1
SUBSETTING A VECTOR WITH A LOGICAL EXPRESSION
You can compute the mean age among the patients older than 50
1 agevec[agevec > 50]
[1] 51 72 61 55 64
1 mean( agevec[agevec > 50] ) ## average age among those older than 50?
[1] 60.6
Biostatistics 607: Module 1
THE WHICH FUNCTION
You can find the indeces of a vector that satisfy a certain condition using
the which function.
1 x <- c(10, 2, 21, 15)
2 which(x > 20) # shows that x[3] > 20
[1] 3
1 which(x > 12) # shows that x[3] > 12 and x[4] > 12
[1] 3 4
The which function really just returns the indeces where a logical vector
is TRUE
1 which( c(FALSE, TRUE, FALSE) )
[1] 2
Biostatistics 607: Module 1
USEFUL METHODS FOR VECTORS
The length function can tell you how many elements are in your vector:
1 x <- 9:0
2 x
[1] 9 8 7 6 5 4 3 2 1 0
1 length(x) # length of the vector
[1] 10
1 typeof(x) # type of elements
[1] "integer"
1 sum(x) # sum of values
[1] 45
Biostatistics 607: Module 1
MORE USEFUL OPERATIONS ON VECTORS
R has functions which allow you to compute all the well-known summary
statistics from a numeric vector.
1 x <- 1:5
2 mean(x) # average of vector elements
[1] 3
1 var(x) # variance (the denominator is length(x)-1)
[1] 2.5
1 sd(x) # standard deviation (the denominator is length(x)-1)
[1] 1.581139
Biostatistics 607: Module 1
MORE USEFUL OPERATIONS ON VECTORS
1 x <- 1:5
2 max(x) # maximum value
[1] 5
1 min(x) # minimum value
[1] 1
1 median(x) # median
[1] 3
Biostatistics 607: Module 1
VECTORS WITH DIFFERENT DATA TYPES IN R
As we mentioned before, R vectors are not limited to having numeric
elements.
The main restriction is that vectors must have elements which are all the
same data type.
1 x <- c(1, 2.5, 42) ## numeric vector
2 print(x)
[1] 1.0 2.5 42.0
1 y <- c("hello","world","biostat607") ## character vectors
2 print(y)
[1] "hello" "world" "biostat607"
1 z <- c(TRUE, FALSE, FALSE) ## logical vectors
2 print(z)
[1] TRUE FALSE FALSE
Biostatistics 607: Module 1
VECTORS WITH “MIXED” DATA TYPES
You can “create” a vector that has mixed data types, but R will
automatically convert the types of some of the elements so that all
elements have the same type.
1 x <- c(TRUE, FALSE, FALSE) ## homogeneous logical vector
2 print(x)
[1] TRUE FALSE FALSE
1 x <- c(TRUE, FALSE, 2) ## contains logical and numeric values
2 print(x) ## R translates logical TRUE/FALSE into numeric 1/0
[1] 1 0 2
1 x <- c(1, 2, "3") ## numeric + character
2 print(x) ## R translates numeric values translates into characters
[1] "1" "2" "3"
Biostatistics 607: Module 1
VECTORS WITH “MIXED” DATA TYPES
1 x <- c(TRUE, 2, "3") ## logical + numeric + character
2 print(x) ## R translates logical and numeric values into characters
[1] "TRUE" "2" "3"
Biostatistics 607: Module 1
EXPLICITLY CHANGING THE DATA TYPES
You can convert a vector to another type using as.logical,
as.numeric, or as.character.
1 x <- as.logical(c(0,1,2,3)) # numeric to logical conversion
2 print(x)
[1] FALSE TRUE TRUE TRUE
1 x <- as.numeric(c(TRUE,FALSE, T,F)) # logical to numeric
2 print(x)
[1] 1 0 1 0
1 x <-as.character(c(0,1,2,3)) # numeric to string
2 print(x)
[1] "0" "1" "2" "3"
Biostatistics 607: Module 1
SOMETIMES CONVERSION DOES NOT WORK
1 ## When a character cannot be converted, it returns NA
2 ## as an invalid number
3 as.numeric(c("123","12.3","123a"))
[1] 123.0 12.3 NA
1 ## Characters cannot be converted into logical values
2 as.logical(c("TRUE","FALSE", "T","TF",0))
[1] TRUE FALSE TRUE NA NA
1 as.integer(c(123, 12.3, "123", "123a"))
[1] 123 12 123 NA
Biostatistics 607: Module 1
MATHEMATICAL OPERATIONS WITH VECTORS
When doing mathematical operations with two vectors of the same
length, R will perform addition, subtraction, multiplication, division
element-by-element.
1 x <- c(10, 5, 0)
2 y <- 1:3
3 x+y # element-wise addition
[1] 11 7 3
1 x*y # element-wise multiplication
[1] 10 10 0
1 x^y # element-wise power
[1] 10 25 0
Biostatistics 607: Module 1
MATHEMATICAL OPERATIONS WITH VECTORS
Multiplying or dividing a vector by a single number multiplies (or divides)
each element by that number
1 x <- c(10, 5, 0, -5)
2
3 3*x
[1] 30 15 0 -15
1 x/2
[1] 5.0 2.5 0.0 -2.5
Adding or subtracting a vector by a single number also adds (or subtracts)
each element by that number
1 x <- c(10, 5, 0, -5)
2
3 3 + x # Actually an example of recycling with a one-element vector
[1] 13 8 3 -2
Biostatistics 607: Module 1
RECYCLING RULES
You can actually add/subtract vectors of different lengths.
When doing this, R recycles the values in the shorter vector
R will print out a warning message if the length of the longer vector is
not a multiple of the shorter vector
1 c(1, 2, 4) + c(6, 0, 9, 10)
[1] 7 2 13 11
What the above code is doing is adding the vector c(1, 2, 4, 1) with
the vector c(6, 0, 9, 10).
Biostatistics 607: Module 1
RECYCLING RULES
Note that if we add a vector of length 3 with a vector of length 6 we will
get no warning message
1 c(1, 2, 4) + c(6, 0, 9, 10, 11, 12)
[1] 7 2 13 11 13 16
This adds the vector c(1, 2, 4, 1, 2, 4) with the vector c(6, 0,
9, 10, 11, 12).
I personally do not use recycling rules much when the length of both
vectors is 2 or more.
It’s probably good to be aware of recycling rules if you are getting this
type of warning message.
You may find it helpful to use these recycling rules if you are, for
example, adding one vector with another vector that has a simple,
repeating pattern.
Biostatistics 607: Module 1
LOGICAL OPERATIONS WITH VECTORS
1 c(TRUE, TRUE, FALSE) & c(TRUE,FALSE,FALSE) # element-wise
[1] TRUE FALSE FALSE
1 c(TRUE, TRUE, FALSE) | c(TRUE,FALSE,FALSE) # element-wise
[1] TRUE TRUE FALSE
1 c(TRUE, TRUE, FALSE) && c(TRUE,FALSE,FALSE) # only first values
[1] TRUE
1 c(TRUE, TRUE, FALSE) || c(TRUE,FALSE,FALSE) # only first values
[1] TRUE
Biostatistics 607: Module 1
QUESTION
Suppose
1 x <- rep(c(1, 5, 10), each=3)
What is the value of
1 sum( x[x > 5] )
a. 45
b. 30
c. 48
d. 33
Biostatistics 607: Module 1
SET OPERATIONS ON VECTORS
You can also do set operations with vectors.
When working with set operations, you should think of the set associated
with a vector as the collection of unique elements from that vector.
1 x <- c(1,2,3,3,4,5) # x is c (1,2,3,3,4,5)
2 y <- c(1,3,3,5,7,9) # y is c (1,3,3,5,7,9)
3 intersect(x,y) # set intersection, note that repeated 3 is dropped
[1] 1 3 5
1 union(x,y) # set union
[1] 1 2 3 4 5 7 9
1 setdiff(x,y) # set difference x - y
[1] 2 4
Biostatistics 607: Module 1
MORE SET OPERATIONS WITH VECTORS
1 x <- 1:5 # x is c (1,2,3,4,5)
2 y <- c(1,3,3,5,7,9) # y is c (1,3,3,5,7,9)
3 x %in% y # membership test
[1] TRUE FALSE TRUE FALSE TRUE
1 match(x, y) # find indices of first matching values
[1] 1 NA 2 NA 4
1 setdiff(x, y) # set difference x-y
[1] 2 4
Biostatistics 607: Module 1
NA VALUES
Missing data in R is usually represented by the value NA.
NA stands for “Not Available”
You can create a vector with NA values by just typing in NA for one of the
vector elements.
1 x <- c(1, 5, NA, 4) # The third element of this vector is NA
2 typeof(x)
[1] "double"
You can type in NA for either numeric or character variables.
R will automatically convert everything to the appropriate type.
1 y <- c("cat", NA, "dog") # The second element of this vector is NA
2 typeof(y)
[1] "character"
Biostatistics 607: Module 1
USING FUNCTIONS WITH NA VALUES
Many of the built-in R functions will return NA if the input numeric vector
contains any NA values.
For example, if we try to compute the standard deviation of the vector x
1 x <- c(1, 5, NA, 4, 7) # The third element of this vector is NA
2 mx <- sd(x) # mx will have the value NA
3 mx
[1] NA
You can compute the standard deviation of the non-NA values by
including the argument na.rm = TRUE
1 sx <- sd(x, na.rm=TRUE) # sx shoud have the standard deviation of 1,5,4,
2 sx
[1] 2.5
Biostatistics 607: Module 1
USING FUNCTIONS WITH NA VALUES
In the function sd, the argument na.rm is an example of an argument
with a default value.
You can see this by looking at the function definition for sd
1 sd <- function(x, na.rm = FALSE) {
2
3 }
The default value of na.rm is FALSE.
So, you need to include na.rm = TRUE if you want sd to ignore
missing values.
Biostatistics 607: Module 1
THE FUNCTION is.na()
The function is.na() is often very useful when you’re working with
data that has mising values
When applied to a vector, is.na() will return a vector of logical values
with the same length as the input vector.
The k th element of is.na(x) will be TRUE if the k th element of x is
missing.
Otherwise, the k th element of is.na(x) will be FALSE.
1 x <- c(10, 3, 5, NA, 1, NA) # Elements 4 and 6 of x have NA values
2 is.na(x)
[1] FALSE FALSE FALSE TRUE FALSE TRUE
You can also use is.na() directly on matrices and data frames.
Biostatistics 607: Module 1
Biostatistics 607: Module 1