R Slides
R Slides
2/16
What is R?
R is a dialect of the S language.
3/16
What is S?
· S is a language that was developed by John Chambers and others at Bell Labs.
· Early versions of the language did not contain functions for statistical modeling.
· In 1988 the system was rewritten in C and began to resemble the system that we have today (this
was Version 3 of the language). The book Statistical Models in S by Chambers and Hastie (the
white book) documents the statistical analysis functionality.
· Version 4 of the S language was released in 1998 and is the version we use today. The book
Programming with Data by John Chambers (the green book) documents this version of the
language.
4/16
Historical Notes
· In 1993 Bell Labs gave StatSci (now Insightful Corp.) an exclusive license to develop and sell the
S language.
· In 2004 Insightful purchased the S language from Lucent for $2 million and is the current owner.
· Insightful sells its implementation of the S language under the product name S-PLUS and has
built a number of fancy features (GUIs, mostly) on top of it—hence the “PLUS”.
· The fundamentals of the S language itself has not changed dramatically since 1998.
· In 1998, S won the Association for Computing Machinery’s Software System Award.
5/16
S Philosophy
In “Stages in the Evolution of S”, John Chambers writes:
“[W]e wanted users to be able to begin in an interactive environment, where they did not consciously
think of themselves as programming. Then as their needs became clearer and their sophistication
increased, they should be able to slide gradually into programming, when the language and system
aspects would become more important.”
http://www.stat.bell-labs.com/S/history.html
6/16
Back to R
· 1991: Created in New Zealand by Ross Ihaka and Robert Gentleman. Their experience
developing R is documented in a 1996 JCGS paper.
· 1995: Martin Mächler convinces Ross and Robert to use the GNU General Public License to
make R free software.
· 1997: The R Core Group is formed (containing some people associated with S-PLUS). The core
group controls the source code for R.
7/16
Features of R
· Syntax is very similar to S, making it easy for S-PLUS users to switch over.
· Semantics are superficially similar to S, but in reality are quite different (more on that later).
8/16
Features of R (cont'd)
· Quite lean, as far as software goes; functionality is divided into modular packages
· Graphics capabilities very sophisticated and better than most stat packages.
· Useful for interactive work, but contains a powerful programming language for developing new
tools (user -> programmer)
· Very active and vibrant user community; R-help and R-devel mailing lists and Stack Overflow
9/16
Features of R (cont'd)
It's free! (Both in the sense of beer and in the sense of speech.)
10/16
Free Software
With free software, you are granted
· The freedom to run the program, for any purpose (freedom 0).
· The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to
the source code is a precondition for this.
· The freedom to redistribute copies so you can help your neighbor (freedom 2).
· The freedom to improve the program, and release your improvements to the public, so that the
whole community benefits (freedom 3). Access to the source code is a precondition for this.
http://www.fsf.org
11/16
Drawbacks of R
· Essentially based on 40 year old technology.
· Little built in support for dynamic or 3-D graphics (but things have improved greatly since the “old
days”).
· Functionality is based on consumer demand and user contributions. If no one feels like
implementing your favorite method, then it’s your job!
· Not ideal for all possible situations (but this is a drawback of all software packages).
12/16
Design of the R System
The R system is divided into 2 conceptual parts:
2. Everything else.
· The “base” R system contains, among other things, the base package which is required to run R
and contains the most fundamental functions.
· The other packages contained in the “base” system include utils, stats, datasets, graphics,
grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4.
· There are also “Recommend” packages: boot, class, cluster, codetools, foreign,
KernSmooth, lattice, mgcv, nlme, rpart, survival, MASS, spatial, nnet, Matrix.
13/16
Design of the R System
And there are many other packages available:
· There are about 4000 packages on CRAN that have been developed by users and programmers
around the world.
· There are also many packages associated with the Bioconductor project (http://bioconductor.org).
· People often make packages available on their personal websites; there is no reliable way to
keep track of how many packages are available in this fashion.
14/16
Some R Resources
Available from CRAN (http://cran.r-project.org)
· An Introduction to R
· Writing R Extensions
· R Data Import/Export
15/16
Some Useful Books on S/R
Standard texts
Other resources
16/16
Getting Help
Roger D. Peng, Associate Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health
Asking Questions
· Asking questions via email is different from asking questions in person
· People on the other side do not have the background information you have
2/14
Finding Answers
· Try to find an answer by searching the archives of the forum you plan to post to.
· Try to find an answer by searching the Web.
3/14
Asking Questions
· It’s important to let other people know that you’ve done all of the previous things already
· If the answer is in the documentation, the answer will be “Read the documentation”
- one email round wasted
4/14
Example: Error Messages
> library(datasets)
> data(airquality)
> cor(airquality)
Error in cor(airquality) : missing observations in cov/cor
5/14
Google is your friend
6/14
Asking Questions
· What steps will reproduce the problem?
· What is the expected output?
· Additional information
7/14
Subject Headers
· Stupid: "Help! Can't fit linear model!"
· Smart: "R 3.0.2 lm() function produces seg fault with large data frame, Mac OS X 10.9.1"
· Smarter: "R 3.0.2 lm() function on Mac OS X 10.9.1 -- seg fault on large data frame"
8/14
Do
· Describe the goal, not the step
9/14
Don't
· Claim that you’ve found a bug
· Grovel as a substitute for doing your homework
· Post homework questions on mailing lists (we’ve seen them all)
10/14
Case Study: A Recent Post to the R-devel
Mailing List
Subject: large dataset - confused
Message:
I'm trying to load a dataset into R, but
I'm completely lost. This is probably
due mostly to the fact that I'm a
complete R newb, but it's got me stuck
in a research project.
11/14
Response
Yes, you are lost. The R posting guide is
at http://www.r-project.org/posting-
guide.html and will point you to the
right list and also the manuals (at
e.g. http://cran.r-project.org/
manuals.html, and one of them seems
exactly what you need).
12/14
Analysis: What Went Wrong?
· Question was sent to the wrong mailing list (R-devel instead of R-help)
· Email subject was very vague
· Question was very vague
13/14
Places to Turn
· Class discussion board; your fellow students
· r-help@r-project.org
· Other project-specific mailing lists (This talk inspired by Eric Raymond’s “How to ask questions
the smart way”)
14/14
Entering Input
At the R prompt we type expressions. The <- symbol is the assignment operator.
> x <- 1
> print(x)
[1] 1
> x
[1] 1
> msg <- "hello"
The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored.
5/27
Evaluation
When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated
expression is returned. The result may be auto-printed.
6/27
Printing
> x <- 1:20
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[16] 16 17 18 19 20
7/27
Objects
R has five basic or “atomic” classes of objects:
· character
· integer
· complex
· logical (True/False)
· BUT: The one exception is a list, which is represented as a vector but can contain objects of
different classes (indeed, that’s usually why we use them)
2/27
Numbers
· Numbers in R a generally treated as numeric objects (i.e. double precision real numbers)
· Ex: Entering 1 gives you a numeric object; entering 1L explicitly gives you an integer.
· There is also a special number Inf which represents infinity; e.g. 1 / 0; Inf can be used in
ordinary calculations; e.g. 1 / Inf is 0
· The value NaN represents an undefined value (“not a number”); e.g. 0 / 0; NaN can also be
thought of as a missing value (more on that later)
3/27
Attributes
R objects can have attributes
· names, dimnames
· class
· length
4/27
Creating Vectors
The c() function can be used to create vectors of objects.
8/27
Mixing Objects
What about the following?
When different objects are mixed in a vector, coercion occurs so that every element in the vector is
of the same class.
9/27
Explicit Coercion
Objects can be explicitly coerced from one class to another using the as.* functions, if available.
10/27
Explicit Coercion
Nonsensical coercion results in NAs.
11/27
Lists
Lists are a special type of vector that can contain elements of different classes. Lists are a very
important data type in R and you should get to know them well.
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i
16/27
Matrices
Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of
length 2 (nrow, ncol)
12/27
Matrices (cont’d)
Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner
and running down the columns.
13/27
Matrices (cont’d)
Matrices can also be created directly from vectors by adding a dimension attribute.
14/27
cbind-ing and rbind-ing
Matrices can be created by column-binding or row-binding with cbind() and rbind().
15/27
Factors
Factors are used to represent categorical data. Factors can be unordered or ordered. One can think
of a factor as an integer vector where each integer has a label.
· Factors are treated specially by modelling functions like lm() and glm()
· Using factors with labels is better than using integers because factors are self-describing; having
a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.
17/27
Factors
> x <- factor(c("yes", "yes", "no", "yes", "no"))
> x
[1] yes yes no yes no
Levels: no yes
> table(x)
x
no yes
2 3
> unclass(x)
[1] 2 2 1 2 1
attr(,"levels")
[1] "no" "yes"
18/27
Factors
The order of the levels can be set using the levels argument to factor(). This can be important
in linear modelling because the first level is used as the baseline level.
19/27
Missing Values
Missing values are denoted by NA or NaN for undefined mathematical operations.
· NA values have a class also, so there are integer NA, character NA, etc.
20/27
Missing Values
> x <- c(1, 2, NA, 10, 3)
> is.na(x)
[1] FALSE FALSE TRUE FALSE FALSE
> is.nan(x)
[1] FALSE FALSE FALSE FALSE FALSE
> x <- c(1, 2, NaN, NA, 4)
> is.na(x)
[1] FALSE FALSE TRUE TRUE FALSE
> is.nan(x)
[1] FALSE FALSE TRUE FALSE FALSE
21/27
Data Frames
Data frames are used to store tabular data
· They are represented as a special type of list where every element of the list has to have the
same length
· Each element of the list can be thought of as a column and the length of each element of the list
is the number of rows
· Unlike matrices, data frames can store different classes of objects in each column (just like lists);
matrices must have every element be the same class
22/27
Data Frames
> x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
> x
foo bar
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 FALSE
> nrow(x)
[1] 4
> ncol(x)
[1] 2
23/27
Names
R objects can also have names, which is very useful for writing readable code and self-describing
objects.
24/27
Names
Lists can also have names.
$b
[1] 2
$c
[1] 3
25/27
Names
And matrices.
26/27
Summary
Data Types
· vectors, lists
· factors
· missing values
· data frames
· names
27/27
Reading Data
There are a few principal functions reading data into R.
2/9
Writing Data
There are analogous functions for writing data to files
· write.table
· writeLines
· dump
· dput
· save
· serialize
3/9
Reading Data Files with read.table
The read.table function is one of the most commonly used functions for reading data. It has a few
important arguments:
· colClasses, a character vector indicating the class of each column in the dataset
4/9
read.table
For small to moderately sized datasets, you can usually call read.table without specifying any other
arguments
R will automatically
· figure out how many rows there are (and how much memory needs to be allocated)
· figure what type of variable is in each column of the table Telling R all these things directly makes
R run faster and more efficiently.
5/9
Reading in Larger Datasets with read.table
With much larger datasets, doing the following things will make your life easier and will prevent R
from choking.
· Read the help page for read.table, which contains many hints
· Make a rough calculation of the memory required to store your dataset. If the dataset is larger
than the amount of RAM on your computer, you can probably stop right here.
6/9
Reading in Larger Datasets with read.table
· Use the colClasses argument. Specifying this option instead of using the default can make
’read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know the
class of each column in your data frame. If all of the columns are “numeric”, for example, then
you can just set colClasses = "numeric". A quick an dirty way to figure out the classes of
each column is the following:
· Set nrows. This doesn’t make R run faster but it helps with memory usage. A mild overestimate
is okay. You can use the Unix tool wc to calculate the number of lines in a file.
7/9
Know Thy System
In general, when using R with larger datasets, it’s useful to know a few things about your system.
· Is the OS 32 or 64 bit?
8/9
Calculating Memory Requirements
I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly,
how much memory is required to store this data frame?
= 1440000000 bytes
= 1,373.29 MB
= 1.34 GB
9/9
Textual Formats
· dumping and dputing are useful because the resulting textual format is edit-able, and in the case
of corruption, potentially recoverable.
· Unlike writing out a table or csv file, dump and dput preserve the metadata (sacrificing some
readability), so that another user doesn’t have to specify it all over again.
· Textual formats can work much better with version control programs like subversion or git which
can only track changes meaningfully in text files
· Textual formats can be longer-lived; if there is corruption somewhere in the file, it can be easier to
fix the problem
· Textual formats adhere to the “Unix philosophy”
2/9
dput-ting R Objects
Another way to pass data around is by deparsing the R object with dput and reading it back in using
dget.
3/9
Dumping R Objects
Multiple objects can be deparsed using the dump function and read back in using source.
4/9
Interfaces to the Outside World
Data are read in using connection interfaces. Connections can be made to files (most common) or to
other more exotic things.
5/9
File Connections
> str(file)
function (description = "", open = "", blocking = TRUE,
encoding = getOption("encoding"))
6/9
Connections
In general, connections are powerful tools that let you navigate files or other external objects. In
practice, we often don’t need to deal with the connection interface directly.
is the same as
7/9
Reading Lines of a Text File
> con <- gzfile("words.gz")
> x <- readLines(con, 10)
> x
[1] "1080" "10-point" "10th" "11-point"
[5] "12-point" "16-point" "18-point" "1st"
[9] "2" "20-point"
writeLines takes a character vector and writes each element one line at a time to a text file.
8/9
Reading Lines of a Text File
readLines can be useful for reading in lines of webpages
9/9
Subsetting
There are a number of operators that can be used to extract subsets of R objects.
· [ always returns an object of the same class as the original; can be used to select more than one
element (there is one exception)
· [[ is used to extract elements of a list or a data frame; it can only be used to extract a single
element and the class of the returned object will not necessarily be a list or data frame
· $ is used to extract elements of a list or data frame by name; semantics are similar to that of [[.
2/14
Subsetting
> x <- c("a", "b", "c", "c", "d", "a")
> x[1]
[1] "a"
> x[2]
[1] "b"
> x[1:4]
[1] "a" "b" "c" "c"
> x[x > "a"]
[1] "b" "c" "c" "d"
> u <- x > "a"
> u
[1] FALSE TRUE TRUE TRUE TRUE FALSE
> x[u]
[1] "b" "c" "c" "d"
3/14
Subsetting Lists
> x <- list(foo = 1:4, bar = 0.6)
> x[1]
$foo
[1] 1 2 3 4
> x[[1]]
[1] 1 2 3 4
> x$bar
[1] 0.6
> x[["bar"]]
[1] 0.6
> x["bar"]
$bar
[1] 0.6
7/14
Subsetting Lists
> x <- list(foo = 1:4, bar = 0.6, baz = "hello")
> x[c(1, 3)]
$foo
[1] 1 2 3 4
$baz
[1] "hello"
8/14
Subsetting Lists
The [[ operator can be used with computed indices; $ can only be used with literal names.
9/14
Subsetting Nested Elements of a List
The [[ can take an integer sequence.
10/14
Subsetting a Matrix
Matrices can be subsetted in the usual way with (i,j) type indices.
> x[1, ]
[1] 1 3 5
> x[, 2]
[1] 3 4
4/14
Subsetting a Matrix
By default, when a single element of a matrix is retrieved, it is returned as a vector of length 1 rather
than a 1 × 1 matrix. This behavior can be turned off by setting drop = FALSE.
5/14
Subsetting a Matrix
Similarly, subsetting a single column or a single row will give you a vector, not a matrix (by default).
6/14
Partial Matching
Partial matching of names is allowed with [[ and $.
11/14
Removing NA Values
A common task is to remove missing values (NAs).
12/14
Removing NA Values
What if there are multiple things and you want to take the subset with no missing values?
13/14
Removing NA Values
> airquality[1:6, ]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> good <- complete.cases(airquality)
> airquality[good, ][1:6, ]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
14/14
Introduction to the R Language
Vectorized Operations
2/3
Vectorized Matrix Operations
> x <- matrix(1:4, 2, 2); y <- matrix(rep(10, 4), 2, 2)
> x * y ## element-wise multiplication
[,1] [,2]
[1,] 10 30
[2,] 20 40
> x / y
[,1] [,2]
[1,] 0.1 0.3
[2,] 0.2 0.4
> x %*% y ## true matrix multiplication
[,1] [,2]
[1,] 40 40
[2,] 60 60
3/3
Introduction to the R Language
Control Structures
Most control structures are not used in interactive sessions, but rather when writing functions or
longer expresisons.
2/14
Control Structures: if
if(<condition>) {
## do something
} else {
## do something else
}
if(<condition1>) {
## do something
} else if(<condition2>) {
## do something different
} else {
## do something different
}
3/14
if
This is a valid if/else structure.
if(x > 3) {
y <- 10
} else {
y <- 0
}
So is this one.
4/14
if
Of course, the else clause is not necessary.
if(<condition1>) {
if(<condition2>) {
5/14
for
for loops take an interator variable and assign it successive values from a sequence or vector. For
loops are most commonly used for iterating over the elements of an object (list, vector, etc.)
for(i in 1:10) {
print(i)
}
This loop takes the i variable and in each iteration of the loop gives it values 1, 2, 3, ..., 10, and then
exits.
6/14
for
These three loops have the same behavior.
for(i in 1:4) {
print(x[i])
}
for(i in seq_along(x)) {
print(x[i])
}
for(letter in x) {
print(letter)
}
7/14
Nested for loops
for loops can be nested.
x <- matrix(1:6, 2, 3)
for(i in seq_len(nrow(x))) {
for(j in seq_len(ncol(x))) {
print(x[i, j])
}
}
Be careful with nesting though. Nesting beyond 2–3 levels is often very difficult to read/understand.
8/14
while
While loops begin by testing a condition. If it is true, then they execute the loop body. Once the loop
body is executed, the condition is tested again, and so forth.
count <- 0
while(count < 10) {
print(count)
count <- count + 1
}
While loops can potentially result in infinite loops if not written properly. Use with care!
9/14
while
Sometimes there will be more than one condition in the test.
z <- 5
10/14
repeat
Repeat initiates an infinite loop; these are not commonly used in statistical applications but they do
have their uses. The only way to exit a repeat loop is to call break.
x0 <- 1
tol <- 1e-8
repeat {
x1 <- computeEstimate()
11/14
repeat
The loop in the previous slide is a bit dangerous because there’s no guarantee it will stop. Better to
set a hard limit on the number of iterations (e.g. using a for loop) and then report whether
convergence was achieved or not.
12/14
next, return
next is used to skip an iteration of a loop
for(i in 1:100) {
if(i <= 20) {
## Skip the first 20 iterations
next
}
## Do something here
}
return signals that a function should exit and return a given value
13/14
Control Structures
Summary
· Control structures like if, while, and for allow you to control the flow of an R program
· Infinite loops should generally be avoided, even if they are theoretically correct.
· Control structures mentiond here are primarily useful for writing programs; for command-line
interactive work, the *apply functions are more useful.
14/14
Functions
Roger D. Peng, Associate Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health
Functions
Functions are created using the function() directive and are stored as R objects just like anything
else. In particular, they are R objects of class “function”.
f <- function(<arguments>) {
## Do something interesting
}
Functions in R are “first class objects”, which means that they can be treated much like any other R
object. Importantly,
· The return value of a function is the last expression in the function body to be evaluated.
2/13
Function Arguments
Functions have named arguments which potentially have default values.
· The formal arguments are the arguments included in the function definition
· The formals function returns a list of all the formal arguments of a function
· Not every function call in R makes use of all the formal arguments
· Function arguments can be missing or might have default values
3/13
Argument Matching
R functions arguments can be matched positionally or by name. So the following calls to sd are all
equivalent
Even though it’s legal, I don’t recommend messing around with the order of the arguments too much,
since it can lead to some confusion.
4/13
Argument Matching
You can mix positional matching with matching by name. When an argument is matched by name, it
is “taken out” of the argument list and the remaining unnamed arguments are matched in the order
that they are listed in the function definition.
> args(lm)
function (formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE,
y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
5/13
Argument Matching
· Most of the time, named arguments are useful on the command line when you have a long
argument list and you want to use the defaults for everything except for an argument near the
end of the list
· Named arguments also help if you can remember the name of the argument and not its position
on the argument list (plotting is a good example).
6/13
Argument Matching
Function arguments can also be partially matched, which is useful for interactive work. The order of
operations when given an argument is
7/13
Defining a Function
f <- function(a, b = 1, c = 2, d = NULL) {
In addition to not specifying a default value, you can also set an argument value to NULL.
8/13
Lazy Evaluation
Arguments to functions are evaluated lazily, so they are evaluated only as needed.
f <- function(a, b) {
a^2
}
f(2)
## [1] 4
This function never actually uses the argument b, so calling f(2) will not produce an error because
the 2 gets positionally matched to a.
9/13
Lazy Evaluation
f <- function(a, b) {
print(a)
print(b)
}
f(45)
## [1] 45
Notice that “45” got printed first before the error was triggered. This is because b did not have to be
evaluated until after print(a). Once the function tried to evaluate print(b) it had to throw an
error.
10/13
The “...” Argument
The ... argument indicate a variable number of arguments that are usually passed on to other
functions.
· ... is often used when extending another function and you don’t want to copy the entire argument
list of the original function
· Generic functions use ... so that extra arguments can be passed to methods (more on this later).
> mean
function (x, ...)
UseMethod("mean")
11/13
The “...” Argument
The ... argument is also necessary when the number of arguments passed to the function cannot be
known in advance.
> args(paste)
function (..., sep = " ", collapse = NULL)
> args(cat)
function (..., file = "", sep = " ", fill = FALSE,
labels = NULL, append = FALSE)
12/13
Arguments Coming After the “...” Argument
One catch with ... is that any arguments that appear after ... on the argument list must be named
explicitly and cannot be partially matched.
> args(paste)
function (..., sep = " ", collapse = NULL)
13/13
Introduction to the R Language
Scoping Rules
how does R know what value to assign to the symbol lm? Why doesn’t it give it the value of lm that
is in the stats package?
2/24
A Diversion on Binding Values to Symbol
When R tries to bind a value to a symbol, it searches through a series of environments to find the
appropriate value. When you are working on the command line and need to retrieve the value of an
R object, the order is roughly
1. Search the global environment for a symbol name matching the one requested.
> search()
[1] ".GlobalEnv" "package:stats" "package:graphics"
[4] "package:grDevices" "package:utils" "package:datasets"
[7] "package:methods" "Autoloads" "package:base"
3/24
Binding Values to Symbol
· The global environment or the user’s workspace is always the first element of the search list and
the base package is always the last.
4/24
Scoping Rules
The scoping rules for R are the main feature that make it different from the original S language.
· The scoping rules determine how a value is associated with a free variable in a function
· Lexical scoping turns out to be particularly useful for simplifying statistical computations
5/24
Lexical Scoping
Consider the following function.
f <- function(x, y) {
x^2 + y / z
}
This function has 2 formal arguments x and y. In the body of the function there is another symbol z.
In this case z is called a free variable. The scoping rules of a language determine how values are
assigned to free variables. Free variables are not formal arguments and are not local variables
(assigned insided the function body).
6/24
Lexical Scoping
Lexical scoping in R means that
the values of free variables are searched for in the environment in which the function was defined.
What is an environment?
· An environment is a collection of (symbol, value) pairs, i.e. x is a symbol and 3.14 might be its
value.
· Every environment has a parent environment; it is possible for an environment to have multiple
“children”
7/24
Lexical Scoping
Searching for the value for a free variable:
· If the value of a symbol is not found in the environment in which a function was defined, then the
search is continued in the parent environment.
· The search continues down the sequence of parent environments until we hit the top-level
environment; this usually the global environment (workspace) or the namespace of a package.
· After the top-level environment, the search continues down the search list until we hit the empty
environment. If a value for a given symbol cannot be found once the empty environment is
arrived at, then an error is thrown.
8/24
Lexical Scoping
Why does all this matter?
· Typically, a function is defined in the global environment, so that the values of free variables are
just found in the user’s workspace
· This behavior is logical for most people and is usually the “right thing” to do
9/24
Lexical Scoping
make.power <- function(n) {
pow <- function(x) {
x^n
}
pow
}
10/24
Exploring a Function Closure
What’s in a function’s environment?
> ls(environment(cube))
[1] "n" "pow"
> get("n", environment(cube))
[1] 3
> ls(environment(square))
[1] "n" "pow"
> get("n", environment(square))
[1] 2
11/24
Lexical vs. Dynamic Scoping
y <- 10
f <- function(x) {
y <- 2
y^2 + g(x)
}
g <- function(x) {
x*y
}
f(3)
12/24
Lexical vs. Dynamic Scoping
· With lexical scoping the value of y in the function g is looked up in the environment in which the
function was defined, in this case the global environment, so the value of y is 10.
· With dynamic scoping, the value of y is looked up in the environment from which the function was
called (sometimes referred to as the calling environment).
13/24
Lexical vs. Dynamic Scoping
When a function is defined in the global environment and is subsequently called from the global
environment, then the defining environment and the calling environment are the same. This can
sometimes give the appearance of dynamic scoping.
14/24
Other Languages
Other languages that support lexical scoping
· Scheme
· Perl
· Python
15/24
Consequences of Lexical Scoping
· In R, all objects must be stored in memory
· All functions must carry a pointer to their respective defining environments, which could be
anywhere
· In S-PLUS, free variables are always looked up in the global workspace, so everything can be
stored on the disk because the “defining environment” of all functions is the same.
16/24
Application: Optimization
Why is any of this information useful?
· Optimization routines in R like optim, nlm, and optimize require you to pass a function whose
argument is a vector of parameters (e.g. a log-likelihood)
· However, an object function might depend on a host of other things besides its parameters (like
data)
· When writing software which does optimization, it may be desirable to allow the user to hold
certain parameters fixed
17/24
Maximizing a Normal Likelihood
Write a “constructor” function
Note: Optimization functions in R minimize functions, so you need to use the negative log-likelihood.
18/24
Maximizing a Normal Likelihood
> set.seed(1); normals <- rnorm(100, 1, 2)
> nLL <- make.NegLogLik(normals)
> nLL
function(p) {
params[!fixed] <- p
mu <- params[1]
sigma <- params[2]
a <- -0.5*length(data)*log(2*pi*sigma^2)
b <- -0.5*sum((data-mu)^2) / (sigma^2)
-(a + b)
}
<environment: 0x165b1a4>
> ls(environment(nLL))
[1] "data" "fixed" "params"
19/24
Estimating Parameters
> optim(c(mu = 0, sigma = 1), nLL)$par
mu sigma
1.218239 1.787343
Fixing σ = 2
Fixing μ = 1
20/24
Plotting the Likelihood
nLL <- make.NegLogLik(normals, c(1, FALSE))
x <- seq(1.7, 1.9, len = 100)
y <- sapply(x, nLL)
plot(x, exp(-(y - min(y))), type = "l")
21/24
Plotting the Likelihood
22/24
Plotting the Likelihood
23/24
Lexical Scoping Summary
· Objective functions can be “built” which contain all of the necessary data for evaluating the
function
· No need to carry around long argument lists — useful for interactive and exploratory work.
· Code can be simplified and cleand up
· Reference: Robert Gentleman and Ross Ihaka (2000). “Lexical Scope and Statistical Computing,”
JCGS, 9, 491–508.
24/24
Coding Standards for R
Roger D. Peng, Associate Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health
Coding Standards for R
1. Always use text files / text editor
2/6
Coding Standards for R
1. Always use text files / text editor
3/6
Coding Standards for R
1. Always use text files / text editor
4/6
Indenting
· Indenting improves readability
· Fixing line length (80 columns) prevents lots of nesting and very long functions
5/6
Coding Standards for R
1. Always use text files / text editor
6/6
Dates and Times in R
Roger D. Peng, Associate Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health
Dates and Times in R
R has developed a special representation of dates and times
2/10
Dates in R
Dates are represented by the Date class and can be coerced from a character string using the
as.Date() function.
x <- as.Date("1970-01-01")
x
## [1] "1970-01-01"
unclass(x)
## [1] 0
unclass(as.Date("1970-01-02"))
## [1] 1
3/10
Times in R
Times are represented using the POSIXct or the POSIXlt class
· POSIXct is just a very large integer under the hood; it use a useful class when you want to store
times in something like a data frame
· POSIXlt is a list underneath and it stores a bunch of other useful information like the day of the
week, day of the year, month, day of the month
There are a number of generic functions that work on dates and times
4/10
Times in R
Times can be coerced from a character string using the as.POSIXlt or as.POSIXct function.
x <- Sys.time()
x
## [1] "2013-01-24 22:04:14 EST"
p <- as.POSIXlt(x)
names(unclass(p))
## [1] "sec" "min" "hour" "mday" "mon"
## [6] "year" "wday" "yday" "isdst"
p$sec
## [1] 14.34
5/10
Times in R
You can also use the POSIXct format.
x <- Sys.time()
x ## Already in ‘POSIXct’ format
## [1] "2013-01-24 22:04:14 EST"
unclass(x)
## [1] 1359083054
x$sec
## Error: $ operator is invalid for atomic vectors
p <- as.POSIXlt(x)
p$sec
## [1] 14.37
6/10
Times in R
Finally, there is the strptime function in case your dates are written in a different format
class(x)
I can never remember the formatting strings. Check ?strptime for details.
7/10
Operations on Dates and Times
You can use mathematical operations on dates and times. Well, really just + and -. You can do
comparisons too (i.e. ==, <=)
x <- as.Date("2012-01-01")
y <- strptime("9 Jan 2011 11:34:21", "%d %b %Y %H:%M:%S")
x-y
## Warning: Incompatible methods ("-.Date",
## "-.POSIXt") for "-"
## Error: non-numeric argument to binary operator
x <- as.POSIXlt(x)
x-y
## Time difference of 356.3 days
8/10
Operations on Dates and Times
Even keeps track of leap years, leap seconds, daylight savings, and time zones.
9/10
Summary
· Dates and times have special classes in R that allow for numerical and statistical calculations
· Dates use the Date class
· Character strings can be coerced to Date/Time classes using the strptime function or the
as.Date, as.POSIXlt, or as.POSIXct
10/10
Introduction to the R Language
Loop Functions
2/12
lapply
lapply takes three arguments: (1) a list X; (2) a function (or the name of a function) FUN; (3) other
arguments via its ... argument. If X is not a list, it will be coerced to a list using as.list.
lapply
3/12
lapply
lapply always returns a list, regardless of the class of the input.
## $a
## [1] 3
##
## $b
## [1] 0.4671
4/12
lapply
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean)
## $a
## [1] 2.5
##
## $b
## [1] 0.5261
##
## $c
## [1] 1.421
##
## $d
## [1] 4.927
5/12
lapply
> x <- 1:4
> lapply(x, runif)
[[1]]
[1] 0.2675082
[[2]]
[1] 0.2186453 0.5167968
[[3]]
[1] 0.2689506 0.1811683 0.5185761
[[4]]
[1] 0.5627829 0.1291569 0.2563676 0.7179353
6/12
lapply
> x <- 1:4
> lapply(x, runif, min = 0, max = 10)
[[1]]
[1] 3.302142
[[2]]
[1] 6.848960 7.195282
[[3]]
[1] 3.5031416 0.8465707 9.7421014
[[4]]
[1] 1.195114 3.594027 2.930794 2.766946
7/12
lapply
lapply and friends make heavy use of anonymous functions.
$b
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
8/12
lapply
An anonymous function for extracting the first column of each matrix.
$b
[1] 1 2 3
9/12
sapply
sapply will try to simplify the result of lapply if possible.
· If the result is a list where every element is length 1, then a vector is returned
· If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
10/12
sapply
> x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
> lapply(x, mean)
$a
[1] 2.5
$b
[1] 0.06082667
$c
[1] 1.467083
$d
[1] 5.074749
11/12
sapply
> sapply(x, mean)
a b c d
2.50000000 0.06082667 1.46708277 5.07474950
> mean(x)
[1] NA
Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA
12/12
Introduction to the R Language
Loop Functions - apply
· It can be used with general arrays, e.g. taking the average of an array of matrices
· It is not really faster than writing a loop, but it works in one line!
2/7
apply
> str(apply)
function (X, MARGIN, FUN, ...)
· X is an array
3/7
apply
> x <- matrix(rnorm(200), 20, 10)
> apply(x, 2, mean)
[1] 0.04868268 0.35743615 -0.09104379
[4] -0.05381370 -0.16552070 -0.18192493
[7] 0.10285727 0.36519270 0.14898850
[10] 0.26767260
4/7
col/row sums and means
For sums and means of matrix dimensions, we have some shortcuts.
The shortcut functions are much faster, but you won’t notice unless you’re using a large matrix.
5/7
Other Ways to Apply
Quantiles of the rows of a matrix.
6/7
apply
Average matrix in an array
7/7
Introduction to the R Language
Loop Functions - mapply
> str(mapply)
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE,
USE.NAMES = TRUE)
2/6
mapply
The following is tedious to type
Instead we can do
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
3/6
Vectorizing a Function
> noise <- function(n, mean, sd) {
+ rnorm(n, mean, sd)
+ }
> noise(5, 1, 2)
[1] 2.4831198 2.4790100 0.4855190 -1.2117759
[5] -0.2743532
4/6
Instant Vectorization
> mapply(noise, 1:5, 1:5, 2)
[[1]]
[1] 1.037658
[[2]]
[1] 0.7113482 2.7555797
[[3]]
[1] 2.769527 1.643568 4.597882
[[4]]
[1] 4.476741 5.658653 3.962813 1.204284
[[5]]
[1] 4.797123 6.314616 4.969892 6.530432 6.723254
5/6
Instant Vectorization
Which is the same as
6/6
Introduction to the R Language
Loop Functions - tapply
> str(tapply)
function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
· X is a vector
· INDEX is a factor or a list of factors (or else they are coerced to factors)
2/14
tapply
Take group means.
3/14
tapply
Take group means without simplification.
$‘2‘
[1] 0.5163468
$‘3‘
[1] 1.246368
4/14
tapply
Find group ranges.
$‘2‘
[1] 0.09479023 0.79107293
$‘3‘
[1] 0.4717443 2.5887025
5/14
split
split takes a vector or other objects and splits it into groups determined by a factor or list of
factors.
> str(split)
function (x, f, drop = FALSE, ...)
6/14
split
> x <- c(rnorm(10), runif(10), rnorm(10, 1))
> f <- gl(3, 10)
> split(x, f)
$‘1‘
[1] -0.8493038 -0.5699717 -0.8385255 -0.8842019
[5] 0.2849881 0.9383361 -1.0973089 2.6949703
[9] 1.5976789 -0.1321970
$‘2‘
[1] 0.09479023 0.79107293 0.45857419 0.74849293
[5] 0.34936491 0.35842084 0.78541705 0.57732081
[9] 0.46817559 0.53183823
$‘3‘
[1] 0.6795651 0.9293171 1.0318103 0.4717443
[5] 2.5887025 1.5975774 1.3246333 1.4372701
7/14
split
A common idiom is split followed by an lapply.
$‘2‘
[1] 0.5163468
$‘3‘
[1] 1.246368
8/14
Splitting a Data Frame
> library(datasets)
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
9/14
Splitting a Data Frame
> s <- split(airquality, airquality$Month)
> lapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
$‘5‘
Ozone Solar.R Wind
NA NA 11.62258
$‘6‘
Ozone Solar.R Wind
NA 190.16667 10.26667
$‘7‘
Ozone Solar.R Wind
NA 216.483871 8.941935
10/14
Splitting a Data Frame
> sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
5 6 7 8 9
Ozone NA NA NA NA NA
Solar.R NA 190.16667 216.483871 NA 167.4333
Wind 11.62258 10.26667 8.941935 8.793548 10.1800
11/14
Splitting on More than One Level
> x <- rnorm(10)
> f1 <- gl(2, 5)
> f2 <- gl(5, 2)
> f1
[1] 1 1 1 1 1 2 2 2 2 2
Levels: 1 2
> f2
[1] 1 1 2 2 3 3 4 4 5 5
Levels: 1 2 3 4 5
> interaction(f1, f2)
[1] 1.1 1.1 1.2 1.2 1.3 2.3 2.4 2.4 2.5 2.5
10 Levels: 1.1 2.1 1.2 2.2 1.3 2.3 1.4 ... 2.5
12/14
Splitting on More than One Level
Interactions can create empty levels.
13/14
split
Empty levels can be dropped.
14/14
Debugging
Roger D. Peng, Associate Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health
Something’s Wrong!
Indications that something’s not right
· error: An indication that a fatal problem has occurred; execution stops; produced by the stop
function
· condition: A generic concept for indicating that something unexpected can occur; programmers
can create their own conditions
2/15
Something’s Wrong!
Warning
log(-1)
## [1] NaN
3/15
Something’s Wrong
printmessage <- function(x) {
if(x > 0)
print("x is greater than zero")
else
print("x is less than or equal to zero")
invisible(x)
}
4/15
Something’s Wrong
printmessage <- function(x) {
if (x > 0)
print("x is greater than zero") else print("x is less than or equal to zero")
invisible(x)
}
printmessage(1)
printmessage(NA)
5/15
Something’s Wrong!
printmessage2 <- function(x) {
if(is.na(x))
print("x is a missing value!")
else if(x > 0)
print("x is greater than zero")
else
print("x is less than or equal to zero")
invisible(x)
}
6/15
Something’s Wrong!
printmessage2 <- function(x) {
if (is.na(x))
print("x is a missing value!") else if (x > 0)
print("x is greater than zero") else print("x is less than or equal to zero")
invisible(x)
}
x <- log(-1)
printmessage2(x)
7/15
Something’s Wrong!
How do you know that something is wrong with your function?
· What was your input? How did you call the function?
8/15
Debugging Tools in R
The primary tools for debugging functions in R are
· traceback: prints out the function call stack after an error occurs; does nothing if there’s no error
· debug: flags a function for “debug” mode which allows you to step through execution of a function
one line at a time
· browser: suspends the execution of a function wherever it is called and puts the function in
debug mode
· trace: allows you to insert debugging code into a function a specific places
· recover: allows you to modify the error behavior so that you can browse the function call stack
These are interactive tools specifically designed to allow you to pick through a function. There’s also
the more blunt technique of inserting print/cat statements in the function.
9/15
traceback
> mean(x)
Error in mean(x) : object 'x' not found
> traceback()
1: mean(x)
>
10/15
traceback
> lm(y ~ x)
Error in eval(expr, envir, enclos) : object ’y’ not found
> traceback()
7: eval(expr, envir, enclos)
6: eval(predvars, data, env)
5: model.frame.default(formula = y ~ x, drop.unused.levels = TRUE)
4: model.frame(formula = y ~ x, drop.unused.levels = TRUE)
3: eval(expr, envir, enclos)
2: eval(mf, parent.frame())
1: lm(y ~ x)
11/15
debug
> debug(lm)
> lm(y ~ x)
debugging in: lm(y ~ x)
debug: {
ret.x <- x
ret.y <- y
cl <- match.call()
...
if (!qr)
z$qr <- NULL
z
}
Browse[2]>
12/15
debug
Browse[2]> n
debug: ret.x <- x
Browse[2]> n
debug: ret.y <- y
Browse[2]> n
debug: cl <- match.call()
Browse[2]> n
debug: mf <- match.call(expand.dots = FALSE)
Browse[2]> n
debug: m <- match(c("formula", "data", "subset", "weights", "na.action",
"offset"), names(mf), 0L)
13/15
recover
> options(error = recover)
> read.csv("nosuchfile")
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file ’nosuchfile’: No such file or directory
1: read.csv("nosuchfile")
2: read.table(file = file, header = header, sep = sep, quote = quote, dec =
3: file(file, "rt")
Selection:
14/15
Debugging
Summary
· When analyzing a function with a problem, make sure you can reproduce the problem, clearly
state your expectations and how the output differs from your expectation
· Interactive debugging tools traceback, debug, browser, trace, and recover can be used to
find problematic code in functions
· Debugging tools are not a substitute for thinking!
15/15
Simulation
Roger D. Peng, Associate Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health
Generating Random Numbers
Functions for probability distributions in R
· rnorm: generate random Normal variates with a given mean and standard deviation
· dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of
points)
· pnorm: evaluate the cumulative distribution function for a Normal distribution
2/15
Generating Random Numbers
Probability distribution functions usually have four functions associated with them. The functions are
prefixed with a
· d for density
3/15
Generating Random Numbers
Working with the Normal distributions requires using these four functions
If Φ is the cumulative distribution function for a standard Normal distribution, then pnorm(q) = Φ(q)
and qnorm(p) = Φ−1 (p).
4/15
Generating Random Numbers
> x <- rnorm(10)
> x
[1] 1.38380206 0.48772671 0.53403109 0.66721944
[5] 0.01585029 0.37945986 1.31096736 0.55330472
[9] 1.22090852 0.45236742
> x <- rnorm(10, 20, 2)
> x
[1] 23.38812 20.16846 21.87999 20.73813 19.59020
[6] 18.73439 18.31721 22.51748 20.36966 21.04371
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.32 19.73 20.55 20.67 21.67 23.39
5/15
Generating Random Numbers
Setting the random number seed with set.seed ensures reproducibility
> set.seed(1)
> rnorm(5)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808
[5] 0.3295078
> rnorm(5)
[1] -0.8204684 0.4874291 0.7383247 0.5757814
[5] -0.3053884
> set.seed(1)
> rnorm(5)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808
[5] 0.3295078
6/15
Generating Random Numbers
Generating Poisson data
> rpois(10, 1)
[1] 3 1 0 1 0 0 1 0 1 1
> rpois(10, 2)
[1] 6 2 2 1 3 2 2 1 1 2
> rpois(10, 20)
[1] 20 11 21 20 20 21 17 15 24 20
7/15
Generating Random Numbers From a Linear
Model
Suppose we want to simulate from the following linear model
y = β0 + β1 x + ε
> set.seed(20)
> x <- rnorm(100)
> e <- rnorm(100, 0, 2)
> y <- 0.5 + 2 * x + e
> summary(y)
Min. 1st Qu. Median
-6.4080 -1.5400 0.6789 0.6893 2.9300 6.5050
> plot(x, y)
8/15
Generating Random Numbers From a Linear
Model
9/15
Generating Random Numbers From a Linear
Model
What if x is binary?
> set.seed(10)
> x <- rbinom(100, 1, 0.5)
> e <- rnorm(100, 0, 2)
> y <- 0.5 + 2 * x + e
> summary(y)
Min. 1st Qu. Median
-3.4940 -0.1409 1.5770 1.4320 2.8400 6.9410
> plot(x, y)
10/15
Generating Random Numbers From a Linear
Model
11/15
Generating Random Numbers From a
Generalized Linear Model
Suppose we want to simulate from a Poisson model where
Y ~ Poisson(μ)
log μ = β0 + β1 x
and β0 = 0.5 and β1 = 0.3. We need to use the rpois function for this
> set.seed(1)
> x <- rnorm(100)
> log.mu <- 0.5 + 0.3 * x
> y <- rpois(100, exp(log.mu))
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 1.00 1.55 2.00 6.00
> plot(x, y)
12/15
Generating Random Numbers From a
Generalized Linear Model
13/15
Random Sampling
The sample function draws randomly from a specified set of (scalar) objects allowing you to sample
from arbitrary distributions.
> set.seed(1)
> sample(1:10, 4)
[1] 3 4 5 7
> sample(1:10, 4)
[1] 3 9 8 5
> sample(letters, 5)
[1] "q" "b" "e" "x" "p"
> sample(1:10) ## permutation
[1] 4 710 6 9 2 8 3 1 5
> sample(1:10)
[1] 2 3 4 1 9 5 10 8 6 7
> sample(1:10, replace = TRUE) ## Sample w/replacement
[1] 2 9 7 8 2 8 5 9 7 8
14/15
Simulation
Summary
· Drawing samples from specific probability distributions can be done with r* functions
· Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc.
· The sample function can be used to draw random samples from arbitrary vectors
· Setting the random number generator seed via set.seed is critical for reproducibility
15/15
Profiling R Code
Roger D. Peng, Associate Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health
Why is My Code So Slow?
· Profiling is a systematic way to examine how much time is spend in different parts of a program
· Often code runs fine once, but what if you have to put it in a loop for 1,000 iterations? Is it still fast
enough?
2/17
On Optimizing Your Code
· Getting biggest impact on speeding up code depends on knowing where the code spends most
of its time
We should forget about small efficiencies, say about 97% of the time: premature
optimization is the root of all evil
--Donald Knuth
3/17
General Principles of Optimization
· Design first, then optimize
· If you're going to be scientist, you need to apply the same principles here!
4/17
Using system.time()
· Takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the
amount of time taken to evaluate the expression
5/17
Using system.time()
· Usually, the user time and elapsed time are relatively close, for straight computing tasks
· Elapsed time may be greater than user time if the CPU spends a lot of time waiting around
· Elapsted time may be smaller than the user time if your machine has multiple cores/processors
(and is capable of using them)
6/17
Using system.time()
## Elapsed time > user time
system.time(readLines("http://www.jhsph.edu"))
user system elapsed
0.004 0.002 0.431
7/17
Timing Longer Expressions
system.time({
n <- 1000
r <- numeric(n)
for (i in 1:n) {
x <- rnorm(n)
r[i] <- mean(x)
}
})
8/17
Beyond system.time()
· Using system.time() allows you to test certain functions or code blocks to see if they are taking
excessive amounts of time
· Assumes you already know where the problem is and can call system.time() on it
9/17
The R Profiler
· The Rprof() function starts the profiler in R
- R must be compiled with profiler support (but this is usually the case)
· The summaryRprof() function summarizes the output from Rprof() (otherwise it’s not
readable)
10/17
The R Profiler
· Rprof() keeps track of the function call stack at regularly sampled intervals and tabulates how
much time is spend in each function
· NOTE: If your code runs very quickly, the profiler is not useful, but then you probably don't need it
in that case
11/17
R Profiler Raw Output
## lm(y ~ x)
sample.interval=10000
"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm"
"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm"
"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm"
"list" "eval" "eval" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"na.omit" "model.frame.default" "model.frame" "eval" "eval" "lm"
"lm.fit" "lm"
"lm.fit" "lm"
"lm.fit" "lm"
12/17
Using summaryRprof()
· The summaryRprof() function tabulates the R profiler output and calculates how much time is
spend in which function
· "by.total" divides the time spend in each function by the total run time
· "by.self" does the same but first subtracts out time spent in functions above in the call stack
13/17
By Total
$by.total
total.time total.pct self.time self.pct
"lm" 7.41 100.00 0.30 4.05
"lm.fit" 3.50 47.23 2.99 40.35
"model.frame.default" 2.24 30.23 0.12 1.62
"eval" 2.24 30.23 0.00 0.00
"model.frame" 2.24 30.23 0.00 0.00
"na.omit" 1.54 20.78 0.24 3.24
"na.omit.data.frame" 1.30 17.54 0.49 6.61
"lapply" 1.04 14.04 0.00 0.00
"[.data.frame" 1.03 13.90 0.79 10.66
"[" 1.03 13.90 0.00 0.00
"as.list.data.frame" 0.82 11.07 0.82 11.07
"as.list" 0.82 11.07 0.00 0.00
14/17
By Self
$by.self
self.time self.pct total.time total.pct
"lm.fit" 2.99 40.35 3.50 47.23
"as.list.data.frame" 0.82 11.07 0.82 11.07
"[.data.frame" 0.79 10.66 1.03 13.90
"structure" 0.73 9.85 0.73 9.85
"na.omit.data.frame" 0.49 6.61 1.30 17.54
"list" 0.46 6.21 0.46 6.21
"lm" 0.30 4.05 7.41 100.00
"model.matrix.default" 0.27 3.64 0.79 10.66
"na.omit" 0.24 3.24 1.54 20.78
"as.character" 0.18 2.43 0.18 2.43
"model.frame.default" 0.12 1.62 2.24 30.23
"anyDuplicated.default" 0.02 0.27 0.02 0.27
15/17
summaryRprof() Output
$sample.interval
[1] 0.02
$sampling.time
[1] 7.41
16/17
Summary
· Rprof() runs the profiler for performance of analysis of R code
· summaryRprof() summarizes the output of Rprof() and gives percent of time spent in each
function (with two types of normalization)
· Good to break your code into functions so that the profiler can give useful information about
where time is being spent
17/17