R Programming - An Approach to Data Analytics (1)
R Programming - An Approach to Data Analytics (1)
R Programming
– An Approach to Data
Analytics
Libros de Estadística-Ciencia de Datos|Statistics-Data Science Books (PDF)
R Programming
– An Approach to Data
Analytics
Dr. G. Sudhamathy
Hopefully you can take the instructions provided in this book to get started in
R programming for your next data analysis project, do some exciting data
visualization and data mining on your own.
It’s my immense happiness in penning this foreword for a book that is quite
impressive for any techie who is interested in R-programming. It’s also equally
joyous to have a book written by experts, Dr. G. Sudhamathy and Dr. C. Jothi
Venkateswaran. When a book can teach you and guide you as you work hands on
the tool, you are in the right direction in your learning path.
One can be definitively sure that book will be of great help and guidance for
the learner to carry out their works on Analytics using R, either in the research,
practice or just to learn the tool.
Best wishes for a bestselling of this book in the Academia, Research and Practice.
Dr. S. Justus
Associate Professor & Chair - Software Engineering Research Group
VIT University, Chennai
Libros de Estadística-Ciencia de Datos|Statistics-Data Science Books (PDF)
FOREWORD
Libros de Estadística-Ciencia de Datos|Statistics-Data Science Books (PDF)
PREFACE
Huge volumes of data are being generated by many sources like commercial
enterprises, scientific domains and general public daily. According to a recent
research, data production will be 44 times greater in 2020 than it was in 2010.
Data being a vital resource for business organizations and other domains like
education, health, manufacturing etc., its management and analysis is becoming
increasingly important. This data, due to its volume, variety and velocity, often
referred to as Big Data, also includes highly unstructured data in the form of
textual documents, web pages, graphical information and social media comments.
Since Big Data is characterised by massive sample sizes, high dimensionality and
intrinsic heterogeneity, traditional approaches to data management, visualisation
and analytics are no longer satisfactorily applicable. There is therefore an urgent
need for newer tools, better frameworks and workable methodologies for such
data to be appropriately categorised, logically segmented, efficiently analysed and
securely managed. This requirement has resulted in an emerging new discipline of
Data Science that is now gaining much attention with researchers and
practitioners in the field of Data Analytics.
R programming language and make it easy to approach by any one. The chapters
are designed in such a fashion that it targets the beginners with the first 4
chapters and targets the advanced concept learners in the next 3 chapters. The
book also helps the reader with the list of all packages and functions used in this
book along with the page numbers to know the usage of those. Every concept
discussed in the various sections in this book has proper example dealt with a set
of code and its results (as text or as graphs).
The book is organized into 7 chapters and the concept discussed in each chapter
is as detailed below.
Chapter 2 discusses on the basic data types in R, the primitive data types
such as vectors, matrices and arrays, lists and factors. It also deals with the
complex data types such as data frames, strings, dates and times. The chapter not
only discusses on the data creation, but also basic operations on the data of
different data types.
Chapter 3 deals with data preparation in which it details on how and where
to fetch the datasets from, how to import and export data from various sources
which are of different types like CSV files, XML files, etc. It also discusses on the
ways of accessing various databases. The data cleaning and transformation
techniques such as data reshaping, grouping functions are also outlined in this
chapter.
Chapter 4 is about using the graphical features in R for exploratory data analysis.
It gives examples of pie charts, scatter plots, line plots, histograms, box plots and
bar plots using the various graphical packages such as base, lattice and ggplot2.
Chapter 5 deals with statistical analysis concepts using R such as the basic
statistical measures like mean, median, mode, standard deviation, variance and
ranges. It discusses on the distribution of data as normal distribution and
binomial
Libros de Estadística-Ciencia de Datos|Statistics-Data Science Books (PDF)
xi Preface
distribution and how it can be viewed and analyzed using R. Then, the chapter
explores on the complex statistical techniques such as correlation analysis, regression
analysis, ANOVA and hypothesis testing which can be implemented using R.
Chapter 7 is mainly to explore the various essential case studies such as text
analytics, credit risk analysis, social network analysis and few exploratory data
analysis. The main purpose of this chapter is to use the basic and advanced concepts
presented in the other previous chapters of this book.
The author would like to mention her special regards and thanks to Dr. G.
P. Jeyanthi, Research and Consultancy Director, Dr. A. Parvathi, Dean, Faculty of
Science and Dr. V. Radha, Head, Department of Computer Science, Avinashilingam
Universty, Coimbatore, for their constant encouragement and support to turn this
work into a useful product.
The author wishes to thank all the faculty members of the Department of
Computer Science, Avinashilingam University, Coimbatore, for their continuous
support and suggestions for this book.
We are grateful to the students and teacher community who kept us on our
toes with their constant bombardment of queries which prompted us to learn more,
simplify our learning and findings and place them neatly in a book.
Our Special regards for the experts Mr. Sajeev Madhavan, Director of
Architecture, Oracle, USA and Dr. S. Justus, Associate Professor, VIT, Chennai who
gave their expert opinion in shaping this book into a more appealing format.
Most importantly we would like to thank our family members without whose
support this book would not have been a reality.
Last, but not the least, this work is a dedication to God, the Almighty whose
grace has showered upon us in making our dream come true.
G. Sudhamathy
C. Jothi Venkateswaran
Chapter 1 Basics of R 1
Chapter 2 Data Types in R 27
Chapter 3 Data Preparation 83
Chapter 4 Graphics using R 117
Chapter 5 Statistical Analysis Using R 141
Chapter 6 Data Mining Using R 177
Chapter 7 Case Studies 233
Glossary 299
Packages Used 309
Functions Used 313
References 359
Books 359
Websites 359
Index 361
Libros de Estadística-Ciencia de Datos|Statistics-Data Science Books (PDF)
Libros de Estadística-Ciencia de Datos|Statistics-Data Science Books (PDF)
CHAPTER 1
BASICS OF R
OBJECTIVES
1.1.Introducing R
R is a Programming Language and R also refers to the software that is used to run
the R programs. Ross Ihaka and Robert Gentleman from University of Auckland
created R language in 1990s. R language is based on the S language. S Language
was developed at the Bell Laboratories in 1970s. S Language was developed by John
Chambers. R Software is a GNU project free and open source software. R
(Language and Software) is developed by the R Core Team. R has evolved over the
past 3 to 4 decades as its history originated from 1970s.
2 R Programming — An Approach for Data Analytics
One can write a new package in R if the existing package is not sufficient
for the individual’s use. R is a high-level scripting language which need not be
compiled, but it is an interpreted language. R is an imperative language and still
it supports object-oriented programming.
The R language allows the user to program loops to successively analyze several
data sets. It is also possible to combine in single program different statistical
functions to perform more complex analyses. The R users may get benefitted from
a large number of programs written and available on the internet. At first R can
look very complex for a beginner or non-specialist. But, this is not actually true as
the prominent feature of R is its flexibility. R displays the results of the analysis
immediately and these results are stored in “objects” so that further analysis can
be done on them. The user can also extract a part of the result which is of interest
to him.
Looking at the features of R, some users may think that “I can’t write
programs using R”. But, this is not the case for two reasons. First, R is an
interpreted language and not a compiled one. This means that all commands
typed on the keyboard are directly executed without need to build the complete
program like in C, C++ or Java. Second, R’s syntax is very simple and intuitive.
In R, a function is always written with parentheses, eg. ls(). If only the name
of the function is typed, R displays the content of the function. In this book the
functions are written with their names followed by parentheses to distinguish them
from other objects. When R is running variables, data, functions, results, etc. are
stored in the active memory of the computer in the form of objects which have a
name. The user can do actions on these objects with operators and functions.
3 Basics of R
1.2.Installing R
R is available in several forms, essentially for Unix and Linux machines, or some
pre-compiled binaries for Windows, Linux and Macintosh. The files needed to
install R, either from the source or from the pre-compiled binaries are distributed
from the internet site of the Comprehensive R Archive Network (CRAN) where
the instructions for installation are also available.
1.3.Initiating
R
Open R Gui, find the command prompt and type the command below and hit enter
to run the command.
> sum(1:5)
[1] 15
The result above shows that the command gives the result 15. That is the
command has taken the input of integers from 1 to 5 and has performed the sum
operation on them. In the above command sum() is a function that takes the
argument 1:5 which means a vector that consists of a sequence of integers from 1
to
5. Like any other command prompt, R also allows to use the up arrow key to
revoke the previous commands.
1.3.2. Help in R
There are many ways to get help from R. If a function name or a dataset name is
known then we can type ? followed by the name. If name is not known then we
5 Basics of R
The same help can be obtained by the functions help() and help.search(). In
these functions the arguments has to be enclosed by quotes.
> help(“mean”)
> help(“+”)
> help(“if ”)
> help.search(“plotting”)
> help.search(“regression model”)
The variable names consist of letters, numbers, dots and underscores, but a
variable name should only start with an alphabet. The variable names should not
be reserve words. To create global variables (variables available everywhere) we use
the symbol “<<-”.
X <<- exp(exp(1))
Assignment operation can also be done using the assign() function. For global
assignment the same function assign() can be used, but, by including an extra
attribute globalenv(). To see the value of the variable, simply type the variable in
the command prompt. The same thing can be done using a print() function.
> assign(“F”, 3 * 8)
> assign(“G”, 6 * 9, globalenv())
> F
[1] 24
> print(G)
[1] 54
If assignment and printing of a value has to be done in one line we can do the
same in two ways. First method, by separating the two statements by a semicolon
and the second method is by wrapping the assignment in parenthesis () as below.
> L <- sum(4:8); L
[1] 30
> (M <- sum(5:9))
[1] 35
The “+” plus operator is used to perform the addition operation. It can be used
to add two numbers or add two vectors. Vector represents an ordered set of
values. Vectors are mainly used to analyse statistical data. The “:” colon operator
creates a sequence. Sequence is a series of numbers within the given limits.
The “c()” function concatenates the values given within the brackets “(“ and
“)”. Variable
7 Basics of R
names in R are case sensitive. Open R Gui, find the command prompt and type the
command below and hit enter to run the command.
> 7:12 + 12:17
[1] 19 21 23 25 27 29
> c(3, 1, 8, 6, 7) + c(9, 2, 5, 7, 1)
[1] 12 3 13 13 8
The vectors and the c() function in R helps us to avoid loops. The statistical
functions in R can take the vectors as input and produce results. The sum()
function takes vector arguments and produces results. But, the median() function
when taking the vector arguments shows errors.
> sum(7:10)
[1] 34
> mean(7:10)
[1] 8.5
> median(7:10)
[1] 8.5
> sum(7,8,9,10)
[1] 34
> mean(7,8,9,10)
[1] 7
> median(7,8,9,10)
Error in median(7, 8, 9, 10) : unused arguments (9, 10)
Similar to the “+” plus operator all other operators in R take vectors as inputs
and can produce results. The subtraction and the multiplication operations work
as below.
> c(5, 6, 1, 9) - 2
[1] 3 4 -1 7
> c(5, 6, 1, 9) - c(4, 2, 0, 7)
8 R Programming — An Approach for Data Analytics
[1] 1 4 1 2
> -1:4 * -2:3
[1] 2 0 0 2 6 12
> -1:4 * 3
[1] -3 0 3 6 9 12
The exponentiation operator is represented using the symbol “^” or the “**”.
This can be checked using the function identical().
> identical(2^3, 2**3)
[1] TRUE
The other mathematical functions are the trigonometry functions like, sin(),
cos(), tan(), asin(), acos(), atan() and the logarithmic and exponential functions
like log(), exp(), log1p(), expm1(). All these mathematical functions can operate on
vectors as well as individual elements. Few more examples of the mathematical
functions are listed below
The operator “==” is used for comparing two values. For checking inequalities
of values the operator “!=” is used. These operators are called the relational
operators. The relational operators also take the vectors as input and operate on
them. The other relational operators are the “< “, “> “, “<= “ and “>= “.
> c(2, 4 - 2, 1 + 1) == 2
9 Basics of R
The equality operator “==” can also be used to compare strings, but, string
comparison is case sensitive. Similarly, the operators “<” and “>” can also be
used on strings. The below examples show the results.
> c(“Week”, “WEEK”, “week”, “weak”) == “week”
[1] FALSE FALSE TRUE FALSE
10 R Programming — An Approach for Data Analytics
1.4.Packages in R
R Packages are installed in an online repository called CRAN (Comprehensive R
Archive Network). A Package is a collection of R functions and datasets. Currently,
the CRAN package repository features 10756 available packages. The list of all
available packages in the CRAN repository can be viewed from the web site “https://
cran.r-project.org/web/packages/available_packages_by_name.html”. To find the
list of functions available in a package (say the package is “stats”) we can use the
command ls(“package:stats”) or the command library(help = stats) in the
command prompt.
A library is a folder in the machine that stores the files for a package. If a package
is already installed on a machine we can load the same using the library()
function. The name of the package to be loaded is passed to the library() function as
argument without enclosing in quotes. If the package name has to be
programmatically passed to the library() function, then we need to set the
argument character.only = TRUE. If a package is not installed and if the library()
function is used to load the package, it will throw an error message. Alternatively
if the require() function is used to load a package, it returns TRUE if the package is
already installed or it returns FALSE if the package is not already installed.
We can list and see all the packages that are already loaded using the search()
function. This list shows the global environment as the first one followed by the
recently loaded packages. The last two are special environments, namely, “Autoloads”
and “base” package.
> search()
[1] “.GlobalEnv” “package:cluster” “tools:rstudio” “package:stats”
[5] “package:graphics” “package:grDevices” “package:utils” “package:datasets”
[9] “package:methods” “Autoloads” “package:base”
11 Basics of R
The CRAN package repository contains handful of packages that needs special
attention. To access additional repositories, type setRepositories() and select
the repository required. The repositories R-Forge and rforge.net contains the
development versions of the packages that appear on the CRAN repository. The
function available.packages() lists thousands of packages in each of the selected
12 R Programming — An Approach for Data Analytics
repository. (Note: can use the View() function to restrict fetching of thousands of
the packages at one go)
> setRepositories()
--- Please select repositories for use in this session ---
1: + CRAN
2: BioC software
3: BioC annotation
4: BioC experiment
5: BioC extra
6: CRAN (extras)
7: Omegahat
8: R-Forge
9: rforge.net
10: + CRANextra
Enter one or more numbers separated by spaces, or an empty line to cancel
1:
There are many online repositories like GitHub, Bitbucket, and Google Code
from where many R Packages can be retrieved. The packages can be installed
using the function install.packages() function by mentioning the name of the
package as argument to this function. But, it is necessary to have internet
connection to install any package and write permission to the hard drive. To update
the latest version of the installed packages, we use the function update.packages()
with the argument ask = FALSE which disallows prompting before updating each
package. To delete a package already installed, we use the function
remove.packages() by passing the name of the package to be removed as
argument.
> install.packages(“chron”)
13 Basics of R
1.5.1. Environments
We can assign variables into a newly created environment using the double
square brackets or the dollar operator as below.
> newenvironment[[“variable1”]] <- c(4, 7, 9)
> newenvironment$variable2 <- TRUE
> assign(“variable3”, “Value for variable3”, newenvironment)
The functions ls() and ls.str() take an environment argument and lists its
contents. We can test if a variable exists in an environment using the exists()
function.
14 R Programming — An Approach for Data Analytics
An environment can be converted into a list using the function as.list() and a
list can be converted into an environment using the function as.environment() or
the function list2env().
> newlist <- as.list(newenvironment)
> newlist
$variable3
[1] “Value for variable3”
$variable1
[1] 4 7 9
$variable2
[1] TRUE
> as.environment(newlist)
<environment: 0x124730a8>
> list2env(newlist)
<environment: 0x12edf3e8>
> anotherenv <- as.environment(newlist)
> anotherenv[[“variable3”]]
[1] “Value for variable3”
All environments are nested and so every environment has a parent environment.
The empty environment sits at the top of the hierarchy without any parent. The
15 Basics of R
exists() and the get() function also looks for the variables in the parent environment.
To change this behaviour we need to pass the argument inherits = FALSE.
> subenv <- new.env(parent = newenvironment)
> exists(“variable1”, subenv)
[1] TRUE
> exists(“variable1”, subenv, inherits = FALSE)
[1] FALSE
The word frame is used interchangeably with the word environment. The function
to refer to parent environment is denoted as parent.frame(). The variables
assigned from the command prompt are stored in the global environment. The
functions and the variables from the R’s base package are stored in the base
environment.
1.5.2. Functions
A function and its environment together is called a closure. When we load a
package, the functions in that package are stored in the environment on the
search path where the package is installed. A function in R is a verb and not a
noun as it does things with its data. Functions are also another data types and
hence we can assign and manipulate and pass them as arguments to other
functions. Typing the function name in the command prompt lists the code
associated with the function. Below is the code listed for the function readLines().
> readLines
function (con = stdin(), n = -1L, ok = TRUE, warn = TRUE, encoding = unknown”,
skipNul = FALSE)
{
if (is.character(con)) {
con <- file(con, “r”)
on.exit(close(con))
}
.Internal(readLines(con, n, ok, warn, encoding, skipNul))
}
16 R Programming — An Approach for Data Analytics
When we call a function by passing values to it, the values are called as
arguments. The lines of code of the function can be seen between the curly braces
as body of the function. In R, there is no explicit return statement to return values.
The last value that is calculated in a function is returned by default in R.
The functions formals(), args() and formalArgs() can fetch the arguments
defined for a function. The body of the function can be retrieved using the body()
and deparse() functions.
> formals(cube)
$x
> args(cube)
function (x)
NULL
> formalArgs(cube)
[1] “x”
> body(cube)
{
cu <- x^3
}
17 Basics of R
> deparse(cube)
[1] “function (x) “ “{“ “ cu <- x^3” “}”
Thus R will search for a variable in the current environment and if it could not
find it, it will check the same in its parent environment. This search will proceed
upwards until the variable is searched in the global environment. The variables
defined in the global environment are called the global variables, which can be
accessed from anywhere else. The replicate() function can be used to run a
function several times as below. In this the user defined function random()
returns 1 if the value returned by the rnorm() function is a positive value and
otherwise it returns the value of the argument passed to the function random().
This function random() is called 20 times using the replicate() function.
> random <- function(x)
+{
+ if(rnorm(1) > 0)
+ {r <- 1}
+ else
+ {r <- x}
+}
> replicate(20, random(5))
[1] 5 5 1 1 5 1 5 5 5 5 5 5 5 5 5 1 1 5 1 5
19 Basics of R
1.6.Flow Control
In some situations it may be required to execute some code only if a condition is
satisfied.
The if statement takes a logical value and executes the next statement only if the
value is TRUE.
> if(TRUE) message(“TRUE Statement”)
TRUE Statement
> if(FALSE) message(“FALSE Statement”)
In the if and else construct the code that follows the if statement is executed if
the condition is TRUE and the code that follows the else statement is executed if
the condition is FALSE. It is important to note that the else statement must occur
on the same line as the closing curly brace of the if statement and otherwise it will
throw an error message.
20 R Programming — An Approach for Data Analytics
a <- 8
if(a < 7)
{
b <- a * 5
c <- b * 3
message(“b is “, b)
message(“c is “, c)
} else
{
message(“a is greater than 7”)
}
a is greater than 7
The if and else statements can be used repeatedly to code multiple conditions
and this respective actions. In this case it is important to note that the if and the
else statements are separated and they are not one word as ifelse. The ifelse function
is of different use which will be covered shortly.
a <- -8
if(a < 0)
{
message(“a is negative”)
} else if(a == 0)
{
message(“a is zero”)
} else if(a > 0)
{
message(“a is positive”)
a is negative
21 Basics of R
The ifelse() function takes three arguments of which the first is logical condition,
the second is the value that is returned when the first vector is TRUE and the
third is the value that is returned when the first vector is FALSE.
> a <- 3
> b <- 5
> ifelse(a < b, “a is less than b”, “a is greater than b”)
[1] “a is less than b”
If there are many else statements, it looks confusing and in such cases the switch()
function is required. The first argument of the switch statement is an expression that
can return a string value or an integer. This is followed by several named arguments
that provide the results when the name matches the value of the first argument.
Here also we can execute multiple statements enclosed by curly braces. If there is
no match the switch statement returns NULL. So, in this case, it is safe to mention
a default value if none matches.
> switch(“color”,”color” = “red”, “shape” = “circle”, “radius” = 10)
[1] “red”
> switch(“position”,”color” = “red”, “shape” = “circle”, “radius” = 10)
[1] NULL
> switch(“position”,”color” = “red”, “shape” = “circle”, “radius” = 10,”default”)
[1] “default”
> switch(2,”red”,”green”,”blue”)
[1] “green”
1.7.Loops
There are three kinds of loops in R namely, repeat, while and for.
22 R Programming — An Approach for Data Analytics
The repeat is the easiest loop in R that executes the same code until it is forced to
stop. This repeat is similar to the do while statement in other languages. A break
statement can be given when it is required to break the looping. Also, it is possible
to skip the rest of the statements in a loop and execute the next iteration and this
is done using the next statement.
a <- 1
repeat {
message(“Inside the loop”)
if(a == 3)
{
a=a+1
next
}
message(“The value of a is “, a)
a=a+1
if(a > 5)
{
message(“Exiting the loop”)
break
}
}
Inside the loop
The value of a is 1
Inside the loop
The value of a is 2
Inside the loop
Inside the loop
23 Basics of R
The value of a is 4
Inside the loop
The value of a is 5
Exiting the loop
The while loops are backward repeat loops. The repeat loop executes the code and
then checks for the condition, but in while loops the condition is first checked
and then the code is executed. So, in this case it is possible that the code may not
be executed even once when the condition fails at the entry itself during the first
iteration. The same example above can be written using the while statement.
a <- 1
while (a <= 5)
{
message(“Inside the loop”)
if(a == 3)
{
a=a+1
next
}
message(“The value of a is “, a)
a=a+1
}
Inside the loop
The value of a is 1
Inside the loop
The value of a is 2
Inside the loop
Inside the loop
24 R Programming — An Approach for Data Analytics
The value of a is 4
Inside the loop
The value of a is 5
The for loops are used when we know how many times the code needs to be repeated.
The for loop accepts an iterating variable and a vector. It repeats the loop giving
the iterating each element from the vector in turn. In this case also if there are
multiple statements to execute, we can use the curly braces. The iterating variable
can be an integer, number, character or logical vectors and they can be even lists.
for(i in 1:5)
{
j <- i * i
message(“The square value of “, i, “ is “, j)
}
The square value of 1 is 1
The square value of 2 is 4
The square value of 3 is 9
The square value of 4 is
16 The square value of 5
is 25
for(i in c(TRUE, FALSE, NA))
{
message(“This Statement is “, i)
}
This Statement is TRUE
This Statement is FALSE
This Statement is NA
a <- c(1,2,3)
b <- c(“a”,”b”,”c”,”d”)
25 Basics of R
HIGHLIGHTS
R is a free open source language that has cross platform compatibility.
R’s syntax is very simple and intuitive.
R’s installation software can be downloaded from the CRAN Website.
Help in R can be obtained by using, for eg. ?mean() / help(“mean”)
Variables can be assigned using the symbol ß or the assign() function.
The basic functions are c(), sum(), mean(), median(), exp(), sqrt() etc.
The basic operators are “+”, “*”, “:”, “/”, “**”, “*”, “%%”, “%/%”,
“==”, “!=”, “<”, “>”, “<=”, “>=” etc.
Currently, the CRAN package repository features 10756 available packages.
A Package can be newly installed using the function install.packages()
and it can be invoked using the function library().
When a variable is assigned in the command prompt, it goes by default
into the global environment.
To create a new environment we use the function new.env().
Typing the function name in the command prompt lists the code
associated with the function.
The if and the else statements are separated and they are not one word as
ifelse.
The ifelse() function takes three arguments.
If there are many else statements, the switch() function is required.
26 R Programming — An Approach for Data Analytics
DATA TYPES IN R
OBJECTIVES
types of R-objects. The frequently used ones are − Vectors, Arrays, Matrices, Lists,
Data Frames, Strings and Factors.
The simplest of these objects is the Vector object and there are six data types
of these atomic vectors, also termed as six classes of vectors. The other R-Objects
are built upon the atomic vectors. Hence, the basic data types in R are Numeric,
Integer, Complex, Logical and Character.
2.1.1. Numeric
Decimal values are called numeric in R. It is the default computational data type. If
we assign a decimal value to a variable x as follows, x will be of numeric type.
> x = 10.5
> x
[1] 10.5
> class(x) # print the class name of x
[1] “numeric”
2.1.2. Integer
We can force a numeric value into an integer with the same as.integer() function
as below.
> as.integer(3.14)
[1] 3
The integer values of the logical values TRUE and FALSE are 1 and 0
respectively.
> as.integer(TRUE)
[1] 1
> as.integer(FALSE)
[1] 0
2.1.3. Complex
> z = 3 + 4i
> z
[1] 3 + 4i
> class(z)
[1] “complex”
If we find the square root of -1, it gives an error. But, if it is converted into a
complex number and then square root is applied, it produces the necessary result as
another complex number.
> sqrt(-1)
[1] NaN
Warning message:
In sqrt(-1) : NaNs produced
> sqrt(as.complex(-1))
[1] 0+1i
2.1.4. Logical
When two variables are compared, the logical values are created. The logical
operators are “&” (and), “|” (or), and “!” (negation).
> a = 4; b = 7
> p=a>b
> p
[1] FALSE
> class(p)
[1] “logical”
> a = TRUE; b = FALSE
> a&b
[1] FALSE
> a| b
31 Data Types in
R
[1] TRUE
> !a
[1] FALSE
2.1.5. Character
The character object is used to represent string values in R. Objects can be converted
into character values using the as.character() function. A paste() function can be
used to concatenate two character values.
> s = as.character(“7.48”)
> s
[1] “7.48”
> class(s)
[1] “character”
> fname = “Adam”
> lname = “Smith”
> paste(fname, lname)
[1] “Adam Smith”
However, a readable string can be created using the sprint() function and this
is similar to the C language syntax.
> sprintf(“%s has %d rupees”, “Sundar”,1000)
[1] “Sundar has 1000 rupees”
The substr() function can be used to extract a substring from a given string.
The sub() function is used to replace the first occurrence of a string with another
string as below.
> substr(“Twinkle Twinkle Little Star”, start = 9, stop = 15)
[1] “Twinkle”
> sub(“Twinkle”, “Wrinkle”, “Twinkle Twinkle Little Star”)
[1] “Wrinkle Twinkle Little Star”
32 R Programming — An Approach for Data Analytics
2.2.Vectors
A sequence of data elements of the same basic type is called a Vector. Members in
a vector are called as components or members. The vector() function creates a
vector of a specified type and length. The result is a zero or FALSE or empty
string.
> vector(“numeric”, 3)
[1] 0 0 0
> vector(“logical”, 5)
[1] FALSE FALSE FALSE FALSE FALSE
> vector(“character”, 2)
[1] “” “”
The below commands also produces the same result as the above commands.
> numeric(3)
[1] 0 0 0
> logical(5)
[1] FALSE FALSE FALSE FALSE FALSE
> character(2)
[1] “” “”
The seq() function allows to generate sequences. The function seq.int() also
creates sequence from one number to another, but this function provides more
options for splitting the sequence.
> seq(1:5)
[1] 1 2 3 4 5
> seq.int(5, 12)
[1] 5 6 7 8 9 10 11 12
> seq.int(10, 5, -1.5)
[1] 10.0 8.5 7.0 5.5
33 Data Types in
R
The function seq_len() creates a sequence from 1 to the input value. The
function seq_along() creates a sequence from 1 to the length of the input.
> seq_len(7)
[1] 1 2 3 4 5 6 7
> p <- c(3, 4, 5, 6)
> seq_along(p)
[1] 1 2 3 4
The function length() can be used to find the length of the vector, that is the
number of elements in a vector. Using this function, it is possible to assign new
length to a vector. If the vector length is extended NA(s) will be added to the end.
> length(1:7)
[1] 7
> length(c(“aa”, “ccc”, “eeee”))
[1] 3
> nchar(c(“aa”, “ccc”, “eeee”))
[1] 2 3 4
> s <- c(1,2,3,4,5)
> length(s) <- 3
> s
[1] 1 2 3
> length(s) <- 8
> s
[1] 1 2 3 NA NA NA NA NA
Each element of a vector can be given a name during the vector creation itself.
If there are space or special characters in the name, it needs to be enclosed in
quotes. The names() function can be used to give names to the vector elements
after its creation.
34 R Programming — An Approach for Data Analytics
> c(a = 1, b = 2, c = 3)
abc
123
> s <- 1:3
> s
[1] 1 2 3
> names(s) <- c(“a”, “b”, “c”)
> s
abc
123
Elements of a vector can be accessed using its indexes which are specified in
a square bracket. The index number starts from 1 and not 0. Specifying a negative
number as index to a vector means, it returns all the elements except the one
specified. The name of the vector element can also be specified as index to fetch it.
> x <- c(1:5)
> x
[1] 1 2 3 4 5
> x[c(2,3)]
[1] 2 3
> x[c(-1,-4)]
[1] 2 3 5
> s <- 1:3
> s
[1] 1 2 3
> names(s) <- c(“a”, “b”, “c”)
> s[“b”]
b
2
35 Data Types in
R
If an incorrect index is specified to access a vector element, the result is NA.
Non integer indices are rounded off. Not passing any index to a vector will return all
the elements of the vector.
> x
[1] 1 2 3 4 5
> x[7]
[1] NA
The which() function returns the elements of the vector which satisfies the
condition specified within this function. The functions which.min() and which.
max() can be used to display the minimum and the maximum elements in the
vector.
> x
[1] 1 2 3 4 5
> which.min(x)
[1] 1
> which.max(x)
[1] 5
> which(x>3)
[1] 4 5
Vectors can be combined using the c() function. When the two vectors are
combined the numeric values are forced into character values. This shows that all
the members of a vector should be of the same basic data type.
> f = c(7, 5, 9)
> g = c(“aaa”, “bbb”, “ccc”)
> c(f, g)
[1] “7” “5” “9” “aaa” “bbb” “ccc”
> x = c(5, 8, 9)
> y = c(2, 6, 9)
> 4*y
[1] 8 24 36
> x+ y
[1] 7 14
18
> x-y
[1] 3 2 0
> x*y
[1] 10 48 81
> x/y
[1] 2.500000 1.333333 1.000000
> v = c(1, 2, 3, 4, 5, 6)
> x+v
[1] 6 10 12 9 13 15
The rep() function creates a vector with repeated elements. This function has
its other variants such as rep.int() and rep_len() whose usage is as given below.
> rep(1:3, 4)
[1] 1 2 3 1 2 3 1 2 3 1 2 3
> rep(1:3, each = 4)
[1] 1 1 1 1 2 2 2 2 3 3 3 3
> rep(1:3, times =
1:3) [1] 1 2 2 3 3 3
> rep(1:3, length.out =
9) [1] 1 2 3 1 2 3 1 2 3
> rep.int(1:3, 4)
[1] 1 2 3 1 2 3 1 2 3 1 2 3
37 Data Types in
R
> rep_len(1:3, 9)
[1] 1 2 3 1 2 3 1 2 3
The syntax for creating matrices is using the function matrix() and passing the
nrow or ncol argument instead of the dim argument in the arrays. A matrix can
also be created using the array() function where the dimension of the array is
two.
> m <- matrix(1:12, nrow = 3, dimnames = list(c(“a”, “b”, “c”), c(“d”, “e”, “f ”, “g”)))
> m
38 R Programming — An Approach for Data Analytics
defg
a147
10
b 2 5 8 11
c 3 6 9 12
> m1 <- array(1:12, dim = c(3,4),
dimnames = list(c(“a”, “b”, “c”), c(“d”, “e”, “f ”, “g”)))
> m1
defg
a147
10
b 2 5 8 11
c 3 6 9 12
The argument byrow = TRUE in the matrix() function assigns the elements
row wise. If this argument is not specified, by default the elements are filled
column wise.
> m <- matrix(1:12, nrow = 3, byrow = TRUE,
dimnames = list(c(“a”, “b”, “c”), c(“d”, “e”, “f ”, “g”)))
The dim() function returns the dimensions of an array or a matrix. The functions
nrow() and ncol() returns the number of rows and number of columns of a matrix
respectively.
> dim(x)
[1] 4 3 2
> dim(m)
[1] 3 4
> nrow(m)
[1] 3
> ncol(m)
[1] 4
39 Data Types in
R
The length() function also works for matrices and arrays. It is also possible to
assign new dimension for a matrix or an array using the dim() function.
> length(x)
[1] 24
> length(m)
[1] 12
> dim(m) <- c(6,2)
The functions rownames(), colnames() and dimnames() can be used to fetch the
row names, column names and dimension names of matrices and arrays respectively.
> rownames(m1)
[1] “a” “b” “c”
> colnames(m1)
[1] “d” “e” “f ” “g”
> dimnames(x)
[[1]]
[1] “a” “b” “c” “d”
[[2]]
[1] “e” “f ” “g”
[[3]]
[1] “h” “i”
It is possible to extract the element at the nth row and mth column using the
expression M[n, m]. The entire nth row can be extracted using M[n, ] and similarly,
the mth column can be extracted using M[,m]. Also, it is possible to extract more
than one column or row.
> M[2,3]
[1] 6
> M[2,]
[1] 4 5 6
40 R Programming — An Approach for Data Analytics
> M[,3]
[1] 3 6 9
> M[,c(1,3)]
[,1] [,2]
[1,] 1 3
[2,] 4 6
[3,] 7 9
> M[c(1,3),]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 7 8 9
The columns of two matrices can be combined using the cbind() function and
similarly the rows of two matrices can be combined using the rbind() function.
> M1 = matrix(c(2,4,6,8,10,12), nrow=3, ncol=2)
> M1
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
> M2 = matrix(c(3,6,9), nrow=3, ncol = 1)
> M2
41 Data Types in
R
[,1]
[1,] 3
[2,] 6
[3,] 9
> cbind(M1, M2)
[,1] [,2] [,3]
[1,] 2 8 3
[2,] 4 10 6
[3,] 6 12 9
> M3 = matrix(c(4,8), nrow=1, ncol=2)
> M3 [,1]
[,2]
[1,] 4 8
> rbind(M1, M3)
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
[4,] 4 8
A matrix can be deconstructed using the c() function which combines all
column vectors into one.
> c(M1)
[1] 2 4 6 8 10 12
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
> M2 = matrix(c(3,6,9,11,1,5), nrow=3, ncol = 2)
> M2 [,1]
[,2]
[1,] 3 11
[2,] 6 1
[3,] 9 5
> M1 + M2
[,1] [,2]
[1,] 5 19
[2,] 10 11
[3,] 15 17
> M1 * M2
[,1] [,2]
[1,] 6 88
[2,] 24 10
[3,] 54 60
> M2 = matrix(c(3,6,9,11), nrow=2, ncol = 2)
> M2 [,1]
[,2]
[1,] 3 9
[2,] 6 11
> M1 %*% M2
[,1] [,2]
[1,] 54 106
43 Data Types in
R
[2,] 72 146
[3,] 90 186
The power operator “^” also works element wise on matrices. To find the
inverse of a matrix the function solve() can be used.
> M2 [,1]
[,2]
[1,] 3 9
[2,] 6 11
> M2^-1
[,1] [,2]
[1,] 0.3333333 0.11111111
[2,] 0.1666667 0.09090909
> solve(M2)
[,1] [,2]
[1,] -0.5238095 0.4285714
[2,] 0.2857143 -0.1428571
2.4.Lists
Lists allow us to combine different data types in a single variable. Lists can be
created using the list() function. This function is similar to the c() function. The
contents of a list are just listed within the list() function as arguments separated
by a comma. List elements can be a vector, matrix or a function. It is possible to
name the elements of the list while creation or later using the names() function.
> L <- list(c(9,1, 4, 7, 0), matrix(c(1,2,3,4,5,6), nrow = 3))
> L
[[1]]
[1] 9 1 4 7 0
[[2]]
44 R Programming — An Approach for Data Analytics
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> L
$Num
[1] 9 1 4 7 0
$Mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> L <- list(Num = c(9,1, 4, 7, 0), Mat = matrix(c(1,2,3,4,5,6), nrow = 3))
> L
$Num
[1] 9 1 4 7 0
$Mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
Lists can be nested. That is a list can be an element of another list. But,
vectors, arrays and matrices are not recursive/nested. They are atomic. The
functions is.recursive() and is.atomic() shows if a variable type is recursive or atomic
respectively.
> is.atomic(list())
[1] FALSE
45 Data Types in
R
> is.recursive(list())
[1] TRUE
> is.atomic(L)
[1] FALSE
> is.recursive(L)
[1] TRUE
> is.atomic(matrix())
[1] TRUE
> is.recursive(matrix())
[1] FALSE
The length() function works on list like in vectors and matrices. But, the dim(),
nrow() and ncol() functions returns only NULL.
> length(L)
[1] 2
> dim(L)
NULL
> nrow(L)
NULL
> ncol(L)
NULL
Arithmetic operations in list are possible only if the elements of the list are of
the same data type. Generally, it is not recommended. As in vectors the elements
of the list can be accessed by indexing them using the square brackets. The index
can be a positive number, or a negative number, or element names or logical
values.
> L1 <- list(l1 = c(8, 9, 1), l2 = matrix(c(1,2,3,4), nrow = 2),
l3 = list( l31 = c(“a”, “b”), l32 = c(TRUE, FALSE) ))
> L1
$l1
46 R Programming — An Approach for Data Analytics
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4
$l3
$l3$l31
[1] “a” “b”
$l3$l32
[1] TRUE FALSE
> L1[1:2]
$l1
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4
> L1[-3]
$l1
[1] 8 9 1
$l2
[,1] [,2]
[1,] 1 3
[2,] 2 4
The list t contains copies of the vectors a, b and d. A list slice is retrieved using
single square brackets []. In the below, t[2] contains a slice and a copy of b. Slice
can also be retrieved with multiple members.
> t[2]
[[1]]
[1] “abc” “def ” “ghi” “jkl” “mno”
> t[c(2,4)]
[[1]]
[1] “abc” “def ” “ghi” “jkl” “mno”
[[2]]
[1] 5
48 R Programming — An Approach for Data Analytics
To reference a list member directly double square bracket [[]] is used. Thus
t[[2]] retrieves the second member of the list t. This results in a copy of b, but not
a slice of b. It is also possible to modify the contents of the elements directly, but
the contents of b are unaffected.
> t[[2]]
[1] “abc” “def ” “ghi” “jkl” “mno”
> t[[2]][1] = “qqq”
> t[[2]]
[1] “qqq” “def ” “ghi” “jkl” “mno”
> b
[1] “abc” “def ” “ghi” “jkl” “mno”
We can assign names to the list members and reference lists by names instead of
numeric indexes. A list of two members is given as example below with the
member names as “first” and “second”. The list slice containing the member “first”
can be retrieved using the square brackets [] as shown below.
> l = list(first=c(1,2,3), second=c(“a”,”b”, “c”))
> l
$first
[1] 1 2 3
$second
[1] “a” “b” “c”
> l[“first”]
$first
[1] 1 2 3
The named list member can also be directly referenced with the $ operator or
double square brackets [[]] as below.
> l$first
[1] 1 2 3
49 Data Types in
R
> l[[“first”]]
[1] 1 2 3
> as.list(v)
[[1]]
[1] 7
[[2]]
[1] 3
[[3]]
[1] 9
[[4]]
[1] 2
[[5]]
[1] 6
[[3]]
[1] “ccc”
> as.character(L1)
[1] “aaa” “bbb” “ccc”
> L1 <- list(l1 = c(78, 90, 21), l2 = c(11,22,33,44,55))
> L1
$l1
[1] 78 90 21
$l2
[1] 11 22 33 44 55
> unlist(L1)
l11 l12 l13 l21 l22 l23 l24
l25 78 90 21 11 22 33 44 55
The c() function can also be used to combine lists as we do for vectors.
> L1 <- list(l1 = c(78, 90, 21), l2 = c(11,22,33,44,55))
> L2 <- list(“aaa”, “bbb”, “ccc”)
> c(L1, L2)
$l1
[1] 78 90 21
$l2
[1] 11 22 33 44 55
[[3]]
[1] “aaa”
[[4]]
[1] “bbb”
[[5]]
[1] “ccc”
51 Data Types in
R
2.5.Data Frames
A data frame is used for storing data tables. They store spread-sheet like data. It is
a list of vectors of equal length (not necessarily of the same basic data type).
Consider a data frame df1 consisting of three vectors a, b, and d.
> a = c(1, 2, 3)
> b = c(“a”, “b”, “c”)
> d = c(TRUE, FALSE, TRUE)
> df1 = data.frame(a, b, d)
> df1
a b d
1 1 a TRUE
2 2 b FALSE
3 3 c TRUE
By default the row names are automatically numbered from 1 to the number
of rows in the data frame. It is also possible to provide row names manually using
the row.names argument as below.
> df1 = data.frame(a, b, d, row.names = c(“one”, “two”, “three”))
> df1
a b d
one 1 a TRUE
two 2 b FALSE
three 3 c TRUE
> nrow(df1)
[1] 3
> ncol(df1)
[1] 3
> dim(df1)
[1] 3 3
> length(df1)
[1] 3
> colnames(df1)
[1] “a” “b” “d”
The argument check.names can be set as FALSE so that a data frame will not
look for valid column names.
53 Data Types in
R
> df3 <- data.frame(“BaD col” = c(1:5), “!@#$%^&*” = c(“aaa”))
> df3
BaD.col X........
1 1 aaa
2 2 aaa
3 3 aaa
4 4 aaa
5 5 aaa
There are many built-in data frames available in R (example – mtcars). When
this data frame is invoked in R tool, it produces the below result.
> mtcars
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 ...
Datsun 710 22.8 4 108 93 3.85 2.32 ...
............
The top line contains the header or the column names. Each row denotes a record
or a row in the table. A row begins with the name of the row. Each data member
of a row is called a cell. To retrieve a cell value, we enter the row and the column
number of the cell in square brackets [] separated by a comma. The cell value of
the second row and third column is retrieved as below. The row and the column
names can also be used inside the square brackets [] instead of the row and
column numbers.
> mtcars[2, 3]
[1] 160
> mtcars[“Mazda RX4 Wag”, “disp”]
[1] 160
The nrow() function gives the number of rows in a data frame and the ncol()
function gives the number of columns in a data frame. To get the preview or the first
few records of a data frame along with the header the head() function can be
used.
54 R Programming — An Approach for Data Analytics
> nrow(mtcars)
[1] 32
> ncol(mtcars)
[1] 11
> head(mtcars)
mpg cyl disp hp drat wt ...
Mazda RX4 21.0 6 160 110 3.90 2.62 ...
......
To retrieve a column from a data frame we use double square brackets [[]]
and the column name or the column number inside the [[]]. The same can be
achieved by making use of the $ symbol as well. This same result can also be
achieved by using single brackets [] by mentioning a comma instead of the row
name / number and using the column name / number as the second index inside
the [].
> mtcars[[“hp”]]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
> mtcars[[4]]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
> mtcars$hp
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
> mtcars[,”hp”]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
> mtcars[,4]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 ....
Similarly, if we use the column name or the column number inside a single
square bracket [], we get the below result.
> mtcars[4]
55 Data Types in
R
hp
Mazda RX4 110
56 R Programming — An Approach for Data Analytics
mpg hp
Mazda RX4 21.0 110
Mazda RX4 Wag 21.0 110
Datsun 710 22.8 93
....
To retrieve a row from a data frame we use the single square brackets [] only by
mentioning the row name / number as the first index inside [] and a comma
instead of the column name / number.
> mtcars[6,]
mpg cyl disp hp drat wt....
Valiant 18.1 6 225 105 2.76 3.46....
> mtcars[c(6,18),]
mpg cyl disp hp drat wt....
Valiant 18.1 6 225 105 2.76 3.46....
Fiat 128 32.4 4 78.7 66 4.08 2.20....
> mtcars[“Valiant”,]
mpg cyl disp hp drat wt....
Valiant 18.1 6 225 105 2.76 3.46....
x y z
1 a 3 TRUE
2 b 4 TRUE
3 c 7 FALSE
4 d 8 TRUE
5 e 12 FALSE
6 f 15 TRUE
As we have for matrices the transpose of a data frame can be obtained using
the t() function as below.
> t(D)
[,1] [,2] [,3] [,4] [,5] [,6]
x “a” “b” “c” “d” “e” “f ”
y “ 3” “ 4” “ 7” “ 8” “12” “15”
z “ TRUE” “ TRUE” “FALSE” “ TRUE” “FALSE” “ TRUE”
58 R Programming — An Approach for Data Analytics
The functions rbind() and cbind() can also be applied on the data frames as we
do for the matrices. The only condition for rbind() is that the column names
should match, but for cbind() it does not check even if the column names are
duplicated.
> x1 <- c(“aaa”, “bbb”, “ccc”, “ddd”, “eee”, “fff ”)
> y1 <- c(9, 12, 17, 18, 23, 32)
> z1 <- c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)
> E <- data.frame(x1, y1, z1)
> E
x1 y1 z1
1 aaa 9 TRUE
2 bbb 12 FALSE
3 ccc 17 TRUE
4 ddd 18 FALSE
5 eee 23 TRUE
6 fff 32 FALSE
> cbind(D, E)
x y z x1 y1 z1
1 a 3 TRUE aaa 9 TRUE
2 b 4 TRUE bbb 12 FALSE
3 c 7 FALSE ccc 17 TRUE
4 d 8 TRUE ddd 18 FALSE
5 e 12 FALSE eee 23 TRUE
6 f 15 TRUE fff 32 FALSE
> F <- data.frame(x, y, z)
> F
59 Data Types in
R
x y z
1 a 9 TRUE
2 b 12 FALSE
3 c 17 TRUE
>4 d 18 FALSE
5 e 23 TRUE
6 f 32 FALSE
rbind(D, F)
x y z
1 a 3 TRUE
2 b 4 TRUE
3 c 7 FALSE
4 d 8 TRUE
5 e 12 FALSE
6 f 15 TRUE
7 a 9 TRUE
8 b 12 FALSE
9 c 17 TRUE
The merge() function can be applied to merge two data frames provided they
have common column names. By default, the merge() function does the merging
based on all the common columns, otherwise one of the common column name
has to be specified.
> merge(D, F, by = “x”, all = TRUE)
59 Data Types in
R
x y.x z.x y.y z.y
1 a 3 TRUE 9 TRUE
2 b 4 TRUE 12 FALSE
3 c 7 FALSE 17 TRUE
4 d 8 TRUE 18 FALSE
5 e 12 FALSE 23 TRUE
6 f 15 TRUE 32 FALSE
> rowSums(G[1:3, ])
1 2 3
45 48 51
> rowMeans(G[2:4, ])
2 3 4
16 17 18
2.6.Factors
Factors are used to store categorical data like gender (“Male” or “Female”). They
behave sometimes like character vectors and sometimes like integer vectors
based on the context.
Factors stores categorical data and they behave like strings sometimes and
integers sometimes. Consider a data frame that stores the weight of few males
and females. In this case the column that stores the gender is a factor as it stores
categorical data. The choices “female” and “male” are called the levels of the
factor. This can be viewed by using the levels() function and nlevels() function.
> weight <- data.frame(wt_kg = c(60,82,45, 49,52,75,68),
gender = c(“female”,”male”, “female”, “female”, “female”, “male”, “male”))
> weight
wt_kg gender
1 60 female
2 82 male
3 45 female
4 49 female
5 52 female
6 75 male
7 68 male
> weight$gender
61 Data Types in
R
[1] female male female female female male male
Levels: female male
> levels(weight$gender)
[1] “female” “male”
> nlevels(weight$gender)
[1] 2
At the atomic level a factor can be created using the factor() function, which
takes a character vector as the argument.
> gender <- factor(c(“female”, “male”, “female”, “female”, “female”, “male”, “male”))
> gender
[1] female male female female female male male
Levels: female male
The levels argument can be used in the factor() function to specify the levels
of the factor. It is also possible to change the levels once the factor is created. This
is done using the function levels() or the function relevel(). The function relevel()
just mentions which level comes first.
> gender <- factor(c(“female”, “male”, “female”, “female”, “female”,
“male”, “male”), levels = c(“male”, “female”))
> gender
[1] female male female female female male male
Levels: male female
> levels(gender) <- c(“F”, “M”)
> gender
[1] M F M M M F F
Levels: F M
> relevel(gender, “M”)
[1] M F M M M F F
Levels: M F
62 R Programming — An Approach for Data Analytics
It is possible to drop a level from a factor using the function droplevels() when
the level is not in use as in the example below. [Note: the function is.na() is used to
remove the missing value].
> diet <- data.frame(eat = c(“fruit”, “fruit”, “vegetable”, “fruit”),
type = c(“apple”, “mango”, NA, “papaya”))
> diet
eat type
1 fruit apple
2 fruit mango
3 vegetable <NA>
4 fruit papaya
> diet <- subset(diet, !is.na(type))
> diet
eat type
1 fruit apple
2 fruit mango
4 fruit papaya
> diet$eat
[1] fruit fruit fruit
Levels: fruit vegetable
> levels(diet)
NULL
> levels(diet$eat)
[1] “fruit” “vegetable”
> unique(diet$eat)
[1] fruit
Levels: fruit vegetable
63 Data Types in
R
> diet$eat <- droplevels(diet$eat)
> levels(diet$eat)
[1] “fruit”
In some cases, the levels need to be ordered as in rating a product or course. The
ratings can be “Outstanding”, “Excellent”, “Very Good”, “Good”, “Bad”. When a
factor is created with these levels, it is not necessary they are ordered. So, to order the
levels in a factor, we can either use the function ordered() or the argument ordered =
TRUE in the factor() function. Such ordering can be useful when analysing survey
data.
> ch <- c(“Outstanding”, “Excellent”, “Very Good”, “Good”, “Bad”)
> val <- sample(ch, 100, replace = TRUE)
> rating <- factor(val, ch)
> rating
[1] Outstanding Bad Outstanding Good Very Good Very Good
[7] Excellent Outstanding Bad Excellent Very Good Bad
...
Levels: Outstanding Excellent Very Good Good Bad
> is.factor(rating)
[1] TRUE
> is.ordered(rating)
[1] FALSE
> rating_ord <- ordered(val, ch)
> is.factor(rating_ord)
[1] TRUE
> is.ordered(rating_ord)
[1] TRUE
> rating_ord
64 R Programming — An Approach for Data Analytics
Numeric values can be summarized into factors using the cut() function and
the result can be viewed using the table() function which lists the count of
numbers in each category. For example let us consider the variable age which has the
numeric values of ages. These ages can be grouped using the cut() function with
an interval of 10 and the result is a factor age_group.
> age <- c(18,20, 31, 32, 33, 35, 41, 38, 45, 48, 51, 27, 29, 42, 39)
> age_group <- cut(age, seq.int(15, 55, 10))
> age
[1] 18 20 31 32 33 35 41 38 45 48 51 27 29 42 39
> age_group
[1] (15,25] (15,25] (25,35] (25,35] (25,35] (25,35] (35,45] (35,45] (35,45] (45,55]
[11] (45,55] (25,35] (25,35] (35,45] (35,45]
Levels: (15,25] (25,35] (35,45] (45,55]
> table(age_group)
age_group
(15,25] (25,35] (35,45] (45,55]
2 6 5 2
The function gl() can be used to create a factor, which takes the first argument
that tells how many levels the factor contains and the second argument that tells
how many times each level has to be repeated as value. This function can also
take the argument labels, which lists the names of the factor levels. The function
can also be made to list alternating values of the labels as below.
> gl(5,3)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Levels: 1 2 3 4 5
65 Data Types in
R
> gl(5,3, labels = c(“one”, “two”, “three”, “four”, “five”))
[1] one one one two two two three three three four four four five
[14] five five
Levels: one two three four five
> gl(5,1,15)
[1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Levels: 1 2 3 4 5
The factors thus generated can be combined using the function interaction() to
get a resultant combined factor.
> fac1 <- gl(5,3, labels = c(“one”, “two”, “three”, “four”, “five”))
> fac2 <- gl(5,1,15, labels = c(“a”, “b”, “c”, “d”, “e”, “f ”, “g”, “h”, “i”, “j”,
“k”, “l”, “m”, “n”, “o”))
> interaction(fac1, fac2)
[1] one.a one.b one.c two.d two.e two.a three.b three.c three.d four.e
[11] four.a four.b five.c five.d five.e
75 Levels: one.a two.a three.a four.a five.a one.b two.b three.b four.b ... five.o
2.7.Strings
Strings are stored in character vectors. Most string manipulation functions act
on character vectors. Character vectors can be created using the c() function by
enclosing the string in double or single quotes. (Generally we follow only double
quotes). The paste() function can be used to concatenate two strings with a space
in between. If the space need not be shown, we use the function paste0(). To have
specified separator between the two concatenated string, we use the argument
sep in the paste() function. The result can be collapsed into one string using the
collapse argument.
> c(“String 1”, ‘String 2’)
[1] “String 1” “String 2”
> paste(c(“Pine”, “Red”), “Apple”)
66 R Programming — An Approach for Data Analytics
The cat() function is also similar to the paste() function, but there is little
difference in it as shown below.
> cat(c(“Red”, “Pine”),
“Apple”) Red Pine Apple
The noquote() function forces the string outputs not to be displayed with
quotes.
> a <- c(“I”, “am”, “a”, “data”, “scientist”)
> a
[1] “I” “am” “a” “data” “scientist”
> noquote(a)
[1] I am a data scientist
67 Data Types in
R
The formatC() function is used to format the numbers and display them as
strings. This function has the arguments digits, width, format, flag etc which can
be used as below. A slight variation of the function formatC() is the function
format() whose usage is as shown below.
> h <- c(4.567, 8.981, 27.772)
> h
[1] 4.567 8.981 27.772
> formatC(h)
[1] “4.567” “8.981” “27.77”
> formatC(h, digits =
3) [1] “4.57” “8.98”
“27.8”
> formatC(h, digits = 3, width = 5)
[1] “ 4.57” “ 8.98” “ 27.8”
> formatC(h, digits = 3, format = “e”)
[1] “4.567e+00” “8.981e+00” “2.777e+01”
> formatC(h, digits = 3, flag =
“+”) [1] “+4.57” “+8.98” “+27.8”
> format(h)
[1] “ 4.567” “ 8.981” “27.772”
> format(h, digits = 3)
[1] “ 4.57” “ 8.98”
“27.77”
> format(h, digits = 3, trim = TRUE)
[1] “4.57” “8.98” “27.77”
The sprint() function is also used for formatting strings and passing number
values in between the strings. The argument %s in this function stands for a string
to be passed. The argument %d and argument %f stands for integer and floating-
point number. The usage of this function can be understood by the below
example.
68 R Programming — An Approach for Data Analytics
To print a tab in between text, we can use the cat() function with the special
character “\t” included in between the text as below. Similarly, if we need to
insert a new line in between the text, we use “\n”. In this cat() function the
argument fill
= TRUE means that after printing the text, the cursor is placed in the next line.
Suppose if a back slash has to be used in between the text, it is preceded by
another back slash. If we enclose the text in double quotes and if the text contains
a double quote in between, it is also preceded by a back slash. Similarly, if we
enclose the text in single quotes and if the text contains a single quote in between,
it is also preceded by a back slash. If we enclose the text in double quotes and if
the text contains a single quote in between, or if we enclose the text in single
quotes and if the text contains a double quote in between, it is not a problem (No
need for back slash).
> cat(“Black\tBerry”, fill =
TRUE) Black Berry
> cat(“Black\nBerry”, fill = TRUE)
Black
Berry
> cat(“Black\\Berry”, fill = TRUE)
Black\Berry
> cat(“Black\”Berry”, fill =
TRUE) Black”Berry
> cat(‘Black\’Berry’, fill = TRUE)
Black’Berry
> cat(‘Black”Berry’, fill = TRUE)
Black”Berry
69 Data Types in
R
> cat(“Black’Berry”, fill = TRUE)
Black’Berry
The function toupper() and tolower() are used to convert a string into upper
case or lower case respectively. The substring() or the substr() function is used to cut
a part of the string from the given text. Its arguments are the text, starting
position and ending position. Both these functions produce the same result.
> toupper(“The cat is on the Wall”)
[1] “THE CAT IS ON THE WALL”
> tolower(“The cat is on the Wall”)
[1] “the cat is on the wall”
The function strsplit() does the splitting of a text into many strings based on
the splitting character mentioned as argument. In the below example the splitting
is done when a space is encountered. It is important to note that this function
returns a list and not a character vector as a result.
> strsplit(“I like Bannana, Orange and Pineapple”, “ “)
[[1]]
[1] “I” “like” “Bannana,” “Orange” “and” “Pineapple”
In this same example if the text has to be split when a comma or space is
encountered it is mentioned as “,?”. This means that the comma is optional and
space is mandatory for splitting the given text.
70 R Programming — An Approach for Data Analytics
The default R’s working directory can be obtained using the function getwd()
and this default directory can be changed using the function setwd(). The
directory path mentioned in the setwd() function should have the forward slash
instead of backward slash as in the example below.
> getwd()
[1] “C:/Users/admin/Documents”
> setwd(“C:/Program Files/R”)
> getwd()
[1] “C:/Program Files/R”
It is also possible to construct the file paths using the file.path() function
which automatically inserts the forward slash between the directory names. The
function R.home() list the home directory where R is installed.
> file.path(“C:”, “Program Files”, “R”, “R-3.3.0”)
[1] “C:/Program Files/R/R-3.3.0”
> R.home()
[1] “C:/PROGRA~1/R/R-33~1.0”
Paths can also be specified by relative terms such as “.” denotes current directory,
“..” denotes parent directory and “~” denotes home directory. The function path.
expand() converts relative paths to absolute paths.
> path.expand(“.”)
[1] “.”
> path.expand(“..”)
[1] “..”
> path.expand(“~”)
[1] “C:/Users/admin/Documents”
71 Data Types in
R
The function basename() returns only the file name leaving its directory if
specified. On the other hand the function dirname() returns only the directory
name leaving the file name.
> filename <- “C:/Program Files/R/R-3.3.0/bin/R.exe”
> basename(filename)
[1] “R.exe”
> dirname(filename)
[1] “C:/Program Files/R/R-3.3.0/bin”
R has three date and time base classes and they are POSIXct, POSIXlt and Date.
POSIX is a set of standards that defines how dates and times should be specified
and “ct” stands for “calendar time”. POSIXlt stores dates as a list of seconds,
minutes, hours, day of month etc. For storing and calculating with dates, we can
use POSIXct and for extracting parts of dates, we can use POSXlt.
The function Sys.time() is used to return the current date and time. This
returned value is by default in the POSIXct form. But, this can be converted to
POSIXlt form using the function as.POSIXlt(). When printed both forms of date
and time are displayed in the same manner, but their internal storage mechanism
differs. We can also access individual components of a POSIXlt date using the
dollar symbol or the double brackets as shown below.
> Sys.time()
[1] “2017-05-11 14:31:29 IST”
> t <- Sys.time()
> t1 <- Sys.time()
> t2 <- as.POSIXlt(t1)
72 R Programming — An Approach for Data Analytics
> t1
[1] “2017-05-11 14:39:39 IST”
> t2
[1] “2017-05-11 14:39:39 IST”
> class(t1)
[1] “POSIXct” “POSIXt”
> class(t2)
[1] “POSIXlt” “POSIXt”
> t2$sec
[1] 39.20794
> t2[[“min”]]
[1] 39
> t2$hour
[1] 14
> t2$mday
[1] 11
> t2$wday
[1] 4
The Date class stores the dates as number of days from start of 1970. This
class is useful when time is insignificant. The as.Date() function can be used to
convert a date in other class formats to the Date class format.
> t3 <- as.Date(t2)
> t3
[1] “2017-05-11”
There are also other add-on packages available in R to handle date and time and
they are date, dates, chron, yearmon, yearqtr, timeDate, ti and jul.
73 Data Types in
R
2.8.2. Date Conversions
In CSV files the dates will be normally stored as strings and they have to be converted
into date and time using any of the packages. For this we need to parse the strings
using the function strptime() and this returns the date of the format POSIXlt.
The date format is specified as a string and passed as argument to the strptime()
function. If the given string does not match the format given in the format string,
then it returns NA.
> date1 <- strptime(“22:15:45 22/08/2015”, “%H:%M:%S %d/%m/%Y”)
> date1
[1] “2015-08-22 22:15:45 IST”
In the format string “%H” denotes hour in 24 hour system, “%M” denotes
minutes, “%S” denotes second, “%m” denotes the number of the month, “%d”
denotes the day of the month as number, “%Y” denotes four digit year.
To convert a date into a string the function strftime() is used. This function also
takes a date formatting string as argument like strptime(). In the format string
“%I” denotes hour in 12 hours system, “%p” denotes AM/PM, “%A” denotes the
string of day of the week, “%B” denotes the string of name of the month.
> strftime(Sys.Date(),”It’s %I:%M%p on %A %d %B, %Y.”)
[1] “It’s 12:00AM on Thursday 11 May, 2017.”
It is possible to specify the time zone when parsing a date string using strptime()
or strftime() functions. If this is not specified, the default time zone is taken. The
functions Sys.timezone() and Sys.getlocale(“LC_TIME”) are used to get the default
time zone of the system and the operating system respectively.
74 R Programming — An Approach for Data Analytics
> Sys.timezone()
[1] “Asia/Calcutta”
> Sys.getlocale(“LC_TIME”)
[1] “English_India.1252”
Few of the time zones are UTC (Universal Time), IST (Indian Standard Time),
EST (Eastern Standard Time), PST (Pacific Standard Time), GMT (Greenwitch
Meridian Time), etc. It is also possible to give manual offset from UTC as “UTC+n”
or “UTC–n” to denote west and east parts of UTC respectively. Even though it
throws warning message, it gives the result correctly.
> strftime(Sys.time(), tz = “UTC”)
[1] “2017-05-12 04:59:04”
The time zone changes does not happen in strftime() function if the date is in
POSIXlt dates. Hence, it is required to change to POSIXct format first and then
apply the function.
If we add a number to the POSIXct or POSIXlt classes, it will shift to that many
seconds. If we add a number to the Date class, it will shift to that many days.
> ct <- as.POSIXct(Sys.time())
> lt <- as.POSIXlt(Sys.time())
75 Data Types in
R
> dt <- as.Date(Sys.time())
> ct
[1] “2017-05-12 11:41:54 IST”
> ct + 2500
[1] “2017-05-12 12:23:34 IST”
> lt
[1] “2017-05-12 11:42:15 IST”
> lt + 2500
[1] “2017-05-12 12:23:55 IST”
> dt
[1] “2017-05-12”
> dt + 2
[1] “2017-05-14”
Adding two dates, throws error. But subtracting two dates gives the number
of days in between the dates. To get the same result, alternatively, the difftime()
function can be used and in this it is possible to specify the attribute units = “secs”
(or “mins” or “hours” or “days” or “weeks”).
> dt1 <- as.Date(“10/10/1973”, “%d/%m/%Y”)
> dt1
[1] “1973-10-10”
> dt2 <- as.Date(“25/09/2000”, “%d/%m/%Y”)
> dt2
[1] “2000-09-25”
The seq() function can be used to generate a sequence of dates. The argument
“by” can take many options based on the class of the dates specified. We can apply
the mean() and summary() functions on these sequence of dates generate.
> seq(dt1, dt2, by = “1 year”)
[1] “1973-10-10” “1974-10-10” “1975-10-10” “1976-10-10” “1977-10-10”
“1978-10-10”
[7] “1979-10-10” “1980-10-10” “1981-10-10” “1982-10-10” “1983-10-10”
“1984-10-10”
[13] “1985-10-10” “1986-10-10” “1987-10-10” “1988-10-10” “1989-10-10”
“1990-10-10”
[19] “1991-10-10” “1992-10-10” “1993-10-10” “1994-10-10” “1995-10-10”
“1996-10-10”
[25] “1997-10-10” “1998-10-10” “1999-10-10”
The lubridate package makes the process of date and time manipulation
easier. The ymd() function in this package converts any date to the format of year,
month and day separated by hyphens.(Note: This function requires the date to be
specified in the order of year, month and day, but can use any separator as
below).
> install.packages(“lubridate”)
> library(lubridate)
> ymd(“2000/09/25”, “2000-9-25”, “2000*9.25”)
[1] “2000-09-25” “2000-09-25” “2000-09-25”
If the given date is in other formats that is not in the order of year, month and
day, then we have other functions such as ydm(), mdy(), myd(), dmy() and dym().
These functions can also be accompanied with time by making use of the
functions ymd_h(), ymd_hm() and ymd_hms() [similar functions available for
ydm(), mdy(), myd(), dmy() and dym()]. All the parsing functions in the lubridate
package returns POSIXct dates and the default time zone is UTC. A function named
stamp() in the lubridate package allows formatting of the dates in a human
readable format.
> dt_format <- stamp(“I purchased on Sunday, the 10th of October 2013 at
6:00:00 PM”)
Multiple formats matched: “I purchased on %A, the %dth of %B %Y at %H:%M:%S
79 Data Types in
R
%Op”(0), “I purchased on %A, the %dth of October %Y at %Om:%H:%M %Op”...
...
Using: “I purchased groceries on %A, the %dth of %Om %Y at %H:%M:%S %Op”
The lubridate package has three variable types, namely the “Durations”, “Periods”
and “Intervals”. The lubridate package has the functions, dyears(), dweeks(), ddays(),
dhours(), dminutes(), dseconds() etc that specify the duration of year, week, day, hour,
minute and second in terms of seconds. The duration of 1 minute is 60 seconds,
the duration of 1 hour is 3600 seconds (60 minutes * 60 seconds), the duration of
1 day is 86,400 seconds (24 hours * 60 minutes * 60 seconds), the duration of 1
year is 31,536,000 seconds (365 days * 24 hours * 60 minutes * 60 seconds) and
so on. The function today() returns the current days date.
> y <- dyears(1:5)
> y
[1] “31536000s (~52.14 weeks)” “63072000s (~2 years)” “94608000s (~3 years)”
[4] “126144000s (~4 years)” “157680000s (~5 years)”
> w <- dweeks(1:4)
> w
[1] “604800s (~1 weeks)” “1209600s (~2 weeks)” “1814400s (~3 weeks)”
[4] “2419200s (~4 weeks)”
> d <- ddays(1:10)
> d
[1] “86400s (~1 days)” “172800s (~2 days)” “259200s (~3 days)”
[4] “345600s (~4 days)” “432000s (~5 days)” “518400s (~6 days)”
[7] “604800s (~1 weeks)” “691200s (~1.14 weeks)” “777600s (~1.29 weeks)”
[10] “864000s (~1.43 weeks)”
80 R Programming — An Approach for Data Analytics
> today() + y
[1] “2018-05-12” “2019-05-12” “2020-05-11” “2021-05-11” “2022-05-11”
“Periods” specify time spans according to the clock time. The lubridate
package has the functions, years(), weeks(), days(), hours(), minutes(), seconds()
etc that specify the period of year, week, day, hour, minute and second in terms of
clock time. The exact length of these periods can be realized only if they are added
to an instance of date or time.
> y <- years(1:7)
> y
[1] “1y 0m 0d 0H 0M 0S” “2y 0m 0d 0H 0M 0S” “3y 0m 0d 0H 0M 0S”
“4y 0m 0d 0H 0M 0S”
[5] “5y 0m 0d 0H 0M 0S” “6y 0m 0d 0H 0M 0S” “7y 0m 0d 0H 0M 0S”
> today()+y
[1] “2018-05-12” “2019-05-12” “2020-05-12” “2021-05-12” “2022-05-12”
“2023-05-12”
[7] “2024-05-12”
“Intervals” are defined by the instance of date or time at the beginning and
end. They are mostly used for specifying “Periods” and “Durations” and conversion
between “Periods” and “Durations”.
> yr <- dyears(5)
> yr
[1] “157680000s (~5 years)”
> as.period(yr)
[1] “5y 0m 0d 0H 0M 0S”
> sdt <- ymd(“2017-05-12”)
> int <- new_interval(sdt, sdt+yr)
> int
[1] 2017-05-12 UTC--2022-05-11 UTC
81 Data Types in
R
The operator “%--%” is used for defining intervals and the operator “%within%”
is used for checking if a given date is within the given interval.
> intv <- ymd(“1973-10-10”) %--% ymd(“2000-09-25”)
> intv
[1] 1973-10-10 UTC--2000-09-25 UTC
> ymd(“1979-12-12”) %within% intv
[1] TRUE
The function with_tz() can be used to change the time zone of a date (correctly
handles POSIXlt dates) and the function force_tz() is used for updating incorrect
time zones.
> with_tz(Sys.time(), tz = “America/Los_Angeles”)
[1] “2017-05-12 06:44:14 PDT”
> with_tz(Sys.time(), tz = “Asia/Kolkata”)
[1] “2017-05-12 19:14:29 IST”
The functions floor_date() and ceiling_date() can be used to find the lower and
upper limit of a given date as below.
> floor_date(today(), “year”)
[1] “2017-01-01”
> ceiling_date(today(), “year”)
[1] “2018-01-01”
> floor_date(today(), “month”)
[1] “2017-05-01”
> ceiling_date(today(),
“month”) [1] “2017-06-01”
HIGHLIGHTS
The basic data types in R are Numeric, Integer, Complex, Logical and
Character.
82 R Programming — An Approach for Data Analytics
CHAPTER 3
DATA PREPARATION
OBJECTIVES
3.1.Datasets
R has many datasets built in. R can read data from variety of other data sources
and in variety of formats. One of the packages in R is datasets which is filled with
example datasets. Many other packages also contain datasets. We can see all the
datasets available in the loaded packages using the data() function.
84 R Programming — An Approach for Data Analytics
To access a particular dataset use the data() function with its argument as the
dataset name enclosed within double quotes and the second optional argument
being the package name in which the dataset is present (This second argument is
required only if the particular package is not loaded). The invoked dataset can be
listed just like a data frame using the head() function.
> data(“kidney”, package = “survival”)
> head(kidney)
id time status age sex disease frail
1 1 8 1 28 1 Other 2.3
2 1 16 1 28 1 Other 2.3
3 2 23 1 48 2 GN 1.9
….
Text documents have several formats. Common format are CSV (Comma
Separated Values), XML (Extended Markup Language), JSON (JavaScript Object
Notation) and YAML. An example of an unstructured text data is a book.
Comma Separated Values (CSV) Files is a spreadsheet like data stored with
comma delimited values. The read.table() function reads these files and stores the
result in a data frame. If the data has header, it is required to pass the argument
header = TRUE to the read.table() function. The argument fill = TRUE makes
the read.table() function substitute NA values for the missing fields. The system.
file() function is used to locate files that are inside a package. In the below
example “extdata” is the folder name and the package name is “learning” and the
file name is “RedDeerEndocranialVolume,dlm” The str() function takes the data
frame name as the argument and lists the structure of the dataset stored in the
data frame.
> install.packages(“learningr”)
> library(learningr)
> deer_file <- system.file(“extdata”,”RedDeerEndocranialVolume.dlm”,
package = “learningr”)
> deer_data <- read.table(deer_file, header=TRUE, fill=TRUE)
> str(deer_data)
‘data.frame’: 33 obs. of 8 variables:
$ SkullID : Factor w/ 33 levels “A4”,”B11”,”B12”,..: 14 2 17 16 15 13 10 11
19 3 ...
$ VolCT : int 389 389 352 388 375 325 346 302 379 410 ...
$ VolBead : int 375 370 345 370 355 320 335 295 360 400 ...
$ VolLWH : int 1484 1722 1495 1683 1458 1363 1250 1011 1621 1740 ...
$ VolFinarelli: int 337 377 328 377 328 291 289 250 347 387 ...
$ VolCT2 : int NA NA NA NA NA NA 346 303 375 413 ...
86 R Programming — An Approach for Data Analytics
The column names and row names are listed by default and if the row names
are not given in the dataset, the rows are simply numbered 1, 2, 3 and so on. The
arguments specify how the file will be read. The argument sep determines the
character to use as separator between fields. The nrow argument specifies the
lines of data to read. The argument skip specifies the number of lines to skip at the
start of the file. For the functions read.table() and read.csv() the default separator
is set to comma and they assume the data has header row. The function
read.csv2() uses the semicolon as the separator and comma instead of decimals.
The read.delim() function imports the tab-delimited files with full stops for
decimal places. The read. delim2() function imports the tab-delimited files with
commas for decimal places.
> read.csv(deer_file, header=FALSE, skip = 3, nrow = 2)
V1
1 DIC90 352 345 1495 328
2 DIC83 388 370 1683 377
> head(deer_data)
SkullID VolCT VolBead VolLWH VolFinarelli VolCT2 VolBead2 VolLWH2
1 DIC44 389 375 1484 337 NA NA NA
2 B11 389 370 1722 377 NA NA NA
3 DIC90 352 345 1495 328 NA NA NA
….
The colbycol and sqldf packages contain functions that allow to read part of
the CSV file into R. These are useful when we don’t need all the columns or all the
rows. For low-level control we can use the scan() function to import CSV file. For
data exported from other languages we may need to pass the na.strings argument to
the read.table() function to replace the missing values. If the data is exported from
SQL, we use na.strings = “NULL” and if the data is exported from SAS or Stata,
we use na.strings = “.”. If the data is exported from Excel we use the na.strings
= c(“”,”#N/A”, “#DIV/0!”, “#NUM!”).
87 Data Preparation
Writing data from R into a file is easier than reading files into R. For this we
use the functions write.table() and write.csv(). These functions take a data frame
and a file path as arguments. They also have arguments to specify if we need not
include row names in the output file or to specify the character encoding of the
output file.
> write.csv(deer_data,”F:/deer.csv”, row.names = FALSE, fileEncoding = “utf8”)
If the file structure is week, it is easier to read the file as lines of text using the function
readLines() and then parse the contents. The readLines() function accepts a path
to the file as the argument. Similarly, the writeLines() function takes a text line or
a character vector and the file name as argument and writes the text to the file.
> tempest <- readLines(“F:/Tempest.txt”)
> tempest
[1] “The writing of Prefaces to Plays was probably invented by some very”
[2] “ambitious Poet, who never thought he had done enough: Perhaps by
ome”
[3] “Ape of the French Eloquence, which uses to make a business of a Letter
of ”
....
> writeLines(“This book is about a story by Shakespeare”, “F:/story.csv”)
XML files are used for storing nested data. Few of them are RSS (Really Simple
Syndication) feeds, SOAP (Simple Object Access Protocols) and XHTML Web
Pages. To read the XML files, the XML package has to be installed. When an XML
file is imported, the result can be stored using the internal nodes or the R nodes. If the
result is stored using internal nodes, it allows to query the node tree using the XPath
language (used for interrogating XML documents). The XML file can be imported
using the function xmlParse() function. This function can take the argument
useInternalNodes
= FALSE to use R-level nodes instead of the internal nodes while importing the
XML files. But, this is set by default by the xml TreeParse() function.
88 R Programming — An Approach for Data Analytics
> install.packages(“XML”)
> library(XML)
The functions for importing HTML pages are htmlParse() and htmlTreeParse()
and they behave same as the xmlParse() and xmlTreeParse() functions.
The two packages dealing with JSON data are RJSONIO and rjson and the best of
these is the RJSONIO. The function used to import the JSON file is fromJSON()
and the function used to export the JSON file is toJSON(). The yaml package has
two functions for importing YAML data and they are yaml.load() and yaml.load_
file(). The function as.yaml() performs the task of converting R objects to YAML
strings.
Many softwares store their data in binary formats which are smaller in size
than the text files. They hence provide performance gains at the expense of
human readability.
Excel is the world’s most powerful data analysis tool and its document formats
are XLX and XLSX. Spreadsheets can be imported with the functions read.xlsx()
and read.xlsx2(). The colClasses argument determines what class each column
should have in the resulting data frame and this argument is optional in the above
functions. To write to an excel file from R we use the function write.xlsx2() that
takes the data frame and the file name as arguments. There is another package
xlsReadWrite that does the same function of the xlxs package but this one works
only in 32-bit R installations and only on windows.
89 Data Preparation
> install.packages(“xlsx”)
> library(xlsx)
> logfile <- read.xlsx2(“F:/Log2015.xls”, sheetIndex = 1, startRow = 2, endrow = 72,
colIndex = 1:5, colClasses = c(“character”, “numeric”, “character”,
“character”, “integer”))
The files from a statistical package are imported using the foreign package. The
read.ssd() function is used to read SAS datasets and the read.dta() function is
used to read Stata DTA files. The read.spss() function is used to import the SPSS
data files. Similarly, these files can be written with the write.foreign() function.
The MATLAB binary data files can be read and written using the readMat() and
writeMat() functions in the R.matlab package. The files in picture formats can be
read via the jpeg, png, tiff, rtiff and readbitmap packages.
R has ways to import data from web sources using Application Programming Interface
(API). For example the World Bank makes its data available using the WDI
package and the Polish government data can be accessed using the
SmarterPoland package. The twitter package provides access to Twitter’s users
and their tweet.
The read.table() function can accept URL rather than a local file. Accessing a
large file from internet can be slow and if the file is required frequently, it is
better to download the file using the download.file() function and create a local
copy and then import that.
> cancer_url <- “http://repository.seasr.org/Datasets/UCI/csv/breast-cancer.csv”
> cancer_data <- read.csv(cancer_url)
> str(cancer_data)
‘data.frame’: 287 obs. of 10 variables:
$ age : Factor w/ 7 levels “20-29”,”30-39”,..: 7 3 4 4 3 3 4 4 3 3 ...
$ menopause : Factor w/ 4 levels “ge40”,”lt40”,..: 4 3 1 1 3 3 3 1 3 3 ...
90 R Programming — An Approach for Data Analytics
3.3.Accessing Databases
R can connect to all database management systems (DBMS) like SQLite,
MySQL, MariaDB, PostgreSQL and Oracle using the DBI package. We need to
install and load the DBI package and the backend package RSQLite. Define a
database driver of type SQLite using the function dbDriver() and setup a
connection to the database using the function dbConnect(). To retrieve data from
the databases you write a query as a string containing SQL commands and send it
to the database with the function dbGetQuery().
> install.packages(“DBI”)
> install.packages(“RSQLite”)
> library(DBI)
> library(RSQLite)
> driver <- dbDriver(“SQLite”)
91 Data Preparation
The function dbDisconnect() is used for disconnecting and unloading the driver
and the function dbUnloadDriver() is used to unload the defined database driver.
> dbDisconnect(conn)
> dbUnloadDriver(driver)
For MySQL database we need to load the RMySQL package and set the
driver type to be “MySQL”. The PostgreSQL, Oracle and JDBC databases need
the PostgreSQL, ROracle and RJDBC packages respectively. To connect to an
SQL Server or Access databases, the RODBC package needs to be loaded. In this
package, the function odbcConnect() is used to connect to the database and the
function sqlQuery() is used to run a query and the function odbcClose() is used to
close and cleanup the database connections. There are not much matured
methods to access the NoSQL (Not only SQL) databases (lightweight databases –
scalable than traditional SQL relational databases). To access the MongoDB
database the packages RMongo and rmongodb are used. The database Cassandra
can be accessed using the package RCassandra.
92 R Programming — An Approach for Data Analytics
In some datasets or data frames logical values are represented as “Y” and “N” instead
of TRUE and FALSE. In such cases it is possible to replace the string with correct
logical value as in the example below.
> a <- c(1,2,3)
> b <- c(“A”, “B”, “C”)
> d <- c(“Y”, “N”, “Y”)
> df1 <- data.frame(a, b, d)
> df1
a b d
1 1 A Y
2 2 B N
3 3 C Y
convt <- function(x)
{
y <- rep.int(NA, length(x))
y[x == “Y”] <- TRUE
y[x == “N”] <- FALSE
y
}
> df1$d <- convt(df1$d)
> df1
a b d
1 1 A TRUE
2 2 B FALSE
3 3 C TRUE
93 Data Preparation
The functions grep() and grepl() are used to find a pattern in a given text
and the functions sub() and gsub() are used to replace a pattern with
another in a given text. The above four functions belong to the base package, but
the package stringr consists of many such string manipulation functions.
The function str_ detect() in the stringr package does the same function of
detecting the presence of a given pattern in the given text. We can also use
the function fixed() to mention if the string that we are searching for is a fixed
one.
> grep(“my”, “This is my pen”)
[1] 1
> grepl(“my”, “This is my pen”)
[1] TRUE
> sub(“my”, “your”,”This is my pen”)
[1] “This is your pen”
> gsub(“my”, “your”,”This is my pen”)
[1] “This is your pen”
> str_detect(“This is my pen”, “my”)
[1] TRUE
> str_detect(“This is my pen”, fixed(“my”))
[1] TRUE
The function str_split() is used to split a given text based on the pattern
specified as below. This function returns a vector. But the function
str_split_fixed() can be used to split the given text into fixed number of
strings based on the specified patterns. This function returns a matrix.
94 R Programming — An Approach for Data Analytics
[[1]]
In the example below, the various ways of storing the gender values are
transformed into one way, ignoring the case differences. This is done using the
str_replace() function and the fixed() functions that ignores the case.
95 Data Preparation
To add a column to a data frame, we can use the below command to achieve this.
> name <- c(“Jhon”, “Peter”, “Mark”)
> start_date <- c(“1980-10-10”, “1999-12-12”, “1990-04-05”)
> end_date <- c(“1989-03-08”, “2004-09-20”, “2000-09-25”)
> service <- data.frame(name, start_date, end_date)
> service
name start_date end_date
1 Jhon 1980-10-10 1989-03-08
2 Peter 1999-12-12 2004-09-20
3 Mark 1990-04-05 2000-09-25
> service$period <- as.Date(service$end_date) - as.Date(service$start_date)
> service
name start_date end_date period
1 Jhon 1980-10-10 1989-03-08 3071 days
2 Peter 1999-12-12 2004-09-20 1744 days
3 Mark 1990-04-05 2000-09-25 3826 days
Another way of doing the same is using the function within(). But, the difference
lies when there are multiple columns to be added to a data frame, we can easily
do the same using the within() function in a single command and this is not
possible using the with() function.
> service <- within(service,
{
period <- as.Date(end_date) - as.Date(start_date)
highperiod <- period > 2000
})
> service
name start_date end_date period highperiod
1 Jhon 1980-10-10 1989-03-08 3071 days TRUE
2 Peter 1999-12-12 2004-09-20 1744 days FALSE
3 Mark 1990-04-05 2000-09-25 3826 days TRUE
The mutate() function in the plyr package also does the same function
as the function within(), but the syntax is slightly different.
> library(plyr)
> service <- mutate(service,
{
period = as.Date(end_date) - as.Date(start_date)
highperiod = period > 2000
})
> service
97 Data Preparation
A data frame can be transformed by choosing few of the columns and ignoring
the remaining, but considering all the rows as in the example below.
> crime.data <- read.csv(“F:/Crimes.csv”)
> colnames(crime.data)
[1] “CASE.” “DATE..OF.OCCURRENCE” “BLOCK”
[4] “IUCR” “PRIMARY.DESCRIPTION”
“SECONDARY.DESCRIPTION”
[7] “LOCATION.DESCRIPTION” “ARREST” “DOMESTIC”
[10] “BEAT” “WARD” “FBI.CD”
[13] “X.COORDINATE” “Y.COORDINATE” “LATITUDE”
[16] “LONGITUDE” “LOCATION”
98 R Programming — An Approach for Data Analytics
Alternatively, the data frame can be transformed by selecting only the required
rows and retaining all columns of a data frame as in the example below.
> nrow(crime.data)
[1] 65535
> crime.data2 <- crime.data[1:10,]
> nrow(crime.data2)
[1] 10
The function sort() sorts the given vector of numbers or strings. It generally
sorts from smallest to largest, but this can be altered using the argument
decreasing
= TRUE.
> x <- c(5, 10, 3, 15, 6, 8)
> sort(x)
[1] 3 5 6 8 10 15
> sort(x, decreasing = TRUE)
[1] 15 10 8 6 5 3
> y <- c(“X”, “AB”, “Deer”, “For”, “Moon”)
> sort(y)
[1] “AB” “Deer” “For” “Moon” “X”
> sort(y, decreasing = TRUE)
[1] “X” “Moon” “For” “Deer” “AB”
The function order() is the inverse of the sort() function. It returns the
index of the vector elements in the order as below. But, x[order(x)] is same
as sort(x). This can be seen by the use of the identical() function.
99 Data Preparation
> order(x)
[1] 3 1 5 6 2 4
> x[order(x)]
[1] 3 5 6 8 10 15
> identical(sort(x), x[order(x)])
[1] TRUE
The order() function is more useful than the sort() function as it can be used to
manipulate the data frames easily.
> name <- c(“Jhon”, “Peter”, “Mark”)
> start_date <- c(“1980-10-10”, “1999-12-12”, “1990-04-05”)
> end_date <- c(“1989-03-08”, “2004-09-20”, “2000-09-25”)
> service <- data.frame(name, start_date, end_date)
> service
name start_date end_date
1 Jhon 1980-10-10 1989-03-08
2 Peter 1999-12-12 2004-09-20
3 Mark 1990-04-05 2000-09-25
> startdt <- order(service$start_date)
> service.ordered <- service[startdt, ]
> service.ordered
name start_date end_date
1 Jhon 1980-10-10 1989-03-08
3 Mark 1990-04-05 2000-09-25
2 Peter 1999-12-12 2004-09-20
The arrange() function of the plyr package does the same function as
above.
> library(plyr)
> arrange(service, start_date)
100 R Programming — An Approach for Data Analytics
The rank() function lists the rank of the elements in a vector or a data
frame. By specifying the argument ties.method = “first”, a rank need not be
shared among more than one element with the same value.
> x <- c(9, 5, 4, 6, 4, 5)
> rank(x)
[1] 6.0 3.5 1.5 5.0 1.5 3.5
> rank(x, ties.method = “first”)
[1] 6 3 1 5 2 4
The SQL statements can be executed from R and the results can be obtained
as in any other database. The package sqldf needs to be installed to manipulate
the data frames or datasets using SQL.
> install.packages(“sqldf ”)
> library(sqldf)
> query <- “SELECT * FROM iris WHERE Species = ‘setosa’”
> sqldf(query)
Data Reshaping in R is about changing the way data is organized into rows and
columns. Most of the time data processing in R is done by taking the input data as a
data frame. It is easy to extract data from the rows and columns of a data frame. But
there are situations when we need the data frame in a different format than what we
received. R has few functions to split, merge and change the columns to rows and vice-
versa in a data frame.
The cbind() function can be used to join multiple vectors to create a data
frame.
We can also merge two data frames using the rbind() function.
101 Data Preparation
5 Lowry CO 80230
6 Charlotte FL 33949
The merge() function can be used to merge two data frames. The
merging requires the data frames to have same column names on which the
merging is done. In the example below, we consider the data sets about
Diabetes in Pima Indian Women available in the library named “MASS”. The
two datasets are merged based on the values of blood pressure (“bp”) and body
mass index (“bmi”). On choosing these two columns for merging, the records
where values of these two variables match in both data sets are combined
together to form a single data frame.
> library(MASS)
> head(Pima.te)
npreg glu bp skin bmi ped age type
1 6 148 72 35 33.6 0.627 50 Yes
2 1 85 66 29 26.6 0.351 31 No
3 1 89 66 23 28.1 0.167 21 No
...
> head(Pima.tr)
npreg glu bp skin bmi ped age type
1 5 86 68 28 30.2 0.364 24 No
2 7 195 70 33 25.1 0.163 55 Yes
3 5 77 82 41 35.8 0.156 35 No
...
> nrow(Pima.te)
[1] 332
> nrow(Pima.tr)
[1] 200
> merged.Pima <- merge(x = Pima.te, y = Pima.tr,
+ by.x = c(“bp”, “bmi”),
+ by.y = c(“bp”, “bmi”)
103 Data Preparation
+)
> head(merged.Pima)
bp bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y
skin.y 1 60 33.8 1 117 23 0.466 27 No 2
125 20
2 64 29.7 2 75 24 0.370 33 No 2 100 23
3 64 31.2 5 189 33 0.583 29 Yes 3 158 13
...
ped.y age.y type.y
1 0.088 31 No
2 0.368 21 No
3 0.295 24 No
...
> nrow(merged.Pima)
[1] 17
Now we melt the data using the melt() function in the package reshape2 to
organize it, converting all columns other than type and year into multiple rows.
104 R Programming — An Approach for Data Analytics
> library(reshape2)
> molten.ships <- melt(ships, id = c(“type”,”year”))
> head(molten.ships)
type year variable value
1 A 60 period 60
2 A 60 period 75
3 A 65 period 60
4 A 65 period 75
5 A 70 period 60
6 A 70 period 75
> nrow(molten.ships)
[1] 120
> nrow(ships)
[1] 40
We can cast the molten data into a new form where the aggregate of each type
of ship for each year is created. It is done using the cast() function.
> recasted.ship <- cast(molten.ships, type+year~variable,sum)
> head(recasted.ship)
type year period service incidents
1 A 60 135 190 0
2 A 65 135 2190 7
3 A 70 135 4865 24
4 A 75 135 2244 11
5 B 60 135 62058 68
6 B 65 135 48979 111
105 Data Preparation
R has many apply functions such as apply(), lapply(), sapply(), vapply(), mapply(),
rapply(), tapply(), aggregate() and by(). Function lapply() is a list apply which acts
on a list or vector and returns a list. Function sapply() is a simple lapply() function
defaults to returning a vector or matrix when possible. Function vapply() is a verified
apply() function that allows the return object type to be pre-specified. Function
rapply() is a recursive apply for nested lists, i.e. lists within lists. Function tapply()
is a tagged apply where the tags identify the subsets. Function apply() is generic,
applies a function to a matrix’s rows or columns or, more generally, to dimensions
of an array.
[4,] 4 8 12 16
,,2
[,1] [,2] [,3] [,4]
[1,] 17 21 25 29
[2,] 18 22 26 30
[3,] 19 23 27 31
[4,] 20 24 28 32
If we want to apply a function to each element of a list in turn and get a list
back, we use the lapply() function as below.
> x <- list(a = 1, b = 1:3, c = 10:100)
> x
$a
[1] 1
$b
107 Data Preparation
[1] 1 2 3
108 R Programming — An Approach for Data Analytics
$c
[1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[18] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
[35] 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
[52] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
[69] 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
[86] 95 96 97 98 99 100
$b
[1] 1 2 3
$c
[1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
[18] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
[35] 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
[52] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
[69] 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
[86] 95 96 97 98 99 100
When we want to use the function sapply(), but need to squeeze some more speed
out of the code, we use the function vapply() as below. For the function vapply(),
we give R the information on what the function will return, which can save some
time coercing returned values to fit in a single atomic vector. In the example below,
we tell R that everything returned by length() should be an integer of length 1.
> x <- list(a = 1, b = 1:3, c = 10:100)
> vapply(x, FUN = length, FUN.VALUE =
0L) a b c
1 3 91
For when we have several data structures (e.g. vectors, lists) and we want to
apply a function to the 1st elements of each, and then the 2nd elements of each,
etc., coercing the result to a vector/array we use the function vapply() as below.
110 R Programming — An Approach for Data Analytics
When we want to apply a function to subsets of a vector and the subsets are
defined by some other vector, usually a factor, we use the function tapply() as below.
> x <- 1:20
>x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
111 Data Preparation
The by() function, can be thought of, as a “wrapper” for the function tapply().
When we want to compute a task that tapply() can’t handle, the by() function
arises.
> cta <- tapply(iris$Sepal.Width , iris$Species , summary )
> cba <- by(iris$Sepal.Width , iris$Species , summary )
> cta
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
$versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
$virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
> cba iris$Species:
setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
112 R Programming — An Approach for Data Analytics
iris$Species: versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
iris$Species: virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
If we print these two objects, cta and cba, we have the same results. The only
differences are in how they are shown with the different class attributes. The power
of the function by() arises when we can’t use the function tapply() as in the following
code.
> tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) :
arguments must have same length
R says that arguments must have the same lengths, say “we want to calculate
the summary of all variable in iris along the factor Species”: but R just can’t do
that because it does not know how to handle. The by() function lets the
summary() function work even if the length of the first argument are different.
> bywork <- by(iris, iris$Species, summary )
> bywork
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
Median :5.000 Median :3.400 Median :1.500 Median :0.200
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
113 Data Preparation
Species
setosa :50
versicolor: 0
virginica : 0
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200
Median :5.900 Median :2.800 Median :4.35 Median :1.300
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
Species
setosa : 0
versicolor:50
virginica : 0
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
Median :6.500 Median :3.000 Median :5.550 Median :2.000
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
Species
setosa : 0
114 R Programming — An Approach for Data Analytics
versicolor: 0
virginica :50
The arguments must have the same lengths. R can’t do that because it does
not know how to handle it. The by() function lets the summary() function work
even if the length of the first argument is different. The result is an object of class
by that along Species computes the summary of each variable.
The aggregate() function can be seen as another a different way of using tapply()
function if we use it in such a way.
> att <- tapply(iris$Sepal.Length , iris$Species , mean)
> agt <- aggregate(iris$Sepal.Length , list(iris$Species), mean)
> att
setosa versicolor virginica
5.006 5.936 6.588
> agt
Group.1 x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
The two immediate differences are that the second argument of the
aggregate() function must be a list while tapply() function can (not mandatory) be
a list and that the output of the aggregate() function is a data frame while the one
of tapply() function is an array. The power of the aggregate() function is that it
can handle easily subsets of the data with subset argument and that it can
handle formula as well. These elements make the aggregate() function easier to
work with than tapply() function in some situations.
> ag <- aggregate(len ~ ., data = ToothGrowth, mean)
> ag
115 Data Preparation
4 VC 1.0 16.77
5 OJ 2.0 26.06
6 VC 2.0 26.14
HIGHLIGHTS
One of the packages in R is datasets which is filled with example datasets.
We can see all the datasets available in the loaded packages using the
data() function.
The read.table() function reads the CSV files and stores the result in a
data frame.
The system.file() function is used to locate files that are inside a package.
Writing data from R into a file is done using the functions write.table()
and write.csv().
If the file is unstructured, it is read using the function readLines().
The writeLines() function takes a text line and the file name as argument
and writes the text to the file.
The XML file can be imported using the function xmlParse() function.
The function used to import the JSON file is fromJSON() and the function
used to export the JSON file is toJSON().
Spreadsheets can be imported with the functions read.xlsx() and read.
xlsx2().
To write to an excel file from R we use the function write.xlsx2().
The read.ssd() function is used to read SAS datasets.
The read.spss() function is used to import the SPSS data files.
The MATLAB binary data files can be read and written using the readMat()
and writeMat() functions in the R.matlab package.
116 R Programming — An Approach for Data Analytics
GRAPHICS USING R
OBJECTIVES
4.3.Pie Charts
In R the pie chart is created using the pie() function which takes positive numbers
as vector input. The additional parameters are used to control labels, colour, title
etc. The basic syntax for creating a pie-chart is as given below and the explanation
of the parameters are also listed.
pie(x, labels, radius, main, col, clockwise)
x – numeric vector
labels – description of the slices
radius – values between [-1 to +1]
119 Graphics using R
A 3D Pie Chart can be drawn using the package plotrix which uses the function
pie3D().
> install.packages(“plotrix”)
> library(plotrix)
> pie3D(x, labels = labels, explode = 0.1, main = “Flowers”)
120 R Programming — An Approach for Data Analytics
4.4.Scatter Plots
Scatter plots are used for exploring the relationship between the two continuous
variables. Let us consider the dataset “cars” that lists the “Speed and Stopping
Distances of Cars”. The basic scatter plot in the base graphics system can be
obtained by using the plot() function as in Fig. 4.3. The below example compares if
the speed of a car has effect on its stopping distance using the plot.
> colnames(cars)
[1] “speed” “dist”
> plot(cars$speed, cars$dist)
121 Graphics using R
This plot can be made more appealing and readable by adding colour and
changing the plotting character. For this we use the arguments col and pch (can
take the values between 1 and 25) in the plot() function as below. Thus the plot in
Fig. 4.4 shows that there is a strong positive correlation between the speed of a
car and its stopping distance.
> plot(cars$speed, cars$dist, col = “red”, pch = 15)
The layout() function is used to control the layout of multiple plots in the
matrix. Thus in the example below multiple related plots are placed in a single
figure as in Fig. 4.5.
> data(mtcars)
> layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
> plot(mtcars$wt, mtcars$mpg, col = “blue”, pch = 17)
> plot(mtcars$wt, mtcars$disp, col = “red”, pch = 15)
> plot(mtcars$mpg, mtcars$disp, col = “dark green”, pch = 10)
> plot(mtcars$mpg, mtcars$hp, col = “violet”, pch = 7)
When we have more than two variables and we want to find the correlation
between one variable versus the remaining ones we use scatter plot matrix. We use
pairs() function to create matrices of scatter plots as in Fig. 4.6. The basic syntax
for creating scatter plot matrices in R is as below.
pairs(formula, data)
The lattice graphics system has equivalent of plot() function and it is xyplot().
This function uses a formula to specify the x and y variables (yvar ~ xvar) and a
data frame argument. To use this function, it is required to include the lattice
package.
> library(lattice)
> xyplot(mtcars$mpg ~ mtcars$disp, mtcars, col = “purple”, pch = 7)
Axis scales can be specified in the xyplot() using the scales argument and this
argument must be a list. This list consists of the name = value pairs. If we mention
log = TRUE, the log scales for the x and y axis are set as in Fig. 4.8. The scales list
can take other arguments also like the x and y that sets the x and y axes
respectively.
> xyplot(mtcars$mpg ~ mtcars$disp, mtcars, scales = list(log = TRUE),
col = “red”, pch = 11)
Figure 4.8 Scatter Plot Matrix with Axis Scales Using xyplot()
The data in the graph can be split based on one of the columns in the dataset
namely mtcars$carb. This can be done by appending the pipe symbol (|) along
with the column name used for splitting. The argument relation = “same” means
that each panel shares the same axes. If the argument alternating = TRUE, axis
ticks for each panel is drawn on alternating sides of the plot as in Fig. 4.9.
> xyplot(mtcars$mpg ~ mtcars$disp | mtcars$carb, mtcars,
scales = list(log = TRUE, relation = “same”, alternating = FALSE),
layout = c(3, 2), col = “blue”, pch = 14)
125 Graphics using R
The lattice plots can be stored in variables and hence they can be further
updated using the function update as below.
> graph1 <- xyplot(mtcars$mpg ~ mtcars$disp | mtcars$carb, mtcars,
scales = list(log = TRUE, relation = “same”, alternating = FALSE),
layout = c(3, 2), col = “blue”, pch = 14)
> graph2 <- update(graph1, col = “yellow”, pch = 6)
In the ggplot2 graphics, each plot is drawn with a call to the ggplot() function
as in Fig. 4.10. This function takes a data frame as its first argument. The passing
of data frame columns to the x and y axis is done using the aes() function which is
used within the ggplot() function. The other aesthetics to the graph are then
added using the geom() function appended with a “+” symbol to the ggplot()
function.
> library(ggplot2)
> ggplot(mtcars, aes(mpg, disp)) +
geom_point(color = “purple”, shape = 16, cex = 2.5)
126 R Programming — An Approach for Data Analytics
The ggplots can also be split into several panels like the lattice plots as in Fig. 4.11.
This is done using the function facet_wrap() which takes a formula of the column
used for splitting. The function theme() is used to specify the orientation of the
axis readings. The functions facet_wrap() and theme() are appended to the
ggplot() function using the “+” symbol. The ggplots can be stored in a variable like
the lattice plots and as usual wrapping the expression in parentheses makes it to
auto print.
> (graph1 <- ggplot(mtcars, aes(mpg, disp)) +
geom_point(color = “dark green”, shape = 15, cex = 3))
> (graph2 <- graph1 + facet_wrap(~mtcars$cyl, ncol = 3) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)
4.5.Line Plots
A line chart / line plot is a graph that connects a series of points by drawing line
segments between them. Line charts are usually used in identifying the trends in
data. The plot() function in R is used to create the line graph in base graphics as in
Fig. 4.12. This function takes a vector of numbers as input together with few more
parameters listed below.
plot(v, type, col, xlab, ylab)
v – numeric vector
type - takes value “p” (only points), or “l” (only lines) or “o” (both points and lines)
xlab – label of x-axis
ylab – label of y-axis
main - title of the
chart col - colour
palette
> plot(male, type = “o”, col = “red”, xlab = “Month”, ylab = “Wages”,
main = “Monthly Wages”, ylim = c(0, 5000))
> lines(female, type = “o”, col = “blue”)
> lines(child, type = “o”, col = “green”)
> legend(“topleft”, wages, cex = 0.8, fill = color)
Line plots in the lattice graphics uses the xyplot() function as in Fig. 4.13.
In this multiple lines can be creating using the “+” symbol in the formula where
the x and the y axes are mentioned. The argument type = “l” is used to mention
that it is a continuous line.
> xyplot(economics$pop + economics$unemploy ~ economics$date, economics, type = “l”)
In the ggplot2 graphics, the same syntax for scatter plots are used, except for
the change of geom_plot() function with the geom_line() function as in Fig. 4.14.
But, there need to be multiple geom_line() functions for multiple lines to be drawn
in the graph.
> ggplot(economics, aes(economics$date)) + geom_line(aes(y = economics$pop)) +
geom_line(aes(y = economics$unemploy))
129 Graphics using R
The plot in the Fig. 4.15 can be drawn without using multiple geom_line()
functions also. This is possible using the function geom_ribbon() as mentioned
below. This function plots not only the two lines, but also the contents in between
the two lines.
> ggplot(economics, aes(economics$date, ymin = economics$unemploy,
ymax = economics$pop)) + geom_ribbon(color = “blue”, fill = “white”)
4.6.Histograms
Histograms represents the variable values frequencies, that are split into ranges.
This is similar to bar charts, but histograms group values into continuous ranges.
In R histograms in the base graphics are drawn using the function hist() as in the
Fig. 4.16, that takes a vector of numbers as input together with few more parameters
listed below.
hist(v, main, xlab, xlim, ylim, breaks, col, border)
v – numeric vector main - title of the chart
col - colour palette border – border colour
xlab – label of x-axis xlim – range of x-axis
ylim – range of y-axis breaks – width of each bar
The lattice histogram is drawn using the function histogram() as in Fig. 4.17
and it behaves in the same way as the base ones. But it allows easy splitting of
data into panels and saving plots as variables. The breaks argument behaves the same
way as with hist(). The lattice histograms support counts, probability densities, and
percentage y-axes via the type argument, which takes the string “count”, “density”,
or “percent”.
131 Graphics using R
4.7.Box Plots
The box plot divides the data into three quartiles. This graph represents the
minimum, maximum, median, first quartile and third quartile in the data. This
shows the data distribution by drawing the box plots. In R base graphics the box
plot is created using the boxplot() function as in Fig. 4.19, which takes the
following parameters. The parameters are used to give the data as a data frame, a
vector or a formula, a logical value to draw a notch, a logical value to draw a box
as per the width of the sample, give title of the chart, labels for the boxes. The
basic syntax for creating a box-plot is as given below and the explanation of the
parameters are also listed.
boxplot(x, data, notch, varwidth, names, main)
x – vector or a formula
data – data frame
notch – logical value (TRUE – draw a notch)
varwidth – logical value (TRUE – box width proportionate to sample size
names – labels printed under the boxes
main – title of the chart
This type of plot is often clearer if we reorder the box plots from smallest to
largest, in some sense. The reorder() function changes the order of a factor’s
levels, based upon some numeric score.
> boxplot(mpg ~ reorder(gear, mpg, median), data = mtcars,
xlab = “Number of Gears”, ylab = “Miles Per
Gallon”, main = “Car Mileage”, varwidth =
TRUE,
col = c(“red”,”blue”, “green”), names = c(“Low”, “Medium”, “High”))
In the lattice graphics the box plot is drawn using the function bwplot() as in
Fig. 4.20.
> bwplot(mpg ~ reorder(gear, mpg, median), data = mtcars,
xlab = “Number of Gears”, ylab = “Miles Per
Gallon”, main = “Car Mileage”, varwidth = TRUE,
col = c(“red”,”blue”, “green”), names = c(“Low”, “Medium”, “High”))
In the ggplot2 graphics the box plot is drawn by adding the function geom_
boxplot() to the function ggplot() as in Fig. 4.21.
> ggplot(mtcars, aes(reorder(gear, mpg, median), mpg)) + geom_boxplot()
134 R Programming — An Approach for Data Analytics
4.8.Bar Plots
Bar charts are the natural way of displaying numeric variables split by a
categorical variable. In R base graphics the bar chart is created using the
barplot() function as in Fig. 4.22, which takes a matrix or a vector of numeric
values. The additional parameters are used to give labels to the X-axis, Y-axis, give
title of the chart, labels for the bars and colours. The basic syntax for creating a
bar-chart is as given below and the explanation of the parameters are also listed.
barplot(H, xlab, ylab, main, names.arg, col)
H – numeric vector or matrix
x-lab – label of x-axis
y-lab – label of y-axis
main - title of the
chart
names.arg – vector of labels under each bar
col – colour palette
> x <- matrix(c(1000, 900, 1500, 4400, 800, 2100, 1700, 2900, 3800),
nrow = 3, ncol = 3)
> years <- c(“2011”, “2012”, “2013”)
> city <- c(“Chennai”, “Mumbai”, “Kolkata”)
135 Graphics using R
By default the bars are vertical, but if we want horizontal bars, they can be
generated with horiz = TRUE parameter as in Fig. 4.23. We can also do some
fiddling with the plot parameters, via the par() function. The las parameter
controls whether labels are horizontal, vertical, parallel, or perpendicular to the
axes. Plots are usually more readable if you set las = 1, for horizontal. The mar
parameter is a numeric vector of length 4, giving the width of the plot margins at
the bottom/left/ top/right of the plot.
> x <- matrix(c(1000, 900, 1500, 4400, 800, 2100, 1700, 2900, 3800), nrow = 3, ncol =
3)
> years <- c(“2011”, “2012”, “2013”)
136 R Programming — An Approach for Data Analytics
Extending this to multiple variables just requires a tweak to the formula, and
passing stack = TRUE to make a stacked plot as in Fig. 4.25.
> barchart(mtcars$mpg ~ mtcars$disp + mtcars$qsec + mtcars$hp, mtcars,
stack = TRUE
In the ggplot2 graphics the bar chart is drawn by adding the function geom_
bar() to the function ggplot() as in Fig. 4.26. Like base, ggplot2 defaults to vertical
bars; adding the function coord_flip() swaps this. We must pass the argument stat
= “identity” to the function geom_bar().
> ggplot(mtcars, aes(mtcars$mpg, mtcars$disp)) + geom_bar(stat = “identity”) +
coord_flip()
HIGHLIGHTS
Exploratory Data Analysis (EDA) shows how to use visualisation and
transformation to explore data in a systematic way.
The main graphical packages are base, lattice and ggplot2.
In R the pie chart is created using the pie() function.
A 3D Pie Chart can be drawn using the package plotrix which uses the
function pie3D().
The basic scatter plot in the base graphics system can be obtained by
using the plot() function.
We use the arguments col and pch (values between 1 and 25) in the plot()
function to specify colour and plot pattern.
The layout() function is used to control the layout of multiple plots in
the matrix.
140 R Programming — An Approach for Data Analytics
OBJECTIVES
respectively. Let us use the dataset named mtcars that is available in R by default
to understand these statistical measures.
> data(mtcars)
> colnames(mtcars)
[1] “mpg” “cyl” “disp” “hp” “drat” “wt” “qsec” “vs” “am” “gear” “carb”
> min(mtcars$cyl)
[1] 4
> max(mtcars$cyl)
[1] 8
> mean(mtcars$cyl)
[1] 6.1875
> median(mtcars$cyl)
[1] 6
All the above results can also be obtained by one function summary() and this
can also be applied on all the fields of the dataset at one shot. The range() function
gives the minimum and maximum values of a numeric field at one go.
> summary(mtcars$cyl)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.000 4.000 6.000 6.188 8.000 8.000
> range(mtcars$cyl)
[1] 4 8
5.1.1. Mean
Mean is calculated by taking the sum of the values and dividing with the number
of values in a data series. The function mean() is used to calculate this in R. The
basic syntax for calculating mean in R is given below along with its parameters.
mean(x, trim = 0, na.rm =
FALSE, ...) x - numeric vector
143 Statistical Analysis Using R
trim - to drop some observations from both end of the sorted vector
na.rm - to remove the missing values from the input vector
> x <- c(45, 56, 78, 12, 3, -91, -45, 15, 1, 24)
> mean(x)
[1] 9.8
When trim parameter is supplied, the values in the vector get sorted and then
the required numbers of observations are dropped from calculating the mean.
When trim = 0.2, 2 values from each end will be dropped from the calculations to
find mean. In this case the sorted vector is (-91, -45, 1, 3, 12, 15, 24, 45, 56, 78)
and the values removed from the vector for calculating mean are (−91, −45) from
left and (56, 78) from right.
> mean(x, trim =
0.2) [1] 16.66667
If there are missing values, then the mean() function returns NA. To drop the
missing values from the calculation use na.rm = TRUE, which means remove the
NA values.
> x <- c(45, 56, 78, 12, 3, -91, NA, -45, 15, 1, 24, NA)
> mean(x)
[1] NA
> mean(x, na.rm = TRUE)
[1] 9.8
5.1.2. Median
The middle most value in a data series is called the median. The median() function
is used in R to calculate this value. The basic syntax for calculating median in R is
given below along with its parameters.
median(x, na.rm =
FALSE) x - numeric vector
na.rm - to remove the missing values from the input vector
144 R Programming — An Approach for Data Analytics
> x <- c(45, 56, 78, 12, 3, -91, -45, 15, 1, 24)
> median(x)
[1] 13.5
5.1.3. Mode
The mode is the value that has highest number of occurrences in a set of data.
Unlike mean and median, mode can have both numeric and character data. R does
not have a standard in-built function to calculate mode. So we create a user function
to calculate mode of a data set in R. This function takes the vector as input and
gives the mode value as output.
Mode <- function(x)
{
y <- unique(x)
y[which.max(tabulate(match(x, y)))]
}
The function unique() returns a vector, data frame or array like x but with
duplicate elements/rows removed. The function match() returns a vector of
the positions of (first) matches of its first argument in its second. The function
tabulate() takes the integer-valued vector bin and counts the number of times
each integer occurs in it. The function which.max() determines the location, i.e.,
index of the (first) maximum of a numeric (or logical) vector.
145 Statistical Analysis Using R
The functions to calculate the standard deviation, variance and the mean absolute
deviation are sd(), var() and mad() respectively.
> sd(mtcars$cyl)
[1] 1.785922
> var(mtcars$cyl)
[1] 3.189516
> mad(mtcars$cyl)
[1] 2.9652
The function cor() and cov() are used to find the correlation and covariance between
two numeric fields respectively. In the below example the value shows that there
is negative correlation between the two numeric fields.
> cor(mtcars$mpg, mtcars$cyl)
[1] -0.852162
146 R Programming — An Approach for Data Analytics
There are other statistics functions such as pmin(), pmax() [parallel equivalents
of min() and max() respectively], cummin() [cumulative minimum value], cummax()
[cumulative maximum value], cumsum() [cumulative sum] and cumprod()
[cumulative product].
> nrow(mtcars)
[1] 32
> mtcars$cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> pmin(mtcars$cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> pmax(mtcars$cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> cummin(mtcars$cyl)
[1] 6 6 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
> cummax(mtcars$cyl)
[1] 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
> cumsum(mtcars$cyl)
[1] 6 12 16 22 30 36 44 48 52 58 64 72 80 88 96 104 112 116 120 124
[21] 128 136 144 152 160 164 168 172 180 186 194 198
> cumprod(mtcars$cyl)
[1] 6.000000e+00 3.600000e+01 1.440000e+02 8.640000e+02 6.912000e+03
4.147200e+04
[7] 3.317760e+05 1.327104e+06 5.308416e+06 3.185050e+07 1.911030e+08
1.528824e+09
[13] 1.223059e+10 9.784472e+10 7.827578e+11 6.262062e+12 5.009650e+13
2.003860e+14
147 Statistical Analysis Using R
5.2.Summary Statistics
Thus the summary() function can be applied on the entire dataset to get all the
statistical values of all the numeric fields.
> summary(mtcars)
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am gear
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000
carb
Min. :1.000
1st Qu.:2.000
Median :2.000
148 R Programming — An Approach for Data Analytics
Mean :2.812
3rd Qu.:4.000
Max. :8.000
5.3.Normal Distribution
In a random collection of data from independent sources, it is generally observed
that the distribution of data is normal. Which means, on plotting a graph with the
value of the variable in the horizontal axis and the count of the values in the
vertical axis we get a bell shape curve. The centre of the curve represents the
mean of the data set. In the graph, half of values lie to the left of the mean and the
other half lie to the right of the graph. This is referred as normal distribution in
statistics. R has four in-built functions to generate normal distribution. They are
described below.
dnorm(x, mean, sd)
pnorm(x, mean, sd)
qnorm(p, mean, sd)
rnorm(n, mean, sd)
x - vector of numbers
p - vector of probabilities
n - sample size
mean - mean (default value is 0)
sd - standard deviation (default value is 1)
5.3.1. dnorm()
For a given mean and standard deviation, this function gives the height of the
probability distribution. Below is an example in which the result of the dnorm()
function is plotted in a graph in Fig. 5.1.
> x <- seq(-5,5, by = 0.05)
> y <- dnorm(x, mean = 1.5, sd = 0.5)
> plot(x, y)
149 Statistical Analysis Using R
5.3.2. pnorm
()
5.3.3. qnorm()
The qnorm() function takes the probability value as input and returns a
cumulative value that matches the probability value. Below is an example in
which the result of the qnorm() function is plotted in a graph as in Fig. 5.3.
> x <- seq(0, 1, by = 0.02)
> y <- qnorm(x, mean = 2, sd = 1)
> plot(x, y)
5.3.4. rnorm
()
5.4.Binomial Distribution
The probability of success of an event is found by the binomial distribution model and
this has only two possible outcomes in a series of experiments. For example, tossing of
a coin always gives a head or a tail. During the binomial distribution, the
probability of finding exactly 3 heads when tossing a coin for 10 times is
estimated. R has four in-built functions to generate binomial distribution. They
are described below.
dbinom(x, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)
x - vector of numbers
p - vector of probabilities
n - sample size
size – number of trials
prob – probability of success of each trial
152 R Programming — An Approach for Data Analytics
5.4.1. dbinom()
This function gives the probability density distribution at each point. Below is an
example in which the result of the dbinom() function is plotted in a graph as in Fig. 5.5.
> x <- seq(0, 25, by = 1)
> y <- dbinom(x,25,0.5)
> plot(x, y)
5.4.2. pbinom
()
5.4.3. qbinom()
The function qbinom() takes the probability value as input and returns a number
whose cumulative value matches the probability value. The below example finds how
many heads will have a probability of 0.5 will come out when a coin is tossed 50 times.
> x <- qbinom(0.5, 50, 1/2)
> x
[1] 25
5.4.4. rbinom()
The function rbinom() returns the required number of random values of the given
probability from a given sample. The below code is to find 5 random values from
a sample of 50 with a probability of 0.5.
> x <- rbinom(5,50,0.5)
> x
[1] 24 21 22 29 32
5.5.Correlation Analysis
To evaluate the relation between two or more variables, the correlation test is used.
Correlation coefficient in R can be computed using the functions cor() or
cor.test(). The basic syntax for the correlation functions in R are as below.
cor(x, y, method)
cor.test(x, y, method)
Consider the data set “mtcars” available in the R environment. Let us first find
the correlation between the horse power (“hp”) and the mileage per gallon
(“mpg”) of the cars and then between the horse power (“hp”) and the cylinder
displacement (“disp”) of the cars. From the test we find that the horse power
(“hp”) and the
154 R Programming — An Approach for Data Analytics
mileage per gallon (“mpg”) of the cars have negative correlation (-0.7761684) and
the horse power (“hp”) and the cylinder displacement (“disp”) of the cars have
positive correlation (0.7909486).
> cor(mtcars$hp, mtcars$mpg, method = “pearson”)
[1] -0.7761684
The correlation results can also be viewed graphically as in Fig. 5.6. The corrplot()
function can be used to analyze the correlation between the various columns of a
dataset, say mtcars. After this, the correlation between individual columns can be
compared by plotting it in separate graphs as in Fig. 5.7 and Fig. 5.8.
> library(corrplot)
> M <- cor(mtcars)
> corrplot(M, method = “number”)
It can be noted that the graph with negative correlation (Fig. 5.7) has the dots
from top left corner to bottom right corner and the graph with positive
correlation (Fig. 5.8) has the dots from the bottom left corner to the top right
corner.
5.6.Regression Analysis
The function lm() creates the relationship model between the predictor and
the response variable. The basic syntax for lm() function in linear regression is as
given below.
lm(formula,data)
> x <- c(1510, 1740, 1380, 1860, 1280, 1360, 1790, 1630, 1520, 1310)
> y <- c(6300, 8100, 5600, 9100, 4700, 5700, 7600, 7200, 6200, 4800)
> model <- lm(y~x)
> model
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-3845.509 6.746
> summary(model)
Call:
lm(formula = y ~ x)
158 R Programming — An Approach for Data Analytics
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3845.5087 804.9013 -4.778 0.00139 **
x 6.7461 0.5191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The basic syntax for the function predict() in linear regression is as given
below.
predict(object, newdata)
In this equation y is the response variable, a, b1, b2...bn are the coefficients and
x1, x2, ...xn are the predictor variables.
In R, the lm() function is used to create the regression model. The model
determines the value of the coefficients using the input data. Next we can predict
the value of the response variable for a given set of predictor variables using
these coefficients. The relationship model is built between the predictor variables
and the response variables. The basic syntax for lm() function in multiple
regression is as given below.
lm(y ~ x1+x2+x3..., data)
160 R Programming — An Approach for Data Analytics
Consider the data set “mtcars” available in the R environment. This dataset
presents the data of different car models in terms of mileage per gallon (“mpg”),
cylinder displacement (“disp”), horse power (“hp”), weight of the car (“wt”) and
some more parameters. This model establishes the relationship between “mpg” as
a response variable with “disp”, “hp” and “wt” as predictor variables. We create a
subset of these variables from the mtcars data set for this purpose.
> model2 <- lm(mpg~disp+hp+wt, data = mtcars[, c(“mpg”, “disp”, “hp”, “wt”)])
> model2
Call:
lm(formula = mpg ~ disp + hp + wt, data = mtcars[, c(“mpg”, “disp”,
“hp”, “wt”)])
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
> a <- coef(model2)[1]
> a
(Intercept)
37.10551
> b1 <- coef(model2)[2]
> b2 <- coef(model2)[3]
> b3 <- coef(model2)[4]
> b1
disp
-0.0009370091
> b2
hp
-0.03115655
> b3
161 Statistical Analysis Using R
wt
-3.800891
We create the mathematical equation below, from the above intercept and
coefficient values.
Y = a+b1*x1+b2*x2+b3*x3
Y = (37.10551)+(-0.0009370091)*x1+(-0.03115655)*x2+(-3.800891)*x3
We can use the regression equation created above to predict the mileage
when a new set of values for displacement, horse power and weight is provided.
For a car with disp = 160, hp = 110 and wt = 2.620 the predicted mileage is
given by:
Y = (37.10551)+(-0.0009370091)*160+(-0.03115655)*110+(-3.800891)*2.620
= 23.57003
The above value can also be calculated using the function predict() for the
given new value.
> newdata <- data.frame(disp = 160, hp = 110, wt = 2.620)
> mileage <- predict(model2, newdata)
> mileage
1
23.57003
In this equation y is the response variable, a, b1, b2...bn are the coefficients and
x1, x2, ...xn are the predictor variables.
The function used to create the logical regression model is the glm() function.
The basic syntax for glm() function in logistic regression is as given below.
glm(formula,data,family)
The in-built data set “mtcars” describes different models of a car with their
various engine specifications. In “mtcars” data set, the column am describes the
transmission mode with a binary value. A logistic regression model is built between
the columns “am” and 3 other columns - hp, wt and cyl.
> model3 <- glm(am ~ cyl + hp + wt, data = mtcars[, c(“am”, “cyl”, “hp”, “wt”)],
family = binomial)
> model3
Coefficients:
(Intercept) cyl hp wt
19.70288 0.48760 0.03259 -9.14947
> summary(model3)
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = mtcars[,
c(“am”, “cyl”, “hp”, “wt”)])
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
164 R Programming — An Approach for Data Analytics
The p-value in the summary is greater than 0.05 for the variables “cyl”
(0.6491) and “hp” (0.0840). This value is considered to be insignificant in
contributing to the value of the variable “am”. Only weight “wt” (0.0276) impacts
the “am” value in this regression model.
In this equation y is the response variable, a, b1, b2...bn are the coefficients and
x1, x2, xn are the predictor variables.
The function used to create the Poisson regression model is the glm() function.
The basic syntax for glm() function in logistic regression is as given below.
glm(formula, data, family)
165 Statistical Analysis Using R
The data set “warpbreaks” describes the effect of wool type and tension on the
number of warp breaks per loom. Let’s consider “breaks” as the response variable
which is a count of number of breaks. The wool “type” and “tension” are taken as
predictor variables. The model so built shows the below results.
> model4 <- glm(formula = breaks ~ wool + tension, data = warpbreaks,
family = poisson)
> summary(model4)
Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the summary we look for the p-value in the last column to be less than 0.05
to consider an impact of the predictor variable on the response variable. As seen the
wooltype B having tension type M and H have impact on the count of breaks.
p-value of woolB = 6.49e-05 = 0.0000649 < 0.05
p-value of tensionM = 9.73e-08 = 0.0000000973 < 0.05
p-value of tensionH = 5.21e-16 = 0.000000000000000521 < 0.05
When modelling real world data for regression analysis, we observe that it is rarely
the case that the equation of the model is a linear equation giving a linear graph.
The equation of the model of real world data involves mathematical functions of
higher degree. In such a scenario, the plot of the model gives a curve rather than a
line. Both linear and non-linear regression aims to adjust the values of the model’s
parameters. This is to find the line or curve that comes nearer to the data. On
finding these values we will be able to estimate the response variable with good
accuracy.
a = b1*x^2+b2
Let us assume the initial coefficients to be 1 and 3 and fit these values into nls()
function.
> x <- c(1.6, 2.1, 2, 2.23, 3.71, 3.25, 3.4, 3.86, 1.19, 2.21)
> y <- c(5.19, 7.43, 6.94, 8.11, 18.75, 14.88, 16.06, 19.12, 3.21, 7.58)
> plot(x, y)
> model <- nls(y ~ b1*x^2+b2, start = list(b1 = 1,b2 = 3))
> new <- data.frame(x = seq(min(x), max(x), len = 100))
> lines(new$x, predict(model, newdata = new))
> res1 <- sum(resid(model)^2)
> res1
[1] 1.081935
> res2 <- confint(model)
> res2
2.5% 97.5%
b1 1.137708 1.253135
b2 1.497364 2.496484
We can conclude that the value of b 1 is more close to 1 (1.253135) while the
value of b2 is more close to 2 (2.496484) and not 3.
Response: PlantGrowth$weight
Df Sum Sq Mean Sq F value Pr(>F)
PlantGrowth$group 2 3.7663 1.8832 4.8461 0.01591 *
Residuals 27 10.4921 0.3886
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The result shows that the F-value is 4.8461 and the p-value is 0.01591 which is less
than 0.05 (5% level of significance). This shows that the null hypothesis is rejected,
that is the control group / treatment has effect on the plant growth / plant weight.
169 Statistical Analysis Using R
For two-way ANOVA, consider the below example of revenues collected for 5
years in each month. We want to see if the revenue depends on the Year and / or
Month or if they are independent of these two factors.
> revenue = c(15,18,22,23,24, 22,25,15,15,14, 18,22,15,19,21,
+ 23,15,14,17,18, 23,15,26,18,14, 12,15,11,10,8, 26,12,23,15,18,
+ 19,17,15,20,10, 15,14,18,19,20, 14,18,10,12,23, 14,22,19,17,11,
+ 21,23,11,18,14)
> anova(fit)
Analysis of Variance Table
Response: revenue
Df Sum Sq Mean Sq F value Pr(>F)
months 11 308.45 28.041 1.4998 0.1660
years 4 44.17 11.042 0.5906 0.6712
Residuals 44 822.63 18.696
The significance of the difference between months is: F = 1.4998. This value is
lower than the value tabulated and indeed p-value > 0.05. So we cannot reject the
null hypothesis: the means of revenue evaluated according to the months are not
proven to be not equal, hence we remain in our belief that the variable “months”
has no effect on revenue.
The significance of the difference between years is: F = 0.5906. This value is
lower than the value tabulated and indeed p-value > 0.05. So we fail to reject the
null hypothesis: the means of revenue evaluated according to the years are not found
to be un-equal, then the variable “years” has no effect on revenue.
170 R Programming — An Approach for Data Analytics
ANCOVA is a type of ANOVA model that has a general linear model with a
continuous outcome variable and two or more predictor variables. Of these predictor
variables, at least one is continuous and at least one more is categorical. Analysis
of variance (ANOVA) is a collection of statistical models and their procedures
which are used to observe differences between the means of three or more
variables in a population based on the sample presented.
Consider the R built in data set “mtcars”. In this dataset the field “am”
represents the type of transmission and it takes the values 0 or 1. The miles per
gallon value, “mpg” of a car can also depend on it besides the value of horse
power, “hp”. The effect of the value of “am” on the regression between “mpg” and
“hp” is studied. It is done by using the aov() function followed by the anova()
function to compare the multiple regressions.
Consider the fields “mpg”, “hp” and “am” from the data set “mtcars”. The
variable “mpg” is the response variable, and the variable “hp” is chosen as the
predictor variable and “am” as the categorical variable. We create a regression model
taking “hp” as the predictor variable and “mpg” as the response variable taking
into account the interaction between “am” and “hp”.
The model with interaction between categorical variable and predictor variable
is given as below.
> res1 <- aov(mtcars$mpg ~ mtcars$hp * mtcars$am, data = mtcars)
> summary(res1)
Df Sum Sq Mean Sq F value Pr(>F)
171 Statistical Analysis Using R
As the p-value in both the cases is less than 0.05, the result shows that both
horse power “hp” and transmission type “am” has significant effect on miles per
gallon “mpg”. But the interaction between these two variables is not significant as
the p-value is more than 0.05.
As the p-value in both cases is less than 0.05 in the above result, it shows that
both horse power and transmission type has significant effect on miles per gallon.
Now we can compare the two models to conclude if the interaction of the
variables is truly in-significant. For this we use the anova() function.
> finres <- anova(res1,res2)
> finres
As the p-value is (0.9806) greater than 0.05 we conclude that the interaction
between horse power “hp” and transmission type “am” is not significant. So the
mileage per gallon “mpg” will depend in a similar manner on the horse power of
the car in both auto and manual transmission mode.
The function chisq.test() is used for performing chi-Square test on the given
data. The R syntax for chi-square test is as below.
chisq.test(data)
We will take the Cars93 data in the “MASS” library which represents the sales
of different models of car in the year 1993. The facor variables in this dataset can
be considered as categorical variables. In the below model the variables “AirBags”
and “Type” are considered. Here we aim to find out any significant correlation
between the types of car sold and the type of Air bags it has. If correlation is
observed we can estimate which types of cars can sell better with what types of
air bags.
173 Statistical Analysis Using R
> library(MASS)
> cardata = table(Cars93$AirBags, Cars93$Type)
> chi <- chisq.test(cardata)
> chi
Pearson’s Chi-squared test
data: cardata
X-squared = 33.001, df = 10, p-value = 0.0002722
The result shows the p-value (0.0002723) of less than 0.05 which indicates a
strong correlation between the “AirBags” and “Type” of the cars sold.
Hypothesis testing is of two types, one-tailed tests and two-tailed tests. A one-
tailed test is a test of a hypothesis which is used to measure the relationship of
variable in one direction which allows the hypothesis to be rejected. A two-tailed
test is used to test the statistical significance of the hypothesis for the dataset to
accept or reject the hypothesis.
The p-value is 0.5743 > 0.05. Hence, at .05 significance level, we do not reject
the null hypothesis that the proportion of voters in the population is above 60%
this year.
Let 12% of apples taken from an orchard last year were rotten. This year, 30 out
of 214 apples turns out to be rotten. Is it possible to reject the null hypothesis
at .05 level of significance, that the proportion of rotten apples in harvest stays below
12% this year?
> prop.test(30, 214, p=.12, alt=”greater”, correct=FALSE)
The p-value is 0.14018 > 0.05. Hence, at .05 significance level, we do not reject
the null hypothesis that the proportion of rotten apples in harvest stays below
12% this year.
Let 12 heads are turned up out of 20 trials in a coin toss. At .05 significance level,
is it possible to reject the null hypothesis that the coin toss is fair?
> prop.test(12, 20, p=0.5, correct=FALSE)
The p-value is 0.6 > 0.05. Hence, at .05 significance level, we do not reject the
null hypothesis that the coin toss is fair.
HIGHLIGHTS
Most important functions to get the statistical measures are available in
the packages base and stats.
The basic statistical measures are obtained by the functions min(),
max(), mean() and median().
The basic statistical measures can also be obtained by one function
summary().
R does not have a standard in-built function to calculate mode and
hence we create a user function to calculate mode.
176 R Programming — An Approach for Data Analytics
OBJECTIVES
6.2.Clustering Using R
Clustering is a process of grouping a set of objects into clusters in such a way that
objects in the same cluster are more similar to each other than to those in other
clusters. The various clustering techniques are Partitioning Method, Hierarchical
Method, Density-based Method, Grid-Based Method, Model-Based Method and
Constraint-based Method. R Package provides various packages and functions
that implement many clustering techniques such as K-Means, Hierarchical, Fuzzy
C-Means, PAM, SOM, CLARA, CLUES, etc. R-Package has many clustering
techniques implemented as the functions bundled in different library packages.
We can see the below table listing the clustering techniques available in R along
with their corresponding packages and functions.
R has the default dataset called iris and this dataset will be used in the
example below. The cluster number is set to 3 in the below clustering as the
number of distinct species in the iris dataset is 3 (“setosa”, “versicolor”,
“virginica”). For the purpose of initial manipulation let us copy the iris dataset
into another dataset called iris1. Then we remove the column “Species” from the
dataset iris1 and then apply kmeans clustering on it.
The result of the clustering is then compared with the “Species” column of the
dataset iris to see if similar objects are grouped together. The result of clustering
shows that the species “setosa” can be clustered separately and the other two species
180 R Programming — An Approach for Data Analytics
have few overlapping of objects and hence can be clustered together. The
functions kmeans(), table(), plot() and point() are used below for getting and
plotting the results. It can be noted that the plots can be drawn with any two
dimensions of the species available at a particular time (eg. Sepal.Length Vs.
Sepal.Width). Also, the results of the kmeans clustering can vary from run to run
due to the selection of cluster centres.
> iris1 <- iris
> iris1$Species <- NULL
> km <- kmeans(iris1, 3)
> km
K-means clustering with 3 clusters of sizes 50, 38,
62 Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.006000 3.428000 1.462000 0.246000
2 6.850000 3.073684 5.742105 2.071053
3 5.901613 2.748387 4.393548 1.433871
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[41] 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3
[81] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3
[121] 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2 2 3
Within cluster sum of squares by cluster:
[1] 15.15100 23.87947 39.82097
(between_SS / total_SS = 88.4 %)
Available components:
[1] “cluster” “centers” “totss” “withinss” “tot.withinss”
[6] “betweenss” “size” “iter” “ifault”
> table(iris$Species, km$cluster)
181 Data Mining Using R
1 2 3
setosa 50 0 0
versicolor 0 2 48
virginica 0 36 14
> plot(iris1[c(“Sepal.Length”, “Sepal.Width”)], col = km$cluster)
> points(km$centers[, c(“Sepal.Length”, “Sepal.Width”)],
col = 1:3, pch = 8, cex = 2)
R has the pam() and pamk() functions of the cluster package to do the k-medoids
clustering. The k-means and k-medoids clustering produces almost the same result
and the only difference is that in k-means the cluster is represented by the cluster
centre and in k-medoids the cluster is represented by the object close to the
cluster centre. But in the presence of outliers, k-medoids is more robust than k-
means clustering. Partitioning Around Medoids (PAM) is the classic algorithm
applied for k-medoids clustering. The PAM algorithm is not efficient in handling
large datasets. CLARA is an enhanced technique of PAM which performs better on
large datasets. For the functions pam() and clara() in the package cluster we need
to specify the number of clusters. But, for the function pamk() in the package
fpc, we need not
183 Data Mining Using R
specify the number of clusters. In this case the number of clusters is estimated
using the silhouette width.
> library(fpc)
> pmk <- pamk(iris1)
> pmk$nc
[1] 2
> table(pmk$pamobject$clustering, iris$Species)
In the above left side chart of Fig 4.3, we can see that there are two clusters,
one for the species “setosa” and the other for the mixture of species “versicolor”
and “virginica”. The right side chart of Fig 4.3, shows the silhouette width which
decides the number of clusters (2 clusters in this case). The silhouette width is
shown to be
184 R Programming — An Approach for Data Analytics
between 0.81 and 0.62 (nearing 1), and this means that the observations are well
clustered. If the silhouette width is around 0 it means the observations lies
between the two clusters and it is less than 0, it means the observations are placed
in the wrong clusters.
Now, let us use the pam() function from the cluster package to cluster the iris
data and plot the results.
> library(cluster)
> pm <- pam(iris1, 3)
> table(pm$clustering, iris$Species)
setosa versicolor virginica
1 50 0 0
2 0 48 14
3 0 2 36
> layout(matrix(c(1, 2), 1, 2))
> plot(pm)
> layout(matrix(1))
In the above left chart of Fig. 6.4, we can see three clusters. Cluster 1 with only
the “setosa” species and the cluster 2 with mostly “versicolor” species and few of
“virginica” species and in cluster 3 we have mostly “viginica” species and few of
“versicolor” species. In both the above graphs Fig. 6.3 and Fig. 6.4, the line
between the clusters shows the distance between the clusters. From the above we
can say that the choice of the clustering function used in R depends on the target
problem and the domain knowledge available.
The hierarchical clustering in R can be done using the function hclust() in the
fpc package. As the hierarchical clustering when plotted will be very crowded if
the data is large, we create a sample of the iris data and do the clustering and
plotting using this sample data. The function rect.hclust() is used to draw
rectangle that covers each cluster. The cutree() function is used to draw the
dendrogram as in the Fig. 6.5.
> i <- sample(1:dim(iris)[1], 50)
> iris3 <- iris[i,]
> iris3$Species <- NULL
> hc <- hclust(dist(iris3), method = “ave”)
> plot(hc, hang = -1, labels = iris$Species[i])
> rect.hclust(hc, k=3)
> grp <- cutree(hc, k=3)
186 R Programming — An Approach for Data Analytics
The resultant graph Fig. 6.5, shows that the first cluster has just the species
“setosa”, the second cluster has the species “virginica” and the third cluster has a
mix of both the species “versicolor” and “virginica”.
The function dbscan() in the package fpc is used for the density based
clustering. The density based clustering is done to cluster the entire data into one
cluster. There are two main parameters in the function dbscan(). They are the eps
and MinPts. The parameter MinPts stands for the minimum points in a cluster and
the parameter eps defines the reachability distance. The density based clustering
is sensitive to noisy data. Standard values are given for these parameters eps and
MinPts.
> library(fpc)
> iris4 <- iris
> iris4$Species <- NULL
> db <- dbscan(iris4, eps = 0.42, MinPts = 5)
> table(db$cluster, iris$Species)
187 Data Mining Using R
3 0 3 33
> plot(db, iris4)
In the result above we can see that there are three clusters, cluster 1, cluster 2
and cluster 3. The cluster 0 corresponds to the outliers in the data.
In the graph as in Fig. 6.6, the black circles represent the outliers. The
clustering results can also be plotted in the scatter plot like in k-means clustering.
This can be done using the function plot() as in Fig. 6.7, or the function
plotcluster() as in Fig. 6.8, in the fpc package. The black circles in the Fig. 6.7 and
the black zeros “0” in the Fig. 6.8 shows the outliers.
> plot(db, iris4[c(1,2)])
> plotcluster(iris4, db$cluster)
188 R Programming — An Approach for Data Analytics
The clustering model can be used to model new data based on the similarity
between the new data and the clusters. We take a sample of 10 records from the iris
dataset and add some outliers to it and try to label the new dataset. The noises
are generated using the function runif().
> set.seed(435)
> i <- sample(1:dim(iris)[1], 10)
189 Data Mining Using R
Thus from the above results and the plot in Fig. 6.9, we can see that out of
the 10 new data, 7 (3 + 2 + 2) are assigned to the correct clusters and there is
one outlier data in it.
190 R Programming — An Approach for Data Analytics
6.3.Classification
Classification is a data mining technique that assigns items in a sample to target
labelled classes. The goal of classification is to accurately predict the target class
for each case in the data. For example, a classification model could be used to
identify the age group of the people as children, youth, or old. The various
classification techniques are Decision Tree based Methods, Rule-based Methods,
Memory based reasoning, Neural Networks, Naïve Bayes and Bayesian Belief
Networks and Support Vector Machines. R Package provides various packages and
functions that implement many classification techniques such as SVM, kNN,
Decision Trees, Naive Bayes, etc. R-Package has many classification techniques
implemented as the functions bundled in different library packages. We can see
the below table listing the classification techniques available in R along with their
corresponding packages and functions.
broken into smaller and smaller subsets. The final result is a tree with decision nodes
and leaf nodes. A decision node has two or more branches. Leaf node represents a
classification or decision. The topmost decision node in a tree is called the root
node and this corresponds to the best predictor. Decision trees can handle both
categorical and numerical data.
The function ctree() in the package party can be used to build the decision tree
for the given data. We consider the iris dataset available in R for our analysis. This
dataset has four attributes namely, the Sepal.Length, Sepal.Width, Petal.Length
and Petal.Width using which we can predict the Species of the flower. After
applying the function ctree() and getting the decision tree model, we can do the
prediction using the function predict() for the given new data, so that we can
categorize it into which Species the flowers belong to.
Before applying the decision tree function, the iris dataset is first split into
training and test subsets. For training we choose 80% of the data randomly and the
remaining 20% is used for testing. The seed for sampling is randomly set to a
fixed number as below for effective splitting of data. After creating the decision tree
using the training data, prediction is done on the test data. The results of the built
tree can be viewed as text result and as well as a decision tree plot as in Fig. 6.10.
> set.seed(1234)
> i <- sample(2, nrow(iris), replace=TRUE, prob=c(0.8, 0.2))
> train <- iris[i==1,]
> test <- iris[i==2,]
> form <- Species ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width
> dt <- ctree(form, data=train)
> table(predict(dt), train$Species)
192 R Programming — An Approach for Data Analytics
3) Petal.Width >
1.7 7)* weights =
35
The above decision tree result can also be drawn as a simple decision tree as in
Fig. 6.11.
> plot(dt, type = “simple”)
In the first decision tree (Fig. 6.10) the number of training data under each
species is listed as bar graph, but in the second decision tree (Fig. 6.11) the same
is listed using variable y. For example, node 2 is labelled as “n = 42, y(1, 0, 0)”,
which means that it contains 42 training instances and all of them belong to the
species “setosa”.
Now, the predicted model will be tested with the test data to see if the instances
are correctly classified.
> pred <- predict(dt, newdata = test)
> table(pred, test$Species)
pred setosa versicolor virginica
setosa 8 0 0
versicolor 0 7 1
virginica 0 0 11
194 R Programming — An Approach for Data Analytics
The function rpart() in the package rpart can be used to build the decision tree for
the given data. We consider the bodyfat dataset available in the package TH.data
of R for our analysis. After applying the function rpart() and getting the decision
tree model, we can do the prediction using the function predict() for the given
new data, so that we can categorize it into which Species the flowers belong to.
Before applying the decision tree function, the dataset is first split into
training and test subsets. For training we choose 70% of the data randomly and the
remaining 30% is used for testing. After creating the decision tree using the
training data, prediction is done on the test data. The decision tree is shown in the
Fig. 6.12 and the details of the split are listed below.
> library(TH.data)
> data(“bodyfat”, package = “TH.data”)
> set.seed(1234)
> i <- sample(2, nrow(bodyfat), replace=TRUE, prob=c(0.7, 0.3))
> train <- bodyfat[i==1,]
> test <- bodyfat[i==2,]
> form <- DEXfat ~ age + waistcirc + hipcirc +
elbowbreadth + kneebreadth
> dt2 <- rpart(form, data = train, control = rpart.control(minsplit = 10))
> plot(dt2)
> text(dt2, use.n=T, all = T, cex = 1)
195 Data Mining Using R
The predicted values are compared with the observed values and the graph in Fig.
6.13 shows that the modelling is good as most points lie close to the diagonal line.
Similarly, we can also apply the same rpart() function for the iris dataset as
before splitting the data into 80% training and 20% test data. The obtained
model (Fig. 6.14) can then be used for predicting the species of the test data. The
prediction shows that out of the 8 “setosa” species, all are correctly classified, out
of the 8 “versicolor” species, 7 are correctly classified and 1 is incorrectly
classified as “virginica” and out of the 11 “virginica” species, 10 are correctly
classified and 1 is incorrectly classified as “versicolor”.
> set.seed(1234)
> i <- sample(2, nrow(iris), replace=TRUE, prob=c(0.8, 0.2))
> train <- iris[i==1,]
> test <- iris[i==2,]
> form <- Species ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width
> dt <- rpart(form, data = train, control = rpart.control(minsplit = 10))
> table(predict(dt), train$Species)
setosa versicolor virginica
setosa 42 0 0
versicolor 0 42 4
virginica 0 1 34
> plot(dt)
> text(dt, use.n=TRUE, all=TRUE)
198 R Programming — An Approach for Data Analytics
virginica 0 3 33
> varImpPlot(rf)
Finally, the built random forest is tested on test data, and the result is checked
with the functions table() and margin(). The margin of a data point is the proportion
of the correct class minus maximum proportion of the other classes. Generally,
positive margin means correct classification (Fig. 6.17).
201 Data Mining Using R
versicolor 0 7 0
virginica 0 0 12
Where P(A) is the percentage (or probability) of cases containing A and P(B)
is the percentage (or probability) of cases containing B.
[15] {Class=3rd,Sex=Male}
=> {Age=Adult} 0.210 0.906 0.953
[16] {Sex=Male,Survived=Yes}
=> {Age=Adult} 0.154 0.921 0.969
[17] {Class=Crew,Survived=No}
=> {Sex=Male} 0.304 0.996 1.266
[18] {Class=Crew,Survived=No}
=> {Age=Adult} 0.306 1.000 1.052
[19] {Class=Crew,Sex=Male}
=> {Age=Adult} 0.392 1.000 1.052
[20] {Class=Crew,Age=Adult}
=> {Sex=Male} 0.392 0.974 1.238
[21] {Sex=Male,Survived=No}
=> {Age=Adult} 0.604 0.974 1.025
[22] {Age=Adult,Survived=No}
=> {Sex=Male} 0.604 0.924 1.175
[23] {Class=3rd,Sex=Male,Survived=No}
=> {Age=Adult} 0.176 0.917 0.965
[24] {Class=3rd,Age=Adult,Survived=No}
=> {Sex=Male} 0.176 0.813 1.034
[25] {Class=3rd,Sex=Male,Age=Adult}
=> {Survived=No} 0.176 0.838 1.237
[26] {Class=Crew,Sex=Male,Survived=No}
=> {Age=Adult} 0.304 1.000 1.052
[27] {Class=Crew,Age=Adult,Survived=No}
=> {Sex=Male} 0.304 0.996 1.266
[7] {Class=Crew,Sex=Female,Age=Adult}
=> {Survived=Yes} 0.009 0.870 2.692
[8] {Class=2nd,Sex=Female,Age=Adult}
=> {Survived=Yes} 0.036 0.860 2.663
[9] {Class=2nd,Sex=Male, Age=Adult}
=> {Survived=No} 0.070 0.917 1.354
[10] {Class=2nd,Sex=Male}
=> {Survived=No} 0.070 0.860 1.271
[11] {Class=3rd,Sex=Male,Age=Adult}
=> {Survived=No} 0.176 0.838 1.237
[12] {Class=3rd,Sex=Male}
=> {Survived=No} 0.192 0.827 1.222
Some rules generated above provide little or no extra information when some
other rules are in the result. For example, the rule 2 provides no extra knowledge in
addition to rule 1, since rules 1 tells us that all 2nd-class children survived. A rule
is considered to be redundant when it is a super rule of another rule and it has the
same or a lower lift. Other redundant rules in the above result are rules 4, 7 and 8,
compared with the rules 3, 6 and 5 respectively. We prune redundant rules with
code below.
> subset.matrix <- is.subset(rules.sorted, rules.sorted)
> subset.matrix[lower.tri(subset.matrix, diag=T)] <- FALSE
> redundant <- colSums(subset.matrix, na.rm=T) >= 1
> which(redundant)
{Class=2nd,Sex=Female,Age=Child,Survived=Yes}
2
{Class=1st,Sex=Female,Age=Adult,Survived=Yes}
4
{Class=Crew,Sex=Female,Age=Adult,Survived=Yes}
7
207 Data Mining Using R
{Class=2nd,Sex=Female,Age=Adult,Survived=Yes}
8
> rules.pruned <- rules.sorted[!redundant]
> inspect(rules.pruned)
The above rules show that only 2nd class children have survived. But, this
cannot be the case as we have setup a higher support and confidence levels in
the previous case. To investigate the above issue, we run the code below to find
rules whose rhs is “Survived=Yes” and lhs contains “Class=1st”, “Class=2nd”,
“Class=3rd”, “Age=Child” and “Age=Adult” only, and which contains no other
items (default=”none”). We use lower thresholds for both support and confidence
than before to find all rules for children of different classes.
208 R Programming — An Approach for Data Analytics
> inspect(rules.sorted)
lhs rhs support confidence lift
[1] {Class=2nd,Age=Child} => {Survived=Yes} 0.010904134 1.0000000 3.0956399
[2] {Class=1st,Age=Child} => {Survived=Yes} 0.002726034 1.0000000 3.0956399
[3] {Class=1st,Age=Adult} => {Survived=Yes} 0.089504771 0.6175549 1.9117275
[4] {Class=2nd,Age=Adult} =>{Survived=Yes} 0.042707860 0.3601533 1.1149048
[5] {Class=3rd,Age=Child} => {Survived=Yes} 0.012267151 0.3417722 1.0580035
[6] {Class=3rd,Age=Adult} => {Survived=Yes} 0.068605179 0.2408293 0.7455209
In the above result, the first two rules show that children of the 1st class are
of the same survival rate as children of the 2nd class and that all of them survived.
The rule of 1st-class children didn’t appear before, simply because of its support
was below the threshold specified. Rule 5 presents a sad fact that children of class
3 had a low survival rate of 34%, which is comparable with that of 2nd-class
adults and much lower than 1st-class adults.
Now, let us see some ways of visualizing association rules such as Scatter Plot,
Grouped Matrix, Graph and Parallel Coordinates Plot as in Fig. 6.18, Fig. 6.19, Fig.
6.20 and Fig. 6.21 respectively.
> library(arulesViz)
> plot(rules.all)
> plot(rules.all, method=”grouped”)
209 Data Mining Using R
The Univariate outlier detection can be used to find outliers in multivariate data
also. First, we create a data frame with the two independent variables and detect
their outliers separately. Then, we take multivariate outliers as those data which are
outliers for both variables. In the below code outliers are marked with “+” in red.
The result is displayed as a scatter plot as in Fig. 6.23.
> y <- rnorm(100)
> df <- data.frame(x, y)
> rm(x, y)
> head(df)
x y
1 -3.31539150 0.7619774
2 -0.04765067 -0.6404403
3 0.69720806 0.7645655
4 0.35979073 0.3131930
5 0.18644193 0.1709528
6 0.27493834 -0.8441813
213 Data Mining Using R
Similarly, we can take multivariate outliers as those data which are outliers in
either of the variables (x or y). This is shown as scatter plot in the Fig. 6.24.
> outlierlist2 <- union(a,b)
> outlierlist2
[1] 1 33 64 74 24 25 49
> plot(df)
> points(df[outlier.list2,], col=”blue”, pch=”x”, cex=2)
214 R Programming — An Approach for Data Analytics
We show outliers with a biplot of the first two principal components in the
Fig. 6.26. In the below code, prcomp() performs a principal component analysis,
and biplot() plots the data with its first two principal components. In the below
graph, Fig. 6.26, x-axis and y-axis are respectively the first and second principal
components, the arrows show the original columns (variables), and the five outliers
are labelled with their row numbers.
> n <- nrow(iris1)
> labels <- 1:n
216 R Programming — An Approach for Data Analytics
We can also show outliers with a pairs plot as below, where outliers are labelled
with “+” in red.
The outliers can also be displayed using the pairs() function as in the Fig. 6.27
in which the outliers are marked as “+” symbol displayed in red colour. Package Rlof
provides the function lof(), a parallel implementation of the LOF algorithm.
217 Data Mining Using R
One way to detect outliers is clustering. By grouping data into clusters, those
data not assigned to any clusters are taken as outliers. For example, with density-
based clustering such as DBSCAN, objects are grouped into one cluster if they
are connected to one another by densely populated area. Therefore, objects not
assigned to any clusters are isolated from other objects and are taken as outliers.
The function dbscan() in the package fpc is used for the density based
clustering. There are two main parameters in the function dbscan(). They are the
eps and MinPts. The parameter MinPts stands for the minimum points in a cluster
and the parameter eps defines the reachability distance. Standard values are given
for these parameters.
> library(fpc)
> iris1 <- iris
> iris1$Species <- NULL
> db <- dbscan(iris1, eps = 0.42, MinPts = 5)
> table(db$cluster, iris$Species)
218 R Programming — An Approach for Data Analytics
In the result above we can see that there are three clusters, cluster 1, cluster 2
and cluster 3. The cluster 0 corresponds to the outliers in the data (marked by
black circles in the plot of Fig. 6.28).
The clustering results can also be plotted in the scatter plot as in Fig. 6.29.
This can be done using the function plot() in the fpc package. Here also the black
circles denote the outliers.
> plot(db, iris1[c(1,2)])
219 Data Mining Using R
We can also detect outliers with the k-means algorithm. With k-means, the
data are partitioned into k groups by assigning them to the closest cluster centres.
After that, we can calculate the distance (or dissimilarity) between each object
and its cluster centre, and pick those with largest distances as outliers. An
example of outlier detection with k-means from the iris data is given below. In the
graph of Fig. 6.30, the cluster centres are labelled with “*” and outliers with “+”.
> iris2 <- iris[,1:4]
> km <- kmeans(iris2, centers=3)
> centers <- km$centers[km$cluster, ]
> dist <- sqrt(rowSums((iris2 - centers)^2))
> outliers <- order(dist, decreasing=T)[1:5]
> print(outliers)
[1] 99 58 94 61 119
> print(iris2[outliers,])
dataset results in lower predictive power of the model. This scenario is often termed
as the curse of dimensionality. Principal Component Analysis (PCA) is a popular
dimensionality reduction technique. PCA is used in many applications that deals
with high dimension of data.
Let us use the function apply() to the crimtab dataset row wise to calculate
the variance to see how each variable is varying. The function apply() returns a
vector or array or list of values obtained by applying a function to margins of an
array or matrix.
> apply(crimtab,2,var)
142.24 144.78 147.32 149.86 152.4
0.02380952 0.02380952 0.17421603 0.88792102 2.56445993
223 Data Mining Using R
We can see that the column “165.1” contains maximum variance “270.58536585”.
Let us now apply PCA using the function prcomp() on the data set crimtab.
> pca =prcomp(crimtab)
> pca
Standard deviations (1, .., p=22):
[1] 30.07962021 14.61901911 5.45438277 4.65250574 3.21408168 2.77322835
2.30250353
[8] 1.92678188 1.40986049 1.24320894 1.02967875 0.72502776 0.50683548
0.47841947
[15] 0.29167315 0.26636232 0.22462458 0.12793888 0.12483426 0.06548509
0.00000000
[22] 0.00000000
...
...
224 R Programming — An Approach for Data Analytics
From the above code, the resultant components of PCA object are the
standard deviations and Rotation. From the standard deviations we can observe
that the 1st PCA explained most of the variation, followed by other PCAs’. The
proportion of each variable along each principal component is given by the
Rotation of the principal component. Let’s plot all the principal components and
see how the variance is accounted with each component in Fig. 6.31.
> par(mar = rep(2, 4))
> plot(PCA)
Clearly the first principal component accounts for maximum information. The
results of PCA can be represented as a biplot graph. Biplot is used to show the
proportions of each variable along the first two principal components as in Fig. 6.32.
The first two lines of the below code changes the direction of the biplot. If we do
not include the first two lines the plot will be mirror image of the below graph.
> PCA$rotation=-PCA$rotation
> PCA$x=-PCA$x
> biplot (PCA, scale = 0.2)
The below Fig. 6.32, is known as a biplot. In this, we can see the two principal
components (PC1 and PC2) of the crimtab dataset ploted in the graph. The arrows
225 Data Mining Using R
represent the loading vectors, and this specifies how the feature space varies
along the principal component vectors. From the below plot, we can see that the
first principal component vector, PC1, more or less places equal weight on three
features: 165.1, 167.64, and 170.18. This means that these three features are more
correlated with each other. In the second principal component, PC2 places more
weight on 160.02, 162.56 than the other 3 features.
> library(caret)
> library(corrplot)
> library(plyr)
> dat <- read.csv(“Sample.csv”)
> set.seed(227)
# Remove variables having high percentage of missing values
> dat1 <- dat[, colMeans(is.na(dat)) <= .5]
> dim(dat1)
[1] 19622 93
> dim(dat)
[1] 19622 160
> nzv <- nearZeroVar(dat1)
# Remove Zero and Near Zero-Variance Predictors
> dat2 <- dat1[, -nzv]
> dim(dat2)
[1] 19622 59
> numericData <- dat2[sapply(dat2, is.numeric)]
> summary(descrCor[upper.tri(descrCor)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.992008 -0.101969 0.001729 0.001405 0.084718 0.980924
> corrplot(descrCor, order = “FPC”, method = “color”, type = “lower”,
tl.cex = 0.7, tl.col = rgb(0, 0, 0))
227 Data Mining Using R
importance. The first one - Gini Gain produced by the variable, averaged
over all trees. The second one - Permutation Importance i.e. mean
decrease in classification accuracy after permuting the variable, averaged
over all trees. Sort the permutation importance score on descending
order and select the TOP k variables. The below code is for feature
selection with Random Forest.
> library(randomForest)
> rf <-randomForest(classe~.,data=dat3, importance=TRUE,ntree=100)
# Finding importance of variables
> imp = importance(rf, type=1)
> imp <- data.frame(predictors=rownames(imp),imp)
# Sorting the variables in descending order of their MeanDecreaseAccuracy
> imp.sort <- arrange(imp,desc(MeanDecreaseAccuracy))
> imp.sort$predictors <-
factor(imp.sort$predictors,levels=imp.sort$predictors)
> imp.20<- imp.sort[1:20,]
# Printing top 20 variables with high MeanDecreaseAccuracy
> print(imp.20)
predictors MeanDecreaseAccuracy
1 X 36.878224
2 raw_timestamp_part_1 19.939217
3 cvtd_timestamp 19.936367
4 pitch_belt 14.474235
5 roll_dumbbell 12.502391
6 gyros_belt_z 12.429689
7 num_window 11.491461
8 total_accel_dumbbell 11.193014
9 gyros_arm_y 10.509349
10 magnet_forearm_z 10.353922
229 Data Mining Using R
11 gyros_dumbbell_x 10.245442
12 magnet_belt_z 10.078787
13 pitch_forearm 10.069103
14 roll_arm 10.049374
15 yaw_arm 9.959173
16 gyros_dumbbell_y 9.770771
17 gyros_belt_x 9.602383
18 magnet_forearm_y 9.407758
19 user_name 9.304626
20 gyros_forearm_x 8.954952
> varImpPlot(rf, type=1)
# Retaining only top 20 variables with high MeanDecreaseAccuracy in dat4
> dat4 = cbind(classe = dat3$classe, dat3[,c(imp.20$predictors)])
HIGHLIGHTS
The R packages that are related to data mining are stats, cluster, fpc,
sna, e1071, cba, biclust, clues, kohonen, rpart, party, randomForest, ada,
caret, arules, eclat, arulesViz, DMwR, dprep, Rlof, plyr, corrplot, RWeka,
gausspred, optimsimplex, CCMtools, FactoMineR and nnet.
The R packages and function for clustering are stats - kmeans(), cluster
- pam(), fpc - pamk(), cluster - agnes(), stats - hclust(), cluster - daisy(),
fpc - dbscan(), sna - kcores(), e1071 - cmeans(), cba - rockCluster(), biclust
- biclust(), clues - clues(), kohonen - som(), cba - proximus() and cluster -
clara().
The functions kmeans(), table(), plot() and point() are used below for
getting and plotting the results of K-Means clustering.
R has the pam() and pamk() functions of the cluster package to do the
k-medoids clustering.
The silhouette width decides the number of clusters in k-medoids
clustering.
The hierarchical clustering in R can be done using the function hclust()
in the fpc package.
The function dbscan() in the package fpc is used for the density based
clustering.
The R packages and function for clustering are e1071 - svm(), RWeka
- IBk(), party - rpart(), rpart - ctree(), rpart - cforest(), randomForest -
randomForest(), e1071 - naiveBayes(), ada - ada(), caret - train().
The function ctree() in the package party and the function rpart() in the
package rpart can be used to build the decision tree for the given data.
We can do the prediction using the function predict() for the given new
data.
The function randomForest() in the package randomForest can be used
to classify the given data.
The importance of variables in a dataset can be obtained with the functions
importance() and varImpPlot().
The built random forest is tested on test data, and the result is checked
with the functions table() and margin().
232 R Programming — An Approach for Data Analytics
CASE STUDIES
OBJECTIVES
7.1.Text Mining
Text mining, is the process of deriving knowledge / insights from textual data.
High-quality information is typically derived through the devising of patterns and
trends through pattern learning. Text mining involves the process of preprocessing
/ formatting the input text, finding patterns within the preprocessed data, and
finally evaluation of the results. The tasks of text mining are categorization of text,
clustering of text, concept / entity extraction, sentiment analysis, production of
granular taxonomies, entity relation modelling and document summarization.
This case study on text mining starts with using the twitter feeds from the dataset
“GameReview.csv” for further analysis. The extracted text is then transformed to
build a document-term matrix. Then, frequent words and associations are found
from the matrix. Important words in a document can be presented as a word
cloud. Packages used for text mining are “tm” and “wordcloud”.
> library(tm)
234 R Programming — An Approach for Data Analytics
From the listings from the corpus before stemming, after stemming and after
stemming completion we can see the words being changed. For example in line
4, the word was “addictive” before stemming, the word became “addict” after
stemming and then after stemming completion it became “addicted”.
As we can see from the above result, the term-document matrix is composed
of 2117 terms and 1000 documents. It is very sparse, with 99% of the entries being
zero. We then have a look at the first six terms starting with “g” and tweets
numbered 201 to 210.
> idx = grep(glob2rx(“g*”), dimnames(tdm)$Terms)
> inspect(tdm[idx,201:210])
<<TermDocumentMatrix (terms: 81, documents: 10)>>
237 Case Studies
Many data mining tasks can be done, based on the above matrix. For example,
clustering, classification and association rule mining. When there are too many
terms, the size of a term-document matrix can be reduced by selecting terms that
appear in a minimum number of documents.
We will now have a look at the popular words and the association between
words from the 1000 tweets.
> findFreqTerms(tdm, lowfreq=80)
[1] “addicted” “love” “game” “good” “great” “play” “fun”
[8] “awesome” “get” “time” “like” “just” “update” “app”
[15] “can” “cant”
popularity. To show the top frequent words visually, we next make a bar plot for them.
From the term document matrix, we can derive the frequency of terms with
rowSums(). Then we select terms that appears in eighty or more documents and
shown them with a bar plot using the package ggplot2. In the code below,
geom=”bar” specifies a bar plot and coord_flip() swaps x-axis and y-axis. The bar plot
below clearly shows that the three most frequent words are “game”, “play” and
“great”.
> termFrequency <- rowSums(as.matrix(tdm))
> termFrequency <- subset(termFrequency, termFrequency>=80)
> library(ggplot2)
> df <- data.frame(term=names(termFrequency), freq=termFrequency)
> ggplot(df, aes(x=term, y=freq)) + geom_bar(stat=”identity”) +
+ xlab(“Terms”) + ylab(“Count”) + coord_flip()
Alternatively, the above plot can also be drawn with barplot() as below, where
the argument las sets the direction of x-axis labels to be vertical.
> barplot(termFrequency, las=2)
239 Case Studies
It is also possible to find the highly associated words with another word with
the function findAssocs(). Below is the code to find the terms associated with the
words “game” and “play” with correlation no less than 0.20 and 0.25 respectively.
The words are ordered by their correlation with the terms “game” (or “play”).
> findAssocs(tdm, “game”, 0.20)
$game
play player dont say new thatd ever
0.29 0.22 0.21 0.21 0.21 0.21 0.20
> findAssocs(tdm, “play”, 0.25)
$play
game potential course dont year will meter
0.29 0.27 0.27 0.26 0.26 0.25 0.25
Words with frequency below twenty are not plotted, as specified by min.freq=20. By
setting random.order=F, frequent words are plotted first, which makes them appear
in the centre of cloud. We also set the colours to gray levels based on frequency.
> library(wordcloud)
> m <- as.matrix(tdm)
> wordFreq <- sort(rowSums(m), decreasing=TRUE)
> pal <- brewer.pal(9, “BuGn”)
> pal <- pal[-(1:4)]
> set.seed(375)
> grayLevels <- gray( (wordFreq+10) / (max(wordFreq)+10) )
> wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=20,
random.order=F, colors=pal)
The above word cloud clearly shows again that “game”, “play” and “great” are
the top three words, which validates that the Game review tweets. Some other
important words are “love”, “good” and “fun”, which shows that the review on the
game is very good.
the distances between terms are calculated with dist() after scaling. After that, the
terms are clustered with hclust() and the dendrogram is cut into 4 clusters. The
agglomeration method is set to ward, which denotes the increase in variance
when two clusters are merged.
> tdm2 <- removeSparseTerms(tdm, sparse=0.90)
> m2 <- as.matrix(tdm2)
> distMatrix <- dist(scale(m2))
> fit <- hclust(distMatrix, method=”ward.D”)
> plot(fit)
> rect.hclust(fit, k=4)
> (groups <- cutree(fit, k=4))
addicted love game good great play fun awesome get
1 1 2 3 4 1 4 1 1
time
1
In the above dendrogram, we can see the topics in the tweets grouped into 4
different clusters. The most frequent word “game” is in the second cluster, the
next
242 R Programming — An Approach for Data Analytics
frequent word “good” is in the third cluster, the words “great” and “fun” falls under the
fourth cluster and the remaining words with low frequency falls under the first cluster.
Next we cluster the tweets using k-means clustering algorithm. The k-means
clustering takes the values in the matrix as numeric. We transpose the term-document
matrix to a document-term one. The tweets are then clustered with kmeans() with
the number of clusters set to eight. After that, we check the popular words in
every cluster and also the cluster centres. A fixed random seed is set with
set.seed() before running kmeans().
> m3 <- t(m2)
> set.seed(123)
> k <- 8
> kmeansResult <- kmeans(m3, k)
> round(kmeansResult$centers, digits=3)
addicted love game good great play fun awesome get time
1 0.168 0.107 1.267 0.137 1.313 0.069 0.176 0.130 0.145 0.061
2 0.227 0.394 3.652 0.409 0.348 0.803 0.273 0.258 0.409 0.258
3 0.038 0.017 1.174 0.547 0.000 0.127 0.068 0.127 0.081 0.106
4 0.212 0.061 1.091 0.333 0.030 0.606 0.212 0.030 2.242 0.788
5 0.183 0.113 1.507 0.113 0.183 0.887 1.085 0.197 0.197 0.254
6 1.205 0.154 0.397 0.077 0.090 0.154 0.423 0.179 0.026 0.064
7 0.000 0.118 0.000 0.125 0.104 0.146 0.212 0.090 0.066 0.097
8 0.237 1.381 1.216 0.031 0.031 0.216 0.062 0.175 0.134 0.093
To make it easy to find what the clusters are about, we then check the top three
words in every cluster.
> for (i in 1:k) {
+ cat(paste(“cluster “, i, “: “, sep=””))
+ s <- sort(kmeansResult$centers[i,], decreasing=T)
+ cat(names(s)[1:3], “\n”)
+}
243 Case Studies
From the above top words and centres of clusters, we can see that the clusters are
of different topics. We can see in every cluster except the cluster 7, the word “game” is
a part of it and each of these clusters talk about the game in different angle.
The dataset has many attributes that define the credibility of the customers
seeking for several types of loan. The values for these attributes can have outliers
that do not fit into the regular range of data. Hence, it is required to remove the
outliers before the dataset is used for further modelling. The outlier detection for
quantitative features is done using the function levels(). For numeric features the
box plot technique is used for outlier detection and this is implemented using the
daisy() function of the cluster package. But, before this the numeric data has to
be normalized into a domain of [0, 1]. The agglomerative hierarchical clustering
algorithm is used for outlier ranking. This is done using the outliers.ranking()
function of the DMwR package. After ranking the outlier data, the ones that is out
of range is disregarded and the remaining outliers are filled with null values.
The inconsistencies in the data like unbalanced dataset have to be balanced
before building the classification model. Many real time datasets have this
problem and hence need to be rectified for better results. But, before this step, it
is required to split the sample dataset into training and test datasets which will be
in the ratio 4:1 (i.e. Training dataset 80% of data and 20% of data will be test
dataset). Now the balancing step will be executed on the training dataset using
the SMOTE() function of the DMwR package.
Next using the training dataset the correlation between the various attributes
need to be checked to see if there are any redundant information represented
using two attributes. This is implemented using the plotcorr() function in the
ellipse package. The unique features will then be ranked and based on the
threshold limit the number of highly ranked features will be chosen for model
building. For ranking the features the randomForest() function of the randomForest
package is used. The threshold for selecting the number of important features is
chosen by using the rfcv() function of the randomForest package.
Now the resultant dataset with the reduced number of features is ready for
use by the classification algorithms. Classification is one of the data analysis
methods that predict the class labels. Classification can be done in several ways
and one of the most appropriate for the chosen problem is using decision trees.
Classification is done in two steps – (i) the class labels of the training dataset is
used to build the decision tree model and (ii) This model will be applied on the
test dataset to predict the class labels of the test dataset. For the first step the
function rpart() of the rpart package will be used. The predict() function is used to
execute the second step. The resultant prediction is then evaluated against the
original class labels of the test dataset to find the accuracy of the model.
Dataset Attribute Types
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19 A20 Def
Q N Q Q N Q Q N Q Q N Q N Q Q N Q N B B B
Q: Quantitative N: Numeric B: Binary
5
24
3: Delay in paying off in the past, 4: Credits existing in other banks)
A4: Loan Purpose
Case Studies
(0: new car purchase, 1: used car purchase, 2: furniture or equipment purchase, 3: radio or television
purchase, 4: domestic appliances purchase, 5: repairs, 6: education, 7: vacation, 8: retraining, 9: Business,
10: others)
A5: Credit Amount (in DM)
A6: Bonds / Savings
(1: < 100 DM, 2: >= 100 and < 500 DM, 3: >= 500 DM and 1000 DM, 4: >= 1000 DM, 5: no savings /
bonds)
A7: Present Employment Since
(1: unemployed, 2: < 1 year, 3: >= 1 and < 4 years, 4: >= 4 and < 7 years, 5: >= 7 years)
A8: Instalment rate in percentage of disposable income
A9: Personal Status and Sex
(1: Divorced Male, 2: Divorced/Married Female, 3: Male Single, 4: Married Male, 5: Female Single)
A10: Other Debtors / Guarantors
(1: None, 2: Co-applicant, 3; Guarantor)
A11: Present Residence Since (in Years)
A12: Property
(1: Real Estate, 2: Life Insurance, 3: Car or others, 4: No property)
A13: Age in years
A14: Other instalment plans
(1: Bank, 2: Stores, 3: None)
6
24
A15: Housing
(1: Rented, 2: Owned, 3: For Free)
Dataset Selection
The German Credit Scoring dataset in the numeric format which is used for the
implementation of this model has the below attributes and the descriptions of the
same are given in the below table.
After selecting and understanding the dataset it is loaded into the R software
using the below code. The dataset is loaded into R with the name creditdata.
> creditdata <- read.csv(“UCI German Credit Data Numeric.csv”,
header = TRUE, sep = “,”)
> nrow(creditdata)
[1] 1000
Data Pre-Processing
1) Outlier Detection: To identify the outliers of the numeric attributes, the
values of the numeric attributes are normalized into the domain range of [0,
1] and they are plotted as box plot to view the outlier values as in Fig. 7.5.
The code and the result for this step are given as below.
> normalization <- function(data,x)
> {for(j in x)
> {data[!(is.na(data[,j])),j]=
> (data[!(is.na(data[,j])),j]-min(data[!(is.na(data[,j])),j]))/
> (max(data[!(is.na(data[,j])),j])-min(data[!(is.na(data[,j])),j]))}
> return(data)}
> c <- c(2,5,8,11,13,16,18)
> normdata <- normalization(creditdata,c)
> boxplot(normdata[,c])
248 R Programming — An Approach for Data Analytics
> outlierdata=outliers.ranking(distance,test.data=NULL,method=”sizeDiff ”,
clus = list(dist=”euclidean”, alg = “hclust”, meth=”average”),
power = 1, verb = F)
3) Outliers Removal: The observations which are out of range (based on the
rankings) are removed using the below code. After outlier removal the dataset
creditdata is renamed as creditdata_noout.
> boxplot(outlierdata$prob.outliers[outlierdata$rank.outliers])
> n=quantile(outlierdata$rank.outliers)
> n1=n[1]
> n4=n[4]
> filler=(outlierdata$rank.outlier > n4*1.3)
> creditdata_noout=creditdata[!filler,]
> nrow(creditdata_noout)
[1] 975
4) Imputations Removal: The method used for null values removal is multiple
imputation method in which the k nearest neighbours’ algorithm is used for
both numeric and quantitative attributes. The numeric features are
normalized before calculating the distance between objects. The following
code is used for imputations removal. After imputations removal the dataset
creditdata_noout is renamed as creditdata_noout_noimp.
> require(DMwR)
> creditdata_noout_noimp=knnImputation(creditdata_noout, k = 5, scale = T,
meth = “weighAvg”, distData = NULL)
> nrow(creditdata_noout_noimp)
[1] 975
There were no null values for the attributes in the dataset we have chosen
and hence the number of records remains unchanged after the above step.
5) Splitting Training and Test Datasets: Before proceeding to the further steps,
the dataset has to be split into training and test datasets so that the model can
be built using the training dataset. The code for splitting the database is listed
250 R Programming — An Approach for Data Analytics
below.
> library(DMwR)
> split<-sample(nrow(creditdata_noout_noimp),
round(nrow(creditdata_noout_noimp)*0.8))
> trainingdata=creditdata_noout_noimp[split,]
> testdata=creditdata_noout_noimp[-split,]
6) Balancing Training Dataset: The SMOTE() function handles unbalanced
classification problems and it generates the new smoted dataset that addresses
the unbalanced class problem. It artificially generates observations of minority
classes using the nearest neighbours of this class of elements to balance the
training dataset. The following code is used for balancing the training dataset.
> creditdata_noout_noimp_train=trainingdata
> creditdata_noout_noimp_train$default <-
factor(ifelse(creditdata_noout_noimp_train$Def == 1, “def ”, “nondef ”))
> creditdata_noout_noimp_train_smot <-
SMOTE(default ~ ., creditdata_noout_noimp_train, k=5,perc.over = 500)
The data distribution before and after balancing the data are shown in the
figures Fig. 7.6 and Fig. 7.7. This method is based on proximities between objects
and produces a spatial representation of these objects. Proximities represent the
similarity or dissimilarity between data objects. The code used to plot these
objects is shown below.
> library(cluster)
> dist1=daisy(creditdata_noout_noimp_train[,-21],stand=TRUE,metric=c(“gower”),
type = list(interval=c(2,5,8,11,13,16,18),
nominal=c(1,3,4,6,7,9,10,12,14,15,17),binary=c(19,20)))
> dist2=daisy(creditdata_noout_noimp_train_smot[,-21],
stand=TRUE,metric=c(“gower”),
type = list(interval=c(2,5,8,11,13,16,18),
nominal=c(1,3,4,6,7,9,10,12,14,15,17),binary=c(19,20)))
> loc1=cmdscale(dist1,k=2)
> loc2=cmdscale(dist2,k=2)
> x1=loc1[,1]
> y1=loc1[,2]
> x2=loc2[,1]
> y2=loc2[,2]
> plot(x1,y1,type=”n”)
252 R Programming — An Approach for Data Analytics
> text(x1,y1,labels=creditdata_noout_noimp_train[,22],
col=as.numeric(creditdata_noout_noimp_train[,22])+4)
> plot(x2,y2,type=”n”)
> text(x2,y2,labels=creditdata_noout_noimp_train_smot[,22],
col=as.numeric(creditdata_noout_noimp_train_smot[,22])+4)
Features Selection
1) Correlation Analysis: Datasets may contain irrelevant or redundant features
which might make the model more complicated. Hence removing such redundant
features will speed up the model. The function plotcorr() plots a correlation matrix
using ellipse shaped glyphs for each entry. It shows the correlation between the
features in an easy way. The plot is coloured for more clarity. The following
code displays the correlation. Correlation is checked independently for each data
type: numeric and nominal. From the results in the figures, Fig. 7.8 and Fig. 7.9,
it is observed that there is no positive correlation between any of the features,
both numeric and quantitative. Hence, in this step none of the features are
removed.
> library(ellipse)
> c= c(2,5,8,11,13,16,18)
> plotcorr(cor(creditdata_noout_noimp_train[,c]),col=cl<-c(7,6,3))
> c= c(1,3,4,6,7,9,10,12,14,15,17)
> plotcorr(cor(creditdata_noout_noimp_train [,c]),col=cl<-c(“green”,”red”,”blue”))
The above function importance() displays the features importance using the
“mean decrease accuracy” measure in the below table. The measures can be
plotted using the function varImpPlot() as shown in the figure. Fig. 7.10.
A5 6.238347
A6 4.554283
A7 3.316346
A8 0.59622
A9 1.634721
A10 1.383725
A11 0.541585
A12 2.344433
A13 2.621854
A14 4.629331
A15 0.825801
A16 1.225997
A17 0.635881
A18 0.037408
A19 1.117891
A20 1.388876
> varImpPlot(randf)
255 Case Studies
The result of this code is shown in the figure Fig. 7.11 and it shows the best
number of features is 15. Hence we select the features A1, A2, A3, A5, A6, A7, A9,
A10, A12, A13, A14, A16, A19, A20, Def to build the model.
Building Model
Classification is one of the data analysis forms that predicts categorical labels. We
used the decision tree model to predict the probability of default. The following
code uses the function rpart() and finds a model from the training dataset.
> library(rpart)
> c = c(4, 8, 11, 15, 17, 18, 22)
> trdata=data.frame(creditdata_noout_noimp_train[,-c])
> tree=rpart(trdata$Def~.,data=trdata,method=”class”)
> printcp(tree)
The result of this code is displayed below and in the table below.
Classification tree:
rpart(formula = trdata$Def ~ ., data = trdata, method = “class”)
Variables actually used in tree construction:
[1] A1 A12 A13 A2 A3 A5 A6 A9
Root node error: 232/780 = 0.29744
n= 780
The command to plot the classification tree is shown in the figure Fig. 7.12.
> plot(tree, uniform=TRUE,main=”Classification Tree”)
> text(tree, use.n=TRUE, all=TRUE, cex=0.7)
Prediction
The model is tested using the test dataset by using the predict() function. The
code for the same and the results of the prediction are displayed below and in the
table below.
> predicttest=data.frame(testdata)
> pred=predict(tree,predicttest)
> c=c(21)
> table(predict(tree, testdata, type=”class”,na.action=na.pass), testdata[, c])
def Nondef
def 30 5
nondef 6 154
Evaluation
Common metrics calculated from the confusion matrix are Precision, Accuracy,
True Positive Rate (TP Rate) and False Positive Rate (FP Rate). The calculations
for the same are listed below.
True Defaults
Precision=
True Defaults+False Defaults
True Defaults
TP Rate = Total Defaults
False Defaults
FP Rate = Total Non defaults
259 Case Studies
From our resultant data we get the values of the above metrics by applying the
values as derived below.
True Defaults = 30
False Default = 6
Total Default = 35
True Non default = 154
False Non default = 5
Total Non default = 160
Total Test set = 195
Precision = 30 / (30 + 6) = 0.833
Accuracy = (30 + 154) / 195 = 0.943
TP Rate = 30 / 35 = 0.857
FP Rate = 6 / 160 = 0.037
These results show that the proposed model is performing with high accuracy
and precision and hence can be applied for credit scoring.
deployment of police resources, to perform temporal data analysis of crime data and
to capture the trend of crimes happening.
chron - Creates chronological objects which represent dates and time of a day.
ggplot2 - A system for creating graphs, based on the data you provide.
261 Case Studies
read.csv() - Reads a file in table format and creates a data frame from it.
which() - Gives the TRUE indices of a logical object, allowing for array indices
cut() - Cut divides the range of x into intervals and codes the values in x according
to which interval they fall. The leftmost interval corresponds to level one, the next
leftmost to level two and so on.
labels() - Finds a suitable set of labels from an object for use in printing or
plotting
length() - Get or set the length of vectors (including lists) and factors, and of
any other R object for which a method has been defined.
262 R Programming — An Approach for Data Analytics
ifelse() - Ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the element
of test is TRUE or FALSE.
aggregate() - Splits the data into subsets, computes summary statistics for each,
and returns the result in a convenient form.
qplot() - It’s is the basic plotting function. It is used to create a different number
of plots.
The dataset was analyzed to get details such as the file size, number of
records, the fields specified and their meaning.
U.S crime dataset is loaded into the R tool and the data field organization
and its important dimensions are understood. It is noted that each crime incident
263 Case Studies
indicates one record and the various crime types are manually analyzed. The dataset
is loaded using the read.csv() function. After this the required packages for this
project are installed and loaded using the below commands.
> install.packages(“chron”)
> library(chron)
> install.packages(“Rcpp”, dependencies = TRUE)
> library(Rcpp)
> install.packages(“ggplot2”)
> library(ggplot2)
The pre-processing deals with removing duplicate records, removing records with
missing values, removing records with incorrect values, formatting the
Timestamp field (splitting the date and time parts of the data), binning the time
intervals (4 intervals – each 6 hours) and grouping of similar crimes into one
crime type. The functions used for these preprocessing steps are subset(),
as.POSIXlt(), weekdays(), months(), chron(), cut(), table(), length(), as.character(),
ifelse().
Then finally we go for finding and visualizing which crime types lead to
arrests, finding and visualizing frequency of different crime types, finding and
visualizing the hours of the day in which more crimes happen, finding and
visualizing the days of a week in which more crimes happen, finding and
visualizing the months of a year in which more crimes occurs, visualizing the
occurrence of various crime types during various hours of the day, visualizing the
occurrence of various crime types during various days of the week, visualizing the
occurrence of various crime types during various months of the year. All these
exploration and visualization are done using the functions qplot(), factor(),
aggregate() and ggplot() that belong to the package ggplot2.
#Loading data
> crime.data <- read.csv(“crime.data.csv”)
264 R Programming — An Approach for Data Analytics
#Date Conversion
> crime.data$date <- as.POSIXlt(crime.data$DATE..OF.OCCURRENCE,
format= “%m/%d/%Y %I:%M:%S %p”)
> crime.data <- subset(crime.data, !is.na(crime.data$date))
> crime.data$time <- times(format(crime.data$date, “%H:%M:%S”))
#Formating Date
> crime.data$date <- as.POSIXlt(strptime(crime.data$date, format = “%Y-%m-%d”))
> crime.data$day <- weekdays(crime.data$date, abbreviate = TRUE)
> crime.data$month <- months(crime.data$date, abbreviate = TRUE)
> table(crime.data.arrest$crime)
on several aspects such as variation of salary over a period of time with respect to
League, Team etc. The player’s career records are analyzed based on their hits
and runs and their batting averages calculated. These analysis results are then
presented in the form of graphs and histograms using the functions available in R.
The objective is to identify the trend of base ball players salary over the years,
facilitates to understand the correlation between players salary and their
performances, analyze if age, country, height and weight of the players have impact
on their performance and it captures the details of top performing baseball
players.
k. Filter the records of the players in the years in which they have not
had a chance to bat (AB > 0)
c. Exploratory analysis of the baseball team data
a. Visualize the trend of how salaries change over time
b. Find one players salary, team and other details
c. Find the relation of the player’s salary with his height, weight, birth
country
d. Find how each player was batting year wise
e. Visualize correlation of salary with the players performance
f. Visualize each player’s career record (Eg. Total Hit and Runs) based
on their highest rank
g. Visualize the correlation between the players hits and runs
h. Visualize the batting average of the players in a
ggplot2 - A system for creating graphs, based on the data you provide.
read.csv() - Reads a file in table format and creates a data frame from it.
merge() - It is used to merge two data frames by common column or row names,
or do other versions of database join operations.
max() - Returns the (parallel) maxima and minima of the input values.
min() - Returns the (parallel) maxima and minima of the input values.
274 R Programming — An Approach for Data Analytics
ggplot() - initializes a ggplot object. It can be used to declare the input data
frame for a graphic and to specify the set of plot aesthetics intended to be
common throughout all subsequent layers unless specifically overridden.
The dataset was analyzed to get details such as the file size, number of
records, the fields specified and their meaning.
manually analyzed. The datasets are loaded using the read.csv() function of the R
Package. After this the required packages for this project are installed and loaded
using the below commands.
> install.packages(“data.table”)
> library(data.table)
> install.packages(“Rcpp”, dependencies = TRUE)
> library(Rcpp)
> install.packages(“ggplot2”)
> library(ggplot2)
datasets (To study player wise batting performance), filtering the records
of the players in the years in which they have not had a chance to bat (AB
> 0) and merging the three datasets to get a study of player wise batting
performance.
#Loading Data
> salaries<-read.csv(“salaries.csv”,header=TRUE)
> master<-read.csv(“master.csv”,header=TRUE)
> batting<-read.csv(“batting.csv”,header=TRUE)
> salaries.filtered.sorted=salaries.filtered[order(salary), ]
=max(salary),minimum=min(salary)),by=”lgID”]
> summarized.year.lg=salaries[,list(Average=mean(salary),Maximum=max(salary),
minimum=min(salary)),by=c(“yearID”,”lgID”)]
> summarized.year.play=salaries[,list(Average=mean(salary),Maximum=max(salary),
minimum=min(salary)),by=c(“yearID”,”playerID”)]
>summarized.year.team=salaries[,list(Average=mean(salary),Maximum=max(salary),
minimum=min(salary)),by=c(“yearID”,”teamID”)]
> batting=as.data.table(batting)
> salaries.reduced.round<-round(as.numeric(y,1))
> summarized.year.lg$average.reduce.round<-paste(summarized.year.
lg$Average/10000)
> averages.reduce.round<-round(as.numeric(summarized.year.lg$average.reduce.
round,1))
Figure 7.22 Ggplot of Trend of Salaries of Year > 1990 for American
League
281 Case Studies
Figure 7.24 Ggplot of Year wise and League wise Average Salary
282 R Programming — An Approach for Data Analytics
Figure 7.25 Ggplot of Year wise and Team wise Average Salary
Figure 7.26 Ggplot of Correlation between the Players Hits and Runs
283 Case Studies
twitter text data used in the Text Mining Section. The terms in this data can be
considered as people and the tweets as groups in the LinkedIn. The term-
document matrix is a representation of the group membership of people.
In this case study, we first build the network of terms based on their co-
occurrence in the same tweets and then build a network of tweets based on the
terms shared by them. After this, we also build a two-mode network composed of
both terms and tweets.
As a first step we build the term-document matrix as in the Text Mining section
using the below code.
> library(tm)
#Reading the input file
> dat<-read.csv(“GameReview.csv”,stringsAsFactors = FALSE)
#Converting it to a corpus
> corp <- Corpus(VectorSource(dat$text))
#Preprocessing – removing stop words, punctuations, whitespaces etc.
> corp <- tm_map(corp, content_transformer(tolower))
> corp <- tm_map(corp, removePunctuation)
> corp <- tm_map(corp, stripWhitespace)
> corp <- tm_map(corp, removeWords, stopwords(“english”))
#Converting into a term-document matrix
> tdm <- TermDocumentMatrix(corp)
#Removing sparse terms
> tdm2 <- removeSparseTerms(tdm,
sparse=0.96) #converting into a matrix
> termDocMatrix <- as.matrix(tdm2)
> termDocMatrix <- termDocMatrix[ ,1:150]
relationship between the frequent terms and also make the graph more readable
by setting colors, font sizes and transparency of vertices and edges.
> termDocMatrix[5:10,1:20]
Docs
Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
good 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0
great 0 1 0 0 1 0 0 0 1 0 2 0 0 0 1 0 0 0 0 0
money 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
play 0 1 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0
fun 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 0
much 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> termDocMatrix[termDocMatrix>=1] <- 1
> termMatrix <- termDocMatrix %*% t(termDocMatrix)
> termMatrix[5:10,5:10]
Terms
Terms good great money play fun much
good 29 6 3 8 4 1
great 6 25 1 5 5 1
money 3 1 9 3 1 2
play 8 5 3 25 10 2
fun 4 5 1 10 35 4
much 1 1 2 2 4 8
In the above code, %*% is an operator for the product of two matrices, and
the function t() transposes a matrix. We then build a term-term adjacency matrix.
In this matrix the rows and columns represent terms, and every entry is the
number of concurrences of two terms. Next we can build a graph with the
function graph. adjacency() from package igraph.
> library(igraph)
287 Case Studies
Next, we set the label size of vertices based on their degrees, to make
important terms stand out. Similarly, we also set the width and transparency of
edges based on their weights. This is useful in applications where graphs are crowded
with many vertices and edges. The vertices and edges in the below code are
accessed with V() and E(). The function rgb(red, green, blue, alpha) defines the
colors. With the same layout, we plot the graph again as in Fig. 7.32.
> V(g)$label.cex <- 2.2 * V(g)$degree / max(V(g)$degree)+ .2
> V(g)$label.color <- rgb(0, 0, .2, .8)
288 R Programming — An Approach for Data Analytics
Figure 7.32 Network of Terms with Label Size of Vertices Based on their
Degrees
Next, we try to detect communities from the graph, called as cohesive blocks
and then plot the network of terms based on cohesive blocks as in Fig. 7.33.
> blocks <- cohesive.blocks(g)
> blocks
Cohesive block structure:
B -1 c 15, n 31
‘- B -2 c 16, n 30 oooooooooo .ooooooooo oooooooooo o
‘- B -3 c 17, n 28 ooo.oooooo .ooooooooo oooooooooo .
> plot(blocks, g, vertex.size=.3, vertex.label.cex=1.5, edge.olor=rgb(.4,.4,0,.3))
289 Case Studies
Next we plot the network of terms based on maximal cliques as in Fig. 7.34.
> cl <- maximal.cliques(g)
> length(cl)
[1] 286
> colbar <- rainbow(length(cl) + 1)
290 R Programming — An Approach for Data Analytics
Next we plot the network of terms based on largest cliques as in Fig. 7.35.
> cl <- largest.cliques(g)
> length(cl)
[1] 41
> colbar <- rainbow(length(cl) + 1)
> for (i in 1:length(cl)) {
+ V(g)[cl[[i]]]$color <- colbar[i+1]
+}
> plot(g, mark.groups=cl, vertex.size=.3, vertex.label.cex=1.5, edge.
color=rgb(.4,.4,0,.3))
most tweets are connected with others and the graph of tweets is very crowded.
To simplify the graph and find relationship between tweets beyond the above two
keywords, we remove the two words before building a graph.
> idx <- which(dimnames(termDocMatrix)$Terms %in% c(“game”, “play”))
> M <- termDocMatrix[-idx,]
> tweetMatrix <- t(M) %*% M
> g <- graph.adjacency(tweetMatrix, weighted=T, mode = “undirected”)
> V(g)$degree <- degree(g)
> g <- simplify(g)
> V(g)$label <- V(g)$name
> V(g)$label.cex <- 1
> V(g)$label.color <- rgb(.4, 0, 0, .7)
> V(g)$size <- 2
> V(g)$frame.color <- NA
Next, we have a look at the distribution of degree of vertices and the result is
shown in the below bar graph as in Fig. 7.36. We can see that there are around 20
isolated vertices (with a degree of zero). Note that most of them are caused by the
removal of the two keywords, “game” and “play”.
> barplot(table(V(g)$degree))
With the code below, the vertex colours are set based on degree, and labels
of isolated vertices are set to tweet IDs and the first 10 characters of every tweet.
The labels of other vertices are set to tweet IDs only, so that the graph will not
be overcrowded with labels. The colour and width of edges are set based on their
weights. The produced graph is shown in Fig. 7.37.
> idx <- V(g)$degree == 0
> V(g)$label.color[idx] <- rgb(0, 0, .3, .7)
> V(g)$label[idx] <- paste(V(g)$name[idx], substr(dat$text[idx], 1, 10), sep=”: “)
> egam <- (log(E(g)$weight)+.2) / max(log(E(g)$weight)+.2)
> E(g)$color <- rgb(.5, .5, 0, egam)
> E(g)$width <- egam
> set.seed(3152)
> layout2 <- layout.fruchterman.reingold(g)
> plot(g, layout=layout2)
The vertices in crescent are isolated from all others, and next they are removed
from the graph with the function delete.vertices() and re-plot the graph as in Fig. 7.38.
> g2 <- delete.vertices(g, V(g)[degree(g)==0])
> plot(g2, layout=layout.fruchterman.reingold)
293 Case Studies
Similarly, it is also possible to remove the edges with low degrees to simplify
the graph using the function delete.edges(). After removing edges, some vertices
become isolated and they are also removed. The produced graph is as in Fig. 7.39.
> g3 <- delete.edges(g, E(g)[E(g)$weight <= 1])
> g3 <- delete.vertices(g3, V(g3)[degree(g3) == 0])
> plot(g3, layout=layout.fruchterman.reingold)
In the Fig. 7.39, there are some groups (or cliques) of tweets. Few of them are
listed below. The group of tweets (25, 35, 112) is about the word “Awesome”, the
group of tweets (31, 47, 122) is about the word “good” and the group of tweets
(57, 58, 67, 75, 103, 146) is about the word “addictive”.
> dat$text[c(25,35,112)]
[1] “ Awesome! A lot of fun!!”
[2] “ Awesome Mysterious Game!! Fun game to play @ night before bed to wind
down!!”
[3] “ Miss Awesome fun”
> dat$text[c(31,47,122)]
[1] “ Error in patching Every time I try to log it it says error in patching but overall
good game.”
[2] “ Good For spending time while waiting for an appointment”
[3] “ Good It is a good game to play while wasting time”
> dat$text[c(57,58,67,75,103,146)]
[1] “ Addictive fun Perfect fun”
[2] “ Wonderful Is a great game and addictive. Brilliant”
[3] “ Addictive Great looking, fun game”
[4] “ ADDICTIVE!!!! This is a fun and easy to play and lose!!”
[5] “ Very fun Addictive game, similar to a Tomogotchi. You will want to check in on your
village and clan. Building, building, building and re-arranging you village. Some battles
too. Ver well constructed.”
[6] “ JD Very addictive fun gaming”
The Fig. 7.40 shows that most tweets are around two centers, “game” and “play”.
Next, let’s have a look at which tweets are about “game”. In the code below, the
function nei() returns all vertices which are neighbors of the vertex “game”.
> V(g)[nei(“game”)]
+ 89/181 vertices, named:
[1] 2 3 4 5 6 8 9 11 13 15 17 18 20 21 27 28 29
[18] 30 31 34 35 37 38 39 40 42 44 51 53 54 55 58 59 60
[35] 61 63 64 66 67 71 72 73 76 80 81 82 83 85 87 90 91
[52] 92 93 94 95 97 98 99 101 102 103 105 107 108 109 110 111
115
[69] 116 117 118 119 120 122 125 127 128 129 131 134 136 138 140 141 143
[86] 144 145 148 149
We can also have a further look at which tweets contain all two terms: “game”
and “play”.
> (rdmVertices <- V(g)[nei(“game”) & nei(“play”)])
+ 20/181 vertices, named:
[1] 2 6 34 35 37 42 44 59 61 66 73 82 92 107 122 131 134
[18] 143 144 149
> dat$text[as.numeric(rdmVertices$label)]
297 Case Studies
[1] “ Great game I love this game. Unlike other games they constantly give you money
to play. They are always given you a bone. Keep up the good work.”
[2] “ Meh Used to be good until World Cup upgrade.\nNow it lags all the time, making
it difficult to play.\nMaybe if you spent more time getting the game to actually work and
less time trying to squeeze advertising into every nook of game play, we could have a
winner.”
...
Next, we remove “game” and “play” to show the relationship between tweets
with other words. Isolated vertices are also deleted from graph.
> idx <- which(V(g)$name %in% c(“game”, “play”))
> g2 <- delete.vertices(g, V(g)[idx-1])
> g2 <- delete.vertices(g2, V(g2)[degree(g2)==0])
> set.seed(209)
> plot(g2, layout=layout.fruchterman.reingold)
From Fig. 7.41, we can clearly see groups of tweets and their keywords, such as
“addictive”, “good” and “fun”.
HIGHLIGHTS
Text mining involves the process of preprocessing the input text, deriving
patterns within the preprocessed data, and finally evaluation of the
output.
A word cloud is used to present important words in documents.
Corpus is a collection of text documents.
The most accurate and highly used credit scoring measure is the Probability
of Default called the PD.
The function importance() displays the features importance using the
“mean decrease accuracy” measure.
Common metrics calculated from the confusion matrix are Precision,
Accuracy, True Positive Rate (TP Rate) and False Positive Rate (FP
Rate).
The US crime dataset is used for the EDA of crimes in US
Pre-processing done in this case study are removing duplicate records,
records with missing values, records with incorrect values, formatting
Timestamp field, binning time intervals (4 intervals – each 6 hours) and
grouping of similar crimes.
The objective of the EDA on baseball data is to identify trend of base ball
players salary over the years, to understand correlation between players
salary and their performances, analyze if age, country, height and
weight of the players have impact on their performance.
Social Network Analysis (SNA) is the process of investigating social
structures through the use of networks and graph theory.
The package used for Text Mining is tm and the package for Social Network
Analysis is igraph.
The function nei() returns all vertices which are neighbours of the given
vertex.
299 Case Studies
GLOSSARY
Base Environment The functions and the variables from the R’s base
package are stored in the base environment.
Basic Data Types The basic data types in R are Numeric, Integer,
Complex, Logical and Character.
300 R Programming — An Approach for Data Analytics
Local Outlier Factor The local outlier factor (LOF) is an algorithm for
finding anomalous data points by measuring the
local deviation of a given data point with respect
to its neighbours.
304 R Programming — An Approach for Data Analytics
arules Package for Mining Association Rules and 113, 131, 151
Frequent Itemsets
arulesViz Package for Visualizing Association Rules and 113, 134, 151
Frequent Itemsets
DMwR Package that provide Functions and Data for 113, 139,
“Data Mining with R” 151, 160, 163
310 R Programming — An Approach for Data Analytics
plyr Package with Tools for Splitting, Applying 58, 60, 113,
and Combining Data 147, 151
randomForest Package for Random Forests for 130, 149,
Classification and Regression 151, 160, 165
aggregate() stats Splits the data into subsets, computes 63, 70,
summary statistics for each, and 171,172,
returns the result in a convenient 174,
form.
colMeans() base Form row and column sums and 35, 36,
means for numeric arrays. 147
colnames() base Retrieve or set the row or column 24, 31,
names of a matrix-like object. 59, 73,
88
colSums() base Form row and column sums and 35, 36,
means for numeric arrays. 133,
delete.
vertices()
igraph Deletes vertices from a graph. 195, 196,
198
52, 54,
Reads a file in table format and
59, 162,
creates a data frame from it,
read.csv() utils 170, 171,
with cases corresponding to
172, 180,
lines and variables to fields in
182
the file.
341 Functions Used
122, 125,
Recursive Partitioning and
127, 130,
rpart() rpart Regression Trees. Fit a
151, 161,
rpart model.
167
153
153
51, 54,
144
351 Functions Used
representations and
objects of classes “POSIXlt”
and ”POSIXct” representing
calendar dates and times.
system.file() base Finds the full file names of files 51, 53,
in packages etc. 55, 70,
WEBSITES
1. https://www.tutorialspoint.com/r/
2. http://www.r-tutor.com/
3. https://cran.r-project.org/manuals.html
4. https://www.r-bloggers.com/
5. http://www.rdatamining.com/examples/
REFERENCES
36 Index
2
218, 230, 233, 237, 240, 242, 244, 249, 250, 253, 255, 257, 258,
261,
248, 300, 302, 303, 318, 322, 331, 262, 273, 274, 297, 305
336 Date Conversion 263
Coef 160, 211, 318 Dates and Times 71, 74, 309
Colbycol 86 Dbinom 151, 152, 176, 322
ColClasses 88, 89 Dbscan 178, 186, 187, 217, 230, 322
Collapse 65, 66, 235 Decision Tree 190, 192, 193, 301
Complex 2, 27, 29, 30, 168, 248, 253 Dendrogram 185, 241, 341
Confidence 201, 202, 300 Density Based Clustering 178, 186-188,
Confusion matrix 300 218, 219, 301
Corpus 234, 235, 284, 297, 320 Density Plot 214
Correlation Analysis xi, 141 Dimensionality Reduction 178, 220,
Corrplot 155, 176, 178, 226, 230, 301
309, DMwR 178, 214, 230, 231, 243, 244,
320 248, 249, 250, 309, 331, 336, 346
CRAN 3, 10, 11, 12, 25, 300, 305, 330 Dnorm 148, 149, 176, 325
Credit Scoring 243, 258, 297 Document-Term Matrix 233, 236, 301,
CSV 86, 87, 89, 90, 97, 115, 226, 233, 307, 352
234, 247, 260-262, 270-275, 284,
339, 355
E
D 260, 261, 272, 273, 301, 314, 315,
322, 323, 325, 328, 333, 334, 339,
Data Frame 11, 51-56, 60, 81, 84, 85, 340, 342, 353, 356
87, 88, 95-98, 100, 102, 114-116, Data Reshaping x
123, 125, 132, 144, 166, 176, 212,
36 Index
2
H
Lattice 118, 123, 125, 126, 128, 130,
133, 136, 138, 139, 140, 310, 316,
Help 1, 3, 4, 5, 9, 10, 25, 329
317, 329, 356
Hierarchical Clustering 185, 230, 240,
Least Square Regression 166
244, 248, 302, 336
Level 2, 61, 62, 64, 81, 86, 87, 118, 168,
Histogram 130, 131, 140, 150, 272,
173-175, 202, 260, 302, 303, 306,
278, 329
338, 341
Home Directory 302
LHS 201
HTML 83, 87, 88, 329
364 Index
P
171, 176, 191, 227
Principal Component Analysis 177,
Package 10, 25, 138, 178, 190, 191, 215, 231
194,
Probability of Default 243, 297, 305
199, 216, 231, 243, 274, 305, 309-
Prune 206
314, 316, 318, 320, 321, 323, 325,
P-value 306
326, 328, 329, 331, 333, 334, 336,
338, 339, 341, 342, 344, 346, 348,
350, 352, 353, 355 Q
Package Party 191
Package RandomForest 199 Qbinom 151, 153, 176
Package rpart 194, 230 Qnorm 148, 150, 176
Packages and Functions for Classifica- Qplot 261, 262, 265
tion 190 Quantile 306
Packages and Functions for Clustering
178 R
Packages for Data Mining 177
Parent Environment 305 RandomForest 177, 178, 190, 199,
Pbinom 151, 152, 176, 337 201,
Pie chart 118, 139, 337 228, 230, 244, 253, 311, 330, 343
Plotcluster 187, 188, 338 Range 71, 130, 142, 145, 176, 196, 244,
Plotcorr 244, 252, 338 247, 249, 260, 306, 308, 321
Plyr 96, 99, 178, 226, 230, 311, 314, Rank 100, 116, 248, 272
334 Ranking Features 253
Pnorm 148, 149, 176, 338 Rbind 17, 40, 41, 57, 58, 81, 100, 101
Poisson Regression 164, 305 Rbinom 151, 153, 176, 339
POSIXct 71, 72, 74, 77, 82, 315, 324, Read.csv 86, 89, 90, 97, 226, 234, 247,
348, 349 260, 262, 272, 274, 275, 284, 339
POSIXlt 71, 72-74, 80, 82, 260, 262, Regression Analysis 156, 306
263, 315, 348, 349 Reshape 342
Prcomp 215, 216, 223, 231, 338 Response 156, 157, 159, 160, 162, 164,
Precision 257, 258, 297 165, 166, 170, 176, 305, 320
Predict 157-159, 161, 167, 189, 190, rfcv 244, 255, 343
191, 193, 194, 196-198, 200, 201, R-Forge 11, 12, 306
230, 244, 255, 257, 300, 304, 339 RHS 201
Predictor 156-160, 162, 164-166, 170, Rlof 178, 216, 230, 231
366 Index
T
S
130, 132, 134, 135, 141-144, 148, 303, 305, 307-309, 313, 317-319,
151, 157, 158, 166, 176, 180, 222, 321, 330-332, 335, 336, 339, 341,
225, 261, 273, 302, 314, 318-320, 342, 344, 345, 346, 349, 350,
353,
322, 324, 325, 326, 328, 331, 333, 354, 355
334, 344, 346, 347, 349, 350, Wordcloud 233, 239, 240, 312, 355
351, Working Directory 70, 81, 346
353, 354, 355
W X