[go: up one dir, main page]

0% found this document useful (0 votes)
252 views62 pages

Ida PDF

The document discusses the basics of using the R programming language. It covers installing R and RStudio, data structures in R including vectors, matrices, data frames and factors. It also discusses importing and managing data, subsetting data, creating common plots like histograms, bar plots and scatter plots using R, and manipulating strings. The document serves as an introduction to getting started with R and the basics of working with data in R.

Uploaded by

nalluri_08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
252 views62 pages

Ida PDF

The document discusses the basics of using the R programming language. It covers installing R and RStudio, data structures in R including vectors, matrices, data frames and factors. It also discusses importing and managing data, subsetting data, creating common plots like histograms, bar plots and scatter plots using R, and manipulating strings. The document serves as an introduction to getting started with R and the basics of working with data in R.

Uploaded by

nalluri_08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Contents

1 Introduction 4
1.1 Why R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Installation of Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Installation of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Installation of RStudio Desktop . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 R packages and libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Getting help in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Things to keep in mind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 The “working directory” and listing files . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Saving and loading R workspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Data Structures 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Atomic Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Matrices & Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 Testing & Coercion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 Combining data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Data Input into R 14


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Data Management 16
4.1 Arithmetic operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Mathematical functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Statistical functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1 Standardizing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Probability functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.3 Character functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.4 Other useful functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Applying functions to data objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Subsetting 22
5.1 Atomic vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Matrices and Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5 Subsetting Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5.1 Simplifying vs. preserving subsetting . . . . . . . . . . . . . . . . . . . . . . 24
5.5.2 $ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5.3 Missing/out of bounds indices . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.6 Subsetting and assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1
14IT601, Introduction to Data Analytics, Sivaram N 2

6 Plots in R 26
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.1.1 Terminology related to ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.4 Bar plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.5 Scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6 Line plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.7 Box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.7.1 Interpretation of Box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.8 Manipulating Strings in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.8.1 paste() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.8.2 sprintf() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.9 Regular expressions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.9.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7 Probability Distributions 37
7.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8 Hypothesis Tests 40
8.1 T Test (Student’s T-Test) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.1.1 What is a T Test? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.1.2 The T Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.1.3 T Values and P Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.1.4 Calculating the T Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.1.5 What is a Paired T Test (Paired Samples T Test)? . . . . . . . . . . . . . . 41
8.2 Analysis of variance (ANOVA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.2.1 ANOVA test hypotheses: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2.2 Assumptions of ANOVA test . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2.3 Steps in ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2.4 one-way ANOVA test in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

9 Correlation 44
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.2 Pearson’s correlation coefficient ( r ) . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9.3 Spearman’s rank correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . 46

10 Classification 48
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.1.1 Classification vs Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.1.2 Steps in classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.3 Interpolation and Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
10.4 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
10.4.1 Examples of binary classification problems . . . . . . . . . . . . . . . . . . . 51
10.4.2 Why not linear regression? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
10.4.3 The Logistic Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
10.5 Classification by Decision Tree Induction . . . . . . . . . . . . . . . . . . . . . . . . 52
10.5.1 R code for Decision tree induction based classifier . . . . . . . . . . . . . . . 52
10.6 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10.6.1 R code for Random forests based classifier . . . . . . . . . . . . . . . . . . . 53
10.7 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
10.7.1 Advantages and Dis-advantages . . . . . . . . . . . . . . . . . . . . . . . . . 56

11 Clustering 57
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
11.1.1 Data structures used by Clustering algorithams . . . . . . . . . . . . . . . . 57
11.2 Dissimilarity / Distance between objects . . . . . . . . . . . . . . . . . . . . . . . . 58
11.2.1 Clustering techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
11.3 Partitioning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 3

11.3.1 Advantages and Dis-advantages . . . . . . . . . . . . . . . . . . . . . . . . . 59


11.4 Hierarchical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.4.1 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . 59
11.4.2 Divisive Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.5 Clustering in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.6 Avoiding non-existent clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Bapatla Engineering College, Dept. of IT, 2019-20


Chapter 1

Introduction

1.1 Why R?
• Vast capabilities, wide range of statistical and graphical techniques

• Very popular in academia, growing popularity in business:


http://r4stats.com/articles/popularity/

• Available freely under the GNU public license

• Excellent community support: mailing list, blogs, tutorials

• Easy to extend by writing new functions

• Very little programming language knowledge necessary

1.2 Installation of Software


1.2.1 Installation of R
• Go to https://cran.r-project.org

• Choose the link relevant to your Operating System

4
14IT601, Introduction to Data Analytics, Sivaram N 5

• Click on the link base

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 6

• Click on the link Download R3.6.1 for Windows

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 7

1.2.2 Installation of RStudio Desktop


• Go to https://rstudio.com/products/rstudio/download

• Click on the link DOWNLOAD in the column RStudio Desktop

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 8

• Click on the link in the download column and the row corresponding to your OS

1.3 R packages and libraries


There are thousands of R packages that extend R’s capabilities. To get started, check out
http://cran.r-project.org/web/views/.

• To view available packages:


library()

• To see what packages are loaded:


search()

• To load a package:
library(“packageName”)

• Install new package:


install.packages(“packageName”)

1.4 Getting help in R


• Start html help, search / browse using web browser

– at the R console: help.start()


– or use the help menu from your GUI

• Look up the documentation for a function


help(topicName)
?topicName

• Look up documentation for a package


help(package=“packageName”)

• Search documentation from R (not always the best way. . . google often works better)
help.search(“topicName”)

1.5 Things to keep in mind


• Case sensitive

• Comments can be put almost anywhere, starting with a hash mark (’#’); everything to the
end of the line is a comment

• The command prompt “>” indicates that R is ready to receive commands

• If a command is not complete at the end of a line, R will give a different prompt, ‘+’ by
default

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 9

• Parentheses must always match (first thing to check if you get an error)

• R Does not care about spaces between commands or arguments

• Names should start with a letter and should not contain spaces

• Can use “.” in object names (e.g., “my.data”)

1.6 The “working directory” and listing files


R knows the directory it was started in, and refers to this as the “working directory”

> getwd() # get the current working directory


> setwd("dataSets") # set wd to the dataSets folder. Use / in (Linux) and \ in
(windows) for directory path names
> setwd("..") # set wd to enclosing folder ("up")

It can be convenient to list files in a directory without leaving R

> list.files("dataSets") # list files in the dataSets folder


> list.files("dataSets", pattern = ".csv") # restrict to .csv files

1.7 Saving and loading R workspaces

> ls() # list objects in our workspace


> save.image(file="myWorkspace.RData") # save workspace
> load("myWorkspace.RData") # load myWorkspace.RData
> rm(list=ls()) # remove all objects from our workspace
> rm(x) # delete the variable x from workspace

Bapatla Engineering College, Dept. of IT, 2019-20


Chapter 2

Data Structures

2.1 Introduction
R’s base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether
they’re homogeneous (all contents must be of the same type) or heterogeneous (the contents can
be of different types). This gives rise to the five data types most often used in data. str() is
shortcut name for structure function and it gives a compact, human readable description of any R
data structure.
Homogeneous Heterogeneous

1d AtomicVector List
2d Matrix DataFrame
nd Array

2.2 Vectors
The basic data structure in R is the vector. Vectors come in two flavours: atomic vectors and lists.
They have three common properties:

• Type, typeof(), what it is.

• Length, length(), how many elements it contains.

• Attributes, attributes(), additional arbitrary metadata.

NOTE: Use is.atomic(x) or is.list(x) to test if an object is actually a vector.

2.2.1 Atomic Vectors


There are four common types of atomic vectors: logical, integer, double (often called numeric),
and character. There are two rare types : complex and raw. Atomic vectors are usually created
with c(), short for combine:

var1 <- c(1, 2.5, 4.5)


# With the L suffix, you get an integer rather than a double
var2 <- c(1L, 6L, 10L)
# Use TRUE and FALSE (or T and F) to create logical vectors
var3 <- c(TRUE, FALSE, T, F)
var4 <- c("these are", "some strings")

Missing values are specified with NA, which is a logical vector of length 1

Types and Tests


Given a vector, you can determine its type with typeof(), or check if it’s a specific type with an
“is” function: is.character(), is.double(), is.integer(), is.logical(), or, more generally, is.atomic().

10
14IT601, Introduction to Data Analytics, Sivaram N 11

var1 <- c(1L, 6L, 10L)


typeof(var1)
is.integer(var1)
is.atomic(var1)
var2 <- c(1, 2.5, 4.5)
typeof(var2)
is.double(var2)
is.atomic(var2)

All elements of an atomic vector must be the same type, so when you attempt to combine
different types they will be coerced to the most flexible type. Types from least to most
flexible are: logical, integer, double, and character.

2.2.2 Lists
Lists are different from atomic vectors because their elements can be of any type, including lists.
You construct lists by using list() instead of c():

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))


str(x)

Lists are sometimes called recursive vectors, because a list can contain other lists. This makes
them fundamentally different from atomic vectors. The typeof() a list is list. You can test for a
list with is.list() and coerce to a list with as.list(). You can turn a list into an atomic vector with
unlist().

2.2.3 Names
You can name a vector in three ways:

• When creating it: x <- c(a = 1, b = 2, c = 3)

• By modifying an existing vector in place: x <- 1:3; names(x)<- c("a", "b", "c")

• By creating a modified copy of a vector: x <- setNames(1:3, c("a", "b", "c"))

2.3 Factors
A factor is a vector that can contain only predefined values, and is used to store categorical data.
Factors are built on top of integer vectors using two attributes: the class(), “factor”, which makes
them behave differently from regular integer vectors, and the levels(), which defines the set of
allowed values.

x <- factor(c("a", "b", "b", "a"))


x
class(x)
levels(x)
# You can't use values that are not in the levels

Unfortunately, most data loading functions in R automatically convert character vectors to


factors. Use the argument stringsAsFactors = FALSE to suppress this behaviour, and then
manually convert character vectors to factors using your knowledge of the data.

2.4 Matrices & Arrays


Adding a dim() attribute to an atomic vector allows it to behave like a multi-dimensional array. A
special case of the array is the matrix, which has two dimensions. Matrices and arrays are created
with matrix() and array(), or by using the assignment form of dim():

# Two scalar arguments to specify rows and columns


var1 <- matrix(1:6, ncol = 3, nrow = 2)

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 12

# One vector argument to describe all dimensions


var2 <- array(1:12, c(2, 3, 2))
# You can also modify an object in place by setting dim()
var3 <- 1:6
dim(var3) <- c(3, 2)
var3

length() and names() have high-dimensional generalisations:

• length() generalises to nrow() and ncol() for matrices, and dim() for arrays.

• names() generalises to rownames() and colnames() for matrices, and dimnames(), a list of
character vectors, for arrays.

c() generalises to cbind() and rbind() for matrices, and to abind(by the abind package) for arrays.

2.5 Data Frames


A data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares
properties of both the matrix and the list. This means that a data frame has names(), colnames(),
and rownames(), although names() and colnames() are the same thing. The length() of a data
frame is the length of the underlying list and so is the same as ncol(); nrow() gives the number
of rows. you can subset a data frame like a 1d structure (where it behaves like a list), or a 2d
structure (where it behaves like a matrix).

2.5.1 Creation
You create a data frame using data.frame(), which takes named vectors as input:

df <- data.frame(x = 1:3, y = c("a", "b", "c"), stringAsFactors = FALSE)


str(df)

Beware data.frame()’s default behaviour which turns strings into factors.Use stringAsFactors
= FALSE to suppress this behaviour.

2.5.2 Testing & Coercion


To check if an object is a data frame, use class() or test explicitly with is.data.frame():

typeof(df)
class(df)
is.data.frame(df)
# You can coerce an object to a data frame with as.data.frame():

• A vector will create a one-column data frame.

• A list will create one column for each element; it’s an error if they’re not all the same length.

• A matrix will create a data frame with the same number of columns and rows.

2.5.3 Combining data frames


You can combine data frames using cbind() and rbind():

cbind(df, data.frame(z = 3:1))


rbind(df, data.frame(x = 10, y = "z"))

When combining column-wise, the number of rows must match, but row names are ignored.
When combining row-wise, both the number and names of columns must match. cbind() will
create a matrix unless one of the arguments is already a data frame. Instead use data.frame()
directly:

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 13

bad <- data.frame(cbind(a = 1:2, b = c("a", "b")))


str(bad)
good <- data.frame(a = 1:2, b = c("a", "b"), stringsAsFactors = FALSE)
str(good)

Bapatla Engineering College, Dept. of IT, 2019-20


Chapter 3

Data Input into R

3.1 Introduction

gene_data <- read.table("filename",


comment.char = "/", # default comment char "#"
blank.lines.skip = TRUE, # alternatively use the argument skip = number of lines to
skip
header = TRUE,
sep = "\t", # default data field separator is the white space, " "
nrows = 20,
na.strings = 0.00 # The value that is to be treated as NA (Not Available) value
)

# Basically, the read.csv() function is a derivative of read.table() with the


following default parameters:

runner_data <- read.table("runners.csv", header = TRUE, sep = ",", …)

# Read tab separated values


read.delim(file.choose())

# Read comma (",") separated values


read.csv(file.choose())

# Read semicolon (";") separated values


read.csv2(file.choose())

# To read data from excel file, package required xlsx


stocks_table <- read.xlsx("filename.xlsx", # you can read .xls file
sheetIndex = 1,
rowIndex = c(1:28),
colIndex = c(1:5,7))

% https://www.r-bloggers.com/importing-data-to-r/

Year Runner Time


1 2007 Usain_Bolt 10.03
2 2008 Usain_Bolt 9.72
3 2009 Usain_Bolt 9.58
4 2010 Usain_Bolt 9.82
5 2011 Usain_Bolt 9.76
6 2012 Usain_Bolt 9.63
7 2004 Asafa_Powell 10.02
8 2005 Asafa_Powell 9.87
...

# To convert data from long to wide format use the cross tabulation function xtabs()

14
14IT601, Introduction to Data Analytics, Sivaram N 15

runners_wide <- xtabs(formula = Time ~ Runner + Year, data = runners_long))

Year
Runner 2004 2005 2007 2008 2009 2010 2011 2012
Asafa_Powell 10.02 9.87 0.00 0.00 0.00 0.00 0.00 0.00
Usain_Bolt 0.00 0.00 10.03 9.72 9.58 9.82 9.76 9.63

If you just want to transpose your data, that is, switch columns and rows, you can use
the t() function:
transposed_data <- t(my_data)

Bapatla Engineering College, Dept. of IT, 2019-20


Chapter 4

Data Management

4.1 Arithmetic operators

Operator Description

+ Addition
- Subtraction
* Multiplication
/ Division
∧ or ** Exponentiation
x%%y Modulus (x mod y) 5%%2 is 1
x%/%y Integer division 5%/%2 is 2

4.2 Mathematical functions


Function Description

abs(x) Absolute value abs(-4) returns 4


Square root sqrt(25) returns 5. This is the same as
sqrt(x)
250.5
Smallest integer not less than x. ceiling(3.475) returns
ceiling(x)
4.
Largest integer not greater than x. floor(3.475) returns
floor(x)
3.
Integer formed by truncating values in x toward 0.
trunc(x)
trunc(5.99) returns 5.
Round x to the specified number of decimal places.
round(x, digits=n)
round(3.475, digits=2) returns 3.48.
Round x to the specified number of significant digits.
signif(x, digits=n )
signif(3.475, digits=2) returns 3.5.
cos(x) , sin(x) , tan(x) Cosine, sine, and tangent. cos(2) returns –0.416.
acos(x) , asin(x) , Arc-cosine, arc-sine, and arc-tangent. acos(-0.416)
atan(x) returns 2.
cosh(x) , sinh(x) , Hyperbolic cosine, sine, and tangent. sinh(2) returns
tanh(x) 3.627.
acosh(x) , asinh(x) , Hyperbolic arc-cosine, arc-sine, and arc-tangent.
atanh(x) asinh(3.627) returns 2.

16
14IT601, Introduction to Data Analytics, Sivaram N 17

Function Description

log(x, base = n) Logarithm of x to the base n


log(x) log(x) is the natural logarithm. log(10) returns 2.3026.
log10(x) log10(x) is the common logarithm. log10(10) returns 1.
exp(x) Exponential function. exp(2.3026) returns 10.

4.3 Statistical functions


z<- mean(x, trim = 0.05, na.rm=TRUE)
provides the trimmed mean, dropping the highest and lowest 5 percent of scores and any missing
values.

Function Description

mean(x) Mean. mean(c(1,2,3,4)) returns 2.5.


median(x) Median. median(c(1,2,3,4)) returns 2.5.
sd(x) Standard deviation. sd(c(1,2,3,4)) returns 1.29.
var(x) Variance. var(c(1,2,3,4)) returns 1.67.
mad(x) Median absolute deviation. mad(c(1,2,3,4)) returns 1.48.
Quantiles where x is the numeric vector where quantiles are
desired and probs is a numeric vector with probabilities in
quantile(x, probs)
[0, 1]. 30th and 84th percentiles of x are calculated using
y ← quantile(x, c(.3, .84)).
Range. x ← c(1, 2, 3, 4). range(x) returns c(1,4).
range(x)
diff(range(x)) returns 3.
sum(x) Sum. sum(c(1,2,3,4)) returns 10.
Lagged differences, with lag indicating which lag to use. The
diff(x, lag=n)
default lag is 1. x ← c(1, 5, 23, 29). diff(x) returns c(4, 18, 6).
min(x) Minimum. min(c(1,2,3,4)) returns 1.
scale(x, center =
Column center (center = TRUE) or standardize (center =
TRUE, scale =
TRUE, scale = TRUE) data object x.
TRUE)

4.3.1 Standardizing Data


By default, the scale() function standardizes the specified columns of a matrix or data frame to a
mean of 0 and a standard deviation of 1:
newdata <- scale(mydata)
To standardize each column to an arbitrary mean and standard deviation, you can use code similar
to the following:
newdata <- scale(mydata) SD + M
where M is the desired mean and SD is the desired standard deviation. Using the scale() function
on non-numeric columns will produce an error. To standardize a specific column rather than an
entire matrix or data frame, you can use code such as
newdata <- transform(mydata, myvar = scale(myvar) 10+50)
This code standardizes the variable myvar to a mean of 50 and standard deviation of 10.

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 18

4.3.2 Probability functions


In R, probability functions take the form
where the first letter refers to the aspect of the distribution returned:
dpqr distribution abbrevation()
d = density
p = distribution function
q = quantile function
r = random generation (random deviates)
The common probability functions are listed in the table below:

Function Description Function Description

Beta beta Logistic logis


Binomial binom Multinomial multinom
Cauchy cauchy Negative binomial nbinom
Chi-squared(noncentral) chisq Normal norm
Exponential exp Poisson pois
F f Wilcoxon Signed Rank signrank
Gamma gamma T t
Geometric geom Uniform unif
Hypergeometric hyper Weibull weibull
Lognormal lnorm Wilcoxon Rank Sum wilcox

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 19

4.3.3 Character functions


Function Description

nchar(x) Counts the number of


characters of x
x <- c(“ab”, “cde”, “fghij”)
length(x) returns 3
nchar(x[3]) returns 5.

substr(x, start, stop ) Extract or replace substrings


in a character vector.
x <- “abcdef”
substr(x, 2, 4) returns “bcd”.
substr(x, 2, 4) <- “22222”
(x is now “a222ef”).

grep(pattern, x, ignore. Search for pattern in x. If


case=FALSE, fixed=FALSE) fixed=FALSE, then case=FALSE,
pattern is a regular expression.
If fixed=TRUE,then pattern is a
text string. Returns matching indices.
grep(“A”, c(“b”,“A”,“c”),
fixed=TRUE) returns 2.

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 20

Function Description

sub(pattern, replacement, Find pattern in x and


x, ignore.case=FALSE, substitute with replacement text.
fixed=FALSE) If fixed=FALSE then pattern is a regular expression.
If fixed=TRUE
then pattern is a text string.

strsplit(x, split, fixed=FALSE) Split the elements of character vector x at split.


If fixed=FALSE, then pattern is a regular expression.
If fixed=TRUE, then pattern is a text string.

paste(..., sep=“”) Concatenate strings after using sep string to separate them.
paste(“x”, 1:3, sep=“”) returns
c(“x1”, “x2”, “x3”).
paste(“x”,1:3,sep=“M”) returns c(“xM1”,“xM2” “xM3”).

toupper(x) Uppercase
toupper(“abc”) returns “ABC”.

tolower(x) Lowercase
tolower(“ABC”) returns “abc”.

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 21

4.3.4 Other useful functions


Function Description

length(x) Length of object x.


x <- c(2, 5, 6, 9)
length(x) returns 4.
seq(from, to, by) Generate a sequence.
indices <- seq(1,10,2)
indices is c(1, 3, 5, 7, 9).
rep(x, n) Repeat x n times.
y <- rep(1:3, 2)
y is c(1, 2, 3, 1, 2, 3).
cut(x, n) Divide continuous variable x into factor with n levels.
To create an ordered factor,
include the option ordered_result =TRUE.
pretty(x, n) Create pretty breakpoints.
Divides a continuous variable x into n
intervals, by selecting n+1
equally spaced rounded values.
Often used in plotting.
cat(… , file =“myfile”, Concatenates the objects in …
append =FALSE) and outputs them to the screen
or to a file (if one is declared).

4.4 Applying functions to data objects


R provides a function, apply() , that allows you to apply an arbitrary function to any dimension
of a matrix, array, or data frame. The format for the apply function is
apply(x, MARGIN, FUN, ...)
where x is the data object, MARGIN is the dimension index, FUN is a function you specify, and
... are any parameters you want to pass to FUN. In a matrix or data frame MARGIN=1 indicates
rows and MARGIN=2 indicates columns.

Bapatla Engineering College, Dept. of IT, 2019-20


Chapter 5

Subsetting

5.1 Atomic vectors


There are five things that you can use to subset a vector:

x <- c(2.1, 4.2, 3.3, 5.4)

• Positive integers return elements at the specified positions:

x[c(3, 1)]
x[order(x)]
# Duplicated indices yield duplicated values
x[c(1, 1)]
# Real numbers are silently truncated to integers
x[c(2.1, 2.9)]

• Negative integers omit elements at the specified positions

x[-c(3, 1)]

• Logical vectors select elements where the corresponding logical value is TRUE.

x[c(TRUE, TRUE, FALSE, FALSE)]


x[x > 3]

• Nothing returns the original vector.

x[ ]

• Zero returns a zero-length vector.

x[0]

If the vector is named, you can also use:

• Character vectors to return elements with matching names.

22
14IT601, Introduction to Data Analytics, Sivaram N 23

(y <- setNames(x, letters[1:4]))


y[c("d", "c", "a")]
# Like integer indices, you can repeat indices
y[c("a", "a", "a")]
# When subsetting with [ names are always matched exactly
z <- c(abc = 1, def = 2)
z[c("a", "d")]

5.2 Lists
Using [ will always return a list; [[ and $, let you pull out the components of the list.

5.3 Matrices and Arrays


You can subset higher-dimensional structures in three ways:

• With multiple vectors.

• With a single vector.

• With a matrix.

The most common way of subsetting matrices (2d) and arrays (>2d)is a simple generalisation
of 1d subsetting: you supply a 1d index for each dimension, separated by a comma.

a <- matrix(1:9, nrow = 3)


colnames(a) <- c("A", "B", "C")
a[1:2, ]
a[c(T, F, T), c("B", "A")]
a[0, -2]

5.4 Data Frames


Data frames possess the characteristics of both lists and matrices: if you subset with a single
vector, they behave like lists; if you subset with two vectors, they behave like matrices.

df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])


df[df\$x == 2, ]
df[c(1, 3), ]
# There are two ways to select columns from a data frame
# Like a list:
df[c("x", "z")]
# Like a matrix
df[, c("x", "z")]
# There's an important difference if you select a single
# column: matrix subsetting simplifies by default, list
# subsetting does not.
str(df["x"])
str(df[, "x"])

5.5 Subsetting Operators


There are two other subsetting operators: [[ and $. [[ is similar to [, except it can only return a
single value and it allows you to pull pieces out of a list. $ is a useful shorthand for [[ combined

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 24

with character subsetting.


When [ is applied to a list it always returns a list: it never gives you the contents of the list. To
get the contents, you need [[.

a <- list(a = 1, b = 2)
a[[1]]

5.5.1 Simplifying vs. preserving subsetting


Simplifying subsets returns the simplest possible data structure that can represent the
output.Preserving subsetting keeps the structure of the output the same as the
input.Preserving is the same for all data types: you get the same type of output as input.
Unfortunately, how you switch between simplifying and preserving differs for different data
types, as summarised in the table below.

Simplifying Preserving
Vector x[[1]] x[1]
List x[[1]] x[1]
Factor x[1:4, drop = T] x[1:4]
Array x[1, ] or x[, 1] x[1, , drop = F] or x[, 1, drop = F]
Data frame x[, 1] or x[[1]] x[, 1, drop = F] or x[1]

5.5.2 $
$ is a shorthand operator, where x$y is equivalent to x[[“y”, exact = FALSE]].

x <- list(abc = 1)
x\$a
x[["a"]]

5.5.3 Missing/out of bounds indices


[ and [[ differ slightly in their behaviour when the index is out of bounds (OOB), for example,
when you try to extract the fifth element of a length four vector, or subset a vector with NA or
NULL.

x <- 1:4
str(x[5])
str(x[NA\_real_])
str(x[NULL])

The following table summarises the results of subsetting atomic vectors and lists with [ and [[
and different types of OOB value.

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 25

Operator Index Atomic List

[ OOB NA list(NULL)
[ NA_real_ NA list(NULL)
[ NULL x[0] list(NULL)
[[ OOB Error Error
[[ NA_real_ Error NULL
[[ NULL Error Error

5.6 Subsetting and assignment


All subsetting operators can be combined with assignment to modify selected values of the input
vector.

x <- 1:5
x[c(1, 2)] <- 2:3
x
# The length of the LHS needs to match the RHS
x[-1] <- 4:1
x
# Note that there's no checking for duplicate indices
x[c(1, 1)] <- 2:3
x
# You can't combine integer indices with NA
x[c(1, NA)] <- c(1, 2)
# Error: NAs are not allowed in subscripted assignments
# But you can combine logical indices with NA
# (where they're treated as false).
x[c(T, F, NA)] <- 1
x
# This is mostly useful when conditionally modifying vectors
df <- data.frame(a = c(1, 10, NA))
df$a[df\$a < 5] <- 0
df$a

Bapatla Engineering College, Dept. of IT, 2019-20


Chapter 6

Plots in R

6.1 Introduction
Functions in ggplot2() package are used to produce plots in R. As usual the required package can
be installed and loaded using the following code:

if (!require(ggplot2)){
install.packages('ggplot2')
library(ggplot2)
}

6.1.1 Terminology related to ggplot2


• Data is what we want to visualize. It consists of variables, which are stored as columns in
a data frame.

• Geoms are the geometric objects that are drawn to represent the data, such as bars, lines,
and points.

• Aesthetic attributes, or aesthetics, are visual properties of geoms, such as x and y


position, line color, point shapes, etc. There are mappings from data values to aesthetics.
Some commonly used aesthetics include horizontal and vertical position, color, size, and
shape. Some aesthetics can only work with categorical variables, such as the shape of a
point: triangle, circle, square, etc. Some aesthetics work with categorical or continuous
variables, such as x (horizontal) position.

• Scales: They control the mapping from the values in the data space to values in the aesthetic
space. A continuous y scale maps larger numerical values to vertically higher positions in
space.

• Guides: To interpret the graph, viewers refer to the guides. They show the viewer how to
map the visual properties back to the data space. The most commonly used guides are the
tick marks and labels on an axis. A legend is another type of scale. A legend might show
people what it means for a point to be a circle or a triangle, or what it means for a line to
be blue or red.

• Stats: Sometimes your data must be transformed or summarized before it is mapped to an


aesthetic. This is true, for example, with a histogram, where the samples are grouped into
bins and counted. The counts for each bin are then used to specify the height of a bar. Some
geoms, like geomh istogram(), automatically do this for you, but sometimes you’ll want to
do this yourself, using various statx x functions.

• Themes: Some aspects of a graph’s appearance fall outside the scope of the grammar of
graphics. These include the color of the background and grid lines in the graphing area, the
fonts used in the axis labels, and the text in the graph title. These are controlled with the
theme() function.

26
14IT601, Introduction to Data Analytics, Sivaram N 27

6.2 Dataset

set.seed(1234)
# Randomly generate weights of 200 females and 200 males
wdata = data.frame(
gender = factor(rep(c("F", "M"), each = 200)),
weight = c(rnorm(n = 200, mean = 55, sd = 5), rnorm(n = 200, mean = 58, sd = 5)))

head(wdata, 3)
gender weight
1 F 48.96467
2 F 56.38715
3 F 60.42221

tail(wdata, 3)
gender weight
398 M 60.69417
399 M 58.07322
400 M 53.41755

# create an empty plot


a <- ggplot(wdata, aes(x = weight))

mu <- aggregate(x = wdata$weight, by = list(wdata$gender), FUN = mean)


colnames(mu) <- c('gender', 'grp.mean')
mu

gender grp.mean
1 F 54.71120
2 M 58.36625

6.3 Histogram
A Histogram represents the distribution of a continuous variable by dividing into bins and counting
the number of observations in each bin. The function geom_histogram() is used to create a
histogram plot. You can also add a verticle line for the mean using the function geom_vline().
Key arguments to customize the plot: alpha, color, fill, linetype, size

# Build histogram from empty plot a


# Position adjustment: dodge, identity, stack ( default )
# Add mean lines and color by gender
a + geom_histogram(aes(color = gender), fill = "white", position = "dodge", alpha =
0.6) +
geom_vline(data = mu, aes(xintercept = grp.mean, color = gender), linetype =
"dashed")

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 28

By default y axis corresponds to the count of weight values. If you want to change the plot in
order to have the density on y axis use aes(y = ..density..)

a + geom_histogram(aes(y = ..density.., color = gender), fill = "white", position =


"dodge", alpha = 0.6) +
geom_vline(data = mu, aes(xintercept = grp.mean, color = gender), linetype =
"dashed")

6.4 Bar plot


The function geom_bar() can be used to visualize one discrete variable. In this case, the count of
each level is plotted. Key arguments to customize the plot: alpha, color, fill, linetype and size.

data(mtcars)
head(mtcars, 3)

mpg cyl disp hp drat wt qsec vs am gear carb


Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

ggplot(mtcars, aes(gear)) +
geom_bar(fill = "steelblue")

# The plot shows the number of vehicles with 3, 4 and 5 gears.

6.5 Scatter plot


The function geom_point() can be used to visualize two discrete variables. Key arguments to
customize the plot: alpha, color, fill, shape and size.

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 29

# Convert cyl column to factor variable


mtcars$cyl <- as.factor(mtcars$cyl)
# Empty plot with weight along x-axis and miles per gallon along y-axis
b <- ggplot(mtcars, aes(x = wt, y = mpg))
# Point shape changes with number of cylinders (cyl) and shape size varies with
acceleration (qsec) of vehicle
b + geom_point(aes(shape = cyl, size = qsec))

6.6 Line plot


A line chart or line graph displays the evolution of one or several numeric variables. Data points
are connected by straight line segments. It is similar to a scatter plot except that the measurement
points are ordered (typically by their x-axis value) and joined with straight line segments. A line
chart is often used to visualize a trend in data over intervals of time – a time series – thus the line
is often drawn chronologically. The function geom_line() can be used to generate line plot. Key
arguments to customize the plot: alpha, color, linetype, size.

# create data
xValue <- 1:100
yValue <- cumsum(rnorm(100))
data <- data.frame(xValue,yValue)
head(data, 3)
xValue yValue
1 1 -1.226815
2 2 -1.190662
3 3 -1.612055

# Plot
ggplot(data, aes(x=xValue, y=yValue)) +
geom_line()

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 30

6.7 Box plot


To visualize the relationship between one discrete and the other discrete or continuous variable we
use Box plot. The function geom_boxplot() is used to create a box plot. A simplified format is :

geom_boxplot(outlier.colour = "black", outlier.shape = 16, outlier.size = 2, notch =


FALSE)

• outlier.colour, outlier.shape, outlier.size: The color, the shape and the size for outlying points
• notch: logical value. If TRUE, makes a notched box plot. The notch displays a confidence

interval around the median which is normally based on the median ±1.58 ∗ IQR/ (n).
Notches are used to compare groups; if the notches of two boxes do not overlap, this is a
strong evidence that the medians differ.

data("ToothGrowth")
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
head(ToothGrowth, 3)
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5

e <- ggplot(ToothGrowth, aes(x = dose, y = len))


e + geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 4)

6.7.1 Interpretation of Box plot

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 31

6.8 Manipulating Strings in R


6.8.1 paste() function
The paste() function takes one or more R objects, converts them to “character”, and then it
concatenates (pastes) them to form one or several character strings. Its usage has the following
form:

paste(..., sep = " ", collapse = NULL)

The argument · · · means that it takes any number of objects. The argument sep is a character
string that is used as a separator. The argument collapse is an optional string to indicate if we
want all the terms to be collapsed into a single string. Here is a simple example with paste():

PI = paste("The life of", pi)


PI
## [1] "The life of 3.14159265358979"

If we give paste() objects of different length, then it will apply a recycling rule. For example, if
we paste a single character “x” with the sequence 1:5, and separator sep = “.” this is what we get:

# paste with objects of different lengths


paste("x", 1:5, sep = ".")
## [1] "x.1" "x.2" "x.3" "x.4" "x.5"

To see the effect of the collapse argument, let’s compare the difference with collapsing and
without it:

# paste with collapse combines all substrings and make one string
paste(c("x", "y", "z"), 1:3, , sep = " ", collapse = "")
## [1] "x 1y 2z 3"

# paste without collapse, keeps the substrings as is


paste(c("x", "y", "z"), 1:3, , sep = " ")
## [1] "x 1" "y 2" "z 3"

In addition to paste(), there’s also the function paste0() which is the equivalent of paste(...,
sep = "", collapse = "")

# collapsing with paste0()


paste0("let's", "collapse", "all", "these", "words")
## [1] "let'scollapseallthesewords"

6.8.2 sprintf()
The function sprintf () returns a formatted string combining text and variable values. The nice
feature about sprintf () is that it provides us a very flexible way of formatting vector elements as
character strings. Its usage has the following form: sprintf(fmt, ...).
The argument f mt is a character vector of format strings. The allowed conversion specifications
start with the symbol % followed by numbers and letters. For demonstration purposes here are
several ways in which the number pi can be formatted:

# '%f' indicates 'fixed point' decimal notation


sprintf("%f", pi)
## [1] "3.141593"

# decimal notation with 3 decimal digits


sprintf("%.3f", pi)
## [1] "3.142"

# 1 integer and 0 decimal digits

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 32

sprintf("%1.0f", pi)
## [1] "3"

# decimal notation with 3 decimal digits


sprintf("%5.1f", pi)
## [1] " 3.1"

sprintf("%05.1f", pi)
## [1] "003.1"

# print with sign (positive)


sprintf("%+f", pi)
## [1] "+3.141593"

# prefix a space
sprintf("% f", pi)
## [1] " 3.141593"

# left adjustment
sprintf("%-10f", pi) # left justified
## [1] "3.141593 "

# exponential decimal notation 'e'


sprintf("%e", pi)
## [1] "3.141593e+00"

# exponential decimal notation 'E'


sprintf("%E", pi)
## [1] "3.141593E+00"

# number of significant digits (6 by default)


sprintf("%g", pi)
## [1] "3.14159"

6.9 Regular expressions in R


Regular expressions in R can be divided into 5 categories:

1. Metacharacters: They comprises a set of special operators which regex doesn’t capture.
These characters include: “. | ( ) [ ] $ * + ?”.

2. Sequences: They contain special characters used to describe a pattern in a given string.
Following are the commonly used sequences in R:
Sequence Description

\d matches a digit character


\D matches a non-digit character
\s matches a space character
\S matches a non-space character
\w matches a word character
\W matches a non-word character
\b matches a word boundary
\B matches a non-word boundary

3. Quantifiers

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 33

Quantifier Description

. It matches everything except a newline.


? The item to its left will be matched zero or one time.
* The item to its left will be matched zero or more times.
+ The item to its left is matched one or more times.

{n} The item to its left is matched exactly n times. The item must have
a consecutive repetition at place.
{n, } The item to its left is matched n or more times.

{n,m} The item to its left is matched at least n times but not more than
m times.
It is known as a greedy quantifier. It says that for a particular
pattern to be matched, it will try to match the pattern as many
times as its repetition are available. For example:

.* regmatches("101000000000100", gregexpr(pattern = "1.*1",


text = "101000000000100"))
# Output : "1010000000001"

It is known as a non-greedy quantifier. Being non-greedy, for a


particular pattern to be matched, it will stop at the first match.
For example:

.? regmatches("101000000000100", gregexpr(pattern = "1.?1",


text = "101000000000100"))
# Output : "101"

4. Character Classes: They refer to a set of characters enclosed in a square bracket [ ].


These classes match only the characters enclosed in the bracket. These classes can also be
used in conjunction with quantifiers. The use of the caret (∧) symbol in character classes is
interesting. It negates the expression and searches for everything except the specified
pattern. Following are the types of character classes used in regex:

Class Description

[aeiou] matches lower case vowels


[AEIOU ] matches upper case vowels
[0123456789] matches any digit
[0 − 9] same as the previous class
[a − z] match any lower case letter
[A − Z] match any upper case letter
[a − zA − Z0 − 9] match any of the above classes
[∧ aeiou] matches everything except letters
[∧ 0 − 9] matches everything except digits

5. POSIX character classes: In R, these classes can be identified as enclosed within a double
square bracket ([[ ]]). They work like character classes. A caret (∧) ahead of an expression

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 34

negates the expression value.

Class Description

[[: lower :]] matches lower case letter


[[: upper :]] matches upper case letter
[[: alpha :]] matches letters
[[: digit :]] matches digits
[[: space :]] matches space characters eg. tab, newline, vertical tab, space, etc
[[: blank :]] matches blank characters (same as previous) such as space, tab
[[: alnum :]] matches alphanumeric characters, e.g. AB12, ID101, etc
matches control characters. Control characters are non-printable
[[: cntrl :]] characters such as \t (tab), \n (new line), \e (escape), \f (form
feed), etc
[[: punct :]] matches punctuation characters
[[: xdigit :]] matches hexadecimal digits (0 − 9A − E)
[[: print :]] matches printable characters ([[: alpha :]][[: punct :]] and space)

[[: graph :]] matches graphical characters. Graphical characters comprise [[:
alpha :]] and [[: punct :]]

6.9.1 Examples
1. Extract digits from a string of characters

#extract digits - all 3 works


mystring <- "My roll number is 1006781"
gsub(pattern = "[^0-9]", replacement = "", x = mystring)
regmatches(mystring, regexpr(pattern = "[0-9]+", text = mystring))
regmatches(mystring, regexpr(pattern = "[[:digit:]]+", text = mystring))
# Output : "1006781"

2. Remove spaces from a line of strings

#remove space- all 3 works


gsub(pattern = "[[:space:]]", replacement = "", x = "I am going to college
tomorrow")
gsub(pattern = "[[:blank:]]", replacement = "", x = "I am going to college
tomorrow")
gsub(pattern = "\\s", replacement = "", x = "I am going to college tomorrow")
# Output : "Iamgoingtocollegetomorrow"

3. Return if a value is present in a vector

#match values
vec <- c("A1","A2","A3","A4","A5","A6","A7")
grep(pattern = "A1|A4", x = vec, value = TRUE)
# Output : "A1" "A4"

4. Extract strings which are available in key value pairs

vec <- c("(monday :: 0.1231313213)","tomorrow","(tuesday :: 0.1434343412)")


grep(pattern = "\\([a-z]+ :: (0\\.[0-9]+)\\)", x = vec, value = TRUE)
regmatches(vec, regexpr(pattern = "\\((.*) :: (0\\.[0-9]+)\\)", text = vec))

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 35

# Output : "(monday :: 0.1231313213)" "(tuesday :: 0.1434343412)"

Explanation: Double backslash is used to escape the metacharacter “(”. “[a-z]+” matches
letters one or more times. “(0[̇0-9]+)” matches the decimal value, where the metacharacter
“.” is escaped using double backslash, so is the period. The numbers are matched using
“[0-9]+.”

5. In a key value pair, extract the values

mystring = c("G1:E001", "G2:E002", "G3:E003")


gsub(pattern = ".*:", replacement = "", x = mystring)
# Output : "E001" "E002" "E003"

Explanation: In the regex above, “.*:” matches everything (except newspace) it can until
it reaches colon (:), then gsub() function replaces it with blank. Hence, we get the desired
output.

6. Remove punctuation from a line of text

mystring <- "a1~!@#$%^&*bcd(){}_+:efg\"<>?,./;'[]-="


gsub(pattern = "[[:punct:]]+", replacement = "", x = mystring)
# Output : "a1bcdefg"

7. Remove digits from a string which contains alphanumeric characters. The desired output
from the string given below, is “day of 2nd ID5 Conference” .We can’t do it with simple
“[[:digit:]]+” regex as it will match all the digits available in the given string. Instead, in
such case, we’ll detect the digit boundary to get the desired result.

mystring <- "day of 2nd ID5 Conference 19 12 2005"


gsub(pattern = "\\b\\d+\\b", replacement = "", x = mystring)
# Output : "day of 2nd ID5 Conference "

8. Find the location(index) of digits in a string

mystring <- "there were 2 players each in 8 teams"


unlist(gregexpr(pattern = '\\d', text = mystring))
# Output : 12 30

9. Extract information available inside parentheses (brackets) in a string

mystring <- "What are we doing tomorrow ? (laugh) Play soccer (groans) (cries)"
gsub(pattern = "[\\(\\)]", replacement = "", x = regmatches(mystring,
gregexpr("\\(.*?\\)", mystring))[[1]])
# Output : "laugh" "groans" "cries"

Explanation: In this solution, we’ve used the lazy matching technique. First, using
regmatches, we’ve extracted the parentheses with words such as (cries) (groans) (laugh).
Then, we’ve simply removed the brackets using the gsub() function.

10. Extract only the first digit in a range

vec <- c("75 to 79", "80 to 84", "85 to 89")


gsub(pattern = " .*\\d+", replacement = "", x = vec)
# Output : "75" "80" "85"

11. Extract email addresses from a given string

mystring <- c("My email address is abc@boeing.com","my email address is


def@jobs.com","aescher koeif","paul renne")

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 36

unlist(regmatches(x = mystring, gregexpr(pattern =


"[[:alnum:]]+\\@[[:alpha:]]+\\.com",text = mystring)))
# Output : "abc@boeing.com" "def@jobs.com"

Bapatla Engineering College, Dept. of IT, 2019-20


Chapter 7

Probability Distributions

7.1 Normal Distribution


In probability theory, a normal (or Gaussian or Gauss or Laplace–Gauss) distribution is a type
of continuous probability distribution for a real-valued random variable. The general form of its
probability density function is
1 −(x−µ)2
f (x) = √ exp 2 σ2 (7.1)
σ 2π
where µ is Population mean, σ is Population standard deviation

7.2 Binomial Distribution


A distribution involving things with only 2 possible outcomes, such as the tossing of a coin.
Example1:-

Toss a fair coin three times ... what is the chance of getting two Heads?

• Outcome: any result of three coin tosses (8 different possibilities)

• Event: “Two Heads” out of three coin tosses (3 outcomes have this)

• P(Two Heads) = P(HHT) + P(HTH) + P(THH) = 1/8 + 1/8 + 1/8 = 3/8

37
14IT601, Introduction to Data Analytics, Sivaram N 38

Example2:-

Toss a fair coin nine times ... what is the chance of getting five Heads?

• Outcome: any result of nine coin tosses

• Total number of outcomes is (29 = 512 different possibilities)

• Event: “Five Heads” out of nine coin tosses

• The formula to calculate the number of desirable outcomes is as follows


( )
n n!
= = 126 (7.2)
k k! (n − k)!
where n = total number of choices = 9, k = number of desirable choices = 5

• The formula to calculate the probability of a desirable outcome is as follows

pk (1 − p)n−k (7.3)

where n = total number of choices ( 9 in this example ), k = number of desirable choices (


5 in this example ) and p is the probability of a desirable choice ( 0.5 in this example )

• P(5 Heads in 9 tosses) = No. of desirable outcomes * P(a desirable outcome) = 126 ∗
(0.55 )(0.59−5 ) = 0.246

The General Binomial Probability Formula is


n!
P robability of k out of n ways : P (k out of n) = pk (1 − p)n−k (7.4)
k! (n − k)!

• The trials are independent,

• There are only two possible outcomes at each trial,

• The probability of “success” p at each trial is constant.

Example3:-

Suppose there are twelve multiple choice questions in an English class quiz. Each question has
five possible answers, and only one of them is correct. Find the probability of having four or less
correct answers if a student attempts to answer every question at random.

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 39

• Outcome: any result of nine coin tosses

• Total number of outcomes is (29 = 512 different possibilities)

• Event: Atleast four correct answers out of 12 coin tosses

• The formula to calculate the number of desirable outcomes is as follows


( )
n n!
= = 126 (7.5)
k k! (n − k)!

where n = total number of choices = 9, k = number of desirable choices = 5

• The formula to calculate the probability of a desirable outcome is as follows

pk (1 − p)n−k (7.6)

where n = total number of choices ( 9 in this example ), k = number of desirable choices (


5 in this example ) and p is the probability of a desirable choice ( 0.5 in this example )

• P(5 Heads in 9 tosses) = No. of desirable outcomes * P(a desirable outcome) = 126 ∗
(0.55 )(0.59−5 ) = 0.246

Bapatla Engineering College, Dept. of IT, 2019-20


Chapter 8

Hypothesis Tests

8.1 T Test (Student’s T-Test)


The t-test tells you how significant the differences between groups are; In other words it lets you
know if those differences (measured in means/averages) could have happened by chance.
A very simple example: Let’s say you have a cold and you try a naturopathic remedy.
Your cold lasts a couple of days. The next time you have a cold, you buy an over-the-counter
pharmaceutical and the cold lasts a week. You survey your friends and they all tell you that their
colds were of a shorter duration (an average of 3 days) when they took the homeopathic remedy.
What you really want to know is, are these results repeatable? A t test can tell you by comparing
the means of the two groups and letting you know the probability of those results happening by
chance.
Another example: Student’s T-tests can be used in real life to compare means. For example,
a drug company may want to test a new cancer drug to find out if it improves life expectancy. In
an experiment, there’s always a control group (a group who are given a placebo, or “sugar pill”).
The control group may show an average life expectancy of +5 years, while the group taking the
new drug might have a life expectancy of +6 years. It would seem that the drug might work. But
it could be due to a fluke. To test this, researchers would use a Student’s t-test to find out if the
results are repeatable for an entire population.

8.1.1 What is a T Test?


8.1.2 The T Score
The t score is a ratio between the difference between two groups and the difference within the
groups. The larger the t score, the more difference there is between groups. The smaller the t
score, the more similarity there is between groups. A t score of 3 means that the groups are three
times as different from each other as they are within each other. When you run a t test, the bigger
the t-value, the more likely it is that the results are repeatable.

• A large t-score tells you that the groups are different.

• A small t-score tells you that the groups are similar.

8.1.3 T Values and P Values


How big is “big enough”? Every t-value has a p-value to go with it. A p-value is the probability
that the results from your sample data occurred by chance. P-values are from 0

8.1.4 Calculating the T Test


There are three main types of t-test:

• A One sample t-test is used in testing the null hypothesis that the population mean is equal
to a specified value µ0 , one uses the statistic
x̄ − µ0
t= √ (8.1)
s/ n
where x̄ is the sample mean, s is the sample standard deviation and n is the sample size.

40
14IT601, Introduction to Data Analytics, Sivaram N 41

• An Independent Samples t-test compares the means for two groups.


Assumption: This test is used only when it can be assumed that the two distributions have
the same variance.
x¯1 − x¯2
t= √ (8.2)
sp ∗ n1 + n2
1 1


(n1 − 1)2 s21 + (n2 − 1)2 s22
sp = (8.3)
n1 + n2 − 2
where x̄i is the ith sample mean, sp is the pooled standard deviation of the two samples, ni
is the ith sample size, ni 1 is the number of degrees of freedom for each group, and the total
sample size minus two (that is, n1 + n2 − 2 ) is the total number of degrees of freedom.
• Dependent t-test for paired samples. This test is used when the samples are dependent; that
is, when there is only one sample that has been tested twice (repeated measures) or when
there are two samples that have been matched or “paired”. This is an example of a paired
difference test. The t statistic is calculated as

x¯d − µ0
t= √ (8.4)
sd / n
where x¯d and sd are the average and standard deviation of the differences between all pairs.

The pairs are e.g. either one person’s pre-test and post-test scores or between-pairs of persons
matched into meaningful groups (for instance drawn from the same family or age group: see
table). The constant µ0 is zero if we want to test whether the average of the difference is
significantly different. The degree of freedom used is n − 1, where n represents the number
of pairs.

A physical explanation of the dynamic matrix

A paired t test (also called a correlated pairs t-test, a paired samples t test or dependent
samples t test) is where you run a t test on dependent samples. Dependent samples are
essentially connected — they are tests on the same person or thing. For example:

A paired t test (also called a correlated pairs t-test, a paired samples t test or dependent
samples t test) is where you run a t test on dependent samples. Dependent samples are
essentially connected — they are tests on the same person or thing. For example:

8.1.5 What is a Paired T Test (Paired Samples T Test)?


A paired t-test (also called a correlated pairs t-test, a paired samples t-test or dependent
samples t-test) is where you run a t-test on dependent samples. Dependent samples are
essentially connected – they are tests on the same person or thing. For example:
• Knee MRI costs at two different hospitals,
• Two tests on the same person before and after training,
• Two blood pressure measurements on the same person using different equipment.
The null hypothesis for the for the independent samples t-test is µ1 = µ2. In other words, it
assumes the means are equal. With the paired t test, the null hypothesis is that the pairwise
difference between the two tests is equal (H0 : µd = 0).

8.2 Analysis of variance (ANOVA)


The one-way analysis of variance (ANOVA), also known as one-factor ANOVA, is an extension of
independent two-samples t-test for comparing means in a situation where there are more than two
groups. In one-way ANOVA, the data is organized into several groups base on one single grouping
variable (also called factor variable).

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 42

8.2.1 ANOVA test hypotheses:


• Null hypothesis: the means of the different groups are the same

• Alternative hypothesis: At least one sample mean is not equal to the others.

8.2.2 Assumptions of ANOVA test


• The observations are obtained independently and randomly from the population defined by
the factor levels

• The data of each factor level are normally distributed.

• These normal populations have a common variance. (Levene’s test can be used to check
this.)

8.2.3 Steps in ANOVA


Assume that we have 3 groups (A, B, C) to compare:
1. Compute the common variance, which is called variance within samples (M SW ) or
residual variance as follow:

SSW
M SW =
N −K

K
SSW = n(xi − µ)2
i=1

n
xi = xi
i=1
∑K
i=1 xi
µ=
K

2. Compute the variance between sample means as follow:

• Compute the mean of each group


• Compute the variance between sample means M SB as follows:

SSB
M SB =
K −1

K
SSB = n(xi − µ)2
i=1

n
xi = xi
i=1
∑K
i=1 xi
µ=
K

where SSB is sum of squares (deviations) between the group means (xi ) and the grand
mean (µ), K is the number of groups and n is the number of observations in a group.
M SB
3. Produce F-statistic as the ratio of M SW

8.2.4 one-way ANOVA test in R

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 43

Figure 8.1: Between Variance and Within Variance

Table 8.1: ANOVA summary


Degrees of Varience estimate
Source Sum of squares F Ratio
freedom (Mean Square)

Between SSB K −1 M SB = SSB


K−1
M SB
M SW

Within SSW N −K M SW = SSW


N −K

Total SST = SSB + SSW N −1

Bapatla Engineering College, Dept. of IT, 2019-20


Chapter 9

Correlation

9.1 Introduction
Correlation is a measure of the strength of the relationship or association between two quantitative,
continuous variables, for example, age and blood pressure. When we analyse the correlation
between two variables, we should follow these steps:

Step 1: Look at the scatter diagram for any pattern.

• For a generally upward shape we say that the correlation is positive. As the
independent variable increases, the dependent variable generally increases.

• For a generally downward shape we say that the correlation is negative. As the
independent variable increases, the dependent variable generally decreases.

• For randomly scattered points with no upward or downward trend, we say there is no
correlation.

44
14IT601, Introduction to Data Analytics, Sivaram N 45

Step 2: Look at the spread of points to make a judgement about the strength of the correlation.

• For positive relationships we would classify the following scatter diagrams as:

• We classify the strengths for negative relationships in the same way:

Step 3: Look at the pattern of points to see if the relationship is linear.

9.2 Pearson’s correlation coefficient ( r )


It is also called product-moment correlation coefficient. It is a measure of the strength of association
between the paired observations (xi , yi ), i = 1, 2, ..., n, where both x and y can be measured on a
continuous scale. This is reasonable when you are dealing with measurable characteristics such as
height or weight, etc.

sxy
rxy = (9.1)
sx ∗ sy
∑n
(xi − x̄)(yi − ȳ)
rxy = √∑n i=1 √∑n (9.2)
i=1 (xi − x̄) i=1 (yi − ȳ)
2 2

where:

• sxy is the sample covariance

• sx is the sample standard deviation of X; and analogously for sy

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 46

• n is sample size

• xi , yi are the individual sample points indexed with i

• x̄ is the sample mean; and analogously for ȳ

The values (r and r2 ) are important because they tell us how close to linear a set of data is.
There is no point in fitting a linear relationship between two variables if they are clearly not
linearly related.

• All values of r lie between −1 and +1.

• If r = +1, the data is perfectly positively correlated.

• If 0 < r < 1, the data is positively correlated.

• If r = 0, the data shows no correlation.

• If −1 < r < 0, the data is negatively correlated.

• If r = −1, the data is perfectly negatively correlated. This means the data lie exactly in a
straight line with negative gradient.

• r2 is called the coefficient of determination.

The following table is a guide for describing the strength of linear association using the
coefficient of determination:

Value Strength of association

r2 = 0 no correlation
0 < r2 < 0.25 very weak correlation
0.25 ≥ r2 < 0.5 weak correlation
0.5 ≥ r2 < 0.75 moderate correlation
0.75 ≥ r2 < 0.9 strong correlation
0.9 ≥ r2 < 1 very strong correlation
r2 = 1 perfect correlation

• Correlations may or may not indicate causal relations. Reversely, causal relations from some
variable to another variable may or may not result in a correlation between the two variables.

• Correlations are very sensitive to outliers; a single unusual observation may have a huge
impact on a correlation. Such outliers are easily detected by a quick inspection a scatterplot.

9.3 Spearman’s rank correlation coefficient


There are cases where it is not possible or may not be worthwhile to measure certain variables.
For example, suppose a manufacturer of tea produced a number of different blends; you could
taste each blend and place the blends in order of preference. You do not, however, have a
numerical scale for measuring your preference. Similarly, it may be quicker to arrange a group of
individuals in order of height than to measure each one. Under these circumstances, Spearman’s
rank correlation coefficient is used.

• Spearman’s rank correlation coefficient is denoted by rs . In principle, rs , is simply a


special case of the Pearson product-moment coefficient in which the data are converted to
rankings before calculating the coefficient. It is also used when one, or both, variables is
measured on an ordinal scale.

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 47

• To rank two sets of data X and Y , you give the rank 1 to the highest of the xi values, 2 to
the next highest, 3 to the next highest, and so on. You do the same for the yi values.

• Spearman’s rank correlation coefficient is sometimes used as an approximation for the


product- moment correlation coefficient as it is easier to calculate.

• It makes no difference if you rank the smallest as 1, the next smallest as 2, etc., provided
you do the same to both X and Y.

Bapatla Engineering College, Dept. of IT, 2019-20


Chapter 10

Classification

10.1 Introduction
10.1.1 Classification vs Prediction
• Classification results in categorical (discrete, unordered) labels as outcome, The predictor
model predicts a continuous-valued function, or ordered value, as opposed to a categorical
label.

• To categorize bank loan applications as either safe or risky is an example of classification,


To predict the expenditure in rupees of potential customers on computer equipment given
their income and occupation is a prediction.

• Regression analysis is a statistical methodology that is most often used for numeric
prediction. Decision tree classifiers, Bayesian classifiers, Bayesian belief networks,
Rulebased classifiers, Backpropagation neural network, Support Vector Machine and
Random forest are examples of classifiers.

10.1.2 Steps in classification


• Learning step (or training phase Fig. 10.1 ), where a classification algorithm builds the
classifier by analyzing or “learning from” a training set made up of database tuples and their
associated class labels. This step is also known as supervised learning (i.e., the learning of
the classifier is “supervised” in that it is told to which class each training tuple belongs). It
contrasts with unsupervised learning (or clustering), in which the class label of each training
tuple is not known, and the number or set of classes to be learned may not be known in
advance.

• In the second step (Fig. 10.2), the model is used for classification. First, the performance
of the classifier is estimated using test set, that comprises of tuples and their class labels,
and are independent of the training tuples. If the perfomance is not satisfactory the model
is built afresh starting from first step. Upon satisfactory performance the model is used for
classification.

10.2 Linear regression


Linear regression is a formal method of finding a line which best fits a set of data.
Consider the data as shown in the following scatter diagram. We can see there is a moderate
positive linear correlation between the variables, so it is reasonable to use a line of best fit to
model the data.

48
14IT601, Introduction to Data Analytics, Sivaram N 49

Figure 10.1: Learning: Training data are analyzed by a classification algorithm. Here, the class
label attribute is loan decision, and the learned model or classifier is represented in the form of
classification rules.

Figure 10.2: Classification: Test data are used to estimate the performance of the classifier
(classification rules). If the accuracy is considered acceptable, the classifier can be applied to
the classification of new data tuples

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 50

• One way to do this is to draw a straight line through the data points which includes the
mean point (x̄, ȳ)

• After plotting the mean on the scatter diagram, we draw in the line of best fit, by eye, that
has about as many points above the line as are below it.

• The problem with drawing a line of best fit by eye is that the answer will vary from one
person to another and the equation of the line may not be very accurate.

• Having found our line of best fit, we can then use this linear model to estimate a value of y
for any given value of x.

Least Squares regression line

1. The method of ’least squares’ is the most common way to determine the gradient (slope)
and the y-intercept of the best fit line.

2. We find the vertical distances d1 , d2 , d3 , . . . dn to the line of best fit. Where n is the number
of data points.

3. We then add the squares of these distances, giving SSE = d21 + d22 + d23 + . . . + d2n

4. The least squares regression line is the one which makes this sum of the squares of the error
SSE as small as possible.

5. The equation of the best fit line y = m ∗ x + c is given by the following formulas

∑ ∑ ∑
n xy − x y
m= ∑ ∑ (10.1)
n x2 − ( x)2
∑ ∑
y−m x
c= (10.2)
n

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 51

10.3 Interpolation and Extrapolation


Interpolation is where we find a value inside our set of data points.

Extrapolation is where we find a value outside our set of data points.

10.4 Logistic regression


In linear regression the response variable ( Y ) is always a continuous variable. If suppose, the Y
variable was categorical, you cannot use linear regression model.
So what would you do when the Y is a categorical variable with 2 classes?
Logistic regression can be used to model and solve such problems, also called as binary
classification problems.
A key point to note here is that Y can have 2 classes only and not more than that. If Y has
more than 2 classes, it would become a multi class classification and you can no longer use the
logistic regression for that.

10.4.1 Examples of binary classification problems


Spam Detection: Predicting if an email is Spam or not

Credit Card Fraud: Predicting if a given credit card transaction is fraud or not

Health: Predicting if a given mass of tissue is benign or malignant

Marketing: Predicting if a given user will buy an insurance product or not

Banking: Predicting if a customer will default on a loan.

10.4.2 Why not linear regression?


When the response variable has only 2 possible values, it is desirable to have a model that predicts
the value either as 0 or 1 or as a probability score that ranges between 0 and 1.
Linear regression does not have this capability. Because, If you use linear regression to model a
binary response variable, the resulting model may not restrict the predicted Y values within 0 and
1. This is where logistic regression comes into play. In logistic regression, you get a probability
score that reflects the probability of the occurence of the event.
An event in this case is, each row of the training dataset. It could be something like classifying
if a given email is spam, or mass of cell is malignant or a user will buy a product and so on.

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 52

10.4.3 The Logistic Equation


P
Logistic regression achieves this by taking the log odds of the event ln( 1−P ), where, P is the
probability of event. So P always lies between 0 and 1.
( )
Pi
Zi = ln = α + β1 x1 + · · · + βn xn (10.3)
1 − Pi
Taking exponent on both sides of the equation gives:

1 ez eα+βi xi
Pi = E(y = )= = (10.4)
xi 1 + ez 1 + eα+βi xi
You can implement this equation using the glm() function by setting the family argument to
binomial.

# Template code

# Step 1: Build Logit Model on Training Dataset


# Y is response variable and X1, X2 are explanatory variables
logitMod <- glm(Y ~ X1 + X2, family = "binomial", data = trainingData)

# Step 2: Predict Y on Test Dataset


predictedY <- predict(logitMod, testData, type = "response")

10.5 Classification by Decision Tree Induction


Decision tree induction is the learning of decision trees from class-labeled training tuples. A
decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a
test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal
node) holds a class label. The topmost node in a tree is the root node.

A typical decision tree is shown in the above figure. It classifies a customer, that is, it predicts
whether a customer is likely to purchase a computer. Internal nodes are denoted by rectangles, and
leaf nodes are denoted by ovals. Some decision tree algorithms produce only binary trees (where
each internal node branches to exactly two other nodes), whereas others can produce nonbinary
trees. “How are decision trees used for classification?” Given a tuple, X, for which the associated
class label is unknown, the attribute values of the tuple are tested against the decision tree. A
path is traced from the root to a leaf node, which holds the class prediction for that tuple. The
decision tree induction algorithm is illustrated in Algorithm 10.1.

10.5.1 R code for Decision tree induction based classifier

if (!require(rpart)){
install.packages('rpart')
library(rpart)
}

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 53

if (!require(rpart.plot)){
install.packages('rpart.plot')
library(rpart.plot)
}

df <- iris
names(df)

# Draw a sample with replacement from vector c(1,2)


train_test_mask <- sample(x = 1:2, size = nrow(df), replace = TRUE, prob = c(0.80,
0.20))

trainset <- df[train_test_mask == 1, ]

testset <- df[train_test_mask == 2, ]

attach(df)

# Buid decision tree model for predicting the specis of iris flower
dtree <- rpart(Species~., data = trainset, method = "class")

rpart.plot(dtree)
pr <- predict(dtree, testset, type = 'class')
cm <- table(predictions = pr, actual = testset$Species)
cm
accuracy <- sum(diag(cm)) / sum(cm)
print(paste('Accuracy of classifier', accuracy))

10.6 Random forests


Random forests or random decision forests are an ensemble learning method for classification,
regression and other tasks that operate by constructing a multitude of decision trees at training
time and outputting the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees.
The reason that the random forest model works so well is: A large number of relatively
uncorrelated models (trees) operating as a committee will outperform any of the individual
constituent models.
In Bootstrap sampling 63% of observations from training data is selected for building the tree.
Thus n samples are used for building n decision trees in Random forest. When a new tuple is
given for classification all the trees in the random forest generates output independently. The final
outcome of random forest is the majority outcome of n decision trees. Refer Fig. 10.3 for details.

10.6.1 R code for Random forests based classifier

if (!require(randomForest)){
install.packages('randomForest')
library(randomForest)
}

df <- iris
names(df)

# Draw a sample with replacement from vector c(1,2)


train_test_mask <- sample(x = 1:2, size = nrow(df), replace = TRUE, prob = c(0.80,
0.20))

trainset <- df[train_test_mask == 1, ]

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 54

Algorithm 10.1 Generate_decision_tree. Generate a decision tree from the training tuples of
data partition D.
Input:

• Data partition, D, which is a set of training tuples and their associated class labels;

• attribute_list, the set of candidate attributes;

• Attribute_selection_method, a procedure to determine the splitting criterion that “best”


par- titions the data tuples into individual classes. This criterion consists of a
splitting_attribute and, possibly, either a split point or splitting subset.

Output: A decision tree.

Method:
create a node N ;
if (tuples in D are all of the same class, N ) then
return N as a leaf node labeled with the class C;
end if
if (attribute list is empty) then
return N as a leaf node labeled with the majority class in D ▷
majority voting
end if
apply Attribute_selection_method(D, attribute_list) to find the “best” splitting criterion;
label node N with splitting criterion;
if (splitting_attribute is discrete-valued) and (multiway splits allowed) then
attribute_list ←− attribute_list − splittingattribute;
end if
for (each outcome j of splitting criterion) do
let Dj be the set of data tuples in D satisfying outcome j;
if (Dj is empty) then
attach a leaf labeled with the majority class in D to node N ;
else
attach the node returned by Generate_decision_tree(Dj , attribute_list) to node N ;
end if
end for
return N ;

Figure 10.3: Random forest algorithm

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 55

Figure 10.4: The 2-D training data are linearly separable. There are an infinite number of (possible)
separating hyperplanes or “decision boundaries”. Which one is best?

testset <- df[train_test_mask == 2, ]

attach(df)

# Buid random forest model for predicting the species of iris flower
rf.model <- randomForest(formula = Species~., data = trainset)
rf.model

# importance of each predictor variable


importance(rf.model)

# Use random forest model to predict class label ( Species ) of Iris flower data in
testset
pr <- predict(rf.model, testset, type = 'class')

cm <- table(predictions = pr, actual = testset$Species)


cm
accuracy <- sum(diag(cm)) / sum(cm)
print(paste('Accuracy of classifier', accuracy))

10.7 Support Vector Machines


A support vector machine (or SVM) is an algorithm that works as follows. It uses a nonlinear
mapping to transform the original training data into a higher dimension. Within this new
dimension, it searches for the linear optimal separating hyperplane (that is, a “decision
boundary” separating the tuples of one class from another). With an appropriate nonlinear
mapping to a sufficiently high dimension, data from two classes can always be separated by a
hyperplane. The SVM finds this hyperplane using support vectors (“essential” training tuples)
and margins (defined by the support vectors).

Margin, It is the shortest distance from a hyperplane to the plane parallel to it and passing
through the closest training tuple of either class.

The hyperplane with the larger margin is more accurate at classifying future data tuples than
the hyperplane with the smaller margin. This is why (during the learning or training phase),
the SVM searches for the hyperplane with the largest margin, that is, the maximum marginal
hyperplane (MMH). Refer Figures 10.4 and 10.5 for details.

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 56

Figure 10.5: Two possible separating hyperplanes and their associated margins. Which one is
better? The one with the larger margin (b) is called MMH.

10.7.1 Advantages and Dis-advantages


• They are highly accurate, owing to their ability to model complex nonlinear decision
boundaries.

• They are much less prone to overfitting than other methods.

• The support vectors found also provide a compact description of the learned model.

• SVMs can be used for prediction as well as classification.

• The training time of even the fastest SVMs can be extremely slow.

mydataframe <- read.table(file,header=logical_value,sep="delimiter", row.names="name")

Bapatla Engineering College, Dept. of IT, 2019-20


Chapter 11

Clustering

11.1 Introduction
Clustering is the process of grouping the data into classes or clusters, so that objects within a
cluster have high similarity in comparison to one another but are very dissimilar to objects in
other clusters. In machine learning, clustering is an example of unsupervised learning. Unlike
classification, clustering and unsupervised learning do not rely on predefined classes and class-
labeled training examples. For this reason, clustering is a form of learning by observation, rather
than learning by examples.

11.1.1 Data structures used by Clustering algorithams


Clustering algorithms typically operate on either of the following two data structures.

• Data matrix (or object-by-variable structure): This represents m objects, such as persons,
with n variables (also called measurements or attributes), such as age, height, weight, gender,
and so on. The structure is in the form of a relational table, or m-by-n matrix (m objects ×
n variables):
 
a a1,2 · · · a1,n
 1,1 
 
 a2,1 a2,2 · · · a2,n 
Am,n =
 .. .. .. .. 

 . . . . 
 
am,1 am,2 · · · am,n

• Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities


that are available for all pairs of m objects. It is often represented by an m-by-m table:
 
d1,1 d1,2 · · · d1,m
 
 
 d2,1 d2,2 · · · d2,m 
Dm,m 
= . 
. .. ... .. 
 . . . 
 
dm,1 dm,2 · · · dm,m
where d(i,j) is the measured difference or dissimilarity between objects i and j. In general,
d(i,j) is a nonnegative number that is close to 0 when objects i and j are highly similar or
“near” each other, and becomes larger the more they differ. Since d(i,j) = d(j,i) , and d(i,i) = 0
the dis-similarity matrix is symmetric.

57
14IT601, Introduction to Data Analytics, Sivaram N 58

11.2 Dissimilarity / Distance between objects

Attribute type Distance measure

Interval-Scaled: continuous √
measurements of a roughly linear d(i,j) = (ai,1 − aj,1 )2 + · · · + (ai,n − aj,n )2
scale. Eg. Weight and Height
r+s
d(i,j) = q+r+s+t where q is the number of
variables that equal 1 for both objects i and
Binary: It has only two states: j, r is the number of variables that equal 1
0 or 1, where 0 means that the for object i but that are 0 for object j, s is the
variable is absent, and 1 means number of variables that equal 0 for object i
that it is present. Eg. smoker but equal 1 for object j, and t is the number
of variables that equal 0 for both objects i
and j.

Categorical: It is a generalization d(i,j) = p−mp


where m is the number of
of the binary variable in that it matches (i.e., the number of variables for
can take on more than two states. which i and j are in the same state), and
Eg. Blood group p is the total number of variables.

Ordinal variable: resembles a Replace rank rif of the ith object in the
categorical variable, except that r −1
f th variable by zif = Miff −1 where Mf
the M states of the ordinal
value are ordered in a meaningful is the number of states of f th variable.
sequence. Eg. Designation of an Use the distance measure for interval-scaled
employee variables.

Ratio-scaled: a positive Apply logarithmic (or appropriate)


measurement on a nonlinear transformation to a ratio-scaled variable f
scale. Eg. Growth of a bacteria having value xif for object i by using the
population or the decay of a formula yif = log(xif ). The yif values can
radioactive element. be treated as interval-scaled variable values

11.2.1 Clustering techniques


Clustering techniques, are organized into the following categories:
• partitioning methods,
• hierarchical methods,
• density-based methods,
• grid-based methods,
• model-based methods, and
• constraint-based clustering.
Note: Our focus is on Partitioning and Hierarchical methods in this course

11.3 Partitioning methods


The most well-known and commonly used partitioning method is k-means method. The clusters
are formed to optimize an objective partitioning criterion. Typically, the square-error criterion is
used, which is defined as


k ∑
E= |p − mi |2 (11.1)
i=1 p∈Ci

where k is the number of clusters, Ci is the i − th cluster and mi is the mean of the i − th cluster.

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 59

Algorithm 11.1 The k-means algorithm for partitioning, where each cluster’s center is represented
by the mean value of the objects in the cluster.
Input:

• k : the number of clusters,

• D : a data set containing n objects.

Output: A set of k clusters.

Method:
arbitrarily choose k objects from D as the initial cluster centers;
repeat
(re)assign each object to the cluster to which the object is the most similar, based on the
mean value of the objects in the cluster;
update the cluster means, i.e., calculate the mean value of the objects for each cluster;
until no change;

11.3.1 Advantages and Dis-advantages


• The method is relatively scalable and efficient in processing large data sets because the
computational complexity of the algorithm is O(nkt), where n is the total number of objects,
k is the number of clusters, and t is the number of iterations. Normally, k ≪ n and t ≪ n.
The method often terminates at a local optimum.

• The k-means method, however, can be applied only when the mean of a cluster is defined.
This may not be the case in some applications, such as when data with categorical attributes
are involved.

• The necessity for users to specify k, the number of clusters, in advance can be seen as a
disadvantage.

• The k-means method is not suitable for discovering clusters with nonconvex shapes or clusters
of very different size.

• k-means is sensitive to noise and outlier data points because a small number of such data
can substantially influence the mean value.

11.4 Hierarchical Methods


11.4.1 Agglomerative Hierarchical Clustering
This bottom-up strategy starts by placing each object in its own cluster and then merges these
atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or
until certain termination conditions are satisfied. Most hierarchical clustering methods belong to
this category. They differ only in their definition of intercluster similarity. Four widely used
measures for distance between clusters are as follows, where |p − p′ | is the distance between two
objects or points, p and p′ ; mi is the mean for cluster, Ci ; and ni is the number of objects in Ci .

Minimum distance (single linkage) : dmin (Ci , Cj ) = minp∈Ci ,p′ ∈Cj |p − p′ |

Maximum distance (complete linkage) : dmax (Ci , Cj ) = maxp∈Ci ,p′ ∈Cj |p − p′ |

Mean distance (centroid linkage) : dmean (Ci , Cj ) = |mi − mj |


∑ ∑
Average distance (average linkage) : davg (Ci , Cj ) = 1ni nj p∈Ci p′ ∈Cj |p − p′ |

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 60

Figure 11.1: Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}

Figure 11.2: Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 61

11.4.2 Divisive Hierarchical Clustering


This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with
all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each
object forms a cluster on its own or until it satisfies certain termination conditions, such as a
desired number of clusters is obtained or the diameter of each cluster is within a certain threshold.

11.5 Clustering in R

install.packages("cluster") # package required for clustering algorithms


library("cluster")
data("USArrests") # Load built-in dataset
df <- scale(USArrests) # Standardize the variables

# Call k-means algoritham


set.seed(123) # seed random number generator algoritham
km.res <- kmeans(x = df.scaled, centers = 4, iter.max = 10, nstart = 25)

# x: input data as numeric matrix, numeric data frame or a numeric vector


# centers: Possible values are the number of clusters (k) or a set of initial
(distinct) cluster centers.
# iter.max: The maximum number of iterations allowed. Default value is 10.
# nstart: The number of random starting partitions when centers is a number. If nstart
= 25, k-means algoritham will be run 25 times and best result is reported. The
default value of nstart in R is one.

# Print the results


print(km.res)

# Cluster label for each of the observations


km.res$cluster

# Cluster sizes
km.res$size

# Cluster means
km.res$centers

########################### Hierarchical Clustering


library("cluster")
data("USArrests") # Load built-in dataset

# Agglomerative Nesting (Hierarchical Clustering)


res.agnes <- agnes(x = USArrests, # data matrix
stand = TRUE, # Standardize the data
metric = "euclidean", # metric for distance matrix
method = "complete" # Linkage method
)

#DIvisive ANAlysis Clustering


res.diana <- diana(x = USArrests, # data matrix
stand = TRUE, # standardize the data
metric = "euclidean" # metric for distance matrix

# Cut dendrogram into 4 clusters


clusters <- cutree(res.diana, k = 4) # you can also analyse the result res.agnes
head(clusters)
# Cluster sizes
table(clusters)
# Get the names for the members of cluster 1
rownames(df[grp == 1])

Bapatla Engineering College, Dept. of IT, 2019-20


14IT601, Introduction to Data Analytics, Sivaram N 62

Figure 11.3: Elbow method of finding optimal number of clusters

11.6 Avoiding non-existent clusters


A big issue, in cluster analysis, is that clustering methods will return clusters even if the data
does not contain any clusters. In other words, if you blindly apply a clustering method on a data
set, it will divide the data into clusters because that is what it supposed to do. Before applying
any clustering method on your data, it’s important to evaluate whether the data sets contains
meaningful clusters (i.e.: non-random structures) or not. If yes, then how many clusters are there.
This process is defined as the assessing of clustering tendency or the feasibility of the clustering
analysis.
This is especially a problem in the k-means clustering because it requires the users to specify
the number of clusters to be generated. One simple solution is to compute k-means clustering
using different values of clusters k. Next, the E (within sum of square error) calculated using
Equation 11.1 is drawn according to the number of clusters. The location of a bend (knee) in
the plot is generally considered as an indicator of the appropriate number of clusters. The R
function f viz_nbclust() [in factoextra package] provides a convenient solution to estimate the
optimal number of clusters.

install.packages("factoextra")
library(factoextra)
fviz_nbclust(df, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)

Figure 11.3 shows that within sum of square error decreases as k increases, but a bend (or
“elbow”) can be seen at k = 4. This bend indicates that additional clusters beyond the 4 have
little value.

Bapatla Engineering College, Dept. of IT, 2019-20

You might also like