Ida PDF
Ida PDF
1 Introduction 4
1.1 Why R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Installation of Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Installation of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Installation of RStudio Desktop . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 R packages and libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Getting help in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Things to keep in mind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 The “working directory” and listing files . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Saving and loading R workspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Data Structures 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Atomic Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Matrices & Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 Testing & Coercion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 Combining data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Data Management 16
4.1 Arithmetic operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Mathematical functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Statistical functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1 Standardizing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Probability functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.3 Character functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.4 Other useful functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Applying functions to data objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Subsetting 22
5.1 Atomic vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Matrices and Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5 Subsetting Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5.1 Simplifying vs. preserving subsetting . . . . . . . . . . . . . . . . . . . . . . 24
5.5.2 $ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5.3 Missing/out of bounds indices . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.6 Subsetting and assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1
14IT601, Introduction to Data Analytics, Sivaram N 2
6 Plots in R 26
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.1.1 Terminology related to ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.4 Bar plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.5 Scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6 Line plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.7 Box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.7.1 Interpretation of Box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.8 Manipulating Strings in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.8.1 paste() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.8.2 sprintf() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.9 Regular expressions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.9.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7 Probability Distributions 37
7.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8 Hypothesis Tests 40
8.1 T Test (Student’s T-Test) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.1.1 What is a T Test? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.1.2 The T Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.1.3 T Values and P Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.1.4 Calculating the T Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.1.5 What is a Paired T Test (Paired Samples T Test)? . . . . . . . . . . . . . . 41
8.2 Analysis of variance (ANOVA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.2.1 ANOVA test hypotheses: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2.2 Assumptions of ANOVA test . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2.3 Steps in ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2.4 one-way ANOVA test in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
9 Correlation 44
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.2 Pearson’s correlation coefficient ( r ) . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9.3 Spearman’s rank correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . 46
10 Classification 48
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.1.1 Classification vs Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.1.2 Steps in classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.3 Interpolation and Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
10.4 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
10.4.1 Examples of binary classification problems . . . . . . . . . . . . . . . . . . . 51
10.4.2 Why not linear regression? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
10.4.3 The Logistic Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
10.5 Classification by Decision Tree Induction . . . . . . . . . . . . . . . . . . . . . . . . 52
10.5.1 R code for Decision tree induction based classifier . . . . . . . . . . . . . . . 52
10.6 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10.6.1 R code for Random forests based classifier . . . . . . . . . . . . . . . . . . . 53
10.7 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
10.7.1 Advantages and Dis-advantages . . . . . . . . . . . . . . . . . . . . . . . . . 56
11 Clustering 57
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
11.1.1 Data structures used by Clustering algorithams . . . . . . . . . . . . . . . . 57
11.2 Dissimilarity / Distance between objects . . . . . . . . . . . . . . . . . . . . . . . . 58
11.2.1 Clustering techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
11.3 Partitioning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Introduction
1.1 Why R?
• Vast capabilities, wide range of statistical and graphical techniques
4
14IT601, Introduction to Data Analytics, Sivaram N 5
• Click on the link in the download column and the row corresponding to your OS
• To load a package:
library(“packageName”)
• Search documentation from R (not always the best way. . . google often works better)
help.search(“topicName”)
• Comments can be put almost anywhere, starting with a hash mark (’#’); everything to the
end of the line is a comment
• If a command is not complete at the end of a line, R will give a different prompt, ‘+’ by
default
• Parentheses must always match (first thing to check if you get an error)
• Names should start with a letter and should not contain spaces
Data Structures
2.1 Introduction
R’s base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether
they’re homogeneous (all contents must be of the same type) or heterogeneous (the contents can
be of different types). This gives rise to the five data types most often used in data. str() is
shortcut name for structure function and it gives a compact, human readable description of any R
data structure.
Homogeneous Heterogeneous
1d AtomicVector List
2d Matrix DataFrame
nd Array
2.2 Vectors
The basic data structure in R is the vector. Vectors come in two flavours: atomic vectors and lists.
They have three common properties:
Missing values are specified with NA, which is a logical vector of length 1
10
14IT601, Introduction to Data Analytics, Sivaram N 11
All elements of an atomic vector must be the same type, so when you attempt to combine
different types they will be coerced to the most flexible type. Types from least to most
flexible are: logical, integer, double, and character.
2.2.2 Lists
Lists are different from atomic vectors because their elements can be of any type, including lists.
You construct lists by using list() instead of c():
Lists are sometimes called recursive vectors, because a list can contain other lists. This makes
them fundamentally different from atomic vectors. The typeof() a list is list. You can test for a
list with is.list() and coerce to a list with as.list(). You can turn a list into an atomic vector with
unlist().
2.2.3 Names
You can name a vector in three ways:
• By modifying an existing vector in place: x <- 1:3; names(x)<- c("a", "b", "c")
2.3 Factors
A factor is a vector that can contain only predefined values, and is used to store categorical data.
Factors are built on top of integer vectors using two attributes: the class(), “factor”, which makes
them behave differently from regular integer vectors, and the levels(), which defines the set of
allowed values.
• length() generalises to nrow() and ncol() for matrices, and dim() for arrays.
• names() generalises to rownames() and colnames() for matrices, and dimnames(), a list of
character vectors, for arrays.
c() generalises to cbind() and rbind() for matrices, and to abind(by the abind package) for arrays.
2.5.1 Creation
You create a data frame using data.frame(), which takes named vectors as input:
Beware data.frame()’s default behaviour which turns strings into factors.Use stringAsFactors
= FALSE to suppress this behaviour.
typeof(df)
class(df)
is.data.frame(df)
# You can coerce an object to a data frame with as.data.frame():
• A list will create one column for each element; it’s an error if they’re not all the same length.
• A matrix will create a data frame with the same number of columns and rows.
When combining column-wise, the number of rows must match, but row names are ignored.
When combining row-wise, both the number and names of columns must match. cbind() will
create a matrix unless one of the arguments is already a data frame. Instead use data.frame()
directly:
3.1 Introduction
% https://www.r-bloggers.com/importing-data-to-r/
# To convert data from long to wide format use the cross tabulation function xtabs()
14
14IT601, Introduction to Data Analytics, Sivaram N 15
Year
Runner 2004 2005 2007 2008 2009 2010 2011 2012
Asafa_Powell 10.02 9.87 0.00 0.00 0.00 0.00 0.00 0.00
Usain_Bolt 0.00 0.00 10.03 9.72 9.58 9.82 9.76 9.63
If you just want to transpose your data, that is, switch columns and rows, you can use
the t() function:
transposed_data <- t(my_data)
Data Management
Operator Description
+ Addition
- Subtraction
* Multiplication
/ Division
∧ or ** Exponentiation
x%%y Modulus (x mod y) 5%%2 is 1
x%/%y Integer division 5%/%2 is 2
16
14IT601, Introduction to Data Analytics, Sivaram N 17
Function Description
Function Description
Function Description
paste(..., sep=“”) Concatenate strings after using sep string to separate them.
paste(“x”, 1:3, sep=“”) returns
c(“x1”, “x2”, “x3”).
paste(“x”,1:3,sep=“M”) returns c(“xM1”,“xM2” “xM3”).
toupper(x) Uppercase
toupper(“abc”) returns “ABC”.
tolower(x) Lowercase
tolower(“ABC”) returns “abc”.
Subsetting
x[c(3, 1)]
x[order(x)]
# Duplicated indices yield duplicated values
x[c(1, 1)]
# Real numbers are silently truncated to integers
x[c(2.1, 2.9)]
x[-c(3, 1)]
• Logical vectors select elements where the corresponding logical value is TRUE.
x[ ]
x[0]
22
14IT601, Introduction to Data Analytics, Sivaram N 23
5.2 Lists
Using [ will always return a list; [[ and $, let you pull out the components of the list.
• With a matrix.
The most common way of subsetting matrices (2d) and arrays (>2d)is a simple generalisation
of 1d subsetting: you supply a 1d index for each dimension, separated by a comma.
a <- list(a = 1, b = 2)
a[[1]]
Simplifying Preserving
Vector x[[1]] x[1]
List x[[1]] x[1]
Factor x[1:4, drop = T] x[1:4]
Array x[1, ] or x[, 1] x[1, , drop = F] or x[, 1, drop = F]
Data frame x[, 1] or x[[1]] x[, 1, drop = F] or x[1]
5.5.2 $
$ is a shorthand operator, where x$y is equivalent to x[[“y”, exact = FALSE]].
x <- list(abc = 1)
x\$a
x[["a"]]
x <- 1:4
str(x[5])
str(x[NA\_real_])
str(x[NULL])
The following table summarises the results of subsetting atomic vectors and lists with [ and [[
and different types of OOB value.
[ OOB NA list(NULL)
[ NA_real_ NA list(NULL)
[ NULL x[0] list(NULL)
[[ OOB Error Error
[[ NA_real_ Error NULL
[[ NULL Error Error
x <- 1:5
x[c(1, 2)] <- 2:3
x
# The length of the LHS needs to match the RHS
x[-1] <- 4:1
x
# Note that there's no checking for duplicate indices
x[c(1, 1)] <- 2:3
x
# You can't combine integer indices with NA
x[c(1, NA)] <- c(1, 2)
# Error: NAs are not allowed in subscripted assignments
# But you can combine logical indices with NA
# (where they're treated as false).
x[c(T, F, NA)] <- 1
x
# This is mostly useful when conditionally modifying vectors
df <- data.frame(a = c(1, 10, NA))
df$a[df\$a < 5] <- 0
df$a
Plots in R
6.1 Introduction
Functions in ggplot2() package are used to produce plots in R. As usual the required package can
be installed and loaded using the following code:
if (!require(ggplot2)){
install.packages('ggplot2')
library(ggplot2)
}
• Geoms are the geometric objects that are drawn to represent the data, such as bars, lines,
and points.
• Scales: They control the mapping from the values in the data space to values in the aesthetic
space. A continuous y scale maps larger numerical values to vertically higher positions in
space.
• Guides: To interpret the graph, viewers refer to the guides. They show the viewer how to
map the visual properties back to the data space. The most commonly used guides are the
tick marks and labels on an axis. A legend is another type of scale. A legend might show
people what it means for a point to be a circle or a triangle, or what it means for a line to
be blue or red.
• Themes: Some aspects of a graph’s appearance fall outside the scope of the grammar of
graphics. These include the color of the background and grid lines in the graphing area, the
fonts used in the axis labels, and the text in the graph title. These are controlled with the
theme() function.
26
14IT601, Introduction to Data Analytics, Sivaram N 27
6.2 Dataset
set.seed(1234)
# Randomly generate weights of 200 females and 200 males
wdata = data.frame(
gender = factor(rep(c("F", "M"), each = 200)),
weight = c(rnorm(n = 200, mean = 55, sd = 5), rnorm(n = 200, mean = 58, sd = 5)))
head(wdata, 3)
gender weight
1 F 48.96467
2 F 56.38715
3 F 60.42221
tail(wdata, 3)
gender weight
398 M 60.69417
399 M 58.07322
400 M 53.41755
gender grp.mean
1 F 54.71120
2 M 58.36625
6.3 Histogram
A Histogram represents the distribution of a continuous variable by dividing into bins and counting
the number of observations in each bin. The function geom_histogram() is used to create a
histogram plot. You can also add a verticle line for the mean using the function geom_vline().
Key arguments to customize the plot: alpha, color, fill, linetype, size
By default y axis corresponds to the count of weight values. If you want to change the plot in
order to have the density on y axis use aes(y = ..density..)
data(mtcars)
head(mtcars, 3)
ggplot(mtcars, aes(gear)) +
geom_bar(fill = "steelblue")
# create data
xValue <- 1:100
yValue <- cumsum(rnorm(100))
data <- data.frame(xValue,yValue)
head(data, 3)
xValue yValue
1 1 -1.226815
2 2 -1.190662
3 3 -1.612055
# Plot
ggplot(data, aes(x=xValue, y=yValue)) +
geom_line()
• outlier.colour, outlier.shape, outlier.size: The color, the shape and the size for outlying points
• notch: logical value. If TRUE, makes a notched box plot. The notch displays a confidence
√
interval around the median which is normally based on the median ±1.58 ∗ IQR/ (n).
Notches are used to compare groups; if the notches of two boxes do not overlap, this is a
strong evidence that the medians differ.
data("ToothGrowth")
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
head(ToothGrowth, 3)
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
The argument · · · means that it takes any number of objects. The argument sep is a character
string that is used as a separator. The argument collapse is an optional string to indicate if we
want all the terms to be collapsed into a single string. Here is a simple example with paste():
If we give paste() objects of different length, then it will apply a recycling rule. For example, if
we paste a single character “x” with the sequence 1:5, and separator sep = “.” this is what we get:
To see the effect of the collapse argument, let’s compare the difference with collapsing and
without it:
# paste with collapse combines all substrings and make one string
paste(c("x", "y", "z"), 1:3, , sep = " ", collapse = "")
## [1] "x 1y 2z 3"
In addition to paste(), there’s also the function paste0() which is the equivalent of paste(...,
sep = "", collapse = "")
6.8.2 sprintf()
The function sprintf () returns a formatted string combining text and variable values. The nice
feature about sprintf () is that it provides us a very flexible way of formatting vector elements as
character strings. Its usage has the following form: sprintf(fmt, ...).
The argument f mt is a character vector of format strings. The allowed conversion specifications
start with the symbol % followed by numbers and letters. For demonstration purposes here are
several ways in which the number pi can be formatted:
sprintf("%1.0f", pi)
## [1] "3"
sprintf("%05.1f", pi)
## [1] "003.1"
# prefix a space
sprintf("% f", pi)
## [1] " 3.141593"
# left adjustment
sprintf("%-10f", pi) # left justified
## [1] "3.141593 "
1. Metacharacters: They comprises a set of special operators which regex doesn’t capture.
These characters include: “. | ( ) [ ] $ * + ?”.
2. Sequences: They contain special characters used to describe a pattern in a given string.
Following are the commonly used sequences in R:
Sequence Description
3. Quantifiers
Quantifier Description
{n} The item to its left is matched exactly n times. The item must have
a consecutive repetition at place.
{n, } The item to its left is matched n or more times.
{n,m} The item to its left is matched at least n times but not more than
m times.
It is known as a greedy quantifier. It says that for a particular
pattern to be matched, it will try to match the pattern as many
times as its repetition are available. For example:
Class Description
5. POSIX character classes: In R, these classes can be identified as enclosed within a double
square bracket ([[ ]]). They work like character classes. A caret (∧) ahead of an expression
Class Description
[[: graph :]] matches graphical characters. Graphical characters comprise [[:
alpha :]] and [[: punct :]]
6.9.1 Examples
1. Extract digits from a string of characters
#match values
vec <- c("A1","A2","A3","A4","A5","A6","A7")
grep(pattern = "A1|A4", x = vec, value = TRUE)
# Output : "A1" "A4"
Explanation: Double backslash is used to escape the metacharacter “(”. “[a-z]+” matches
letters one or more times. “(0[̇0-9]+)” matches the decimal value, where the metacharacter
“.” is escaped using double backslash, so is the period. The numbers are matched using
“[0-9]+.”
Explanation: In the regex above, “.*:” matches everything (except newspace) it can until
it reaches colon (:), then gsub() function replaces it with blank. Hence, we get the desired
output.
7. Remove digits from a string which contains alphanumeric characters. The desired output
from the string given below, is “day of 2nd ID5 Conference” .We can’t do it with simple
“[[:digit:]]+” regex as it will match all the digits available in the given string. Instead, in
such case, we’ll detect the digit boundary to get the desired result.
mystring <- "What are we doing tomorrow ? (laugh) Play soccer (groans) (cries)"
gsub(pattern = "[\\(\\)]", replacement = "", x = regmatches(mystring,
gregexpr("\\(.*?\\)", mystring))[[1]])
# Output : "laugh" "groans" "cries"
Explanation: In this solution, we’ve used the lazy matching technique. First, using
regmatches, we’ve extracted the parentheses with words such as (cries) (groans) (laugh).
Then, we’ve simply removed the brackets using the gsub() function.
Probability Distributions
Toss a fair coin three times ... what is the chance of getting two Heads?
• Event: “Two Heads” out of three coin tosses (3 outcomes have this)
37
14IT601, Introduction to Data Analytics, Sivaram N 38
Example2:-
Toss a fair coin nine times ... what is the chance of getting five Heads?
pk (1 − p)n−k (7.3)
• P(5 Heads in 9 tosses) = No. of desirable outcomes * P(a desirable outcome) = 126 ∗
(0.55 )(0.59−5 ) = 0.246
Example3:-
Suppose there are twelve multiple choice questions in an English class quiz. Each question has
five possible answers, and only one of them is correct. Find the probability of having four or less
correct answers if a student attempts to answer every question at random.
pk (1 − p)n−k (7.6)
• P(5 Heads in 9 tosses) = No. of desirable outcomes * P(a desirable outcome) = 126 ∗
(0.55 )(0.59−5 ) = 0.246
Hypothesis Tests
• A One sample t-test is used in testing the null hypothesis that the population mean is equal
to a specified value µ0 , one uses the statistic
x̄ − µ0
t= √ (8.1)
s/ n
where x̄ is the sample mean, s is the sample standard deviation and n is the sample size.
40
14IT601, Introduction to Data Analytics, Sivaram N 41
√
(n1 − 1)2 s21 + (n2 − 1)2 s22
sp = (8.3)
n1 + n2 − 2
where x̄i is the ith sample mean, sp is the pooled standard deviation of the two samples, ni
is the ith sample size, ni 1 is the number of degrees of freedom for each group, and the total
sample size minus two (that is, n1 + n2 − 2 ) is the total number of degrees of freedom.
• Dependent t-test for paired samples. This test is used when the samples are dependent; that
is, when there is only one sample that has been tested twice (repeated measures) or when
there are two samples that have been matched or “paired”. This is an example of a paired
difference test. The t statistic is calculated as
x¯d − µ0
t= √ (8.4)
sd / n
where x¯d and sd are the average and standard deviation of the differences between all pairs.
The pairs are e.g. either one person’s pre-test and post-test scores or between-pairs of persons
matched into meaningful groups (for instance drawn from the same family or age group: see
table). The constant µ0 is zero if we want to test whether the average of the difference is
significantly different. The degree of freedom used is n − 1, where n represents the number
of pairs.
A paired t test (also called a correlated pairs t-test, a paired samples t test or dependent
samples t test) is where you run a t test on dependent samples. Dependent samples are
essentially connected — they are tests on the same person or thing. For example:
A paired t test (also called a correlated pairs t-test, a paired samples t test or dependent
samples t test) is where you run a t test on dependent samples. Dependent samples are
essentially connected — they are tests on the same person or thing. For example:
• Alternative hypothesis: At least one sample mean is not equal to the others.
• These normal populations have a common variance. (Levene’s test can be used to check
this.)
SSW
M SW =
N −K
∑
K
SSW = n(xi − µ)2
i=1
∑
n
xi = xi
i=1
∑K
i=1 xi
µ=
K
SSB
M SB =
K −1
∑
K
SSB = n(xi − µ)2
i=1
∑
n
xi = xi
i=1
∑K
i=1 xi
µ=
K
where SSB is sum of squares (deviations) between the group means (xi ) and the grand
mean (µ), K is the number of groups and n is the number of observations in a group.
M SB
3. Produce F-statistic as the ratio of M SW
Correlation
9.1 Introduction
Correlation is a measure of the strength of the relationship or association between two quantitative,
continuous variables, for example, age and blood pressure. When we analyse the correlation
between two variables, we should follow these steps:
• For a generally upward shape we say that the correlation is positive. As the
independent variable increases, the dependent variable generally increases.
• For a generally downward shape we say that the correlation is negative. As the
independent variable increases, the dependent variable generally decreases.
• For randomly scattered points with no upward or downward trend, we say there is no
correlation.
44
14IT601, Introduction to Data Analytics, Sivaram N 45
Step 2: Look at the spread of points to make a judgement about the strength of the correlation.
• For positive relationships we would classify the following scatter diagrams as:
sxy
rxy = (9.1)
sx ∗ sy
∑n
(xi − x̄)(yi − ȳ)
rxy = √∑n i=1 √∑n (9.2)
i=1 (xi − x̄) i=1 (yi − ȳ)
2 2
where:
• n is sample size
The values (r and r2 ) are important because they tell us how close to linear a set of data is.
There is no point in fitting a linear relationship between two variables if they are clearly not
linearly related.
• If r = −1, the data is perfectly negatively correlated. This means the data lie exactly in a
straight line with negative gradient.
The following table is a guide for describing the strength of linear association using the
coefficient of determination:
r2 = 0 no correlation
0 < r2 < 0.25 very weak correlation
0.25 ≥ r2 < 0.5 weak correlation
0.5 ≥ r2 < 0.75 moderate correlation
0.75 ≥ r2 < 0.9 strong correlation
0.9 ≥ r2 < 1 very strong correlation
r2 = 1 perfect correlation
• Correlations may or may not indicate causal relations. Reversely, causal relations from some
variable to another variable may or may not result in a correlation between the two variables.
• Correlations are very sensitive to outliers; a single unusual observation may have a huge
impact on a correlation. Such outliers are easily detected by a quick inspection a scatterplot.
• To rank two sets of data X and Y , you give the rank 1 to the highest of the xi values, 2 to
the next highest, 3 to the next highest, and so on. You do the same for the yi values.
• It makes no difference if you rank the smallest as 1, the next smallest as 2, etc., provided
you do the same to both X and Y.
Classification
10.1 Introduction
10.1.1 Classification vs Prediction
• Classification results in categorical (discrete, unordered) labels as outcome, The predictor
model predicts a continuous-valued function, or ordered value, as opposed to a categorical
label.
• Regression analysis is a statistical methodology that is most often used for numeric
prediction. Decision tree classifiers, Bayesian classifiers, Bayesian belief networks,
Rulebased classifiers, Backpropagation neural network, Support Vector Machine and
Random forest are examples of classifiers.
• In the second step (Fig. 10.2), the model is used for classification. First, the performance
of the classifier is estimated using test set, that comprises of tuples and their class labels,
and are independent of the training tuples. If the perfomance is not satisfactory the model
is built afresh starting from first step. Upon satisfactory performance the model is used for
classification.
48
14IT601, Introduction to Data Analytics, Sivaram N 49
Figure 10.1: Learning: Training data are analyzed by a classification algorithm. Here, the class
label attribute is loan decision, and the learned model or classifier is represented in the form of
classification rules.
Figure 10.2: Classification: Test data are used to estimate the performance of the classifier
(classification rules). If the accuracy is considered acceptable, the classifier can be applied to
the classification of new data tuples
• One way to do this is to draw a straight line through the data points which includes the
mean point (x̄, ȳ)
• After plotting the mean on the scatter diagram, we draw in the line of best fit, by eye, that
has about as many points above the line as are below it.
• The problem with drawing a line of best fit by eye is that the answer will vary from one
person to another and the equation of the line may not be very accurate.
• Having found our line of best fit, we can then use this linear model to estimate a value of y
for any given value of x.
1. The method of ’least squares’ is the most common way to determine the gradient (slope)
and the y-intercept of the best fit line.
2. We find the vertical distances d1 , d2 , d3 , . . . dn to the line of best fit. Where n is the number
of data points.
3. We then add the squares of these distances, giving SSE = d21 + d22 + d23 + . . . + d2n
4. The least squares regression line is the one which makes this sum of the squares of the error
SSE as small as possible.
5. The equation of the best fit line y = m ∗ x + c is given by the following formulas
∑ ∑ ∑
n xy − x y
m= ∑ ∑ (10.1)
n x2 − ( x)2
∑ ∑
y−m x
c= (10.2)
n
Credit Card Fraud: Predicting if a given credit card transaction is fraud or not
1 ez eα+βi xi
Pi = E(y = )= = (10.4)
xi 1 + ez 1 + eα+βi xi
You can implement this equation using the glm() function by setting the family argument to
binomial.
# Template code
A typical decision tree is shown in the above figure. It classifies a customer, that is, it predicts
whether a customer is likely to purchase a computer. Internal nodes are denoted by rectangles, and
leaf nodes are denoted by ovals. Some decision tree algorithms produce only binary trees (where
each internal node branches to exactly two other nodes), whereas others can produce nonbinary
trees. “How are decision trees used for classification?” Given a tuple, X, for which the associated
class label is unknown, the attribute values of the tuple are tested against the decision tree. A
path is traced from the root to a leaf node, which holds the class prediction for that tuple. The
decision tree induction algorithm is illustrated in Algorithm 10.1.
if (!require(rpart)){
install.packages('rpart')
library(rpart)
}
if (!require(rpart.plot)){
install.packages('rpart.plot')
library(rpart.plot)
}
df <- iris
names(df)
attach(df)
# Buid decision tree model for predicting the specis of iris flower
dtree <- rpart(Species~., data = trainset, method = "class")
rpart.plot(dtree)
pr <- predict(dtree, testset, type = 'class')
cm <- table(predictions = pr, actual = testset$Species)
cm
accuracy <- sum(diag(cm)) / sum(cm)
print(paste('Accuracy of classifier', accuracy))
if (!require(randomForest)){
install.packages('randomForest')
library(randomForest)
}
df <- iris
names(df)
Algorithm 10.1 Generate_decision_tree. Generate a decision tree from the training tuples of
data partition D.
Input:
• Data partition, D, which is a set of training tuples and their associated class labels;
Method:
create a node N ;
if (tuples in D are all of the same class, N ) then
return N as a leaf node labeled with the class C;
end if
if (attribute list is empty) then
return N as a leaf node labeled with the majority class in D ▷
majority voting
end if
apply Attribute_selection_method(D, attribute_list) to find the “best” splitting criterion;
label node N with splitting criterion;
if (splitting_attribute is discrete-valued) and (multiway splits allowed) then
attribute_list ←− attribute_list − splittingattribute;
end if
for (each outcome j of splitting criterion) do
let Dj be the set of data tuples in D satisfying outcome j;
if (Dj is empty) then
attach a leaf labeled with the majority class in D to node N ;
else
attach the node returned by Generate_decision_tree(Dj , attribute_list) to node N ;
end if
end for
return N ;
Figure 10.4: The 2-D training data are linearly separable. There are an infinite number of (possible)
separating hyperplanes or “decision boundaries”. Which one is best?
attach(df)
# Buid random forest model for predicting the species of iris flower
rf.model <- randomForest(formula = Species~., data = trainset)
rf.model
# Use random forest model to predict class label ( Species ) of Iris flower data in
testset
pr <- predict(rf.model, testset, type = 'class')
Margin, It is the shortest distance from a hyperplane to the plane parallel to it and passing
through the closest training tuple of either class.
The hyperplane with the larger margin is more accurate at classifying future data tuples than
the hyperplane with the smaller margin. This is why (during the learning or training phase),
the SVM searches for the hyperplane with the largest margin, that is, the maximum marginal
hyperplane (MMH). Refer Figures 10.4 and 10.5 for details.
Figure 10.5: Two possible separating hyperplanes and their associated margins. Which one is
better? The one with the larger margin (b) is called MMH.
• The support vectors found also provide a compact description of the learned model.
• The training time of even the fastest SVMs can be extremely slow.
Clustering
11.1 Introduction
Clustering is the process of grouping the data into classes or clusters, so that objects within a
cluster have high similarity in comparison to one another but are very dissimilar to objects in
other clusters. In machine learning, clustering is an example of unsupervised learning. Unlike
classification, clustering and unsupervised learning do not rely on predefined classes and class-
labeled training examples. For this reason, clustering is a form of learning by observation, rather
than learning by examples.
• Data matrix (or object-by-variable structure): This represents m objects, such as persons,
with n variables (also called measurements or attributes), such as age, height, weight, gender,
and so on. The structure is in the form of a relational table, or m-by-n matrix (m objects ×
n variables):
a a1,2 · · · a1,n
1,1
a2,1 a2,2 · · · a2,n
Am,n =
.. .. .. ..
. . . .
am,1 am,2 · · · am,n
57
14IT601, Introduction to Data Analytics, Sivaram N 58
Interval-Scaled: continuous √
measurements of a roughly linear d(i,j) = (ai,1 − aj,1 )2 + · · · + (ai,n − aj,n )2
scale. Eg. Weight and Height
r+s
d(i,j) = q+r+s+t where q is the number of
variables that equal 1 for both objects i and
Binary: It has only two states: j, r is the number of variables that equal 1
0 or 1, where 0 means that the for object i but that are 0 for object j, s is the
variable is absent, and 1 means number of variables that equal 0 for object i
that it is present. Eg. smoker but equal 1 for object j, and t is the number
of variables that equal 0 for both objects i
and j.
Ordinal variable: resembles a Replace rank rif of the ith object in the
categorical variable, except that r −1
f th variable by zif = Miff −1 where Mf
the M states of the ordinal
value are ordered in a meaningful is the number of states of f th variable.
sequence. Eg. Designation of an Use the distance measure for interval-scaled
employee variables.
∑
k ∑
E= |p − mi |2 (11.1)
i=1 p∈Ci
where k is the number of clusters, Ci is the i − th cluster and mi is the mean of the i − th cluster.
Algorithm 11.1 The k-means algorithm for partitioning, where each cluster’s center is represented
by the mean value of the objects in the cluster.
Input:
Method:
arbitrarily choose k objects from D as the initial cluster centers;
repeat
(re)assign each object to the cluster to which the object is the most similar, based on the
mean value of the objects in the cluster;
update the cluster means, i.e., calculate the mean value of the objects for each cluster;
until no change;
• The k-means method, however, can be applied only when the mean of a cluster is defined.
This may not be the case in some applications, such as when data with categorical attributes
are involved.
• The necessity for users to specify k, the number of clusters, in advance can be seen as a
disadvantage.
• The k-means method is not suitable for discovering clusters with nonconvex shapes or clusters
of very different size.
• k-means is sensitive to noise and outlier data points because a small number of such data
can substantially influence the mean value.
Figure 11.1: Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}
Figure 11.2: Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}
11.5 Clustering in R
# Cluster sizes
km.res$size
# Cluster means
km.res$centers
install.packages("factoextra")
library(factoextra)
fviz_nbclust(df, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)
Figure 11.3 shows that within sum of square error decreases as k increases, but a bend (or
“elbow”) can be seen at k = 4. This bend indicates that additional clusters beyond the 4 have
little value.