[go: up one dir, main page]

0% found this document useful (0 votes)
106 views192 pages

RProgramming

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 192

R Programming

Vishal Jain
Samatrix Consulting Pvt Ltd
1
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Table of Contents
1. GETTING STARTED WITH R ................................................................................................................... 6
INTRODUCTION TO R .............................................................................................................................................. 6
R as a programming language .................................................................................................................... 6
R as a computing environment ................................................................................................................... 7
THE NEED FOR R .................................................................................................................................................... 7
INSTALLING R ........................................................................................................................................................ 8
RSTUDIO ............................................................................................................................................................ 10
RSTUDIO’S USER INTERFACE .................................................................................................................................. 11
The console ................................................................................................................................................ 11
The editor ................................................................................................................................................... 12
The environment pane............................................................................................................................... 13
The history pane ........................................................................................................................................ 13
The file pane .............................................................................................................................................. 14
The plots pane............................................................................................................................................ 14
The package pane ...................................................................................................................................... 15
The help pane............................................................................................................................................. 16
The viewer pane......................................................................................................................................... 16
2. R WORKSPACE .................................................................................................................................... 18
R’S WORKING DIRECTORY .................................................................................................................................... 18
CREATE R PROJECT IN RSTUDIO ............................................................................................................................. 19
ABSOLUTE AND RELATIVE PATH .............................................................................................................................. 20
MANAGING THE PROJECT FILES .............................................................................................................................. 21
INSPECTING AN ENVIRONMENT............................................................................................................................... 22
INSPECTING EXISTING SYMBOLS .............................................................................................................................. 23
VIEW THE STRUCTURE OF OBJECT ............................................................................................................................ 24
REMOVING SYMBOLS ........................................................................................................................................... 27
MODIFYING GLOBAL OPTIONS ................................................................................................................................ 28
Modifying the number of digits to print ................................................................................................... 29
Modifying the warning level ..................................................................................................................... 30
MANAGING THE LIBRARY OF PACKAGES ................................................................................................................... 32
Getting to know a package ....................................................................................................................... 32
Installing package from CRAN................................................................................................................... 33
Update package from CRAN ...................................................................................................................... 35
INSTALL PACKAGE FROM ONLINE REPOSITORIES ......................................................................................................... 35
PACKAGE FUNCTIONS ........................................................................................................................................... 36
Masking and name conflicts...................................................................................................................... 40
3. BASIC OBJECTS ................................................................................................................................... 42
VECTOR ............................................................................................................................................................. 42
Numeric Vector .......................................................................................................................................... 43
Logical vector ............................................................................................................................................. 45
Character Vector ........................................................................................................................................ 46
Sub setting Vectors .................................................................................................................................... 48
Named Vector ............................................................................................................................................ 51
EXERCISE ............................................................................................................................................................ 52
EXTRACTING ELEMENT .......................................................................................................................................... 56
CLASS OF THE VECTOR........................................................................................................................................... 57
Converting Vectors .................................................................................................................................... 58
ARITHMETIC OPERATORS ...................................................................................................................................... 59
MATRIX ............................................................................................................................................................. 60

2
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Naming Rows and Columns ....................................................................................................................... 62
Subsetting a Matrix ................................................................................................................................... 62
Matrix Operators ....................................................................................................................................... 64
ARRAYS ............................................................................................................................................................. 66
Subsetting an array ................................................................................................................................... 67
LIST ................................................................................................................................................................... 68
Subset a list ................................................................................................................................................ 70
Named Lists ................................................................................................................................................ 71
Setting Values ............................................................................................................................................ 72
Other List Operations................................................................................................................................. 73
DATA FRAME ...................................................................................................................................................... 74
Create a Data Frame.................................................................................................................................. 75
Naming rows and columns ........................................................................................................................ 76
Subset of a Data frame .............................................................................................................................. 77
Subset of a data frame as a list ................................................................................................................. 77
Subset a data frame as matrix .................................................................................................................. 78
Filtering Data ............................................................................................................................................. 80
Setting Values as a list ............................................................................................................................... 80
Factors ........................................................................................................................................................ 81
Useful functions for Data Frame ............................................................................................................... 83
Loading and Writing data on the disk....................................................................................................... 85
FUNCTIONS......................................................................................................................................................... 85
Creating a function .................................................................................................................................... 86
Calling a function ....................................................................................................................................... 86
Dynamic Typing ......................................................................................................................................... 87
Generalizing a function ............................................................................................................................. 87
Default value for function argument ........................................................................................................ 89
4. BASIC EXPRESSIONS ........................................................................................................................... 90
ASSIGNMENT EXPRESSIONS ................................................................................................................................... 90
Using backticks .......................................................................................................................................... 93
CONDITIONAL EXPRESSIONS ................................................................................................................................... 95
Using if as a statement .............................................................................................................................. 95
Using if as an expression ........................................................................................................................... 98
Using if with vector .................................................................................................................................. 100
Vectorized if:ifelse ................................................................................................................................... 101
USING SWITCH FUNCTION.................................................................................................................................... 102
LOOP EXPRESSIONS ............................................................................................................................................ 104
For loop .................................................................................................................................................... 104
Managing the flow of a for loop ............................................................................................................. 106
Creating nested for loop .......................................................................................................................... 108
While Loop ............................................................................................................................................... 109
5. WORKING WITH BASIC OBJECTS ....................................................................................................... 110
OBJECT FUNCTIONS ............................................................................................................................................ 111
Testing object types ................................................................................................................................. 111
Accessing Object Classes and Types ........................................................................................................ 114
Getting data dimensions ......................................................................................................................... 116
Reshaping Data Structures ...................................................................................................................... 117
Iterating over one dimension .................................................................................................................. 118
USING LOGICAL FUNCTION ................................................................................................................................... 119
Logical operators ..................................................................................................................................... 119
Logical functions ...................................................................................................................................... 120
Which elements are TRUE ....................................................................................................................... 122

3
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Dealing with missing values .................................................................................................................... 123
Logical Coercion ....................................................................................................................................... 124
MATH FUNCTIONS ............................................................................................................................................. 125
Number rounding functions .................................................................................................................... 126
TRIGONOMETRIC FUNCTIONS ............................................................................................................................... 127
HYPERBOLIC FUNCTION ...................................................................................................................................... 128
EXTREME FUNCTIONS ......................................................................................................................................... 128
FINDING ROOTS ................................................................................................................................................ 129
DERIVATIVES..................................................................................................................................................... 131
INTEGRATION .................................................................................................................................................... 131
USING STATISTICAL FUNCTION ............................................................................................................................. 132
Sampling from a vector ........................................................................................................................... 132
PROBABILITY DISTRIBUTIONS ............................................................................................................................... 133
SUMMARY STATISTICS ........................................................................................................................................ 135
COVARIANCE AND CORRELATION MATRIX .............................................................................................................. 137
6. WORKING WITH STRINGS ................................................................................................................. 138
STRINGS AND CHARACTER VECTORS ...................................................................................................................... 139
Printing Strings ........................................................................................................................................ 139
TRANSFORMING TEXT ........................................................................................................................................ 143
Changing case .......................................................................................................................................... 143
Counting characters ................................................................................................................................. 144
Trimming leading and trailing whitespace ............................................................................................. 144
Substring .................................................................................................................................................. 145
Splitting Texts .......................................................................................................................................... 145
Formatting Text ....................................................................................................................................... 146
Parsing text as date/time ........................................................................................................................ 148
Formatting date/time to strings ............................................................................................................. 151
USING REGULAR EXPRESSIONS.............................................................................................................................. 151
Finding a string pattern ........................................................................................................................... 152
Using group to extract data .................................................................................................................... 154
7. WORKING WITH DATA...................................................................................................................... 155
READING AND WRITING DATA .............................................................................................................................. 155
Reading and writing data to text format file.......................................................................................... 155
Importing data via RStudio ..................................................................................................................... 156
Importing data using built-in Functions .................................................................................................. 157
Importing data using the readr package ................................................................................................ 158
Reading and writing Excel worksheets ................................................................................................... 160
Reading and writing native data files ..................................................................................................... 161
Loading built-in datasets ......................................................................................................................... 161
VISUALIZING THE DATA....................................................................................................................................... 163
Creating scatter plots .............................................................................................................................. 163
Customize Chart Elements ....................................................................................................................... 165
Customize point style ............................................................................................................................... 165
Customizing the point colors ................................................................................................................... 168
Creating line plots .................................................................................................................................... 170
Line Type and Width ................................................................................................................................ 171
Multi-period line plot ............................................................................................................................... 173
Line plot with points ................................................................................................................................ 173
Multi-Series Chart with a Legend ............................................................................................................ 174
Bar charts ................................................................................................................................................. 175
Pie Charts ................................................................................................................................................. 178
Histogram and density plots ................................................................................................................... 179

4
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Boxplot ..................................................................................................................................................... 183
8. ANALYSING DATA ............................................................................................................................. 185
LINEAR MODEL .................................................................................................................................................. 185
DECISION TREE .................................................................................................................................................. 189

5
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
1. Getting Started with R

For data analysis, we need proper tools. Extracting patterns directly from a
large set of numbers that have been aligned in rows and columns is almost
impossible. To work with data, we need tools such as R to boost the
productivity.

Introduction to R

R programming language is used in statistical computing, data exploration,


analysis, and visualization. R is free, open-source and it has a strong and
rapidly growing community. It has more than 17000 packages that enable R to
deal with problems in a wide range of fields

The R programming language originated in 1993. The adoption of R started in


the data-related research industry has been growing rapidly for the last
decade. Today R programming language has become the lingua franca of data
science.

R is not just a programming language. It is a comprehensive computing


environment that is supported by a strong and active community and has a
rapidly growing and expanding ecosystem.

R as a programming language

R programming language has been evolving and developing over the last 20
years. The goal is to make the language easy and flexible so that complex
statistical computing, data exploration, and visualization operations can be
performed.

The ease of use and flexibility are conflicting goals. A programming language
can help finish a variety of statistical analysis tasks by clicking a few buttons,
but it won't be flexible if you need customization, automation, and your work
needs to be reproducible. On the other hand, a programming language can
flexible so that you can transform data and make complicated graphs but it
may not be easy to learn. R is known for its well-positioned balance.

6
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
R as a computing environment

R is lightweight and ready to use. R is smaller and easier to deploy in


comparison to other statistical software, for example, Matlab and SAS.

The need for R

R Programming Language has gained its importance in the data science


community for the following reasons:

Free of charge: R is available free of charge. In other words, for the installation
and using for commercial use, you need not buy a license.
Open-source: R is open source. Thousands of developers around the globe
have been working constantly to add new packages, review the source code,
and fix the bugs. The source code is also available so that you can dig in the
source code to fix any bug or improve the functionality of the packages.

Popular: R is a popular programming language for statistical analysis, data


mining, analysis, and visualization.

Flexible: R supports dynamic scripting. It allows programming styles in multiple


paradigms, including functional programming and object-oriented
programming. It also supports flexible metaprogramming. Its flexibility enables
you to perform highly customized and comprehensive data transformation and
visualization.

Reproducible: When using software based on a graphical user interface, you


only need to choose from menus and click buttons. However, it is hard to
accurately reproduce what you have done automatically without writing
scripts.

Rich Online Resources: R is known for the huge, rapidly increasing number of
online resources. There are more than 7,500 packages available at CRAN (short
for Comprehensive R Archive Network), a worldwide network of mirror servers
from which you can get identical, up-to-date, R distributions and packages.

7
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Strong community: The community of R consists of not only R developers but
also, (the majority), R users from a wide range of backgrounds such as
statistics, econometrics, finance, bioinformatics, mechanical engineering,
physics, medicine, and so on.
A great number of R developers actively contribute to open source projects or
packages written in R. The goal of the community is to make data analysis,
exploration, and visualization easier and more interesting.

Installing R

You can install R from official website (https://www.r-project.org/), download


R (https://cran.r-project.org/mirrors.html), choose a nearby mirror (For India
https://mirror.niser.ac.in/cran/), download a version for your operating
system, select base as subdirectory, and click on “Download R 3.2.3 for
Windows”. The latest version while writing this content is 3.2.3. It may be
different when you are trying to install R.

If you are Windows user, you can download an installer for the latest version.
Then run the Windows installer to install R. Even though the installation
process is easy, many users face issues during the installation.

When choosing the components to install, in the Windows drop-down, the


installer would display four components. Install the default options as shown
below

8
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Next step is to select additional tasks. Select the default options.

9
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Now installation starts to copy the files on your hard drive

Now R has been installed on your system. You can either use R in the
command prompt or in the R GUI.

Even though, you can directly start using R, we recommend RStudio for editing
and debugging R scripts. R is the backend and RStudio is the front end.

Windows users may also install Rtools from


http://cran.rstudio.com/bin/windows/Rtools/. You can write C++ code,
compile and call it in R. You can also use C/C++ code from other sources.

RStudio

The user interface for R programming is RStudio. It is open-source and it is


available for free for multiple platforms such as Windows, Mac, and Linux.

RStudio is known for its powerful features to boost productivity in data


analysis and visualization. RStudio support various advanced features such as
syntax highlighting, autocompletion, multi-tabbed views, file management,

10
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
graphics viewport, package management, integrated help viewer, code
formatting, version control, interactive debugging, and many more.

RStudio can be downloaded from


https://www.rstudio.com/products/rstudio/download. The preview version
with new features can be downloaded from
https://www.rstudio.com/products/rstudio/download/preview. Note that
RStudio does not include R, so you need to make sure that you have R installed
while working in RStudio.

Once you complete the installation of RStudio, you see the following user
interface of RStudio.

RStudio’s User Interface

The screenshot of the user interface of RStudio for the Windows operating
system is given below. The main window consists of several parts. Each part is
known as a pane. Each part performs a different function. The panes have
been designed to help data analysts work with the data.

The console

11
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The R Console is also embedded in RStudio. It works like a command prompt or
terminal. The commands that you type at the console, would be submitted to
R engine by RStudio. R engine is responsible for executing the commands.
RStudio takes the inputs from the user to R engine and presents the results
back to the user.

You can use console to execute a command, define a variable, or evaluate an


expression interactively to compute a statistical measure, transform data, or
produce charts.

The editor

While working with data, we not only type commands at the console but also
write scripts, a set of commands that represent a logic flow, at the editor. The
editor is useful for editing R scripts, markdown documents, web pages, and
many types of other configuration files.

The code editor is a more advanced editor than a plain text editor. It supports
advanced functionalities such as syntax highlighting, autocompletion of R
Code, and debugging with the breakpoint. You may also use the following
shortcut keys:

Ctrl + Enter – Execute the selected line


Ctrl + Shift + S – source the current document. Evaluate all the
expressions in the current document
Tab or Ctrl + Space – Autocompletion list of variables and function,
matching as you type

12
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Breakpoint - You can click on the left margin of a line number to set a
breakpoint. When you execute the script the program will pause at this line
and wait for you to debug.

The environment pane

The environment pane exhibits the variables and functions that have been
created and that are available for repeated use. By default, variables are
shown in the global environment, which is the user workspace where you are
working.

Whenever you create a new object, you can find a new entry in the
Environment pane. You can see the variable name and the short description of
its values. When you change the value of a symbol, the change is reflected in
the environment pane.

The history pane

You can see previous expressions evaluated in the console. In the history pane,
you can repeat the task that were performed previously by simply pressing up
in the console.

13
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The file pane

In the file pane, you can see the files in the folder whereas you can navigate
between the folders, create new folders, delete or rename the folders and
files. When you work on the RStudio project, you can view and organize the
project files in the File pane

The plots pane

You can use the plots pane to see the graphics produced by R code. If there is
more than one plot, previous plots are stored. You can view all the plots by
navigating back and forth.

14
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The package pane

You can view all the installed packages in the package pane. You can use CRAN
to install or update the package or you can remove an existing package from
your library.

15
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The help pane

R platform provides a detailed documentation. You can find the


documentation in the Help Pane. Using this documentation, you can learn how
to use the functions.

Ways to view the documentation of a function are:

• Type the function name in the Search box and find it directly
• Type the function name in the console and press F1
• Type ? before the function name and execute it
In practice, you don't have to remember all of R's functions; you only need to
remember how to get help with a function you are not familiar with.

The viewer pane

The Viewer pane is a new feature; it was introduced as an increasing number


of R packages combine the functionality of both R and existing JavaScript
libraries to make rich and interactive presentations of data.

16
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
17
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
2. R Workspace

This chapter will cover some basic yet important skills that are required to
manage the R Workspace

The important topics are

• R’s Working Directory


• Inspecting the environment
• Modifying global options
• Managing the library of packages

R’s Working Directory

Whether you launch R session as an R Terminal or in RStudio, it always starts in


a directory. This directory is called working directory. In order to access other
files on the hard disk, you can either use absolute path (for example,
D:\Workspaces\project\R-Class\data\house.csv) or relative path (for example,
data\house.csv) with working directory (in this case, D:\Workspaces\project\R-
Class\)

When you use the relative path, the file path does not change, rather the
notation becomes shorter. This also helps make the scripts more portable. It
helps other users who is using the code on some other machine, has to modify
the code to update the location of the data on their hard drive. If you have
used the relative path and data is stored in same relative location, there is no
need to modify the code.

You can check the current working directory of the running R session using
getwd() from R terminal. By default, the new R session is started from your
user directory. The RStudio runs the R session in the background from the user
documents directory.

In RStudio, you can choose a directory and create an R project. Whenever you
open the project, the location of the project becomes the working directory. It

18
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
improves the portability of the project by accessing the files using the relative
paths.

Create R Project in RStudio

In order to create a new project, you can go to File | New Project or click the
Project drop-down menu in the top-right corner of the main window and
choose New Project. A window will appear, and you can create a new directory
or choose an existing directory on your hard drive as the project directory:

You have to choose a local directory. The project will be created in this
directory. An R project is .Rproj file. This file has some session. Once you open
the .Rproj file, the setting values stored in the file will be applied. As a result,
the working directory will be set to the directory in which the project file is
located.

When you RStudio to work in a project, the auto-completion makes writing file
paths much more efficient. If you type a string of either an absolute or relative
file path and press Tab, RStudio will list the files in that directory:

19
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Absolute and relative path

You can access the working directory by using getwd() command

> getwd()
[1] "D:/Workspaces/R-programming"

In this example, you can notice that the path of the working directory uses /
instead of \. As we know in windows, the \ is the default path separator. In R
symbol \ is used to make special characters. For example, while creating a
character vector, you can use \n to represent a new line

> "Hello\nWorld"
[1] "Hello\nWorld"

In this example, the special character has been preserved when the character
vector is directly printed. That is why you do not see the effect of the newline
character in the previous example.

If you want the special characters to translate to the character they represent,
you can use cat()
> cat("Hello\nWorld")
Hello
World

In the example above, the second word starts with a new line (\n). If you want
to write \ itself, you can use \\:

> cat("The string with '\\' is translated")


The string with '\' is translated

20
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
When specifying the path in Windows, you should use \\ or /. Both the options
are supported. The path in Unix like operating systems such as macOS and
Linux, the path is always given by /. So it is easy. In Windows, if you use /, you
will get error

filename <- "d:\data\test.csv"


Error: '\d' is an unrecognized escape in character string
starting ""d:\d"

You can write like

filename <- "d:\\data\\test.csv"

In most cases, Windows users can use /. This will help run the same code in all
the major operating systems

absolute_filename <- "d:/data/test.csv"


relative_filename <- "data/test.csv"

To set the working directory of the current R session, you can use
setwd().This is not a recommended practice because it directs all the relative
paths in the script to another directory and make everything go wrong. That is
why it is a good practice to create an R project to start your work.

Managing the Project files

When we create a project in RStudio, a .Rproj file is also created. This file is
created in the project directory. Initially there is no other file.
For a typical R project, there would be many R scripts for statistical computing
and programming tasks, data files (such as .csv files), documents (such as
markdown files), and output graphics.

If all the files are mixed up in the project directory, managing the files would
be difficult. So it is recommended to create subdirectories to contain different
types of files for different tasks.

Here is the example of a plain directory structure with all the files in same
folder

21
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
project/
- household.csv
- population.csv
- national-income.png
- popluation-density.png
- utils.R
- import-data.R
- check-data.R
- plot.R
- README.md
- NOTES.md

Here is the example of cleaner directory structure

project/
- data/
- household.csv
- population.csv
- graphics/
- national-income.png
- popluation-density.png
- R/
- utils.R
- import-data.R
- check-data.R
- plot.R
- README.md
- NOTES.md

Inspecting an environment

Every expression of R is evaluated in a specific environment. An environment is


a collection of objects (functions, variables etc.). When you fire up the R
interpreter, the environment is created. Any variable that you define, is
recorded in the environment. If you type any command on RStudio console,
the command is evaluated in Global Environment.

When you start a fresh R session, the global environment is empty. No object
has been defined in this environment. If you run a command x <- c(1,2,3),
the numeric vector c(1,2,3) is bound to the symbol x in the global
environment.

22
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Inspecting existing symbols

The most useful function to inspect collection of objects that we are working
with is objects(). This function returns a character vector of the names of
existing objects in the current environment

For a fresh R session, there would not be any symbols

> objects()
character(0)

If we create the following objects

> x <- c(1, 2, 3)


> y <- c("a","b","c")
> z <- list(m=1:5,n=c("x","y","z"))
> objects()
[1] "x" "y" "z"

You can also use ls() as an alias of objects()


> ls()
[1] "x" "y" "z"

If you are working in RStudio, there is no need to use ls() or objects()


functions. You can see all the symbols in the environment pane.

23
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The Environment pane shows all the symbols and their values in a compact
form. You can view the vectors inside a list or a data frame by expending them.

The Environment pane has two views: List and Grid. The grid view shows not
only the names, types, and the value structures of existing objects, but also
their object sizes:

View the structure of object

In the Environment pane, the compact representation of the object can be


viewed using str() function. The function prints the structure of a given
object.

The str() function shows the type, positions, and a preview of its values:

> x
[1] 1 2 3
> str(x)
num [1:3] 1 2 3

If the vector has more than 10 elements, str() will show only the first 10:

> str(1:40)
int [1:40] 1 2 3 4 5 6 7 8 9 10 ...
24
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
A list can be directly evaluated in the console. You can evaluate the list using
print() as well

> z
$m
[1] 1 2 3 4 5

$n
[1] "x" "y" "z"

You can also use str() to show its type, length, and the structure preview of
the elements

> str(z)
List of 2
$ m: int [1:5] 1 2 3 4 5
$ n: chr [1:3] "x" "y" "z"

For a nested list such as the following:

> nest_list <- list(d=1:15,e=list("a",c(1,2,3)),


f = list(x=1:10, y = c("g","h")),
g = list(x=0:11,y=c("i","j")))

We can directly print the list. It will show all its elements and tell us how we
can access them. However it would be long and unnecessary in most of the
cases

> nest_list
$d
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

$e
$e[[1]]
[1] "a"

$e[[2]]
[1] 1 2 3

$f
$f$x
[1] 1 2 3 4 5 6 7 8 9 10
25
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
$f$y
[1] "g" "h"

$g
$g$x
[1] 0 1 2 3 4 5 6 7 8 9 10 11

$g$y
[1] "i" "j"

We can use str() function to get the compact representation

> str(nest_list)
List of 4
$ d: int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
$ e:List of 2
..$ : chr "a"
..$ : num [1:3] 1 2 3
$ f:List of 2
..$ x: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ y: chr [1:2] "g" "h"
$ g:List of 2
..$ x: int [1:12] 0 1 2 3 4 5 6 7 8 9 ...
..$ y: chr [1:2] "i" "j"

You can use str() to show the structure of an object. You can use ls.str()
to show the structure of the current environment

> ls.str()
nest_list : List of 4
$ d: int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
$ e:List of 2
$ f:List of 2
$ g:List of 2
x : num [1:3] 1 2 3
y : chr [1:3] "a" "b" "c"
z : List of 2
$ m: int [1:5] 1 2 3 4 5
$ n: chr [1:3] "x" "y" "z"

You can use the filters for the ls.str(). One of the filter is mode argument.
For example, if you use ls.str(mode="list"), you can view the structure of
all the list objects
26
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> ls.str(mode="list")
nest_list : List of 4
$ d: int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
$ e:List of 2
$ f:List of 2
$ g:List of 2
z : List of 2
$ m: int [1:5] 1 2 3 4 5
$ n: chr [1:3] "x" "y" "z"

The other filter is the pattern argument, which specifies the pattern of the
names to match. The pattern is expressed in a regular expression. If you want
to show the structures of all variables whose names contain only one
character, you can run the following command:

> ls.str(pattern = "^\\w$")


x : num [1:3] 1 2 3
y : chr [1:3] "a" "b" "c"
z : List of 2
$ m: int [1:5] 1 2 3 4 5
$ n: chr [1:3] "x" "y" "z"

If you want to show the structures of all list objects whose names contain only
one character, you can use both pattern and mode at the same time:

> ls.str(pattern = "^\\w$", mode = "list")


z : List of 2
$ m: int [1:5] 1 2 3 4 5
$ n: chr [1:3] "x" "y" "z"

If you're put off by commands such as ^\\w$, don't worry. This pattern
matches all strings in the form of (string begin)(any one word character like a,
b, c)(string end). We shall cover them in detail in following units.

Removing Symbols

In many cases, removing symbols is not necessary, but it can be useful to


remove very large objects that occupy a big area of memory. If R feels memory
pressure, it will clean up unused objects with no bindings.

27
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can remove the symbols from the environment using remove() function
of rm().

The symbols in the current environment are as follows

> ls()
[1] "nest_list" "x" "y" "z"

Now we can remove x using rm()

> rm(x)
> ls()
[1] "nest_list" "y" "z"

We can also remove multiple symbols in one function call:

> ls()
[1] "nest_list"

If the symbol to be removed does not exist in the environment, a warning will
appear:

> rm(x)
Warning message:
In rm(x) : object 'x' not found

If we want to clear all the bindings in an environment, we can combine rm()


and ls() and call the function like this:

> rm(list = ls())


> ls()
character(0)

Modifying global options

Instead of creating, inspecting, and removing objects in the working


environment, R options have effects in the global scale of the current R
session. Options allow the user to set and examine a variety of global options
which affect the way in which R computes and displays its results. We can call
getOption() to see the value of a given option and call options() to modify
one.

28
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Modifying the number of digits to print

In RStudio, when you type getOption(<Tab>), you can see a list of available
options and their descriptions. A commonly used option is the number of digits
to display. In an R session, the number of digits printed on screen is entirely
managed by digits. We can call getOption() to see the current value of digits
and call options() to set digits to a larger number:

When an R session starts, the default value of digits is 7.

> 1234567.1234567
[1] 1234567
> 123.12345678
[1] 123.1235

In the second example given above, that the 11-digit number is only shown
with 7 digits. This means the last few decimal digits are gone; the printer only
displays the number with 7 digits. To verify no precision is lost because of
digits = 7, see the output of the following code:

> 0.1000002
[1] 0.1000002
> 0.10000002
[1] 0.1
> 0.10000002 -0.1
[1] 2e-08

If the numbers are rounded to the seventh decimal place by default, then
0.10000002 should be rounded to 0.1 and the second expression should result
in 0. However, apparently, this does not happen because digits = 7 only means
the number of numeric digits to be displayed rather than rounded up.

29
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
However, in some cases, the number before the decimal point can be large,
and we don't want to ignore digits following the decimal point. Without
modifying digits, the following number will only display the integer part:

> 1234567.12345678
[1] 1234567

If we want to see more digits printed, we need to increase digits from the
default value 7 to a higher number:

> getOption("digits")
[1] 7
> 1e10 + 0.5
[1] 1e+10
> options(digits=15)
> 1e10 + 0.5
[1] 10000000000.5

Note: if we call the options() function, the modified values are effective
immediately. They may affect the behaviour of all the subsequent commands.
In order to reset the options, we can use

> options(digits=7)
> 1e10 + 0.5
[1] 1e+10

Modifying the warning level

We can manage the warning level by specifying the value of the warn option

> getOption("warn")
[1] 0

By default, the warning level is 0, which means a warning is a warning and an


error is an error. In this state, a warning will be displayed but will not stop the
code, while an error terminates the code immediately. If multiple warnings
occur, they will be combined and displayed together. For example, the
following conversion from a string to a numeric vector will produce a warning
and result in a missing value:

30
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> as.numeric("Program")
[1] NA
Warning message:
NAs introduced by coercion

We can execute the same code without warning and get a missing value from
unsuccessful conversion

> options(warn=-1)
> as.numeric("Program")
[1] NA

Now there is no warning. However, it is not a good idea to always remove the
warning messages. Because it will also not show the potential error and you
will get to know about them in the final result. In that case, you have to spend
time in debugging the code. If you want to achieve good results and spend less
time on debugging the code, we recommend that you should be strict in your
code.

If you set warn to 1 or 2, the buggy code will fail fast. When the warn is set to
0, the values are returned before all the warning messages are displayed
together.

> options(warn=0)
> f <- function (x,y){ as.numeric(x)+as.numeric(y)}
> f("Learn","R")
[1] NA
Warning messages:
1: In f("Learn", "R") : NAs introduced by coercion
2: In f("Learn", "R") : NAs introduced by coercion

The function coerces two input arguments to numeric vectors. As the input
arguments are both strings, we get two warning messages. But they appear
after the function returns. On the flip side, if the function is takes considerable
amount of time to complete, you would not see any warning message before
you get the final results even though the intermediate compute was off the
track for some time.

If you want to print the warning messages as soon as the warning is produced,
you can use warn = 1

31
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> options(warn=1)
> f("Learn","R")
Warning in f("Learn", "R") : NAs introduced by coercion
Warning in f("Learn", "R") : NAs introduced by coercion
[1] NA

In this case, we get the same results but warning messages appear before the
result. If the function is time-consuming, you can see the warning messages
first and you may decide to stop the code and debug.

If you set the value warn = 2, all the warnings are considered errors. This is
stricter warning level.

> options(warn=2)
> f("Learn","R")
Error in f("Learn", "R") :
(converted from warning) NAs introduced by coercion

Managing the library of packages

Packages play an important role in data analysis and visualization in R. R is built


on several basic packages. A package contains predefined functions, that are
designed to solve a range of problems. By using the package, we do not have
to focus on writing the code the solve the problem but focus on the problem
we are trying to solve.

R contains not only the rich source of packages but also well maintained
package archive system called The Comprehensive R Archive Network, or CRAN
(http://cran.r-project.org/). CRAN is an archive of source code of R and
thousands of packages. At the time of writing this content there are 17077
active packages on the CRAN system. Every week more than 100 packages are
updated. You can check out the list of packages at
https://cran.rstudio.com/web/packages/.

Getting to know a package

A package is a collection of functions to solve a certain range of problems. It


can be an implementation of a family of statistical estimators, data-mining
32
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
methods, database interfaces, or optimization tools. To know more about a
package, for example, ggplot2, a super powerful graphics package, several
information sources are useful:

Package description page (https://cran.rstudio.com/web/packages/ggplot2/):


You can find basic information about the package such as name, description,
version, dependency, imports, suggests, enhance, published, author,
maintainer, BugReports, License, URL, citation, and so on. In addition to CRAN,
you may find package information on others websites such as METACRAN that
provides a description of ggplot2 at http://www.r-pkg.org/pkg/ggplot2
Package website (https://ggplot2.tidyverse.org/) You will find the description
and related resources for the package such as Installation, cheatsheet, usage,
lifecycle, and so on. Every package does not have a website. If such website
exists that would be the official starting point for learning about the package.
Package source code (https://github.com/tidyverse/ggplot2/) The source code
of the package is hosted on GitHub. If you are interested in the
implementation of the package functions, you can check out the source code
of the package.

Installing package from CRAN

CRAN archives R packages and distributes them to more than 120 mirrors
around the world. You can visit CRAN Mirrors (https://cran.r-
project.org/mirrors.html) and check out a nearby mirror. If you find one, you
can go to Tools | Global Options and open the following dialog:

33
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
You can change the CRAN mirror to a nearby one. You can also use the default
mirror. If you use a nearby mirror, the download will be fast. Once you choose
your mirror, you can download and install the package in R.

Once you choose the mirror, you can install R package easily. You can install
package using install.packages("ggplot2"). R will download, install, and
compile it.

You can install the package using RStudio. You need to go to the Tools | Install
Package menu option.
34
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
As per the package description, the package may have some dependencies.
The install.packages() takes care of the dependencies and install the
them before installing the package.

Update package from CRAN

By default, the install.packages() installs the latest version of the


specified package. However we may need the updated version after sometime
to fix bugs or take advantage of new features. Sometimes an older version may
have depreciated function with warning.

RStudio provides an Update button next to Install in the package pane.


Alternatively we can use update.packages() command to update the
package.

Both RStudio and command line scan for newer function and install the
package with the dependencies.

Install package from online repositories

35
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Many package authors publish their packages on GitHub. The version control
and community development is very easy on GitHub. On several occasion, the
latest version of a package is available first on GitHub. If want to try the latest
development version of a package, you can directly install the package from
online repository using the devtools package.

For this, you need to install the devtools package if it is any already installed.

install.packages(“devtools”)

Then you can use install.github() in the devtools package to install the
latest development version of a package

library(devtools)
install.github(“hadley/ggplot2)

The devtools package will download the source code from GitHub and make it
a package in your library. If the package already exists in the library, the
installation will replace it without asking.

Due to some reasons, if you want to the latest CRAN version, you can do so by
running the following command

install.packages(“ggplot2”)

Package Functions

If you want to use the function that is part of a package, you can do so either
using library() or package:function(). Second option uses the function
without attaching the whole package to the environment.

For example, if we want to calculate the skewness of numeric vector x, we can


use skewness function that is part of the moments package. We use call it the
following ways

library(moments)
skewness(x)

Alternatively, we can use the function without attaching the package using ::
36
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
moments::skewness(x)

We shall receive the same output from both the methods. However they have
different impact on the environment. The first method (using library())
modifies the search path of symbol, whereas the second method (using ::)
does not. When you call library(moments), the package is attached to the
search path and the package function can be directly used in the subsequent
code.

We can use sessionInfo() to see the packages that we are using

Here you can see the R version and the list of attached and loaded packages.
When we use :: to access a function in a package, the package is not attached
but it is loaded in the memory.

37
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
This shows that the package has been loaded but not attached. We can use the
following to attach the package

38
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In this case the skewness() and other functions in moments package are
directly available.

We can see the attached packages using search()

> search()
[1] ".GlobalEnv" "tools:rstudio" "package:stats"
[4] "package:graphics" "package:grDevices" "package:utils"
[7] "package:datasets" "package:methods" "Autoloads"
[10] "package:base"

To attach a package, we can also use require(), which is similar to


library(), but it returns a logical value to indicate whether the package is
successfully attached

> loaded <- require(moments)


> loaded
[1] TRUE

You can use require(moments) in your code. However there is a difference


between require() and library(). If package to be attached is not
available or even does not exist at all (maybe a typo), require() will
produces a warning whereas library() will produce an error.

> require(textPkg)
Loading required package: textPkg
Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, :
there is no package called 'textPkg'
> library(textPkg)
Error in library(textPkg) : there is no package called
'textPkg'

If your R script is long and time consuming. If your script is require() and the
required package is not installed on the machine, you have to wait for the
script to complete before you could see the warning message. Whereas in the
case of library(), the script will stop immediately as soon as the package
function is called and package is not installed.

39
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Masking and name conflicts

When you start a fresh R session, the basic packages are automatically
attached. The base packages include base, stats, graphics, and so on. You can
directly use the functions of these base packages. For example, if you want to
calculate the average of a numeric vector, you can directly use mean()
without using base::mean().

Thousands of functions are immediately available as soon the fresh R session


starts. Therefore two different packages may have functions with same name
and they may conflict with each other. For example, suppose two packages A
and B both have the a function named X. In this case, if you attach A and then
attach B, the function A::X will be masked by function B::X. In other words,
if you attach A and call X(), and then A’s X is called. Then you attach B and call
X(); then B’s X is called. This mechanism is known as masking

R has a powerful data manipulation package dplyr. To ease the task of


manipulation of tabular data, the package dplyr contains a family of
functions. When we attach the package, R console will show messages that
some of the existing packages have been masked by the package function with
same name.

library(dplyr)

Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

filter, lag
The following objects are masked from 'package:base':

intersect, setdiff, setequal, union

The implementation of the function in dplyr package does not change the
meaning and usage, however it generalizes them. The functions in the
packages are compatible with the masked version. Hence, there is no need to
worry about. The masked function will not be broken.

The package function that mask basic functions generally generalize the base
functions rather than replace. However if there is a need to use two packages

40
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
with functions sharing the same name, you should not attach the package. You
can rather extract the functions from both packages as shown below

fun1 <- package1::a_function


fun2 <- pacakge2::a_function

If you want to detach an already attach package, you can call


unloadNamespace() as follows

unloadNamespace("moments")

As soon as the package is detached, the package functions are no longer


directly available:

skewness(c(1, 2, 3, 2, 1))
Error in eval(expr, envir, enclos): could not find function
"skewness"

However, you can still use :: to call the function:

moments::skewness(c(1, 2, 3, 2, 1))
[1] 0.3436216

41
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
3. Basic Objects

To learn any programming language, the first step is to get familiar with basic
objects and their behaviour. In this chapter you will learn

• Create and subset atomic vectors such as numeric vectors, character


vectors, and logical vectors. You will also learn about matrices, arrays,
lists, and data frames.
• Define and work with functions

We need different types of objects while solving any problem. Each object has
its own properties and behaviour. To solve the real-world business problem,
you need to understand how the basic objects work. It help you solve any
problem with more elegant code and fewer steps. A concrete understanding of
object behaviour helps you spend more time on solving the problem than
spend time on fixing countless minor problems.

We will study a variety of basic objects in R that represent different types of


data and make it easy to analyse and visualize datasets. At the end of this
section, you will have a basic understanding of how these objects work and
how they interact with each other.

Vector

Vector is one of the building blocks of all R objects. It contains primitive values
of the same type. Vector can be a group of numbers, texts, true/false values,
and values of some other type. Several type of vector exists in R. The most
commonly used vectors are numeric vectors, logical vectors, and character
vectors.
42
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Numeric Vector

Vector of numeric values is numeric vector. The simplest numeric vector is a


scalar number. The example is

> 1.5
[1] 1.5

The most frequently used data type in R is a numeric vector. It is the


foundation of every data analysis processes. In other programming language
such as C, C++, Java, you handle a variety of scalar types such as integer,
double, and string. In R, we formally do not have such scalar types. The scalar
number is a special case of numeric vector with length 1.

After creating a value, we can store for the future use. We can use equal
operator, leftward operator, or rightward operator. We can create a variable in
the following ways

> # equal operator


> x = 1.5
> x
[1] 1.5
> # leftward operator
> y <- 2.5
> y
[1] 2.5
> # rightward operator
> 3.5 -> z
> z
[1] 3.5

Once the variable is created and value is stored, we can use the variable to
represent the value from now on

We can create a numeric vector using multiple ways such as calling numeric()
to create a zero vector of a given length:

> numeric(10)
[1] 0 0 0 0 0 0 0 0 0 0

43
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
By using c(), we can combine several vectors to make one vector. For
example, we can combine several single-element vectors to create a multi-
element vector.

> c(1,2,3,4)
[1] 1 2 3 4

We can use : operator to create a series of consecutive integers. The : operator


creates an integer vector instead of numeric vector

> 1:5
[1] 1 2 3 4 5

We should be careful while using the : operator. If we refer to the following


example

> 1+1:5
[1] 2 3 4 5 6

In this case, 1+1:5 does not mean a sequence from 2 to 5. But it means the
sequence from 2 to 6. The operator : has a higher priority than +. Hence, 1:5 is
evaluated first then 1 is added to each entry.

To create a numeric sequence, we can use seq(). The following code produces
a numeric vector of a sequence from 1 to 10 with an increment of 2

> seq(1,10,2)
[1] 1 3 5 7 9

The functions such as seq() have many parameters. While calling a function,
we can provide these parameters. But in most of the cases, the function takes
the default parameter only. We need to pass the parameter when there is a
requirement of modifying the default value.

We can create a numeric vector that starts from 2 with length 7 by specifying
the length.out parameter.

44
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> seq(2,length.out=7)
[1] 2 3 4 5 6 7 8

Logical vector

The logical vector stores a group of TRUE or FALSE values. They represents the
yes or no as answers to a group of logical questions.

The simplest logical vectors are TRUE and FALSE themselves:

> TRUE
[1] TRUE

We can obtain the logical vector by asking logical questions about R object. For
example, we can ask whether 1 is greater than 2 by using the following:

> 1 > 2
[1] FALSE
> 1 < 2
[1] TRUE

The answer yes is represented by TRUE and no is represented by FALSE.

If we want to perform multiple comparisons at the same time, we can directly


use numeric vectors in the question:

> c(1,2)>2
[1] FALSE FALSE
> c(1,2)>1
[1] FALSE TRUE

R performs element-wise comparison between c(1,2) and 2. In other words,


it is equivalent to c(1>2,2>2).

We can also compare two multi-element numeric vectors. For this comparison,
the length of the longer vector should be a multiple of the length of the
shorter one

> c(1,2) > c(2,1)


[1] FALSE TRUE

45
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
This expression is equivalent to c(1>2,2>1). We can consider another
example to demonstrate how we can compare two vectors of different length.

> c(2,3) > c(1,-2,3,-2)


[1] TRUE TRUE FALSE TRUE

In this case, the shorter vector is recycled and repeated. Hence the vector
c(2,3)will become c(2,3,2,3) and the comparison c(2>1,3>-2,2>3,3>-
3). More specifically, the shorter vector will be recycled to finish all the
comparisons for each element in the longer vector.

In R programming, we have several logical binary vectors such as == For


Equality, > for Greater than, >= For Greater or Equal to, < for Less than, <= for
less than or equal to.

R also uses %in% logical operator. It tells whether each element in the left-
hand side vector is contained by the right-hand side vector:

> 1 %in% c(1,2,3)


[1] TRUE
> 1 %in% c(2,3,4)
[1] FALSE

> c(1,2) %in% c(1, 3, 4)


[1] TRUE FALSE

The %in% logical operator does not recycle itself. It iterates itself over the
vectors on the left hand side and performs c(1 %in% c(1, 3, 4), 2 %in%
c(1, 3, 4))

Character Vector

A character vector is a group of strings. The character in this case does not
mean literally a single letter or symbol in a language, but it means a string like
this is a string. We can use both the double quotation marks as well as
single quotation mark to create a character vector, as follows:

> "hello world"


[1] "hello world"
> 'hello world'
[1] "hello world"
46
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
To construct a multi-element character vector, the combine function c() can
also be used:

> c('hello','world')
[1] "hello" "world"

To check whether two vectors have equal values in corresponding positions,


we can use ==

> c("Hello","World") == c('Hello','World')


[1] TRUE TRUE

The character vectors are equal because " and ' both work to create a string
and do not affect its value. Hence, The quotes at the beginning and end of a
string should be both double quotes or both single quote. They cannot be
mixed

> c("Hello","World") == "Hello,World"


[1] FALSE FALSE

We get both FALSE. Because neither Hello nor World equals Hello,
World.

While working with string (a single-element character vector), we need to


ensure the following
1. You can insert double quotes into a string that starts and ends with
single quote.

> cat('You are attending "R Programming" Class')


You are attending "R Programming" Class

2. You can insert single quotes into a string that starts and ends with
double quotes.

> cat("You are attending 'R Programming' Class")


You are attending 'R Programming' Class

47
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
3. You cannot insert double quotes into a string that starts and ends with
double quotes

> cat("You are attending "R Programming" Class")


Error: unexpected symbol in "cat("You are attending "R"

4. You cannot insert single quote into a string that starts and ends with
single quote

> cat('You are attending 'R Programming' Class')


Error: unexpected symbol in "cat('You are attending 'R"

5. You can use escape character (\) to insert double quotes into a string
that starts and ends with double quotes

> cat("You are attending \"R Programming\" Class")


You are attending "R Programming" Class

Sub setting Vectors

You can access some specific entries using indexing. We use [ ] brackets for
indexing. The indexing starts with position 1. We can drop an element from
index by using a negative value.

To start with first we create a vector using the c() function.

> m <- c("Jan","Feb","Mar","Apr","May","Jun")


> m
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun"

Now we can access an element in the vector using the position

> m[2]
[1] "Feb"

We can access a range of elements using :. In this case we get the elements
from 2nd and 4th position

> m[2:4]
[1] "Feb" "Mar" "Apr"

48
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can access the elements using the position. In the following example, we
get the elements in 2nd, 4th, and 5th position.

> m[c(2,4,5)]
[1] "Feb" "Apr" "May"

We can also use logical indexing. In this case, where ever the value is TRUE, we
get the corresponding element

> m[c(TRUE,FALSE,TRUE,TRUE,FALSE,FALSE)]
[1] "Jan" "Mar" "Apr"

We can also use negative indexing. When we use negative indexing, the index
number in negative value is dropped.

> m[c(-2,-6)]
[1] "Jan" "Mar" "Apr" "May"

But we cannot use positive and negative numbers together

> m[c(2,-4)]
Error in m[c(2, -4)] : only 0's may be mixed with negative
subscripts

If we subset the vector using the positions beyond the range of the vector, the
non-existing positions will be returned as NA. In the following example, we
subset the vector using the 7th non-existing position. In this case the missing
value is represented by NA

> m[c(2,7)]
[1] "Feb" NA
> m[c(2:7)]
[1] "Feb" "Mar" "Apr" "May" "Jun" NA

We can over write a specific element of the vector as follows

> m[1] <- "Jul"


> m
[1] "Jul" "Feb" "Mar" "Apr" "May" "Jun"

We can also overwrite multiple elements at different positions as follows

49
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> m[2:4] <- c("Aug","Sep","Oct")
> m
[1] "Jul" "Aug" "Sep" "Oct" "May" "Jun"

We can also use the logical selectors to overwrite multiple elements at


different positions

> m[c(FALSE, FALSE,FALSE,FALSE,TRUE,TRUE)] <- c("Nov","Dec")


> m
[1] "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

Suppose we have a numeric vector as follows

> num <- c(1,2,3,4,5,6)


> num
[1] 1 2 3 4 5 6

We can select the values by logical criteria. For example the following code
picks out all the elements that are not greater than 2 in num

> num[num <= 2]


[1] 1 2

We can use a more complex selection criterion. For example if you want to
pick all the elements of num that satisfy 𝑥 ! − 𝑥 + 1 > 2

> num[num^2 - num + 1 >=2]


[1] 2 3 4 5 6

We can replace all the elements that satisfy 𝑥 ≤ 2 with 0

> num[num <= 2] <- 0


> num
[1] 0 0 3 4 5 6

We can also overwrite the vector at a non-existing entry. In this case, the
vector will expend automatically and assign NA to the unassigned values

> num[8] <- 8


> num
50
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] 0 0 3 4 5 6 NA 8

Named Vector

We can assign name to the elements of the vector. We can assign the names
when we create the vector

> n = c("First" = "Mary", "Last" = "John")


> n
First Last
"Mary" "John"

We can also create the named vector without using the quotes

> n = c(First = "Mary", Last = "John")


> n
First Last
"Mary" "John"

Now we can access the elements as

> n["First"]
First
"Mary"

We can also get multiple elements. But in this case, we have to pass the name
of the elements as vector

> n[c("First","Last")]
First Last
"Mary" "John"

We can also reverse the order with a character string index vector

> n[c("Last","First")]
Last First
"John" "Mary"

If the character string index vector has duplicate elements, the selection with
result in selecting the duplicate elements

> n[c("First","First","Last")]
51
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
First First Last
"Mary" "Mary" "John"

We can get the names of the vector using names()

> names(n)
[1] "First" "Last"

We can change the name of the vector by assigning another character vector
to its names

> names(n) <- c("GivenName","SurName")


> names(n)
[1] "GivenName" "SurName"
> n
GivenName SurName
"Mary" "John"

If we try to access an element that does not exists in the vector, we get a
vector of single missing value with a missing name
> n["Last"]
<NA>
NA

If we provide a character string vector in which some names exists in the


vector but others do not, the length of selection vector is preserved

> name[c("Last","SurName")]
<NA> SurName
NA "John"

Exercise

1. Create a numeric vector of length 5

> vec <- 21:25


> vec
[1] 21 22 23 24 25

2. Named Numeric Vector – Create Names for the R object created in


above step

52
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> names(vec) <- c('sum1', 'num1', 'dum1', 'rum1', 'lum1')
> vec
sum1 num1 dum1 rum1 lum1
21 22 23 24 25

3. Print Vector – We can print the vector in the following ways

a. > vec
sum1 num1 dum1 rum1 lum1
21 22 23 24 25
b. > (vec)
sum1 num1 dum1 rum1 lum1
21 22 23 24 25
c. > print(vec)
sum1 num1 dum1 rum1 lum1
21 22 23 24 25
d. > show(vec)
sum1 num1 dum1 rum1 lum1
21 22 23 24 25

4. Find the length of the vector

> length(vec)
[1] 5

5. Find the structure of the vector

> str(vec)
Named int [1:5] 21 22 23 24 25
- attr(*, "names")= chr [1:5] "sum1" "num1" "dum1" "rum1" ...

6. Access the values without names

> unname(vec)
[1] 21 22 23 24 25

7. Access the names of the vector

> names(vec)
[1] "sum1" "num1" "dum1" "rum1" "lum1"

8. Access the elements of the vector based on names

53
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> vec["sum1"]
sum1
21

9. Access multiple elements of the vector based on names

> vec[c('num1','rum1','lum1')]
num1 rum1 lum1
22 24 25

10.Access the elements based on positive indexes or positions – Access just


3rd element with names

> vec[3]
dum1
23

11.Access 2nd and 4th element with names

> vec[c(2,4)]
num1 rum1
22 24

12.Access consecutive elements with names based on indexes/position

> vec[3:5]
dum1 rum1 lum1
23 24 25

13.Access elements based on negative indexes/position – Access all


elements with names except 2nd element

> vec[-2]
sum1 dum1 rum1 lum1
21 23 24 25

14.Access multiple elements based on negative indexes/position – Access


all the elements except 3rd and 4th element with names

> vec[c(-3,-4)]
sum1 num1 lum1
21 22 25

54
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
15.Arrange elements in specific order based on indexes/position

> vec[c(3,2,4,1,5)]
dum1 num1 rum1 sum1 lum1
23 22 24 21 25

16.Arrange elements in specific order based on names

> vec[c('lum1', 'num1','sum1','dum1','rum1')]


lum1 num1 sum1 dum1 rum1
25 22 21 23 24

17.Arrange the vectors in reverse order based on position

> rev(vec)
lum1 rum1 dum1 num1 sum1
25 24 23 22 21

18.Replace the names of few elements

> names(vec)[c(2,4)] <- c('Lion','Tiger')


> vec
sum1 Lion dum1 Tiger lum1
21 22 23 24 25

19.Replace value of few elements

> vec[c(3,5)] <- c(42, 11)


> vec
sum1 Lion dum1 Tiger lum1
21 22 42 24 11

20.Arrange in the ascending order of values of elements

> vec[order(vec, decreasing = FALSE)]


lum1 sum1 Lion Tiger dum1
11 21 22 24 42

21.Arrange in the descending order of values of elements

> vec[order(vec, decreasing = TRUE)]


dum1 Tiger Lion sum1 lum1
42 24 22 21 11
55
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
22.Arrange in alphabetical order of names

> vec[sort(names(vec))]
dum1 Lion lum1 sum1 Tiger
42 22 11 21 24

23.Arrange in alphabetical order of names (reverse)

> vec[sort(names(vec), decreasing = TRUE)]


Tiger sum1 lum1 Lion dum1
24 21 11 22 42

24.Replace values with missing values

> vec[c(2,4)] <- NA


> vec
sum1 Lion dum1 Tiger lum1
21 NA 42 NA 11

25.Number of NA’s in vector

> table(is.na(vec))

FALSE TRUE
3 2

Extracting Element

We use [ ] to create a subset of a vector. But to extract an element from the


vector, we use [[ ]]. We get different results if we subset a named vector
using one entry and if we extract an element from it

> x <- c(a=1,b=2,c=3)


> x["a"]
a
1
> x[["a"]]
[1] 1

56
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
You can use [[ ]] to extract one element only. You cannot extract more than
one element

> x[[c("a","b")]]
Error in x[[c("a", "b")]] :
attempt to select more than one element in vectorIndex

You cannot use negative integers as well

> x[[-1]]
Error in x[[-1]] : invalid negative subscript in get1index
<real>

Sub-setting a vector with non-existing position or name will produce missing


values. But [[ ]] cannot if we try to extract an element that is beyond the
range.

> x[["d"]]
Error in x[["d"]] : subscript out of bounds

Class of the vector

On several occasions, we need to know the kind of vector we are dealing with
before we can use it. We can use class() function to know about the class of
any R object.

> class(c(1,2,3))
[1] "numeric"
> class(c(TRUE, FALSE))
[1] "logical"
> class(c("Tiger","Snake"))
[1] "character"

To ensure that the object is a vector of a specific class, we can use is.number,
is.logical, is.character

> is.numeric(c(1,2,3))
[1] TRUE
> is.logical(c(TRUE,FALSE))
[1] TRUE
> is.character(c("Tiger","Snake"))
[1] TRUE

57
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> is.numeric(c("Tiger","Snake"))
[1] FALSE

Converting Vectors

We can coerce different class of vectors to a specific class of vectors. For


example, we can also represent numbers such as 1 and 20 as strings. However,
we cannot perform numeric calculations on the string representation of the
numbers. Hence, we need to convert them to numeric value.

To demonstrate such conversion, let’s create a character vector:

> st <- c('1','2','3')


> st
[1] "1" "2" "3"
> class(st)
[1] "character"

We cannot perform the mathematical operations on this vector

> st + 10
Error in st + 10 : non-numeric argument to binary operator

Hence, we can use as.numeric() to convert vector to a numeric vector

> num <- as.numeric(st)


> num
[1] 1 2 3
> class(num)
[1] "numeric"

Now we can perform the mathematical operations

> num + 1
[1] 2 3 4
> num + 10
[1] 11 12 13

In the previous section, we used is.* functions, such as is.numeric,


is.logical, and is.character, to check the class of a given object. Similarly

58
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
we can use as.* function family in order to convert a vector from one class to
another

> as.numeric(c('1','2','3'))
[1] 1 2 3
> as.numeric(c('1','2','3','a'))
[1] 1 2 3 NA
Warning message:
NAs introduced by coercion
> as.logical(c(-1,0,1,2))
[1] TRUE FALSE TRUE TRUE
> as.character(c(1,2,3))
[1] "1" "2" "3"
> as.character(c(TRUE,FALSE))
[1] "TRUE" "FALSE"

Even though each type of vector can be converted to all other type but the
conversion follows a set of rules.

The second command in the previous code block tries to convert the character
vector to a numeric vector as we did in the first command. However, the last
element a cannot be converted to a number. The conversion for the character
representation of numeric values was successful but the conversion of
character value a produced a missing value.

In the third command, we convert a numeric vector to a logical vector. The 0


values produced FALSE and non-zero values produces TRUE.

Arithmetic Operators

The arithmetic operations can be performed easily. They follow two rules
1. Computing in an element-wise manner
2. Recycling the shorter vector

> c(10,11,12,13) + 20
[1] 30 31 32 33
> c(20,21,22,23) - c(10,11,12,13)
[1] 10 10 10 10
> c(10,11,12,13) * c(1,2,3,4)
[1] 10 22 36 52
> c(10,15,20,25) / c(2,3,4,5)
[1] 5 5 5 5
59
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> c(1,2,3,4)^2
[1] 1 4 9 16
> c(1,2,3,4) ^ c(1,2,3,4)
[1] 1 4 27 256
> c(2,3,4,5)%%2
[1] 0 1 0 1

For the named vectors, we cannot perform the operations on the


corresponding names. The mathematical operations will be performed on the
values. The names of the vector on left hand side will remain and the names
on the right hand side will be ignored

> c(a=1,b=2,c=3)+c(d=1,e=2,f=3)
a b c
2 4 6
> c(a=1, b=2, 3)+c(d=1,e=2,f=3)
a b
2 4 6

Matrix

A matrix is a collection of data elements arranged in a two-dimensional


rectangular layout. The example of a matrix with 2 rows and 3 columns is

1 2 5
𝐴=* .
2 3 7

In R programming language, we can create the matrix using matrix function. To


create the matrix, the data elements must be of same data types.

> A = matrix(
+ c(1, 3, 2, 4, 6, 7), #the data elements
+ nrow = 2, #desired number of rows
+ ncol = 3 #desired number of columns
+ )
> A #Print the matrix
[,1] [,2] [,3]
[1,] 1 2 6
[2,] 3 4 7

In this case, if you omit the value of nrow or ncolumn, the value is
automatically taken based on the given value of nrow or ncolumn.

60
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
You may also fill the values by columns. In the case the values are populated
for column wise. By default, R fills the values by columns. You can specify the
option by giving byrow = FALSE

> A = matrix(
+ c(1, 3, 2, 4, 6, 7), #the data elements
+ nrow = 2, #desired number of rows
+ ncol = 3, #desired number of columns
+ byrow = FALSE #Fill rows by columns
+ )
> A #Print the value of A
[,1] [,2] [,3]
[1,] 1 2 6
[2,] 3 4 7

We can also populate the values by rows by flagging the parameter


byrow=TRUE

> A = matrix(
+ c(1, 3, 2, 4, 6, 7), #the data elements
+ nrow = 2, #desired number of rows
+ ncol = 3, #desired number of columns
+ byrow = TRUE #Fill rows by columns
+ )
> A #Print the value of A
[,1] [,2] [,3]
[1,] 1 3 2
[2,] 4 6 7

You may create a diagonal matrix by using diag() function.

> diag(1,nrow=5)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 1 0 0 0
[3,] 0 0 1 0 0
[4,] 0 0 0 1 0
[5,] 0 0 0 0 1

Diagonal matrix has equal number of rows and columns. Based on the value of
nrow or ncolumn, the other value is calculated by R.

61
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Naming Rows and Columns

When we create a matrix, by default, no name is given to the rows and


columns. But we can provide the names of the columns while creating the
matrix

> A = matrix(
+ c(1, 2, 3, 4, 5, 6), #the data elements
+ nrow = 3, #desired number of rows
+ byrow = TRUE, #Fill rows by columns
+ dimnames = list( #To give names of rows and columns
+ c('r1','r2','r3'), #row name
+ c('c1','c2') #column name
+ ))
> A #Print the value of A
c1 c2
r1 1 2
r2 3 4
r3 5 6

We can also provide the names after creating the matrix

> B = matrix(
+ c(1, 2, 3, 4, 5, 6), #the data elements
+ nrow = 3) #desired number of rows
> B #Print the values of B
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> rownames(B) <- c('r1','r2','r3') #Specify row name
> colnames(B) <- c('c1','c2') #Specify col name
> B #Print the value of B
c1 c2
r1 1 4
r2 2 5
r3 3 6

Subsetting a Matrix

On various occasion, we need to extract the data from a matrix. We can do so


by matrix subset operations.

62
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We have studied in previous section that a matrix is a two-dimensional
rectangular layout. To access a value from a matrix we need a two-dimensional
accessor [ , ]. This is similar to one-dimensional accessor [ ].

To determine the subset of a matrix, we can supply two vectors for each
dimension. The vector are separated by , operator. The first vector is the row
selector and the second vector is column selector.

To extract only one element in the first row and the second column

> B[1,2]
[1] 4

We can subset it with a range of row and column position

> B
c1 c2
r1 1 4
r2 2 5
r3 3 6
> B[2:3,1]
r2 r3
2 3
> B[2:3,1:2]
c1 c2
r2 2 5
r3 3 6

If we leave one dimension blank, all the values in that dimension will be
returned

> B[,1] #select all row 1st col


r1 r2 r3
1 2 3
> B[,1:2] #select all the rows
c1 c2
r1 1 4
r2 2 5
r3 3 6
> B[1,] #select 1st row all cols
c1 c2
1 4
> B[1:2,] #select all the column
c1 c2
63
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
r1 1 4
r2 2 5

We can use negative numbers to exclude positions

> B[,-1] #select all row exclude 1st col


r1 r2 r3
4 5 6
> B[-1,] #Exclude 1st row select all cols
c1 c2
r2 2 5
r3 3 6

For a named matrix, we can use character vector to subset it

> B[c('r1','r2'), #select rows named r1 and r2


+ c('c1')] #select cols named c1
r1 r2
1 2

Even though a matrix is a vector that can be represented and accessed in two-
dimensional, it is still a vector. Hence, we can use one-dimensional accessors
for vectors

> B[1]
[1] 1
> B[5]
[1] 5

Similar to the vectors, matrix also contains the entries of the same type. If you
type an inequality, you will get a logical matrix of equal size

> B > 2
c1 c2
r1 FALSE TRUE
r2 FALSE TRUE
r3 TRUE TRUE

Matrix Operators

64
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can perform all the arithmetic operators with matrix as well. The operators
perform element-wise operations except for matrix product, %*%.

> B
c1 c2
r1 1 4
r2 2 5
r3 3 6
> B + B #Addition
c1 c2
r1 2 8
r2 4 10
r3 6 12
> B - 0.5*B #Subtraction
c1 c2
r1 0.5 2.0
r2 1.0 2.5
r3 1.5 3.0
> B * B #Multiplication
c1 c2
r1 1 16
r2 4 25
r3 9 36
> B / 0.5*B #Division
c1 c2
r1 2 32
r2 8 50
r3 18 72
> B^2 #Power
c1 c2
r1 1 16
r2 4 25
r3 9 36
> t(B) %*% B #Matrix Multiplication
c1 c2
c1 14 32
c2 32 77

In the last example we noticed t(). This is a transpose operation.

> t(B)
r1 r2 r3
c1 1 2 3
c2 4 5 6

65
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Arrays

Compared to matrices, arrays can have more than two dimensions. To create
an array, we can use array() function. To specify the dimension, we can use
dim parameter.

> mularray = array( #Call array() function


+ c(1:24), #1 dimension vector with 24 values
+ dim=c(4,3,2) #specify 4x3x2 dimensional array
+ )
> mularray
, , 1

[,1] [,2] [,3]


[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12

, , 2

[,1] [,2] [,3]


[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24

We can create the array with names for these dimensions using dimnames

> mularray = array( #Call array() function


+ c(1:24), #1 dimension vector with 24
values
+ dim=c(4,3,2), #specify 4x3x2 dimensional array
+ dimnames = list(
+ c('x1','x2','x3','x4'),#specify the names of 1st dim
+ c('y1','y2','y3'), #specify the names of 2nd dim
+ c('z1','z2') #specify the names of 3rd dim
+ )
+ )
> mularray
, , z1

y1 y2 y3
x1 1 5 9
x2 2 6 10
66
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
x3 3 7 11
x4 4 8 12

, , z2

y1 y2 y3
x1 13 17 21
x2 14 18 22
x3 15 19 23
x4 16 20 24

We can setup the names for each dimension using dimnames(x) <- for
already created array

> mularray1 = array( #Call array() function


+ c(1:24), #1 dimension vector with 24 values
+ dim=c(4,3,2) #specify 4x3x2 dimensional array
+ )
>
> dimnames(mularray1) <- list(
+ c('x1','x2','x3','x4'),#specify the names of 1st dim
+ c('y1','y2','y3'), #specify the names of 2nd dim
+ c('z1','z2') #specify the names of 3rd dim
+ )
> mularray1
, , z1

y1 y2 y3
x1 1 5 9
x2 2 6 10
x3 3 7 11
x4 4 8 12

, , z2

y1 y2 y3
x1 13 17 21
x2 14 18 22
x3 15 19 23
x4 16 20 24

Subsetting an array

We can subset an array exactly the same way as we subset a matrix

67
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> mularray1[1,,]
z1 z2
y1 1 13
y2 5 17
y3 9 21
> mularray1[,1,]
z1 z2
x1 1 13
x2 2 14
x3 3 15
x4 4 16
> mularray1[,,1]
y1 y2 y3
x1 1 5 9
x2 2 6 10
x3 3 7 11
x4 4 8 12
> mularray1[3,2,1]
[1] 7
> mularray1[1:2,2:3,1]
y2 y3
x1 5 9
x2 6 10

As you may notice, atomic vectors, matrices, and arrays share almost the same
set of behaviours. A fundamental common feature they share is that they are
all homogeneous data types, that is, the type of elements they store must be
the same. However, there are also heterogeneous data types in R, that is, they
can store different types of elements, which makes them much more flexible
but they are less memory efficient and slower to operate.

List

A list in R can contain many different data types inside it. A list is a collection of
data which is ordered and changeable. List is known for its flexibility and ability
to extract information without calling different functions each time

We can use list() to create a list and put different type of objects into one
list.

For example, the following variable x is a list of three vectors and a numeric
value
68
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> n <- c(2,3,5) #Numeric Vector
> p <- c(TRUE, FALSE) #Logical Vector
> q <- c('a','b','c') #Character Vector
> x <- list(n,p,q,3) #Heterogeneous members
> x #Print List x
[[1]]
[1] 2 3 5

[[2]]
[1] TRUE FALSE

[[3]]
[1] "a" "b" "c"

[[4]]
[1] 3

Extracting an element from list

We can use $ sign to extract the value of a list element by name

> x <- list(


+ n=c(2,3,5), #Numeric Vector
+ p=c(TRUE, FALSE), #Logical Vector
+ q=c('a','b','c'), #Character Vector
+ 3)
> x$n
[1] 2 3 5
> x$p
[1] TRUE FALSE
> x$q
[1] "a" "b" "c"

We can also use double square bracket to extract the value of a list member

> x[[1]]
[1] 2 3 5
> x[[2]]
[1] TRUE FALSE
> x[[3]]
[1] "a" "b" "c"
> x[[4]]
[1] 3

69
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
You can also provide the name to extract the list member with that name

> x[["n"]]
[1] 2 3 5

For a non-existent list member NULL is returned

> x$a
NULL
> x[[5]]
Error in x[[5]] : subscript out of bounds

Subset a list

On several occasions, we need to extract multiple elements from a list. We can


use single square bracket operator to subset a list. This notation is consistent
with notations for vectors and matrix.

> x["n"] #list item n


$n
[1] 2 3 5

> x[c("p","q")] #list item p and q


$p
[1] TRUE FALSE

$q
[1] "a" "b" "c"

> x[1] #list index 1


$n
[1] 2 3 5

> x[c(1,2)] #list index 1 and 2


$n
[1] 2 3 5

$p
[1] TRUE FALSE

> x[c(TRUE, FALSE, TRUE, FALSE)] #Logical Index


$n
[1] 2 3 5

70
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
$q
[1] "a" "b" "c"

Named Lists

Even though, while creating the list, we named the list members, we can
always name or rename the list vector.

> names(x) <- c('Num','Pum','Dum')


> x
$Num
[1] 2 3 5

$Pum
[1] TRUE FALSE

$Dum
[1] "a" "b" "c"

$<NA>
[1] 3

We can remove the names of the list members by assigning NULL

> names(x) <- NULL


> x
[[1]]
[1] 2 3 5

[[2]]
[1] TRUE FALSE

[[3]]
[1] "a" "b" "c"

[[4]]
[1] 3

Once we remove the names of the list member, we cannot access the list
members by name anymore. We can still access them by position and logical
criterion.

71
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Setting Values

We can modify value of the list member by assigning a new value.

> y <- list(


+ a = 1,
+ b = c(TRUE, FALSE),
+ c = c("aa","bb")
+ )
> y
$a
[1] 1

$b
[1] TRUE FALSE

$c
[1] "aa" "bb"

>
> y$a <- 2
> y
$a
[1] 2

$b
[1] TRUE FALSE

$c
[1] "aa" "bb"

If we assign a value to a non-existent member, a new member would be added


to the list with given name or position

> y$d <- c(1,2)


> y
$a
[1] 2

$b
[1] TRUE FALSE

$c
[1] "aa" "bb"

$d
72
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] 1 2

We can even set multiple values at the same time

> y[c("b","c")] <- list(b="Updated Values", c=c(1,4))


> y
$a
[1] 2

$b
[1] "Updated Values"

$c
[1] 1 4

$d
[1] 1 2

We can easily remove more than one members from the list

> y[c("c","d")] <- NULL


> y
$a
[1] 2

$b
[1] "Updated Values"

Other List Operations

To find out whether an R object is list or not, we can use is.list() function

> z <- list(


+ a=c(1:3),
+ b=c("car","jar")
+ )
> is.list(z)
[1] TRUE
> is.list(z$a)
[1] FALSE

In this case z is a list but z$a is a vector not a list.

We can also convert a vector into a list by using as.list() function


73
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> zz <- as.list(c(a=1, b=2, c =3))
> zz
$a
[1] 1

$b
[1] 2

$c
[1] 3

We can convert a list to a vector by calling unlist() function. In this case, if all
the members are of same type, they will convert to the particular type.

> zz <- list(a=1, b=2, c =3)


> unlist(zz)
a b c
1 2 3

If we unlist a list of numbers and texts in mixture, all members will be


converted to the closest type that each one can be converted to:

> zz <- list(a=1, b=2, c ="abc")


> unlist(zz)
a b c
"1" "2" "abc"

Here, zz$a and zz$b are numbers and can be converted to a character;
however, but zz$c is a character vector and cannot be converted to numeric
values. Therefore, their closest type that is compatible with all elements is a
character vector.

Data Frame

A data frame is used for storing data tables. It is a list of vectors of equal
length.

Data Frames are data displayed in a format as a table. Data Frames can have
different types of data inside it. While the first column can be character, the
second and third can be numeric or logical. However, each column should have
the same type of data.
74
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The following table fully characterized a Data Frame

Name Gender Age Major


John Male 20 Statistics
Nancy Female 19 Mathematics
Kate Female 21 Computer
Science

Create a Data Frame

We can create a data frame using data.frame() and give the data of each
column using a vector of the corresponding type

> batch <- data.frame(


+ Name = c("John","Nancy","Kate"),
+ Gender = c("Male","Female","Female"),
+ Age = c(20, 19, 21),
+ Major = c("Statistics","Mathematics","Computer Science")
+ )
> batch
Name Gender Age Major
1 John Male 20 Statistics
2 Nancy Female 19 Mathematics
3 Kate Female 21 Computer Science

We can also create the data frame from a list either by calling data.frame()
or as.data.frame().

> lst <- list (x=c(1,2,3), y=c("a","b","c"))


> data.frame(lst)
x y
1 1 a
2 2 b
3 3 c
> as.data.frame(lst)
x y
1 1 a
2 2 b
3 3 c

75
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can also create a data frame from a matrix

> mtx <- matrix(c(1:9),nrow = 3, byrow = FALSE)


> data.frame(mtx)
X1 X2 X3
1 1 4 7
2 2 5 8
3 3 6 9
> as.data.frame(mtx)
V1 V2 V3
1 1 4 7
2 2 5 8
3 3 6 9

Please note that the conversion automatically assign the column names to the
data frame. However, if the columns or the rows already have been named,
the names will be preserved in the conversion.

Naming rows and columns

The data frame is a list that looks like a matrix. Hence, we can apply the
methods to access list and matrix on data frame also.

> df1 <- data.frame(id=1:5, x=c(-1, 0, 1, 2, 3), y =c(0.76,


0.45, 0.56, 0.63, 0.71))
> df1
id x y
1 1 -1 0.76
2 2 0 0.45
3 3 1 0.56
4 4 2 0.63
5 5 3 0.71

We can rename the rows and columns in the same way as we do in the case of
a matrix.

> colnames(df1) <- c("id","x-value","y-value")


> rownames(df1) <- letters[1:5]
> df1
id x-value y-value
a 1 -1 0.76
b 2 0 0.45
c 3 1 0.56

76
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
d 4 2 0.63
e 5 3 0.71

Subset of a Data frame

Since a data frame is a matrix-like list of column vectors, we can use both sets
of notations to access the elements and subsets in a data frame.

Subset of a data frame as a list

Since the data frame can be regarded as a list of vectors, we can use the list
notations to extract a value. We can either use $ or [[ ]] operators to do so.

> df1$id
[1] 1 2 3 4 5
> df1[[1]]
[1] 1 2 3 4 5
> df1[["id"]]
[1] 1 2 3 4 5

The subset operator ([) allows us to use a numeric vector to extract columns
by position, a character vector to extract columns by name, or a logical vector
to extract columns by TRUE and FALSE selection:

> df1[1] #1st Column


id
a 1
b 2
c 3
d 4
e 5
> df1[1:2] #Column No 1 and 2
id x-value
a 1 -1
b 2 0
c 3 1
d 4 2
e 5 3
> df1["x-value"] #Column with Name “x-value”
x-value
a -1
b 0
c 1

77
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
d 2
e 3
> df1[c("x-value","y-value")] #column with names “x-value”,”y-
value”
x-value y-value
a -1 0.76
b 0 0.45
c 1 0.56
d 2 0.63
e 3 0.71
> df1[c(TRUE,FALSE, TRUE)] #using logical vector
id y-value
a 1 0.76
b 2 0.45
c 3 0.56
d 4 0.63
e 5 0.71

Subset a data frame as matrix

The list notation does not support the row selection whereas the matrix
notation supports both row selection and column selection. We can use [row,
column] notation to subset a data frame by specifying the row and column
selector which ca be numeric vector, character vector and/or a logical vector.

The examples of column selector are:

> df1[,1]
[1] 1 2 3 4 5
> df1[,"x-value"]
[1] -1 0 1 2 3
> df1[,c("x-value","y-value")]
x-value y-value
a -1 0.76
b 0 0.45
c 1 0.56
d 2 0.63
e 3 0.71
> df1[,c(1:2)]
id x-value
a 1 -1
b 2 0
c 3 1
d 4 2
e 5 3
78
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The examples of row selectors are

> df1[2:4,] #select rows 2,3,4 and all columns


id x-value y-value
b 2 0 0.45
c 3 1 0.56
d 4 2 0.63
> df1[c('a','e'),] #select rows names a, e and all cols
id x-value y-value
a 1 -1 0.76
e 5 3 0.71

We can use both column selector as well as row selector

> df1[c('a','e'),1:2]
id x-value
a 1 -1
e 5 3

Note that the matrix notation automatically simplifies the output. That is, if
only one column is selected, the result won't be a data frame but the values of
that column

> df1[1:4,'id']
[1] 1 2 3 4

To always keep the result as a data frame, even if it only has a single column,
we can use both notations together:

> df1[1:4,]['id']
id
a 1
b 2
c 3
d 4

In this case, the first group of brackets subsets the data frame as the matrix
with first four rows and all the columns. The second group of brackets subsets
the resultant data frame as list with only one column selected.

We can specify drop = FALSE to avoid simplification of the results

79
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> df1[1:4,'id',drop=FALSE]
id
a 1
b 2
c 3
d 4

Filtering Data

We can filter the rows of the data frame based on a criteria and then select the
desired columns. For example, we select the rows of df1 with y-score >=
0.6 and then select the columns id and x-score

> df1$`y-value` >= 0.6


[1] TRUE FALSE FALSE TRUE TRUE
> df1[df1$`y-value`>=0.6,c('id','x-value')]
id x-value
a 1 -1
d 4 2
e 5 3

The following code filters the rows of df1 by a criterion that the row name
must be among a, d, or e, and selects the id and x-score columns:

> rownames(df1) %in% c('a','d','e')


[1] TRUE FALSE FALSE TRUE TRUE
> df1[rownames(df1) %in% c('a','d','e'),c('id','x-value')]
id x-value
a 1 -1
d 4 2
e 5 3

Setting Values as a list

We can use $ and <- operators to assign a value

> df1$'y-value' <- c(0.52,0.67,0.75,0.61,0.49)


> df1
id x-value y-value
a 1 -1 0.52
b 2 0 0.67
c 3 1 0.75
80
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
d 4 2 0.61
e 5 3 0.49

We can use [ ] operator as well as [[ ]] operator. The [ ] operator allows to


change values of multiple columns in one expression whereas [[ ]] allows to
change the value of one column at a time.

> df1['y-value']<- c(0.6,0.7,0.8,0.9,0.75) #using [ ]


> df1
id x-value y-value
a 1 -1 0.60
b 2 0 0.70
c 3 1 0.80
d 4 2 0.90
e 5 3 0.75

> df1[['y-value']]<- c(0.3,0.4,0.5,0.6,0.7) #using [[ ]]


> df1
id x-value y-value
a 1 -1 0.3
b 2 0 0.4
c 3 1 0.5
d 4 2 0.6
e 5 3 0.7

> df1[c('x-value','y-value')] <-


+ list('x-value'=c(1,2,3,4,5),
+ 'y-value'=c(0.1,0.2,0.3,0.4,0.5)
+ )
> df1
id x-value y-value
a 1 1 0.1
b 2 2 0.2
c 3 3 0.3
d 4 4 0.4
e 5 5 0.5

Factors

In R programming, factors are used to represent the categorical data. These


data structures are used for predefined, finite number of values (categorical
data). For example, a data field such as marital status may contain only values
from single, married, divorced, or widowed. In such cases, we are aware of the
81
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
possible values beforehand. These predefined and distinct values are called
levels.

> person <- data.frame(


+ Name=c("Kate","Mate","Late","Jate"),
+ Age=c(24, 25, 35, 26),
+ Gender=c("Female","Male","Female","Female"),
+ MaritalStatus =c("Single","Single","Married","Single"),
stringsAsFactors = TRUE)
> str(person)
'data.frame': 4 obs. of 4 variables:
$ Name : Factor w/ 4 levels "Jate","Kate",..: 2 4 3 1
$ Age : num 24 25 35 26
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 1
$ MaritalStatus: Factor w/ 2 levels "Married","Single": 2 2 1
2

In the example above, we can notice that Name, Gender, and MaritalStatus are
not character vectors. But they are factors. They represent the categorical
data. For example Gender
We can see that the class of Name, Gender, and MaritalStatus is Factor
whereas the class of Age is numeric. We can confirm the same as follows

> class(person$Name)
[1] "factor"
> class(person$Age)
[1] "numeric"

In this case, we can clearly see the levels (the unique values in the column)
and number of observations.

> str(person$MaritalStatus)
Factor w/ 2 levels "Married","Single": 2 2 1 2

It is reasonable to store the categorical data by factors if the distinct possible


values are very limited as in the case of Gender. But storing the other
character data type objects such as Name, regardless of repetition, as factor is
not efficient.

Due to factors, the value of the character column cannot be updated

> person[1,"Name"] <- "Fate"

82
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "Fate") :
invalid factor level, NA generated

> person$Name
[1] <NA> Mate Late Jate
Levels: Jate Kate Late Mate

The reason for the Warning message is that there was no word called Fate
when the data frame was initially created using the unique values in that
character vector

This behavior is sometimes very annoying and does not really help much,
especially as memory is cheap today. The simplest way to avoid this behavior is
to set stringsAsFactors = FALSE when we create a data frame using
data.frame():

> person <- data.frame(


+ Name=c("Kate","Mate","Late","Jate"),
+ Age=c(24, 25, 35, 26),
+ Gender=c("Female","Male","Female","Female"),
+ MaritalStatus =c("Single","Single","Married","Single"),
stringsAsFactors = FALSE)
> str(person)
'data.frame': 4 obs. of 4 variables:
$ Name : chr "Kate" "Mate" "Late" "Jate"
$ Age : num 24 25 35 26
$ Gender : chr "Female" "Male" "Female" "Female"
$ MaritalStatus: chr "Single" "Single" "Married" "Single"

In the older versions of R, the value was defaulted to TRUE


(stringsAsFactors = TRUE). In the newer versions of R, the value was
defaulted to FALSE (stringsAsFactors = FALSE)

Useful functions for Data Frame

The summary() function provides the summary statistics of each column. For
the numeric vector, it shows the important quantiles of the number. However
other type of columns, it shows the length, class, and mode of them. In the
case of character columns, the summary statistics depends on the value of
stringsAsFactors.

83
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> person <- data.frame(
+ Name=c("Kate","Mate","Late","Jate"),
+ Age=c(24, 25, 35, 26),
+ Gender=c("Female","Male","Female","Female"),
+ MaritalStatus =c("Single","Single","Married","Single"),
stringsAsFactors = FALSE)
> summary(person)
Name Age Gender MaritalStatus
Length:4 Min. :24.00 Length:4 Length:4
Class :character 1st Qu.:24.75 Class :character Class :character
Mode :character Median :25.50 Mode :character Mode :character
Mean :27.50
3rd Qu.:28.25
Max. :35.00

> person <- data.frame(


+ Name=c("Kate","Mate","Late","Jate"),
+ Age=c(24, 25, 35, 26),
+ Gender=c("Female","Male","Female","Female"),
+ MaritalStatus =c("Single","Single","Married","Single"),
stringsAsFactors = TRUE)
> summary(person)
Name Age Gender MaritalStatus
Jate:1 Min. :24.00 Female:3 Married:1
Kate:1 1st Qu.:24.75 Male :1 Single :3
Late:1 Median :25.50
Mate:1 Mean :27.50
3rd Qu.:28.25
Max. :35.00

We can bind multiple data frames either by row or column using rbind()and
cbind(). As their name suggests, they perform row binding and column
binding respectively

For example, if we want to add a new record of a person, we can use rbind()

> rbind(person, data.frame(Name = "Bate", Age = 20, Gender =


"Female", MaritalStatus = "Single"))
Name Age Gender MaritalStatus
1 Kate 24 Female Single
2 Mate 25 Male Single
3 Late 35 Female Married
4 Jate 26 Female Single
5 Bate 20 Female Single

Similarly, if we want to add two new columns to indicate the nationality and
education level, we can use cbind()

84
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> cbind(person, data.frame(Nationality =
c("USA","UK","France","Australia"), Education =
c("Graduate","High School","Post Graduate","Graduate") ))
Name Age Gender MaritalStatus Nationality Education
1 Kate 24 Female Single USA Graduate
2 Mate 25 Male Single UK High School
3 Late 35 Female Married France Post Graduate
4 Jate 26 Female Single Australia Graduate

Note that rbind() and cbind() do not modify the original data but create a
new data frame with given rows or columns appended.

Loading and Writing data on the disk

R provides a number of functions to read a table from a file or write a data


frame to a file.

To read the data into the R environment, we only need to call


read.csv(file) where the file is the path of the file. To ensure that the
data file can be found, please place the data folder directly in your working
directory, call getwd() to find out.

If we need to save a data frame to a CSV file, we may call write.csv(file)


with some additional arguments.

write.csv(persons, "data/persons.csv", row.names = FALSE,


quote = FALSE)

The argument row.names = FALSE avoids storing the row names which are
not necessary, and the argument quote = FALSE avoids quoting text in the
output

Functions

Function is an object which has internal logic. It takes a group of inputs


(parameters or arguments) and returns a value as output.

In previous sections, we studied about several built-in functions of R such as


is.numeric(). This function takes any R object as an input and returns a
85
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
logical values (to indicate whether the object is a numeric vector) as output.
Similarly we can use is.function() to determine whether the given object is
a function object.

For a typical interactive data analysis, the built-in functions provided by


thousands of packages are sufficient. However, if you want to repeat a logic or
a process for your data manipulation or analysis project, the built-in functions
may not fully serve your purpose. Because they are not designed to meet the
specific needs of a task. In that case, you need to create your own functions
targeting a specific set of demands.

Creating a function

You can easily create a function. For example, you need an object that can
simply add two objects x, and y.

> add <- function(x, y){x + y}

The syntax function(x,y) specifies two arguments named x and y. The


function body consists of a series of expressions in terms of x and y and other
symbols. In this case the function body is {x + y}. The function returns the
value of the last expression unless the return() is called inside the function.
The function is assigned to a variable add that can be used to call the function
later on.

In R Programming language, the functions act like other objects. We can see
the function by typing add in the console.

> add
function(x, y){x + y}

Calling a function

Once the function is defined, we can call the function as follows

> add (2,3)


[1] 5

86
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
When we call the function add, R will look for a function named add in the
environment. If it finds one, it will create a local environment in which x takes
the value 2 and y takes the value 3. For the parameter values, the expression
within the function is evaluated and returns the value 5.

Dynamic Typing

The functions in R are not strongly typed. The type of the inputs are not fixed
prior to the calling the function. The function can work with any type of vector
as long as + operation can be performed on them. For example, we can
execute the following code without changing the function.

> add(c(1,2),1)
[1] 2 3

> add(as.Date("2020-12-01"),1)
[1] "2020-12-02"

The function passes the two argument into the expression without any type
checking. In the example above as.Date() creates a Date object. The function
works very well with the Date object. If the + operation is not possible on the
two values passed as arguments, the function will fail.

> add("a","b")
Error in x + y : non-numeric argument to binary operator

Generalizing a function

In the previous section, we developed a function add to We can also generalize


a function so that it can perform a wider range of operations. For example, we
will define another function calc(). This function will accept three arguments
that include two numeric vectors x and y and one character vector type. The
character vector type will define the kind of operation, the user wants to
perform.

> calc <- function(x, y, type) {


+ if (type == "add") {
+ x + y
+ } else if (type == "minus") {
+ x - y

87
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
+ } else if (type == "multiply") {
+ x * y
+ } else if (type == "divide") {
+ x / y
+ } else {
+ stop("Unknown type of operation")
+ }
+ }

> calc (3, 2, "minus")


[1] 1
> calc(c(4,6,8,9),c(2,3),"divide")
[1] 2 2 4 3
> calc(as.Date("2020-12-01"),31,"add")
[1] "2021-01-01"

If we pass some other value for type, we get the predefined error message

> calc(as.Date("2020-12-01"),31,"Addition")
Error in calc(as.Date("2020-12-01"), 31, "Addition") :
Unknown type of operation

In this case, no conditions are satisfied, so the expression in the last else block
will be evaluated. The stop() call yields an error message and terminates the
whole evaluation immediately.

The function seems to work well, but it gives unclear message if we pass a
character vector for type argument.

> calc(1,2,c("add","minus"))
[1] 3
Warning message:
In if (type == "add") { :
the condition has length > 1 and only the first element will
be used

We can further refine the function to avoid such ambiguity. We can add a
condition to check whether the vector has length 1.

> calc <- function(x, y, type) {


+ if (length(type) > 1) stop("More than one element in
Type")
+ if (type == "add") {
+ x + y
88
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
+ } else if (type == "minus") {
+ x - y
+ } else if (type == "multiply") {
+ x * y
+ } else if (type == "divide") {
+ x / y
+ } else {
+ stop("Unknown type of operation")
+ }
+ }
> calc(1,2,c("add","minus"))
Error in calc(1, 2, c("add", "minus")) : More than one element
in Type

Default value for function argument

Some functions can take a wider range of inputs and meet a variety of
demand. But it would be cumbersome to specify these many arguments
whenever we call the function. By setting the default values of the arguments,
we can simplify the code to call the function.

We can use arg=value to set the default value of the argument and make the
argument optional. If the value of the arg is provided, the new value overrides
the default value

> increment <- function(x, y=1){x+y}


> increment(2)
[1] 3
> increment(2,3)
[1] 5
> increment(c(1,2,3))
[1] 2 3 4

89
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
4. Basic Expressions

We have studied the functions. The building blocks of a function are


expressions. In R programming language, an expression can be used either as a
symbol or as a function call.

In this section, we will study the following fundamental expressions:

• Assignment expressions
• Conditional expressions
• Loop expressions

Assignment Expressions

As we have seen earlier, R uses left assignment (<-), right assignment (->) and
equal (=) operators for assignment. In this section, we will try to study the
assignment expressions in more details

We can have a chain of assignments so that all the symbols take the same
value

> a <- b <- c <- 0


> c(a, b, c)
[1] 0 0 0

We can also use the following

> a <- 10 -> b


> a
[1] 10
> b
[1] 10

90
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
For assignments, both = and <- operators are allowed and they have exactly
the same effect, but as a custom, <- instead of = is preferred in R.

Consider the following example. In this example, we have defined a function


and assigned it to abc. The function abc() takes two arguments:

> abc = function(xvalue, yvalue){


+ cat ("Value of x: ",xvalue)
+ cat ("Value of y: ",yvalue)
+ }

First two lines we used <- as assignment operator whereas in the third line,
we used = to match function argument by name for the function abc().

> x <- 1
> y <- 0.5

> abc(xvalue=x, yvalue=y)


[1] "Value of x: 1"
[1] "Value of y: 0.5"

Now we change all the operators to = and get the same results from the
function abc(). In this case, we used the = for assignment as well as named
argument.

> x = 1
> y = 0.5

> abc(xvalue=x, yvalue=y)


[1] "Value of x: 1"
[1] "Value of y: 0.5"

If we try to check the value of variable xvalue, we get an error.

> xvalue
Error: object 'xvalue' not found

Now we change all the operators to <-. Even though the same results from
the function abc(). But new variables xvalue and yvalue have been
created in the environment. The variable xvalue gains the value 1 and the
variable yvalue gains the value 0.5.

91
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> x <- 1
> y <- 0.5

> abc(xvalue <- x, yvalue <- y)


[1] "Value of x: 1"
[1] "Value of y: 0.5"

> xvalue
[1] 1

> yvalue
[1] 0.5

When we call the function in this manner, the variable xvalue and yvalue are
created in the environment and due to <- operator results in x and y. Hence
the arguments are not matched by names but by positions.

To understand this further, we can conduct more experiments.

First we use = operator and check the results

> abc(xvalue = x, yvalue = y)


[1] "Value of x: 1"
[1] "Value of y: 0.5"

Now we exchange the position of both the variables, the result is still the same

> abc(yvalue = y, xvalue = x)


[1] "Value of x: 1"
[1] "Value of y: 0.5"

First we use <- operator and check the results, the results are still the same

> abc(xvalue <- x, yvalue <- y)


[1] "Value of x: 1"
[1] "Value of y: 0.5"

Then we use <- operator and exchange the position of both the variables, the
results are different because in this case, the function is taking the variables by
position not by name

> abc(yvalue <- y, xvalue <- x)


[1] "Value of x: 0.5"
92
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] "Value of y: 1"

Hence <- operator as the name argument for a function not only results in
creating new variables in the environment but also results in abc(yvalue,
xvalue)

Therefore we can use <- or = operators for assignment. For arguments in the
variables, we should only use = operator.

Using backticks

While creating a variable or a name or a symbol in R environment, we can use


letters a – z, A – Z(R is case sensitive), the underscore(_) and the dot (.). But
the variable name should not start with an underscore(_) or should not contain
space.

The following names are valid

> students <- data.frame()


> us_population <- data.frame()
> sales.2015 <- data.frame()

The following names are invalid

> new data <- data.frame()


Error: unexpected symbol in "new data"

> _data <- data.frame()


Error: unexpected input in "_"

> population(data) <- data.frame()


Error in population(data) <- data.frame() :
could not find function "population<-"

The name new data contains a space whereas _data starts with _. The
population(data) is not a symbol but a function call. However while
working on data science problems, we may find invalid column names, as given
above, in the data table.

We can use back ticks to make invalid column as valid columns


93
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> `new data` <- c(1, 2, 3)
> `_data` <- c('a','b','c')
> `population(data)` <- c(0.5, 0.6, 0.7)

While referring these columns, we should use the back ticks

> `new data`


[1] 1 2 3
> `_data`
[1] "a" "b" "c"
> `population(data)`
[1] 0.5 0.6 0.7

We can use back ticks while creating a function. The back tick should be used
while calling the function

> `R Programming Class` <- function(x, y){x/y}


> `R Programming Class`(6,2)
[1] 3

We can use back ticks while creating a list. The back tick should be used to
refer the symbol

> li <- list(`Sec(Name)`=c('Pat','Kat','Mat'), `Sec(Marks)` =


c(60, 70, 80))

> li$`Sec(Name)`
[1] "Pat" "Kat" "Mat"

But the data frame works differently

> result <- data.frame(`Sec(Name)`=c('Pat','Kat','Mat'),


`Sec(Marks)` = c(60, 70, 80))

> result
Sec.Name. Sec.Marks.
1 Pat 60
2 Kat 70
3 Mat 80

> result$`Sec(Name)`
NULL

94
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In this case, even if you have used the back ticks around the unusual variable
name, data.frame() replaces them with the dots such as Sec.Name.. We can
access them using the column name with dots not with back tick

> colnames(result)
[1] "Sec.Name." "Sec.Marks."

We can disable it using check.names = FALSE while creating the data


frame.

> result <- data.frame(`Sec(Name)`=c('Pat','Kat','Mat'),


`Sec(Marks)` = c(60, 70, 80),
+ check.names = FALSE)

> result
Sec(Name) Sec(Marks)
1 Pat 60
2 Kat 70
3 Mat 80

> result$`Sec(Name)`
[1] "Pat" "Kat" "Mat"

Conditional expressions

Several programs are not sequential but they contain several branches
depending on certain conditions. In all the programming languages, we use
conditional expression to code the branches based on conditions. In R, we use
if to branch the logic flow by logical conditions

Using if as a statement

The if expression works with a logical condition. The logical condition in R


produces a single – element logical vector. For example, we can write a
function test_positive that returns 1 if a number is provided and returns
nothing otherwise.

> test_positive = function(x){


+ if (x > 0){
+ return(1)
+ }
+ }
95
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> test_positive(1)
[1] 1
> test_positive(0)

In this function, we check the condition x > 0, if the condition is satisfied, the
function returns 1. We have tested the function by passing values.

We can generalize the function by adding else if and else branches and
check branch conditions. Now the function returns 1 for positive input, -1 for
negative input and 0 for 0.

> sign = function(x){


+ if (x > 0){
+ return(1)
+ } else if (x < 0) {
+ return(-1)
+ }
+ else{
+ return(0)
+ }
+ }
> sign (2)
[1] 1
> sign (-4)
[1] -1
> sign (0)
[1] 0

It is not mandatory for a function to return a value. It may return nothing


(NULL) depending on various conditions. The following function prints the sign
of the number that is passed as an argument

> print_sign = function(x){


+ if (x > 0){
+ print("Number is greater than 0")
+ } else if (x < 0) {
+ print("Number is less than 0")
+ }
+ else{
+ print("Number is equal to 0")
+ }
+ }
> print_sign(0)
[1] "Number is equal to 0"

96
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> print_sign(4)
[1] "Number is greater than 0"
> print_sign(-2)
[1] "Number is less than 0"

The branch conditions may or may not be related. For example, in the
following grading policy , branch conditions slices the score range

> assign_grade <- function(marks) {


+ if (marks >= 90) {
+ return("A")
+ } else if (marks >= 80) {
+ return("B")
+ } else if (marks >= 70) {
+ return("C")
+ } else if (marks >= 60) {
+ return("D")
+ } else {
+ return("F")
+ }
+ }
> c(assign_grade (95), assign_grade(83), assign_grade(78),
assign_grade(61), assign_grade(54))
[1] "A" "B" "C" "D" "F"

In this case, the branch condition in else if assumes that the previous
condition does not hold. When we specify marks >= 80, we mean that marks
< 90 and marks >= 80 which depends on the previous conditions. Hence we
can neither change the order of the branches nor make the branches
independent.

We can try to change the order of the branches.

> assign_grade <- function(marks) {


+ if (marks >= 60) {
+ return("D")
+ } else if (marks >= 70) {
+ return("C")
+ } else if (marks >= 80) {
+ return("B")
+ } else if (marks >= 90) {
+ return("A")
+ } else {
+ return("F")

97
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
+ }
+ }
> c(assign_grade (95), assign_grade(83), assign_grade(78),
assign_grade(61), assign_grade(54))
[1] "D" "D" "D" "D" "F"

In this case only assign_grade(54) got the correct grade but rest of them
were broken. We can rewrite the conditions so that they do not depend on the
conditions.

> assign_grade <- function(marks) {


+ if (marks >= 60 && marks < 70) {
+ return("D")
+ } else if (marks >= 70 && marks < 80) {
+ return("C")
+ } else if (marks >= 80 && marks < 90) {
+ return("B")
+ } else if (marks >= 90) {
+ return("A")
+ } else {
+ return("F")
+ }
+ }
> c(assign_grade (95), assign_grade(83), assign_grade(78),
assign_grade(61), assign_grade(54))
[1] "A" "B" "C" "D" "F"

However in this case, the function is more verbose than the first correct
version. Therefore, we should figure out the correct order of the branch
conditions and be careful about the dependency of each branch.

Using if as an expression

if can also be used as inline expression. Instead of writing return() in the


conditional expression, we can return the value of if statement in the
function body as well.

> test_positive = function(x){


+ return(if (x > 0){
+ 1
+ })
+ }

98
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can rewrite this expression syntax in one line by removing the curly
brackets

> test_positive = function(x){


+ return(if (x > 0) 1)
+ }

Since the return value of a function is the value of its last expression in the
function body, return() can be removed in this case:

> test_positive = function(x){


+ if (x > 0) 1
+ }

Using the similar logic we can rewrite the sign() function

> sign = function(x){


+ if (x > 0) 1 else if (x < 0) -1 else 0
+ }

We can explicitly get the value of if expression. In the following example, we


have implemented a grade reporting function that mentions the grade of a
student and give the student name and their score

> assign_grade <- function(name, marks) {


+ grade <- if (marks >= 90) "A"
+ else if (marks >= 80) "B"
+ else if (marks >= 70) "C"
+ else if (marks >= 60) "D"
+ else "F"
+ cat("The sudent ",name," scored ", grade)
+ }
> assign_grade("Kate",76)
The sudent Kate scored C

Using the if statement as an expression is more compact and less verbose, In


practice, all the branch conditions may be more complex and return complex
objects. In such situation, we should use curly brackets to avoid syntax errors
and improve the readability.

> assign_grade <- function(name, marks) {


+ if (marks >= 90){
+ grade <- "A"
99
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
+ cat("Congratulations for Good Score \n")
+ } else if (marks >= 80){
+ grade <- "B"
+ } else if (marks >= 70) {
+ grade <- "C"
+ } else if (marks >= 60) {
+ grade <- "D"
+ } else {
+ grade <-"F"
+ cat("Sorry You cannot be promoted")
+ }
+ cat("The sudent ",name," scored ", grade)
+ }
> assign_grade("Jate",96)
Congratulations for Good Score
The sudent Jate scored A

Using if with vector

In the previous chapter we have seen that the functions created earlier work
with single-value input. If we provide a vector, the functions will produce
warnings. Because the functions do not work with multi-element vectors:

> test_positive(c(-1, 0, 1))


Warning message:
In if (x > 0) 1 :
the condition has length > 1 and only the first element will
be used

In this example we can see that the if statement ignores all but the first
element, if a multi-element logical vector is supplied.

Similarly we can see the following example

> num <- c(1, 2, 3)


> if (num > 2) {
+ cat("num > 2!")
+ }
Warning message:
In if (num > 2) { :
the condition has length > 1 and only the first element will
be used

100
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We get this error because the logic is not clear. When we try to condition the
logical vector, the values can be a mix of TRUE and FALSE value.

We should avoid this ambiguity. One of the methods of avoiding the ambiguity
is any() method. The any() method returns true if at least one element in the
given vector is TRUE:

> any(c(TRUE, FALSE, FALSE))


[1] TRUE

> any(c(FALSE, FALSE))


[1] FALSE

Now we can try the previous example to print a message if any single value is
greater than 2

> num <- c(1, 2, 3)


> if (any(num > 2)) {
+ cat("num > 2!")
+ }
num > 2!

Similarly, if we want to print a message if all the values are greater than 2, we
should use all():

> num <- c(1, 2, 3)


> if (all(num > 2)) {
+ cat("num > 2!")
+ } else {
+ cat("all values are not greater than 2")
+ }
all values are not greater than 2

Vectorized if:ifelse

We have studied that vectors are the basic building blocks of R programming.
Many functions in R take vectors as input and they output a resultant vector.
The mathematical operations on vectors are more efficient than those on each
element of the vector.

R programming provides an equivalent form of if…else statement in the form


of ifelse() function.
101
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The syntax of ifelse() function is

ifelse(test_expression, x, y)

In this case test_expression must be logical vector. The return value is also a
vector that is of same length as test_expression.

The returned vector has element from x if the corresponding value of


test_expression is TRUE or from y if the corresponding value of
test_expression if FALSE. The vector x and y are recycled whenever
necessary.

In other words, we can say that ifelse() is the vectorized version of if as


shown below

> ifelse(c(TRUE, FALSE, FALSE), c(1, 2, 3), c(4, 5, 6))


[1] 1 5 6

In the case of ifelse(), it is mandatory to provide yes and no. This is


different from if…else statement where the else statement is not
mandatory.

We can also use more complex expressions

> a = c(4, 5, 6, 7)
> ifelse(a %% 2 == 0, "Even", "Odd")
[1] "Even" "Odd" "Even" "Odd"

In the above example, the test_expression is a %% 2 == 0 which will result


in the vector (TRUE, FALSE, TRUE, FALSE).

Similarly, the other two vectors in the function argument are recycled to
("Even", "Even", "Even", "Even")and ("Odd", "Odd", "Odd",
"Odd")respectively. Hence the results are evaluated accordingly.

Using switch function

102
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In R, we test an expression against elements of a list using switch() function.
If the value given in the expression matches item from the list, the
corresponding value is returned.

The syntax of switch function is

switch (expression, list)

In this case, the expression is evaluated. Based on this value, we get from the
corresponding item in the list.

If more than one item in the list matches the expression, switch() function
returns the first matches item.

If the evaluated value is a number, that item of the list is returned

> switch(1, "red", "blue", "green")


[1] "red"

> switch(2, "red", "blue", "green")


[1] "blue"

If the evaluated number is out of bound, the invisible NULL is returned.

> switch(4, "red", "blue", "green")

If the evaluated value is a string, it returns the value of the first argument that
matches with the evaluated value.

> switch("color", color="red", shape="square", fill="not


filled")
[1] "red"

> switch("fill", color="red", shape="square", fill="not


filled")
[1] "not filled"

In this case also, if evaluated value is out of bound, the invisible NULL is
returned

103
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> switch("frame", color="red", shape="square", fill="not
filled")

To cover all the possibilities, we can add the last argument without an
argument name that captures all other inputs:

> switch("frame", color="red", shape="square", fill="not


filled", "photo frame")
[1] "photo frame"

Compared to the ifelse() method, switch behave more like if() method, if
only accepts a single value but It can return any value

Loop expressions

In programming, we use a loop to repeat a specific block of code.

For loop

We use a for loop to evaluate an expression by iterating over a given vector or


list.

for (var in vector){


expr
}

We use a for loop to evaluate an expression by iterating over a given vector or


list. In this case, the command expr will be executed in a loop while var will
take the value of each element of the vector in each iteration.

If the vector contains n element, the loop will be equivalent to the following
statement block

var <- vector[[1]]


expr
var <- vector[[2]]
expr
...
var <- vector[[n]]
expr

104
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
For example, if we want to iterate an expression 3 times, by iterating over 1:3
using variable i. In each iteration, we can display a text with the values of
during each iteration i.

> for (i in 1:3){


+ cat("Printing value of i",i,"\n")
+ }
Printing value of i 1
Printing value of i 2
Printing value of i 3

We can not only use the iterator with the numeric vectors but all with any
vectors. In the example below, we have replaced a numeric vector 1:3 with a
character vector

> for (word in c("I", "am", "Learning", "R")){


+ cat("Printing current word",word,"\n")
+ }
Printing current word I
Printing current word am
Printing current word Learning
Printing current word R

We can also use a list

> listloop <- list(


+ a = c(1,2,3),
+ b = c('p','q','r','s')
+ )
> for (item in listloop){
+ cat("item: \n length:",length(item),"\n
class:",class(item),"\n")
+ }
item:
length: 3
class: numeric
item:
length: 4
class: character

We can also use data frame

> df <- data.frame(


+ x = c(1, 2, 3),

105
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
+ y = c("A", "B", "C"),
+ stringsAsFactors = FALSE)
> for (col in df) {
+ str(col)
+ }
num [1:3] 1 2 3
chr [1:3] "A" "B" "C"

Since the data frame is a list in which all the elements are of the same length.
Therefore the behaviour of for loop is the same for the list and the data frame
as we have seen in the previous two examples.

However, we can iterate a data frame row by row. For that, we need to iterate
over the integer sequence from 1 to the number of rows of the data frame

> for (i in 1:nrow(df)) {


+ cat("row", i, "\n", str(df[i,]),"\n")
+ }
'data.frame': 1 obs. of 2 variables:
$ x: num 1
$ y: chr "A"
row 1

'data.frame': 1 obs. of 2 variables:


$ x: num 2
$ y: chr "B"
row 2

'data.frame': 1 obs. of 2 variables:


$ x: num 3
$ y: chr "C"
row 3

The iteration over a data frame row by row is not a good idea. It is slow and
verbose. We will discuss the better option in the next chapter.

Managing the flow of a for loop

On several occasions, we need to intervene in a for loop. In each iteration, we


can interrupt the for loop, or do nothing and finish the loop.

We can use break to terminate a for loop.

106
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> for (i in 1:5) {
+ if (i == 3) break
+ cat("message ", i, "\n")
+ }
message 1
message 2

This is very helpful in finding out a solution to a problem. For example if we


want to find the numbers between 1000 and 1100 that satisfy (i ^ 2) %% 11
equals (i ^ 3) %% 17, where ^ is a power operator and %% (modulo
operator) returns the remainder of a division

> m <- integer()


> for (i in 1000:1100) {
+ if ((i ^ 2) %% 11 == (i ^ 3) %% 17) {
+ m <- c(m, i)
+ }
+ }
> m
[1] 1055 1061 1082 1086 1095

You can use break expression in the place of the record tracking expression if
you only need the first number that can satisfy the condition

> for (i in 1000:1100) {


+ if ((i ^ 2) %% 11 == (i ^ 3) %% 17) break
+ }
> i
[1] 1055

Once the program finds the solution, the for loop breaks and the last value of
the iterator i is preserved.

You can also use the next keyword to skip the rest of the expressions in the
current iteration and directly jump on the next iteration in the loop.

> for (i in 1:5) {


+ if (i == 3) next
+ cat("message ", i, "\n")
+ }
message 1
message 2
message 4

107
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
message 5

Creating nested for loop

We can include a for loop inside another for loop. If we want to print all the
permutations of the elements in a vector, we can use a two-level nested for
loop.

> input <- c("a", "b", "c")


> perm <- character()
> for (x in input) {
+ for (y in input) {
+ perm <- c(perm, paste(x, y, sep = ","))
+ }
+ }
> perm
[1] "a,a" "a,b" "a,c" "b,a" "b,b" "b,c" "c,a" "c,b" "c,c"

If we want the permutation of distinct items, we can use a test condition and
the next expression inside the inner for loop.

> input <- c("a", "b", "c")


> perm <- character()
> for (x in input) {
+ for (y in input) {
+ if (x == y) next
+ perm <- c(perm, paste(x, y, sep = ","))
+ }
+ }
> perm
[1] "a,b" "a,c" "b,a" "b,c" "c,a" "c,b"

We have shown, how the for loops and nested for loops work. But they may
not be an optimal solution. R programming language offers several built-in
functions. For example, we can use combn() method to produce a matrix of
combinations of vector elements

> combn(c("a","b","c"),2)
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "b" "c" "c"

108
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Similarly, we can use expand.grid() to produce a data frame that contains all
the permutation of elements in multiple vectors

> expand.grid(num = c(1, 2, 3), char = c("a", "b"))


num char
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b

While Loop

The while loop does not stop running until a specific condition is met.

while (test_expression)
{
expr
}

In the following example, the while loop starts with x = 0. Each time, the
test_expression, which in this case is x <= 5, is evaluated. If it evaluates
to TRUE, body of the loop is executed, else the while loop terminates.

> x <- 0
> while (x <= 5) {
+ cat(x, " ", sep = "")
+ x <- x + 1
+ }
0 1 2 3 4 5

If we remove the expression x <- x + 1 the value of x will not be


incremented and the code will run forever. Hence, we should be careful while
implementing the while loop.

We can use the flow control statements, break and next

> x <- 0
> while (TRUE) {
+ x <- x + 1
+ if (x == 4) break
+ else if (x == 2) next
109
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
+ else cat(x, '\n')
+ }
1
3

In practice, we prefer while loop when the number of iterations is unknown.

5. Working with Basic Objects

110
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
R provides an enormous amount of function. These built-in functions not only
save your time but also boost productivity.

In this chapter, we shall over the following built-in functions

• Object functions
• Logical functions
• Math functions
• Numeric methods
• Statistical functions
• Apply-family functions

Object functions

In this section, we will study some basic functions that can be used for objects.
Some of such functions, we have already studied in previous chapters. In this
chapter, we will learn more functions to access the type and dimensions of a
data object.

Testing object types

In the R programming language, everything is an object that is of different


types.

Suppose, we have a user-defined object. As a project requirement, we need to


develop a function that can behave differently according to the type of input
object. We need to define a function obj_type that can return the first
element if the input object is an atomic vector such as numeric, character, or
logical vector. But if the input object is a list and index, it should return a user-
defined element.

For example, if we pass the numeric vector c(1, 2, 3) as an input argument,


our function should return the first element 1. If the input argument is a
character vector c("a","b","c"), our function should return the first
element "a". But if our input argument is a list list(vec = c("a", "b",
"c"), index = 2), the function should return the second element of index
= 2, that is "b".
111
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In order to meet this project requirement, we need to develop logic and
program flow. First, we need to check the object type of the input argument.
Because the output of the function depends on the object type. We can use
is.* function to check the type of the object. The function returns different
values based on the different input object types. So we will use the conditional
expression, if else. Finally, we will use the element-extraction operator
because the function needs to return an element of the input object.

> obj_type <- function(s) {


+ if (is.atomic(s)) {
+ s[[1]]
+ } else if (is.list(s)) {
+ s$vec[[s$index]]
+ } else {
+ stop("The input type is not supported")
+ }
+ }

The function returns different values based on the input object s. If s takes an
atomic vector such as a numeric vector, we get the first element of the vector.
If s takes a list of vec and index, we get the element with the index of index
from s$vec.

> obj_type(c(10,11,12))
[1] 10
> obj_type(list(vec=c("Cat","Bat","Rat"), index = 2))
[1] "Bat"

If we pass some other input type such as function, the function should not
return any value. Rather we should get an error message. If we pass the mean
as a function, the function obj_type should get into the else condition and
stop.

> obj_type(mean)
Error in obj_type(mean) : The input type is not supported

Now we need to test out function for other possibilities. What if the input is a
list but the elements are not vec and index? We can test it by passing a list of
lst without any index element.

112
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> obj_type(list(lst=c("Cat","Bat","Rat")))
NULL

It returns NULL instead of returning an error message or value. We get NULL


because s$vec is NULL. If we extract any value from NULL, we get NULL.

> NULL[[NULL]]
NULL
> NULL[[1]]
NULL

Another possibility is passing the vec element correctly but missing the index.

> obj_type(list(vec=c("Cat","Bat","Rat")))
Error in s$vec[[s$index]] :
attempt to select less than one element in get1index

This time, we got an error because s$index is NULL. If we extract value from a
vector by NULL, we get an error.

> c("Cat","Bat","Rat")[[NULL]]
Error in c("Cat", "Bat", "Rat")[[NULL]] :
attempt to select less than one element in get1index

Another possibility is that the list only contains one element index = 2. In this
case, we only get NULL.

> obj_type(list(index = 2))


NULL

From the experiments above, we observe that the error messages are not so
informative. Hence, we should check the input our self in the implementation
of the function.

> obj_type1 <- function(s) {


+ if (is.atomic(s)) {
+ x[[1]]
+ } else if (is.list(s)) {
+ if (!is.null(s$vec) && is.atomic(s$vec)) {
+ if (is.numeric(s$index) && length(s) == 1) {
+ s$vec[[s$index]]
+ } else {
+ stop("Index is Invalid")
113
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
+ }
+ } else {
+ stop("Data is Invalid")
+ }
+ } else {
+ stop("Input type is not supported")
+ }
+ }

In case s is a list, we check if s$vec is not null and is an atomic vector. If this
condition is TRUE, we check whether s$index is properly defined as a single-
element numeric vector. If any of the conditions are violated, the program
stops and displays an informative error message.

> obj_type1(list(lst=c("Cat","Bat","Rat")))
Error in obj_type1(list(lst = c("Cat", "Bat", "Rat"))) : Data
is Invalid
> obj_type1(list(index = 2))
Error in obj_type1(list(index = 2)) : Data is Invalid

Accessing Object Classes and Types

In previous sections, we have shown that we can use is.* function to access
object class and type. In addition, we can also use class() and typeof()

The function typeof() returns the low-level internal type of an object, while
class() returns the high-level class of an object.

For a numeric vector

> x <- c(1, 2, 3)


> class(x)
[1] "numeric"
> typeof(x)
[1] "double"
> str(x)
num [1:3] 1 2 3

For an integer vector

> x <- 1:3


> class(x)
[1] "integer"
114
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> typeof(x)
[1] "integer"
> str(x)
int [1:3] 1 2 3

For a character vector

> x <- c("a","b","c")


> class(x)
[1] "character"
> typeof(x)
[1] "character"
> str(x)
chr [1:3] "a" "b" "c"

For a list

> x <- list(one = c("a","b","c"), two = c(TRUE, FALSE))


> class(x)
[1] "list"
> typeof(x)
[1] "list"
> str(x)
List of 2
$ one: chr [1:3] "a" "b" "c"
$ two: logi [1:2] TRUE FALSE

For a data frame

> x <- data.frame(one = c("a","b","c"), two = c(1, 2, 3))


> class(x)
[1] "data.frame"
> typeof(x)
[1] "list"
> str(x)
'data.frame': 3 obs. of 2 variables:
$ one: chr "a" "b" "c"
$ two: num 1 2 3

We can notice that in the last statement, that data.frame is essentially a list
that has all the columns of equal length. Even though the class is returned as
data.frame but typeof() returns list internally.

115
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Getting data dimensions

In R programming, a vector is a one-dimensional structure

> s <- c(1, 2, 3, 4, 5, 6, 6, 5, 4, 3, 2, 1)


> class(s)
[1] "numeric"
> typeof(s)
[1] "double"

For the above underlying data, we can use more dimensions dim(), nrow(),
ncol()

> mat <- matrix(s, ncol=4)


> mat
[,1] [,2] [,3] [,4]
[1,] 1 4 6 3
[2,] 2 5 5 2
[3,] 3 6 4 1
> class(mat)
[1] "matrix" "array"
> typeof(mat)
[1] "double"
> dim(mat)
[1] 3 4
> nrow(mat)
[1] 3
> ncol(mat)
[1] 4

The first expression creates a four-column matrix from number vector s. The
underlying typeof() of s have been preserved. But class()has been
changed to "matrix" "array". The dim() shows the dimensional structure
in a vector form. We have also used two shortcuts nrow() and ncol() to
access the number of rows and columns. The nrow() and ncol()are the first
and second elements of dim()vector.

We can use an array to represent the higher dimensions.

> arr <- array(s, dim= c(3,2,2))


> class(arr)
[1] "array"
> typeof(arr)
[1] "double"
116
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> dim(arr)
[1] 3 2 2
> nrow(arr)
[1] 3
> ncol(arr)
[1] 2

In this case, dim() shows the number of dimensions of the data.

Another data structure where the notion of dimension is used is the data
frame. The data frame is fundamentally different from the matrix. We derive a
matrix from a vector by adding dimensional property. Similarly, we derive a
data frame from a list. Just we add a constraint that each list element should
be of the same length.

> df <- data.frame(a=c(1,2,3), b=c("Cat","Mat","Rat"))


> class(df)
[1] "data.frame"
> typeof(df)
[1] "list"
> dim(df)
[1] 3 2
> nrow(df)
[1] 3
> ncol(df)
[1] 2

Reshaping Data Structures

We can use dim()method to reshape the data structure by assigning a new


value to the dimension of the underlying data structure.

> dat <- s


> dim(dat) <- c(3,4)
> dat
[,1] [,2] [,3] [,4]
[1,] 1 4 6 3
[2,] 2 5 5 2
[3,] 3 6 4 1
> class(dat)
[1] "matrix" "array"
> typeof(dat)
[1] "double"

117
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The class of object changes from numeric to matrix whereas the type of the
object does not change.

We can reshape the matrix

> dim(dat) <- c(2,6)


> class(dat)
[1] "matrix" "array"
> typeof(dat)
[1] "double"
> dim(dat)
[1] 2 6

We can reshape the matrix into an array. This is possible because the
dim()function only changes the representation. But the underlying data store
does not change.

> dim(dat) <- c(2,3,2)


> class(dat)
[1] "array"
> typeof(dat)
[1] "double"
> dim(dat)
[1] 2 3 2

Iterating over one dimension

We have already created a data frame where each row represents a record.
We can iterate over all records that have been stored in the data frame. Let’s
consider the data frame df

> df
a b
1 1 Cat
2 2 Mat
3 3 Rat

We can iterate using for loop over 1:nrow(df)

> for (i in 1:nrow(df)){


+ cat("row number",i,"column a",df[i,"a"],"column
b",df[i,"b"],"\n")
+ }
118
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
row number 1 column a 1 column b Cat
row number 2 column a 2 column b Mat
row number 3 column a 3 column b Rat

Using logical function

The logical vectors are used to filter the data. They can take TRUE or FALSE. To
solve problems, on various occasions, we need to create joint conditions by
involving multiple logical vectors.

Logical operators

The logical operators that help us do basic logical calculations are

Symbol Description Example Result


& Vectorized c(T, T) & c(T, F) c(TRUE, FALSE)
AND
| Vectorized OR c(T, T) | c(T, F) c(TRUE, TRUE)
&& Univariate c(T, T) && c(F, T) FALSE
AND
|| Univariate OR c(T, T) || c(F, T) TRUE
! Vectorized !c(T, F) c(FALSE, TRUE)
NOT
%in% Vectorized IN c(1, 5) %in% c(1, 2, 3, c(TRUE, FALSE)
4)

In an if expression, && and || yield only a single-element logical vector. If


we use && with multiple element vectors, it will ignore all but the first element
of the vector on both the side.

The following R program check whether the values x, y, and z are increasing
monotonically. If they are increasing, the function should return 1; if they are
decreasing, the function should return -1; else it should return 0.

direction <- function(i, j, k) {


if (i < j & j < k) 1
else if (i > j & j > k) -1
else 0
}

119
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We have seen that & performs the vectorized calculation and returns a multi-
element vector if one of the arguments has more than one element. But in the
case of if statements, it only works with a single-value logical vector.

> direction(1, 2, 3)
[1] 1

Otherwise, it gives a warning.

> direction(c(1, 2),c(2, 3),c(3, 4))


[1] 1
Warning message:
In if (i < j & j < k) 1 else if (i > j & j > k) -1 else 0 :
the condition has length > 1 and only the first element will
be used

For our experiment, we created a new function direction2 by replacing &


with &&.

direction2 <- function(i, j, k) {


if (i < j && j < k) 1
else if (i > j && j > k) -1
else 0
}

For the scalar input, the behaviour of the two versions is the same.

> direction2(1, 2, 3)
[1] 1

But for multiple value input, direction2 ignores the second element of each
input vector without producing any warning.

> direction2(c(1, 2),c(2, 3),c(3, 4))


[1] 1

Now the question arises, which of the two & and &&, is the better option. It
depends on the requirement. But if the requirement is to compare all the
elements in the same position of each input vector, then both the options will
be incorrect.

Logical functions

120
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In previous chapters, we discussed how a few logical aggregation functions
have been very useful. The most commonly used two logical aggregation
functions are any() and all(). The any() function returns TRUE, if at least
one of the elements of the given logical vector is TRUE. Otherwise, it will return
FALSE. The all() function returns TRUE, if all the elements of the given logical
vector are TRUE else it will return FALSE.

> x <- c( -2, -1, 0, 1, 2, 3)


> any(x > 1)
[1] TRUE
> all(x <= 0)
[1] FALSE

While dealing with any() and all() functions, we should remember that they
only return a single TRUE or FALSE value. They never return a multi-element
logical vector. We can modify the function direction, to include both all()
and & together in the if condition.

updated_direction_all <- function(i, j, k) {


if (all(i < j & j < k)) 1
else if (all(i > j & j > k)) -1
else 0
}

The function gives the same results as we got in direction() and


direction2() for scalar values.

> updated_direction_all(1, 2, 3)
[1] 1

But for the multi-element vector input, we have to test whether the function
gives us the same monotonicity

> updated_direction_all(c(1, 2),c(2, 3), c(3, 4))


[1] 1
> updated_direction_all(c(4,3),c(3, 2), c(2,1))
[1] -1

The function has returned meaningful results now.

We can use several other variations. You can try these functions to test the
functionality they provide

121
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
updated_direction_any <- function(i, j, k) {
if (any(i < j & j < k)) 1
else if (any(i < j & j < k)) -1
else 0
}
updated_direction_all2 <- function(i, j, k) {
if (all(i < j) && all(j < k)) 1
else if (all(i > j) && all(j > k)) -1
else 0
}
updated_direction_any2 <- function(i, j, k) {
if (any(i < j) && any(j < k)) 1
else if (any(i > j) && any(j > k)) -1
else 0
}

Which elements are TRUE

The logical operations that we introduced so far, just return whether a certain
condition is TRUE or FALSE. It does not tell us which elements are TRUE. The
which() function can be used to get the positions of TRUE elements in a
logical vector.

> x
[1] -2 -1 0 1 2 3
> abs(x) >= 1.5
[1] TRUE FALSE FALSE FALSE TRUE TRUE
> which(abs(x) >= 1.5)
[1] 1 5 6

We can also use logical conditions to filter elements from a vector or a list

> x[abs(x) >= 1.5]


[1] -2 2 3

In this example, abs(x) >= 1.5 the is evaluated to be a logical vector. It is


then used to select elements in x corresponding to TRUE values.

If we use a logical vector that returns all the FALSE values. A zero-length
numeric vector is returned.

> x[x>=10]

122
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
numeric(0)

Dealing with missing values

The real-world data may contain several data issues. It may contain missing
values represented by NA. For example

> x <- c(-3, NA, -2, -1, 0, NA, 1, NA, 3)

Any arithmetic calculations on the missing values will also return missing
values:

> x + 2
[1] -1 NA 0 1 2 NA 3 NA 5

The logical vector will also contain missing values

> x > 2
[1] FALSE NA FALSE FALSE FALSE NA FALSE NA TRUE

Hence any() and all() have to deal with missing values too.

> x
[1] -3 NA -2 -1 0 NA 1 NA 3
> any(x > 1)
[1] TRUE
> any(x < -2)
[1] TRUE
> any(x < -3)
[1] NA

If any of the results of the expression is TRUE, the function returns TRUE. If no
element is TRUE, the function returns NA. Otherwise if the function returns all
the FALSE values, it would return FALSE as demonstrated below.

> any(c(TRUE, FALSE, NA))


[1] TRUE
> any(c(FALSE, FALSE, NA))
[1] NA
> any(c(FALSE, FALSE))
[1] FALSE

We can use na.rm=TRUE to ignore the missing values.


123
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> any(x < -3, na.rm=TRUE)
[1] FALSE

Similar but opposite logic applies to all(). If any element of the input vector
along with missing values is FALSE, the function returns FALSE. But all the
elements in the vector are TRUE along with missing values, the function returns
NA.

> all(c(TRUE, FALSE, NA))


[1] FALSE
> all(c(TRUE, TRUE, NA))
[1] NA
> all(c(TRUE, TRUE))
[1] TRUE

In this case also, we can use na.rm=TRUE, to ignore the missing values.

> all(c(TRUE, TRUE, NA), na.rm=TRUE)


[1] TRUE

The data filtering also behaves differently when missing values are involved.
The following code will preserve the missing values at the corresponding
positions of the logical vector that is produced by x >= 0.

> x[x > 0]


[1] NA NA 1 NA 3

We can use which() that does not preserve the missing values.

> x[which(x > 0)]


[1] 1 3

Logical Coercion

We can also use numeric vectors in place of logical vectors as input for some
functions. The non-logical vectors are coerced to the logical values.

For example. We can put a numeric vector in the if condition. The numeric
vector will be coerced in such cases.

> if (2) cat("True") else cat("False")


124
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
True
> if (0) cat("False") else cat("True")
True

All non-zero values in a numeric vector or integer vector can be coerced to


TRUE, only zero values will be coerced to FALSE. The string values cannot be
coerced into logical values.

> if("a") 1 else 2


Error in if ("a") 1 else 2 : argument is not interpretable as
logical

Math Functions

R provides several groups of basic math functions. The basic functions include
square root, exponential, and logarithm functions.

You can use sqrt() with real numbers. For a negative number, NaN will be
returned followed by a warning message.

> sqrt(4)
[1] 2
> sqrt(-1)
[1] NaN
Warning message:
In sqrt(-1) : NaNs produced

In R, numeric values can be finite, infinite ( Inf and -Inf), and NaN values. The
following code will produce infinite values ( Inf and -Inf):

> 1/0
[1] Inf
> log(0)
[1] -Inf

You can use function is.finite() or is.infinite() to check whether a


numeric value is finite or infinite.

> is.finite(1/0)
[1] FALSE
> is.infinite(log(0))
[1] TRUE

125
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can use the inequality to check the sign of the Inf.

> 1/0 < 0


[1] FALSE
> 1/0 > 0
[1] TRUE
> log(0) < 0
[1] TRUE
> log(0) > 0
[1] FALSE

We can create a custom function, is.pos.infinite()that can tell whether


the number is Inf or -Inf .

is.pos.infinite <- function(x){


is.infinite(x) & x > 0
}

is.neg.infinite <- function(x){


is.infinite(x) & x < 0
}

> is.pos.infinite(1/0)
[1] TRUE
> is.neg.infinite(log(0))
[1] TRUE

Number rounding functions

Major number rounding functions are

To round up the values passed as an argument to the nearest integer

> ceiling(c(-1.3,-1.7, 1.3, 1.7))


[1] -1 -1 2 2

To round down the values passed as an argument to the nearest integer

> floor(c(-1.3,-1.7, 1.3, 1.7))


[1] -2 -2 1 1

To truncate the values passed as an argument towards 0

> trunc(c(-1.3,-1.7, 1.3, 1.7))


126
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] -1 -1 1 1

To rounds the values in its first argument to the specified number of decimal
places (default 0)
> round(c(-1.3,-1.7, 1.3, 1.7))
[1] -1 -2 1 2
> round(pi, 3)
[1] 3.142

To round the values in its first argument to the specified number of significant
digits

> signif(pi, 4)
[1] 3.142

Trigonometric functions

Commonly used trigonometric functions are

> sin(0)
[1] 0
> cos(0)
[1] 1
> tan(0)
[1] 0
> asin(1)
[1] 1.570796
> acos(1)
[1] 0
> atan(1)
[1] 0.7853982

R also provides a numeric version of 𝜋

In maths, sin(𝜋 ) = 0 strictly holds. But in R the formula does not lead to 0 due
to the precision of floating numbers.

> sin(pi)
[1] 1.224647e-16

But equating sin(pi) == 0 results FALSE.


> sin(pi) == 0

127
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] FALSE

We can use all.equal() that uses the default tolerance of 1.5e-8


> all.equal(sin(pi),0)
[1] TRUE

Hyperbolic Function

R provides hyperbolic functions as given below

> x <- 1
> sinh(x)
[1] 1.175201
> cosh(x)
[1] 1.543081
> tanh(x)
[1] 0.7615942
> asinh(x)
[1] 0.8813736
> acosh(x)
[1] 0
> atanh(0)
[1] 0

Extreme Functions

It is a very common requirement to calculate the minimum and maximum


values of some numbers. We can use max() and min().

> min(c(1, 2, 3))


[1] 1
> max(c(1, 2, 3))
[1] 3

We can also use multiple vector input

> min(c(1, 2, 3), c(3, 2, 1), c(2, 3, 4))


[1] 1
> max(c(1, 2, 3), c(3, 2, 1), c(2, 3, 4))
[1] 4

128
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can see that min() returns the minimal value among all the input vectors.
On the other hand, max() returns the maximal value.

If we want to obtain maximal or minimal values for each vector, we should use
pmax() or pmin().

> pmin(c(1, 2, 3), c(3, 2, 1), c(2, 3, 4))


[1] 1 2 1
> pmax(c(1, 2, 3), c(3, 2, 1), c(2, 3, 4))
[1] 3 3 4

In the first example above, pmin()function will give the minimal value among
all the elements at 1st position, then 2nd position, and finally 3rd position within
the vectors. This is called the parallel minima.

The twin function pmax()is used to find parallel maxima. If any of the vectors
contains lesser number of elements than the other two, the elements in the
smaller vector will recycle.

> pmin(c(1, 2, 3), c(3, 2, 2), c(0, 1))


[1] 0 1 0
Warning message:
In pmin(c(1, 2, 3), c(3, 2, 2), c(0, 1)) :
an argument will be fractionally recycled

One of the use cases is given below

Suppose, you need to write a function that returns -5 if the value is less than
-5. If the input is between -5 and 5, it should return the value of the input. If
the input the greater than 5 then the value is 5.

new_func <- function(x){


pmin(5, pmax(-5, x))
}

> new_func(seq(-8,8))
[1] -5 -5 -5 -5 -4 -3 -2 -1 0 1 2 3 4 5 5 5 5

Finding Roots

One of the most commonly encountered tasks is to find the roots. Suppose, we
want to find the roots of the following equation
129
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
𝑥! + 𝑥 − 2 = 0
We can find the roots as 𝑥 = −2 𝑎𝑛𝑑 𝑥 = 1

Using R we can find the roots of a polynomial equation in the form of

𝑝(𝑥 ) = 𝑧" + 𝑧! 𝑥 + ⋯ + 𝑧# 𝑥 #$"

In R, we need to specify the polynomial coefficient vector from zero-order to


the term of the highest order present in the equation. In this case, the vector
c(-2, 1, 1) is that represents the coefficients in the increasing order of
power. Hence to find the roots for the equation, 𝑥 ! + 𝑥 − 2 = 0, we can use
the following.

> polyroot(c(-2, 1, 1))


[1] 1-0i -2+0i

The function always returns a complex vector whereas each element will be in
the form of a + bi. If we want to get the real roots only, we can use Re() to
extract the real parts of the complex roots:

> Re(polyroot(c(-2, 1, 1)))


[1] 1 -2

We can try a more complex equation 𝑥 % − 𝑥 ! − 2𝑥 − 1

> r <- polyroot(c(-1, -2, -1, 1))


> r
[1] -0.5739495+0.3689894i -0.5739495-0.3689894i 2.1478990-
0.0000000i

If we replace 𝑥 with r

> r ^ 3 - r ^ 2 - 2 * r - 1
[1] 8.881784e-16+1.110223e-16i 8.881784e-16+2.220446e-16i
8.881784e-16-4.188101e-16i

You may notice that the result does not go to zero, but it is very close to zero.
If we are only interested in 8 digits of precision, we can use round() to check
whether roots are valid

130
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> round(r ^ 3 - r ^ 2 - 2 * r - 1, 8)
[1] 0+0i 0+0i 0+0i

Derivatives

We can use D() to compute the derivative of a function symbolically with


respect to given variable

&
For example if we want to find &' 𝑥 ! , we can use the following

> D(quote (x ^ 2), "x")


2 * x

&
Similarly, we can find &' sin(𝑥) cos(𝑥𝑦) as follows

> D(quote(sin(x) * cos(x * y)), "x")


cos(x) * cos(x * y) - sin(x) * (sin(x * y) * y)

We have used quote() function. This function helps keep the expression
unevaluated. This helps us access the symbols as they are written.

The derivative is an unevaluated expression. We can evaluate it by using a


function eval().

> z <- D(quote(sin(x) * cos(x * y)), "x")


> eval(z, list(x=1, y=2))
[1] -1.75514

In the example above, we have used both quote()and eval(). The quote()
creates an expression object whereas eval()evaluates a given expression with
specified symbols.

Integration

R supports numeric integration. In this case, we do not have to write the


expression but we have to provide function since it is not symbolic
computation. R provides a built-in function, integrate() to solve such
problems. For example to solve the following:

131
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
(
!
@ sin(𝑥 ) 𝑑𝑥
)

> integ <- integrate(function(x) sin(x), 0, pi / 2)


> integ
1 with absolute error < 1.1e-14

The result looks like a numeric value. But it contains some other information
too because it is a list.

> str(integ)
List of 5
$ value : num 1
$ abs.error : num 1.11e-14
$ subdivisions: int 1
$ message : chr "OK"
$ call : language integrate(f = function(x) sin(x),
lower = 0, upper = pi/2)
- attr(*, "class")= chr "integrate"

Using Statistical function

R is a very useful language for statistical computing and modeling. It provides a


variety of statistical functions from statistical testing to random sampling. In
this section, we shall study some of the important statistical functions.

Sampling from a vector

In statistics, to study the population, we need a random sample. We can use


the sample() function to draw a random sample from a given vector or list. By
default, the function sample() draws a sample without replacement.

The following example demonstrates how to draw a sample of five from a


numeric vector without replacement,

> sample(1:10, size=5)


132
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] 10 8 9 2 3

If we want sampling with replacement, we can use replace = TRUE.

> sample(1:10, size=5, replace = TRUE)


[1] 9 1 2 8 1

We can sample from character vectors also

> sample(letters, size=5)


[1] "y" "q" "h" "m" "r"

We can sample from a list also

> sample(list(a=c(1,2,3,4,5), b=c('x','y','z'), c= c(TRUE,


FALSE), d=c(10.5, 12.4, 15.1, 11.7)), size=2)
$c
[1] TRUE FALSE

$b
[1] "x" "y" "z"

We can also draw a sample from any object using sample() provided that it
supports subset with [ ].

R also supports weighted sampling. We can specify the probability distribution


of each element.

> grade <- sample(c("A","B","C"), size =24, replace = TRUE,


prob = c(0.25, 0.5, 0.25))
> grade
[1] "B" "A" "A" "B" "C" "B" "B" "B" "C" "B" "B" "B" "A" "B"
"C" "C" "B" "B" "C" "A" "C" "B"
[23] "B" "B"

By using table(), we can find the number of occurrences of each value

> table(grade)
grade
A B C
4 14 6

Probability Distributions

133
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
On various occasions, we need to draw a sample from a probability distribution
instead of a vector. R provides us a variety of built-in functions for the
probability distribution. In this topic, we will learn basic statistical tools for
sampling from probability distributions. These tools work mainly with numeric
vectors.

Two of the most common statistical probability distributions are uniform


distribution and normal distribution.

For uniform distribution, over [0, 1], we can use runif(n) to generate n
random numbers.

> runif(5)
[1] 0.6488628 0.7354542 0.6606280 0.8576631 0.9273368

To generate random numbers with a non-default interval, we can specify min


and max.

> runif(5, min=-1, max=1)


[1] 0.9350436 -0.8466224 0.4442535 -0.3423099 0.9919320

Another commonly used distribution is a normal distribution. We can use


rnorm() to generate random numbers based on a standard normal
distribution.

> rnorm(5)
[1] -1.53022446 1.77021591 -1.53184603 -0.73656058 -
0.07438508

The interface of both the random generator functions is the same. The first
argument of both runif() and rnorm() is n, the number of values to
generate. The rest of the arguments are the parameters of random
distribution. The parameters of a normal distribution are mean and standard
deviation (sd).

> rnorm(5, mean=5, sd = 0.5)


[1] 4.800469 6.216979 5.111336 4.445852 4.976657

R has functions for many probability distributions. Every probability


distribution has four functions. The root name such as norm in the case of the
normal distribution is prefixed by one of the letters.
134
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
p for probability, the cumulative distribution function (c. d. f.)
q for quantile, inverse of c. d. f.
d for density, the density function (p. d. f.)
r for random, a random variable

For continuous distribution, the most useful functions are p and q functions.
For a discrete distribution, we use d function to calculate density, which in this
case is a probability.

Distribution Functions
Beta pbeta qbeta dbeta rbeta
Binomial pbinom qbinom dbinom Rbinom
Chi-Square pchisq qchisq Dchisq Rchisq
Exponential pexp qexp Dexp Rexp
Gamma pgamma qgamma Dgamma Rgamma
Normal pnorm qnorm Dnorm Rnorm
Poisson ppois qpois Dpois Rpois
Student t pt qt Dt Rt
Uniform punif qunif Dunif Runif

Summary Statistics

R provides a set of functions to calculate the summary statistics such as mean,


standard deviation, variance, mean, range, minimum, maximum, and
quantiles. For multiple numeric vectors, we can also calculate covariance
matrix and correlation matrix.

To start with, let’s generate a random numeric vector of length 100. We will
use the standard normal distribution

> x <- rnorm(100)

To calculate mean

> mean(x)
[1] -0.06842303

We can also calculate mean as follows

> sum(x)/length(x)
135
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] -0.06842303

We can also calculate trimmed mean by trimming a fraction of observations


from each end of the input data

> mean(x, trim=0.05)


[1] -0.07933614

A trimmed mean removes a small percentage of the largest and smallest


values before calculating the mean. The trimmed mean omits the outliers
hence is more robust.

To calculate median use median()

> median(x)
[1] -0.05230469

Standard deviation sd() and variance var()

> sd(x)
[1] 1.085102

> var(x)
[1] 1.177447

To calculate extreme values, use min() and max()

> c(min(x), max(x))


[1] -2.486898 4.007746

We can also use range() to calculate extreme values

> range(x)
[1] -2.486898 4.007746

We can calculate the critical quantiles using quantile()

> quantile(x)
0% 25% 50% 75% 100%
-2.48689824 -0.85248359 -0.05230469 0.63106073 4.00774583

We can calculate more quantiles using probs argument

136
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> quantile(x, probs=seq(0,1,0.1))
0% 10% 20% 30% 40%
50% 60%
-2.48689824 -1.26102234 -1.00358444 -0.71149067 -0.40031959 -
0.05230469 0.18725574
70% 80% 90% 100%
0.53641061 0.72404199 1.44893509 4.00774583

To get the most commonly used summary statistics, we can use summary()

> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.48690 -0.85248 -0.05230 -0.06842 0.63106 4.00775

We can use the for summary() data frames also.

df <- data.frame(num = rnorm(100, mean=80, sd=10),


alph = (sample(letters[1:4], 100, replace=TRUE )))

> summary(df)
num alph
Min. : 50.91 Length:100
1st Qu.: 73.19 Class :character
Median : 79.66 Mode :character
Mean : 80.23
3rd Qu.: 87.11
Max. :115.23

Covariance and Correlation Matrix

Using two or more vectors, we can compute the covariance and correlation
matrix.

First, we create another vector that correlates with the vector x

> y <- 1.5 * x + 0.75 * rnorm(length(x))

To compute covariance between x and y

> cov(x,y)
[1] 1.630218

To compute the correlation between x and y

137
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> cor(x,y)
[1] 0.8961332

We can also use the two functions for more than 2 vectors. For this exercise,
we created a new vector z that has the same length as x using a uniform
distribution. The vector z does not depend on x and y. In this case, we use
cbind() to create a three-column matrix and then compute the covariance of
them

> z = runif(length(x))
> comb = cbind(x, y, z)
> cov(comb)
x y z
x 1.177446803 1.6302178 0.003800914
y 1.630217841 2.8106373 0.034256899
z 0.003800914 0.0342569 0.091151886

To compute the correlation matrix, we can use cor()

> cor(comb)
x y z
x 1.00000000 0.89613325 0.01160204
y 0.89613325 1.00000000 0.06768038
z 0.01160204 0.06768038 1.00000000

6. Working with Strings

The string-related functions are very important for a data analysis problem. In
this chapter, we will study about

• Strings and Character vectors

138
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
• Manipulation of date/time objects and string representations
• Regular expressions to extract information from text

Strings and Character Vectors

We use character vectors to store text data. In R programming language, a


character vector is not a vector of single character, letters, or alphabet
symbols. It is a vector of strings. In R, we have a variety of built-in functions to
deal with character vectors. We can perform vectorized operation on them so
that numerous string values can be processed in one step.

Printing Strings

The most basic string operation is to print the string. There are several ways in
which we can view the text in the console.

The simplest way is to type the string by using the quotation marks:

> "hello"
[1] "hello"

As we have seen in the previous sections that a character vector is a vector of


character values or strings.

We can also store the value of the string in a variable and print by evaluating it.

> str1 <- "hello"


> str1
[1] "hello"

But if we write a character value in a loop, it does not print anything at all

for (i in 1:3) {
"hello"
}

If an expression is typed in the console, its value is printed. But a for loop does
not return a value explicitly, the value inside for loop cannot be printed
directly. We can investigate this using the following example

139
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
test1 <- function(x){
"hello"
x
}
> test1("world")
[1] "world"

In the example above, the function does not print hello but it prints world.
When we call the function test1("world"), the function returns the value of
the last expression x, which is world. If we remove x from the function:

test2 <- function(x){


"hello"
}

> test2("world")
[1] "hello"

In this case, the test2 will always return hello irrespective of the value of x.

But our objective is to print both the vectors. We can use print()to solve this
problem.

> print(str1)
[1] "hello"

In this case, a character vector is printed at position [1]. We can use


print()in a loop too.

for (i in 1:3){
print(str1)
}

[1] "hello"
[1] "hello"
[1] "hello"

This also works very well for a function.

test3 <- function(x){


print("hello")
x
}
> test3("world")
140
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] "hello"
[1] "world"

If we want to print the text as a message not as a character vector with indices,
we can call cat() or message().

> cat("hello")
hello

The example below demonstrates how to print a statement.

> name <- "John"


> language <- "R"
> cat("Hello," ,name, "- learner of",language)
Hello, John - learner of R

We can print another statement

> cat("Hello," ,name, ", you are learning",language,".")


Hello, John , you are learning R.

In the second statement, we can notice unnecessary spaces between the


arguments. The function cat() inserts a space character as a separator
between the input strings by default. If we specify sep= argument, we can
change it. In the example below, we have avoided the default space and
inserted the space manually.

> cat("Hello, " ,name, ", you are learning ",language,".",


sep="")
Hello, John, you are learning R.

Alternatively, we can use message() function, which does not use space
separators by default. We need to write the space separators manually.

> message("Hello, " ,name, ", you are learning ",language,".")


Hello, John, you are learning R.

The message() function also ends the text with a new line while cat() does
not. We can run two experiments to understand it further/

> for (i in 1:3) cat(letters[i])


abc

141
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In the example above, the cat() function prints the input strings without
appending a new line. Due to this, all three letters are shown in the same line.

> for (i in 1:3) message(letters[i])


a
b
c

Whereas the message() function appends a new line to the input string.
Hence all three letters are printed in three lines.

If we want to print each letter in a new line, we should explicitly add a new line
character in the input.

> for (i in 1:3) cat(letters[i],"\n")


a
b
c

Concatenating Strings

We use paste() function to concatenate several character vectors. In this


function, spaces are used as the default separator.

> paste("hello","world")
[1] "hello world"
> paste("hello","world", sep="-")
[1] "hello-world"

To avoid the separator, we can set sep="" or alternatively call paste0():

> paste0("hello","world")
[1] "helloworld"

The question arises, what is the difference between paste() and cat(), if
both of them print the characters the same way and concatenate the strings.
The function cat() can only print the value to the console but paste()
returns the value that can be assigned to other variables. We can study the
following examples.

> new <- cat("hello","world")


hello world
> new
142
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
NULL

In the example above, we can see that cat() concatenates the strings but
returns NULL.

> new1 <- paste("hello","world")


> new1
[1] "hello world"

In the example above, we can see that paste() not only concatenates the
strings but also assigns the value to another variable.

The difference between cat() and paste() is more visible while working with
multielement characters.

> cat(c("A","B"), c("C","D"))


A B C D

The function cat() concatenates both the vectors into one string sequentially.
Whereas the paste(), concatenates element-wise as shown below.

> paste(c("A","B"), c("C","D"))


[1] "A C" "B D"

The function paste() returns a character vector of two elements. However, if


we want to put the result in one string, we can use collapse= to specify how
these two elements can be concatenated again.

> paste(c("A","B"), c("C","D"), collapse=", ")


[1] "A C, B D"

Transforming Text

Changing case

On several occasions, changing the case becomes essential. We can use


tolower() function to change the text to lower case. On the other hand, we
can use toupper() to change the text to upper case.

Changing the case becomes important while working on functions. If a function


accepts a character input, the user may pass the input in any case. If we
143
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
change the case before evaluating the input parameter, the parameter
becomes case-insensitive.

calc <- function(type, x, y) {


type <- tolower(type)
if (type == "add") {
x + y
} else if (type == "times") {
x * y
} else {
stop("Not supported type of command")
}
}

> c(calc("add", 2, 3), calc("Add", 2, 3), calc("TIMES", 2, 3))


[1] 5 5 6

Counting characters

We can use nchar() to count the number of characters of each element of a


character vector.

> nchar("Programming")
[1] 11

The function nchar() is also vectorized

> nchar(c("Learn","R","Programming"))
[1] 5 1 11

Trimming leading and trailing whitespace

We can use trimws() to trim the leading and trailing whitespace (including
spaces and tabs).

> trimws(c(" Learn "," R "," Programming "))


[1] "Learn" "R" "Programming"

By default, the function trims the whitespaces from both sides of the string.
We can use which= to specify which side of the string, we want to trim.

[1] "Learn " "R " "Programming "


> trimws(c(" Learn "," R "," Programming "), which="right")
144
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] " Learn" " R" " Programming"

Substring

We can subset the texts in a character vector using substr().

Suppose we have a vector that includes several dates where months are
represented by three-letter abbreviations.

> dates <- c("Jan 6","Jun 30", "Sep 15")


> dates
[1] "Jan 6" "Jun 30" "Sep 15"

We can use substr() function to extract the months

substr(dates, 1, 3)
[1] "Jan" "Jun" "Sep"

We can use substr() and nchar() to extract day.

> substr(dates, 5, nchar(dates))


[1] "6" "30" "15"

We can replace the values returned by substr() with a given character vector

> substr(dates, 1, 3) <- c("Mar", "Jul", "Oct")


> dates
[1] "Mar 6" "Jul 30" "Oct 15"

Splitting Texts

The function substr() works very well if the lengths of the parts of the strings
are fixed. However, in many cases such as person names, are not of fixed
length. For example, “Mary John” and “Tim Johnson”. In such cases, we can
use the function strsplit() to split texts by a separator such as space or
comma.

> class <-


strsplit(c("Tim,34,USA","Travis,45,Germany","Pascal,23,France"
),split=",")
> class
[[1]]
145
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] "Tim" "34" "USA"

[[2]]
[1] "Travis" "45" "Germany"

[[3]]
[1] "Pascal" "23" "France"

We can use strsplit() to split the whole string into individual characters.
For this, we have to pass an empty split argument

> strsplit("Analytics", split="")


[[1]]
[1] "A" "n" "a" "l" "y" "t" "i" "c" "s"

Formatting Text

To return the formatted string with the values that have been provided in the
list, we use sprintf() function.

The syntax of sprintf() function is as follows.

sprintf(format, values)

In the function, format is used to provide the format of the printing the
values and values is used to provide the values

To format the numerical vector to the default number of decimal places (six
digits after the decimal point), we can use the following.

> x <- 123.456


> sprintf("%f",x)
[1] "123.456000"

In the example above %f represents the numeric values.

We can add a point and a number between the percentage sign and the f. To
round the numeric input value to two digits after the decimal place, use the
following.

> sprintf("%.2f",x)
[1] "123.46"
146
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can also format the number of digits before the decimal place but without
decimal place

> sprintf("%1.0f",x)
[1] "123"

We can format the pout to print 10 leading blanks before our number without
decimal

> sprintf("%10.0f",x)
[1] " 123"

Format Output
sprintf("%s", "A") A
sprintf("%d", 10) 10
sprintf("%04d", 10) 0010
sprintf("%f", pi) 3.141593
sprintf("%.2f", pi) 3.14
sprintf("%1.0f", pi) 3
sprintf("%8.2f", pi) " 3.14"

Formatting date/time
In data analysis, we encounter the data and time data types very often. The
simplest function for the date is Sys.Date() and for time is Sys.time(). The
function Sys.Date()returns the current date. The function
Sys.time()returns the current time

> Sys.Date()
[1] "2020-01-07"

> Sys.time()
[1] "2020-01-07 16:58:39 IST"

The output above may suggest that date and time are character vectors but
they are not character vectors. It could be verified from the following
command

> as.numeric(Sys.Date())
[1] 18268

147
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The above statement gives the numeric value relative to 1970-01-01. That
means it returns the number of days passed after 1970-01-01.

Parsing text as date/time

We can create date and time from a standard text representation

> my_date <- as.Date("2020-01-07")


> my_date
[1] "2020-01-07"

The question arises if the date can be represented as a string, why do we need
a Date object. The Date object has good arithmetic properties. We can add or
subtract a number of days from a date to get a new date.

> my_date + 7
[1] "2020-01-14"
> my_date - 80
[1] "2019-10-19"

We can subtract one date from another to get the number of days between
the two dates

> date1 <- as.Date('2019-01-01')


> date2 <- as.Date('2018-07-01')
> date1 - date2
Time difference of 184 days

This looks like a message but it is a numeric value. We can get the numeric
value explicitly using as.numeric().

> as.numeric(date1-date2)
[1] 184

In R, the time is similar. However, R does not have any function called
as.Time(). We can use either as.POSIXct() or as.POSIXlt() to create
date time from the text representation. The two functions are different
148
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
implementations of date/time. The implementation of as.POSIXlt()is given
below.

> my_time <- as.POSIXlt("2020-01-07 14:45:17")


> my_time
[1] "2020-01-07 14:45:17 IST"

We can perform addition and subtraction for time calculations. It takes time as
a unit.

> my_time + 10
[1] "2020-01-07 14:45:27 IST"

> my_time + 12345


[1] "2020-01-07 18:11:02 IST"

> my_time - 123456


[1] "2020-01-06 04:27:41 IST"

We have seen that to perform calculations, we need to convert the string


representation of date/time to date or date/time objects. However,
sometimes the date/time data in raw data is not in a required format so that it
could be recognized by as.Date() or as.POSIXlt(). For example, if the
input is 2017.05.21, the as.Date() function will return an error.

> as.Date('2017.05.21')
Error in charToDate(x) :
character string is not in a standard unambiguous format

In such a case, a format string can be used to let the as.Date() function
know, how to parse the string to a date.

> as.Date('2017.05.21', format='%Y.%m.%d')


[1] "2017-05-21"

Similarly, we can use format string as a template for as.POSIXlt().

> as.POSIXlt('21/05/2017 10:09:16',format='%d/%m/%Y %H:%M:%S')


[1] "2017-05-21 10:09:16 IST"

149
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can use strptime(), which is a more direct method to convert a string to
a date/time.

> strptime('21/05/2017 10:09:16',format='%d/%m/%Y %H:%M:%S')


[1] "2017-05-21 10:09:16 IST"

The function strptime() is just a wrapper of as.POSIXlt(), if we want to


input character. But in the case of strptime(), we have to supply the format
string every time. However, for as.POSIXlt() function, there is no need to
supply a format string for standard formats.

The date and date/time are also vectors. We can input a character vector and
get a vector of dates.

> as.Date(c('2019-12-01', '2020-01-01'))


[1] "2019-12-01" "2020-01-01"

The math is also vectorized. We can add some consecutive integers to the
date. Hence we get the consecutive dates.

> as.Date('2020-01-01') + 1:3


[1] "2020-01-02" "2020-01-03" "2020-01-04"

We can apply the same feature to the date/time objects.

> strptime('21/05/2017 10:09:16',format='%d/%m/%Y %H:%M:%S') +


1:3
[1] "2017-05-21 10:09:17 IST" "2017-05-21 10:09:18 IST" "2017-
05-21 10:09:19 IST"

On several occasions, the data has numerical representation. To parse


20190610, we use the following code.

> as.Date('20190610','%Y%m%d')
[1] "2019-06-10"

We can parse the numerical representation of the date/time object as follows.

> strptime('20190610042949','%Y%m%d%H%M%S')
[1] "2019-06-10 04:29:49 IST"

150
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Formatting date/time to strings

In this section, we will learn the functions to convert date and date/time
objects to strings. These functions use a certain template.

To convert the date to a string in a standard representation using


as.character():

> as.character(my_date)
[1] "2020-01-07"

Even though the output looks the same but it is plain text. It does not support
date calculations.

> txt_date + 1
Error in txt_date + 1 : non-numeric argument to binary
operator

We can also format the date in a non-standard way.

> as.character(my_date, format='%Y.%m.%d')


[1] "2020.01.07"

We can also get the same result using format(). The function
as.character() calls the format() function directly behind the scenes.
Hence this is a recommended to use format() function.

> format(my_date, '%Y.%m.%d')


[1] "2020.01.07"

Using regular expressions

While working on a data analysis problem, you may get data in various
formats. Most of the time, the data is well organized. The example is given
below

id,name,score
1,A,20
2,B,30
3,C,25

151
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In R, we use read.csv() to import a CSV file as a data frame, which has the
right header and data type.

However, every data file is not well organized. It is challenging to deal with
poorly organized data. We can use built-in functions such as read.table()
and read.csv(), but they may not give us desired results for such format-less
data.

For example, if we want to analyze raw data (fruits.txt) that describes the
number or status of some fruits.

apple: 20
orange: missing
banana: 30
pear: sent to Jerry
watermelon: 2
blueberry: 12
strawberry: sent to James

Our requirement is to pick out all the fruits with a number instead of status
information. First of all, we should distinguish between fruits with numbers
and fruits without numbers. We need to distinguish the text that matches a
pattern from the ones that do not. We can use regular expressions for this
problem.

We can use regular expressions by following two steps. The first step is to find
a pattern to match the text. The second step is to group the patterns to extract
the required information.

Finding a string pattern

To solve the fruits problem, we need to find out a pattern to extract the
required information. In this case, we need to extract all the lines that start
with a word, which is followed by a semicolon and space. The line should end
with an integer instead of the words or other symbols.

The regular expressions help us with a set of symbols that can represent the
patterns. We can describe the preceding pattern using ^\w+:\s\d+$. In this
case, we have used the meta-symbols to represent a class of symbols.

152
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
• ^: We use this symbol at the beginning of the line
• \w: For a word character
• \s: For a space character
• \d: For a digit character
• $: At the end of the line
• \w+: For one or more word characters
• :: The symbol that we want to see after the word
• \d+: For one or more digit characters

We need to select the lines that match the pattern abc: 123 while ignoring
others. We can use the function grep() to get the lines that match the
pattern.

> fruits <- readLines("data/fruits.txt")


> pat_match <- grep("^\\w+:\\s\\d+$", fruits)
> pat_match
[1] 1 3 5 6

Please note that in R, we should use \\to avoid escaping. Now we can filter
fruits by pat_match

> fruits[pat_match]
[1] "apple: 20" "banana: 30" "watermelon: 2"
"blueberry: 12"

Thus we have segregated the desirable lines from undesirable ones.

In the above example, we have specified a pattern that starts with ^ and ends
with $ in order to avoid partial matching. By default, the regular expression
performs partial matching. It means that if any part of the string matches the
pattern, the whole string is considered to match the pattern. For example. The
following code determines which string matched two patterns respectively.

> grep("\\d", c("xyz", "d657", "345", "9"))


[1] 2 3 4

> grep("^\\d$", c("xyz", "d657", "345", "9"))


[1] 4

153
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The first pattern is an example of partial matching. The results from the
pattern matching are the strings that include any digit. In the second pattern
where we used ^ and $ only one digit is returned.

Using group to extract data

In the pattern strings, we can mark groups using parentheses. In the fruits
problem, we can mark the groups by modifying the pattern to
(\w+):\s(\d+). In this case, we have marked two groups. One is fruit name
(\w+) and another one is the number of fruits (\d+).

For such problems, we will use stringr package. Even though R has built-in
functions to solve such problems, stringr package is easier to use and more
efficient. We will call function str_match() with the updated group pattern.

> library(stringr)
> match <- str_match(fruits, "^(\\w+):\\s(\\d+)$")
> match
[,1] [,2] [,3]
[1,] "apple: 20" "apple" "20"
[2,] NA NA NA
[3,] "banana: 30" "banana" "30"
[4,] NA NA NA
[5,] "watermelon: 2" "watermelon" "2"
[6,] "blueberry: 12" "blueberry" "12"
[7,] NA NA NA

This time, we get a matrix with more than one column. The groups in the
parenthesis have been extracted from the text. They have been placed in
columns 2 and 3. Now we can transform the character matrix to a data frame
using the right header and data types.

# transform to data frame


df_fruits <- data.frame(na.omit(match[, -1]), stringsAsFactors
=FALSE)
# add a header
colnames(df_fruits) <- c("fruit","quantity")
# convert type of quantity from character to integer
df_fruits$quantity <- as.integer(df_fruits$quantity)

Now we get the data frame df_fruits that has right header and data types.

154
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> df_fruits
fruit quantity
1 apple 20
2 banana 30
3 watermelon 2
4 blueberry 12

7. Working with Data

In this chapter, we will cover the following topics:

• Reading and Writing data


• Visualizing data with plot functions
• Analysing data with simple statistical models and data modelling tools

Reading and writing data

A typical data analysis project starts with loading the data. It means that we
need to import a data set into the R environment. Before we load the data, we
need to check the type of the data file and then use an appropriate tool to
read the data.

Reading and writing data to text format file

155
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The most commonly used data file type is CSV file. The first line of a typical CSV
file is the header of the columns. Subsequent lines represent a data record
with columns that have been separated by commas. Here is an example of the
CSV file.

Name,Gender,Age,Major
John,Male,24,Finance
Amily,Female,25,Statistics
Jessie,Female,23,Computer Science

Importing data via RStudio

You can import the data using RStudio by navigating File | Import Dataset |
From Text (base). You choose a local file in a text format, such as .csv and
.txt.

You should check the Strings as factors only if you want to convert the string
columns to factors.

The file importer translates the file path and options to R code. After setting
different parameters, you can click on Import. It will call the read.csv()

156
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
function. The interactive tool is very handy and helps you avoid several
mistakes.

Importing data using built-in Functions

The function readLines() can be used to read a text file. This function
returns a number of lines as a character vector

> readLines("data/student.txt")
[1] "Name,Gender,Age,Major" "John,Male,24,Finance"
[3] "Amily,Female,25,Statistics"
"Jessie,Female,23,Computer Science"

By default, the function reads all the lines of the file. We can also preview the
first two lines.

> readLines("data/student.txt", n=2)


[1] "Name,Gender,Age,Major" "John,Male,24,Finance"

The function readLines() is too simple a method because it reads lines as a


string instead of parsing lines into a data frame. To import the data from a CSV
file, we can call read.csv().

> student <- read.csv("data/student.csv")


> str(student)
'data.frame': 3 obs. of 4 variables:
$ Name : chr "John" "Amily" "Jessie"
$ Gender: chr "Male" "Female" "Female"
$ Age : int 24 25 23
$ Major : chr "Finance" "Statistics" "Computer Science"

We can customize the import using several arguments. We can use


colClasses= to explicitly specify the types of columns.

> student <- read.csv("data/student.csv", colClasses =


c("character", "factor", "integer", "character"))
> str(student)
'data.frame': 3 obs. of 4 variables:
$ Name : chr "John" "Amily" "Jessie"
$ Gender: Factor w/ 2 levels "Female","Male": 2 1 1
$ Age : int 24 25 23
$ Major : chr "Finance" "Statistics" "Computer Science"

157
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can use col.names= to explicitly specify the names of the columns

> student <- read.csv("data/student.csv", col.names =


c("FirstName", "Sex", "Age", "Subject"))
> str(student)
'data.frame': 3 obs. of 4 variables:
$ FirstName: chr "John" "Amily" "Jessie"
$ Sex : chr "Male" "Female" "Female"
$ Age : int 24 25 23
$ Subject : chr "Finance" "Statistics" "Computer Science"

CSV files use a comma (,) to separate columns and a new line to separate rows.
If the file, you are trying to import is in tab-delimited format, you may use
read.table().

Importing data using the readr package

The functions read.* have some inconsistencies. Instead, we can use readr
package to import tabular data in a fast and consistent manner.

We can install the package by running install.packages("readr"). Then


you can use a family of read_* functions to import tabular data.

The readr package provides read_table() and read_csv(). These are


analogous to built-in R functions read.table() and read.csv(). The readr
package functions are often much faster than their R analogous.

The typical call to readr package is as follows

> library(readr)
> student1 <- read_csv("data/student.csv")

── Column specification
──────────────────────────────────────────────────────────────
──────
cols(
Name = col_character(),
Gender = col_character(),
Age = col_double(),
Major = col_character()
)

> student1

158
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
# A tibble: 3 x 4
Name Gender Age Major
<chr> <chr> <dbl> <chr>
1 John Male 24 Finance
2 Amily Female 25 Statistics
3 Jessie Female 23 Computer Science

By default, the read_csv() function opens the CSV file and reads it line-by-
line. By default, it also reads the first few rows of the table in order to decide
the type (i.e. integer, character, etc.) of each column. You can specify the type
of each column with col_types argument.

It is recommended to specify the column types explicitly to rule out any


possible guessing errors on the part of read_csv(). Moreover, by specifying
the column type, you can check if there is any change in the dataset without
knowing about it.

> student2 <- read_csv("data/student.csv", col_types = 'ccdc')


> student2
# A tibble: 3 x 4
Name Gender Age Major
<chr> <chr> <dbl> <chr>
1 John Male 24 Finance
2 Amily Female 25 Statistics
3 Jessie Female 23 Computer Science

Here col_types = 'ccdc' indicates that the data type of the first, second,
and forth columns is a character, and the data type of the third column is
double.

We can decide the number of rows to import using n_max argument.

> student3 <- read_csv("data/student.csv", col_types = 'ccdc',


n_max=1)
> student3
# A tibble: 1 x 4
Name Gender Age Major
<chr> <chr> <dbl> <chr>
1 John Male 24 Finance

The read_csv() function can also read the compressed files automatically.

159
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Reading and writing Excel worksheets

Excel workbook is another format for storing tabular data. R does not provide
any built-in function to read an Excel workbook. But several R packages such as
readxl (https://github.com/hadley/readxl), are available to work with Excel
worksheets. You can install the readxl package from CRAN using
install.packages("readxl").

> library(readxl)
> price <- read_excel("data/price.xlsx")
> price
# A tibble: 6 x 3
Date Price Growth
<dttm> <dbl> <dbl>
1 2020-01-03 00:00:00 136 NA
2 2020-02-03 00:00:00 138 0.0147
3 2020-03-03 00:00:00 137 -0.00725
4 2020-04-03 00:00:00 130 -0.0511
5 2020-05-03 00:00:00 139 0.0692
6 2020-06-03 00:00:00 140 0.00719

We can see that read_excel() translates the dates in Excel to dates in R


automatically. We can also notice that read_excel() also preserves the
missing value in the Growth column.

Another package that we can use while working with Excel is openxlsx. We
can use this package to read, write, and edit XLSX files. Hence openxlsx is
more comprehensive than readr. You can install this package using
install.package("openxlsx") command.

With openxlsx, we can use read.xlsx to read data in the XLSX files into a
data frame just like read_excel()from readxl.

> library(openxlsx)
> price1 <- read.xlsx("data/price.xlsx", detectDates = TRUE)
> price1
Date Price Growth
1 2020-01-03 136 NA
2 2020-02-03 138 0.014705882
3 2020-03-03 137 -0.007246377
4 2020-04-03 130 -0.051094891
5 2020-05-03 139 0.069230769
6 2020-06-03 140 0.007194245
160
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We used detectDates = TRUE to ensure that date values can be imported
correctly. Else the dates will be imported as number. We can also use
write.xlsx() to write the data frame to workbook.

openxlsx::write.xlsx(price1, "data/price1.xlsx")

Reading and writing native data files

CSV files and Excel workbooks are non-native data formats to R. Hence, there
is a gap between the original data object and the output file. If we export a
data frame with many columns of data types to a CSV file, the information
about the column types is discarded. The numeric, string or date column data
type is represented in text format.

If the portability of the data is not an issue and you want to use only the R to
work with the data, you can use the native formats to read and write data. The
native formats help you save the objects in a file and recover the same file
exactly without worrying about the data type issues.

R has its own data file format that uses .rds extensions. We can use
readRDS() function to read a R data file.

dat <- readRDS("ACS.rds")

The .rds file format is usually smaller than its text file and hence it takes up
less storage space. The .rds file format also preserves data types and classes
such as factors and dates eliminating the need to redefine data types after
loading the file.

Loading built-in datasets

R has a great number of built-in datasets. We can easily load and use them
easily. The built-in datasets are mostly data frames and contain detailed
specifications.

The most famous built-in R datasets are iris and mtcars. You can use ? iris
and ? mtcars to read the description of the datasets. You can get more
information about the dataset from the description.
161
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can use the built-in datasets because these datasets are immediately
available once R is ready.

You can view the first six rows of iris.

> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

We can access the structure as follows.

> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1
...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5
...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1
...
$ Species : Factor w/ 3 levels "setosa","versicolor",..:
1 1 1 1 1 1 1 1 1 1 ...

You can also print iris to see the whole data frame. We can also use
View(iris) to view data in a grid pane.

Similarly, we can view the first six rows of mtcars and see its structure.

> head(mtcars)
mpg cyl disp hp drat wt qsec vs am
gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
4 1

162
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0
3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0
3 1

> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92
...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...

Visualizing the Data

In the previous chapter, we discussed a number of methods to import data,


which is the first step in most data analysis. Before we load the data to any
model, we need to view the data. Each machine learning model has its own
strengths. There is no universally accepted machine learning model. Hence
before fitting any data to a model, we need to visualize the data to analyse the
patterns.

For this chapter, we will use nycflights13 packages. We can install the
package using the following commands

install.package("nycflights13")

Creating scatter plots

The function plot() is basic R function to visualize the data. By providing a


numeric or integer vector to plot(), we can produce a scatter plot of value by
index. We can plot a scatter plot of 10 points in the increasing order as follows:

163
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
plot(1:10)

We can generate two linearly correlated random numeric vectors to create a


more realistic scatter plot.

x <- rnorm(200)
y <- 2*x + rnorm(200)
plot(x,y)

As a result, we get the following plot.

164
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Customize Chart Elements

We can customize several chart elements such as title (main or title()), the
label of the x-axis (xlab), the label of y-axis (ylab), the range of the x axis
(xlim), and the range of the y-axis (ylim)

plot(x, y,
main = "Correlated Random numbers",
xlab = "x", ylab = "2x + noise",
xlim = c(-4, 4), ylim = c(-6, 6))

We can specify the chart tile by either the main argument or a separate
title() function call. The following code will plot the same chart as given
above.

plot(x, y,
xlab = "x", ylab = "2x + noise",
xlim = c(-3, 3), ylim = c(-6, 6))
title("Correlated Random numbers")

Customize point style

For a scatter plot, the default point style is a circle. We can specify the pch
argument (plotting character), to change the point style. 26 point styles are
available in R

plot(0:25, 0:25, pch = 0:25,


165
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
xlim = c(-1, 26), ylim = c(-1, 26),
main = "Point styles (pch)")
text(0:25+1, 0:25-1, 0:25)

In the preceding code, we have created a scatter plot that includes all the point
styles while printing the corresponding pch number beside it. First, we created
a simple scatter plot using plot, then printed the pch number using the
text().

We can plot a scatter plot graph using non-default point style by setting
pch=17.

x <- rnorm(200)
y <- 2*x + rnorm(200)
plot(x,y, pch = 17,
main = "Scatter plot with pch = 17")

166
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can also distinguish the two groups of points by a logical condition. We
know that pch is vectorized. So, we can use ifelse() to specify the point of
each observation based on certain condition. The following example applies
pch = 17 to the points satisfying x * y > 1 otherwise pch = 1;

x <- rnorm(200)
y <- 2*x + rnorm(200)
plot(x,y,
pch = ifelse(x * y > 1, 17, 1),
main = "Scatter plot with conditional pch")

A plot containing two separate datasets sharing the same x-axis can be drawn
using plot() and points(). In the previous example, a normally distributed
167
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
vector x, and a linearly correlated random vector y were generated. For this
example, we will generate another random vector, z, that has a non-linear
relationship with x. In this example, we have plotted both y and z against x
whereas both the plots have different point styles:

x <- rnorm(75)
y <- 1.5*x + rnorm(75)
z <- sqrt(1 + x ^ 2) + rnorm(75)
plot(x, y, pch = 1,
xlim = range(x), ylim = range(y, z),
xlab = "x", ylab = "value")
points(x, z, pch = 17)
title("Scatter plot with two datasets")

In the preceding example, first, we created datasets x, y, and z. Then we


created a plot of x and y. Then we added another group of points z with a
different pch. We have specified ylim = range(y, z). This is to ensure that
the plot builder consider the range of both y and z. The points() does not
lengthen the axes created by plot(). Due to which any point beyond the axes
range will disappear. By specifying ylim = range(y, z), we have ensured
that all the points in y and z are shown in the plot area.

Customizing the point colors

We can specify different point colors by setting the column of plot():

x <- rnorm(75)
168
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
y <- 1.5*x + rnorm(75)
plot(x, y, pch = 15, col = "blue", main = "Blue Color Scatter Plot")

Different colors can be applied to separate points that belong to different


categories if they satisfy certain conditions.

plot(x, y, pch = 16,


col = ifelse(y >= mean(y), "red", "green"),
main = "Scatter plot - conditional colors")

We can use col to distinguish different groups of point while plotting two
different datasets using plot() and points().

plot(x, y, col = "blue", pch = 0,


169
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
xlim = range(x), ylim = range(y, z),
xlab = "x", ylab = "value")
points(x, z, col = "red", pch = 1)
title("Scatter plot with two datasets in different color")

R supports 657 colors in total. You can call the function colors() to get the
list of all the colors supported by R.

Creating line plots

On several data analysis problems such as time series analysis, we use line
plots to demonstrate the trend and variation across time. We should use
type=”l” while calling plot().

t <- 1:50
y <- 2.5 * sin(t * pi / 60) + rnorm(t)
plot(t, y, type = "l", main = "Line plot")

170
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Line Type and Width

For the line plot, we can use lty to specify the line type of a line plot. It is
similar to pch for scatter plot. The preview of the six-line types that R supports
is shown below.

lty_val <- 1:6


plot(lty_val, axes = FALSE, ann = FALSE, type = "n")
abline(h =lty_val, lwd = 2, lty = lty_val)
mtext(lty_val, at = lty_val, side = 2)
title("Line types (lty)")

171
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In the preceding code, we have used the parameter type = "n" to create an
empty canvas. The value "n" signifies no plotting. The parameters axes =
FALSE, ann = FALSE are used to turn off axes and annotation.

We used the abline() function to add straight lines through the current plot.
The parameter h =lty_val is used to draw the six horizontal line, for each
value of lty_val. The line width has been set by lwd = 2. The different line
types are specified by lty = lty_val.

We have used the function mtext() to draw the text on the margin. Please
note that abline() and mtext() are vectorized with respect to their
argument.

In the following example, we have drawn the auxiliary lines in a plot using the
function abline(). In this example, first of all, we created a plot of y with
time, t. We have shown the mean value and the range (minimum and
maximum values) of y along with the time. We can easily draw these auxiliary
lines very easily by using different line types and colors.

plot(t, y, lwd = 2, type = "l")


abline(h = mean(y), col = "red", lty = 2)
abline(h = range(y), col = "blue", lty = 3)
abline(v = t[c(which.min(y), which.max(y))], col = "brown",
lty = 3)
title("Line plot with auxiliary lines")

172
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Multi-period line plot

In a multi-period line plot, we mix different line types. For example, a time
series dataset in which the first period is historic data and the second period is
predictions.

In the example below, the first 40 observations of y represent the historic


data and the remaining points represent the predictions based on the historic
data. We have used the solid line to plot the historic data and the dashed line
to plot the predictions. In this case, we have plotted the data in the first period
and then added dashed lines() for the data in the second period of the plot.
As we used points() in the case of scatter plot, we can use lines() in the
case of the line plot.

p <- 40
plot(t[t <= p], y[t <= p], type = "l",
xlim = range(t), xlab = "t", ylab = "y")
lines(t[t >= p], y[t >= p], lty = 2)
title("Two period Line Plot")

Line plot with points

We can plot both the lines and points in the same chart. This can be done
easily by first plotting a line chart and then adding points() of the same data
to the plot again.

173
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
plot(y, type = "l")
points(y, pch = 16)
title("Line plot with points")

Alternatively, first, we can plot a scatter plot using the plot() function and
then we can add lines using the lines() function.

plot(y, pch = 16)


lines(y)
title("Line plot with points")

Multi-Series Chart with a Legend

In the following code, we have generated two series, y1 and y2, with time t
and created a chart with the two series with respect to time t.

t <- 1:30
y1 <- 1.5 * t + 6 * rnorm(30)
y2 <- 2.5 * sqrt(t) + 8 * rnorm(30)
plot(t, y1, type = "l", col = "black",
ylim = range(y1, y2), ylab ="y1, y2")
points(y1, pch = 15)
lines(y2, col = "blue", lty = 2)
points(y2, col = "blue", pch = 16)
title ("Plot of two series")
legend("topleft",
legend = c("y1", "y1"),
col = c("black", "blue"),
174
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
lty = c(1, 2), pch = c(15, 16),
cex = 0.8, x.intersp = 0.5, y.intersp = 0.8)

In the above example, we have added a legend() on the top left. It shows the
line and point styles of y1 and y2 respectively. We have also used cex to scale
the font sizes of the legend and x.intersp and y.intersp to make some
minor adjustments to the legend.

Bar charts

The bar charts are one of the most commonly used charts. We use bar charts
to visualize the qualitative data by category frequency.

To plot the bar chart we use barplot() function instead of plot() function.
The function draws either vertical or horizontal bars that are separated by
white space. Even though we display the raw frequencies, but we can use
barplot to visualize other quantities, such as means or proportions, which
directly depend upon these frequencies.

The basic syntax to create a barplot in R is:

barplot(H, xlab, ylab, main, names.arg, col)

H: is a vector or matrix containing numeric values


xlab: label for x-axis
ylab: label for y-axis

175
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
main: title of the bar chart
names.arg: vector of names appearing under each bar
col: color for the bars in the graph

Let’s plot a simple bar chart

barplot(1:10, names.arg = LETTERS[1:10])

If the numeric vector is a named vector, the names will automatically be the
names on the x-axis. Hence, we get the same results from the following code,
as we received from the previous code.

ints <- 1:10


names(ints) <- LETTERS[1:10]
barplot(ints)

Now we will draw the barplot using the flights dataset in nycflights13. This
package contains information about 336,776 flights that departed from NYC to
destinations in 2013.

The data table flights contains the data of all flights that departed from NYC
in 2013.

In this example, we will create a bar plot of the top eight carriers with the most
flights in the record. Before we can start using the dataset, we will use the
command install.packages("nycflights13") to install the dataset.
176
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
data("flights", package = "nycflights13")
carriers <- table(flights$carrier)
carriers

9E AA AS B6 DL EV F9 FL HA MQ OO
UA US VX WN
18460 32729 714 54635 48110 54173 685 3260 342 26397 32
58665 20536 5162 12275
YV
601

In the previous code, we have used table() to count the number of flights in
the record for each carrier. Now sort the carriers in decreasing order.

carriers_sort <- sort(carriers, decreasing = TRUE)


carriers_sort

UA B6 EV DL AA MQ US 9E WN VX FL
AS F9 YV HA
58665 54635 54173 48110 32729 26397 20536 18460 12275 5162 3260
714 685 601 342
OO
32

Now we can take the first 8 elements from the table and draw a bar plot:

barplot(head(carriers_sort, 8),
ylim = c(0, max(carriers_sort) * 1.1),
xlab = "Carrier", ylab = "Flights",
main ="Top 8 carriers ordered by number of flights")

177
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Pie Charts

Pie charts are also useful charts for data analysis. We can use the pie()
function to create a pie chart. The pie-chart is a representation of values as
slices of a circle with different colors.

The basic syntax of plotting a pie-chart by using R programming is as follows.

pie(x, labels, radius, main, col, clockwise)

x: vector that contains the numeric values that are used in the pie chart
labels: to provide the description of the slices
radius: to provide the radius of the circle of the pie chart (value between -1
and +1)
main: to provide the title of the chart
col: indicates the color palette
clockwise: indicates whether the slices are drawn clockwise or anti-
clockwise

The following code is an example of the implementation of pie() function.

grades <- c(A = 2, B = 10, C = 12, D = 8)


pie(grades, main = "Grades", radius = 1)

178
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Histogram and density plots

We use the histogram to represent the frequencies of values of a variable


bucketed into ranges. Histogram groups the values into continuous ranges.
Each bar in the histogram shows the number of observations that are present
in that range.

We can create the histogram using hist() function. The function accepts a
vector as an input along with some more parameters to plot histograms.

The basic syntax for creating a histogram is as follows

hist(v,main,xlab,xlim,ylim,breaks,col,border)

The description of the parameters are as follows:

v: a vector containing the numeric values that are used in the histogram
main: title of the chart
xlab: description of the x-axis
xlim: range of values on the x-axis
ylim: range of values on the y-axis
breaks: width of each bar
col: color of the bars
border: border-color of each bar

179
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In the following example, we have demonstrated how we can use hist() to
plot a histogram using a normally distributed random numeric vector and the
density function of the normal distribution.

random_norm <- rnorm(10000)


hist(random_norm)

We can overlay the curve of a probability density function of a standard


normal distribution by using dnorm() function. We need to ensure that the y-
axis of the histogram represents the probability. We can add the curve to the
histogram

hist(random_norm, probability = TRUE, col = "lightgray",


main="Histogram - Normally Distributed Data")
curve(dnorm, add = TRUE, lwd = 2, col ="blue")

180
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In this case, we have used the curve() function. We have used the parameter
add = TRUE to add the curve to the existing plot.

Now we can make a histogram of the speed of an aircraft from the


nycflights13 dataset. We can calculate the speed of an aircraft by dividing
the distance of the trip (distance) by the air time (air_time)

ft_speed <- flights$distance / flights$air_time


hist(ft_speed, xlab="Flight Speed", main = "Flight speed -
Histogram")

181
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We observe that the distribution is different from a normal distribution. So, we
can use density() function to estimate the empirical distribution of the
speed and plot a smooth probability distribution curve. We have also added a
vertical line to indicate the global average of all the observations.

plot(density(ft_speed, from = 2, to = 10, na.rm = TRUE),


main ="Empirical distribution of flight speed",
xlab="Flight Speed")
abline(v = mean(ft_speed, na.rm = TRUE),
col = "blue", lty = 2)

We can combine both plots to get a better understanding of the data.

hist(ft_speed,
probability = TRUE, ylim = c(0, 0.5),
main ="Histogram & distribution of flight speed",
xlab = "Flight Speed",
border ="gray", col = "lightgray")
lines(density(ft_speed, from = 2, na.rm = TRUE),
col ="darkgray", lwd = 2)
abline(v = mean(ft_speed, na.rm = TRUE),
col ="blue", lty =2)

182
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Boxplot

Boxplot is used to visualize the distribution of the data in a data set. The
boxplot represents the minimum, maximum, median, first quartile, and the
third quartile in the data set. You can compare the distribution of data across
data sets by drawing the boxplot for each one of them.

The basic syntax to create a box plot is as follows

boxplot(x, data, notch, varwidth, names, main)

x: vector or a formula
data: data frame
notch: logical value. Draws a notch is set as TRUE
varwidth: logical value, If TRUE, the width of the box is proportionate to the
sample size
names: group label that can be printed under each boxplot
main: provides the title to the graph

We can plot a simple box plot as follows

x <- rnorm(1000)
boxplot(x)

183
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can draw a box plot of the flight speed for each carrier. In this example, we
have 16 boxplots in one chart. It helps us compare the distribution of different
carriers. We have used the formula distance/air_time ~ carrier to
indicate that the x-axis denotes the carrier and the y-axis denotes the flight
speed (distance/air_time).

boxplot(distance / air_time ~ carrier, data =flights,


main = "Box plot - flight speed by carrier")

184
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
8. Analysing Data

Linear model

The linear model is the simplest model in R. In these models, we use a linear
function to describe the relationship between two random variables. In the
following example, first, we generated a normally distributed random numeric
vector x. Then we mapped x to a function 3 + 2 * x. Finally, we generated y
by adding some independent noise to f(x).

x <- rnorm(100)
f <- function(x) 3 + 2 * x
y <- f(x) + 0.5 * rnorm(100)

Let’s assume that we do not know the underlying relationship between x and
y. Hence, we used a linear model to explore the relationship between the two
variables. Therefore, we need to find out the coefficients of the linear function.

In the following code, we used lm() to fit x and y with a simple linear model.
In the code, we used the formula y ~ x which denotes the linear regression
between the dependent variable y and independent variable x.

linear_model <- lm(y ~ x)


linear_model

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
3.036 1.995

The coefficients received as the result of fitting the model are 3.036
(intercept) and 1.995 (slope) which is close to the true coefficients 3
(intercept) and 2 (slope).

If we want to access the coefficients of the model, we can use the following
code.
185
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
coef(linear_model)
(Intercept) x
3.036328 1.994926

The linear_model is a list. So, we can also use it to access the coefficients.

> linear_model$coefficients
(Intercept) x
3.036328 1.994926

We can also use the summary(linear_model) to access the statistical


properties of the linear model.

> summary(linear_model)

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-1.39226 -0.31731 0.01711 0.28940 1.26922

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.03633 0.04980 60.97 <2e-16 ***
x 1.99493 0.05589 35.69 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.497 on 98 degrees of freedom


Multiple R-squared: 0.9286, Adjusted R-squared: 0.9278
F-statistic: 1274 on 1 and 98 DF, p-value: < 2.2e-16

You can refer to the Machine Learning course to know more about the
interpretation of the summary.

You can plot the data and the regression line using the following code.

plot(x, y, main = "A simple linear regression")


abline(coef(linear_model), col = "blue")

186
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Now we can call the predict() function to make predictions using the fitted
model. We can predict y with standard errors when x = -1 and x = 0.5
using the following code.

> predict(linear_model, list(x = c(-1, 0.5)), se.fit = TRUE)


$fit
1 2
1.041402 4.033791

$se.fit
1 2
0.0772407 0.0554981

$df
[1] 98

$residual.scale
[1] 0.4969659

Now we can look into the real-world data set nycflights13. We can analyze
the air time of a flight using linear models by using a different set of input
variables. First, we will start with distance because distance is the most
important variable to analyze air time.

Therefore, to start with, after loading the data set, we make a scatter plot of
distance vs air_time. Since the number of records in the data set is large, we
will use pch = ".".

187
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
data("flights", package = "nycflights13")
plot(air_time ~ distance, data = flights,
pch = ".",
main = "Plot - Flight Speed",
ylab = "Air Time",
xlab = "Distance")

The plot suggests that there is a positive correlation between the two
variables. Hence, we can use a linear model to fit the data.

lm_model <- lm(air_time ~ distance, data = flights)

> summary(lm_model)

Call:
lm(formula = air_time ~ distance, data = flights)

Residuals:
Min 1Q Median 3Q Max
-82.397 -7.334 -1.320 6.513 145.389

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.847e+01 3.888e-02 474.9 <2e-16 ***
distance 1.261e-01 3.036e-05 4154.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

188
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Residual standard error: 12.78 on 327344 degrees of freedom
(9430 observations deleted due to missingness)
Multiple R-squared: 0.9814, Adjusted R-squared: 0.9814
F-statistic: 1.726e+07 on 1 and 327344 DF, p-value: < 2.2e-16

Now we plotted the regression line. The value of Adjusted R-squared is


equal to 0.9814, hence we can say that the model has fitted well on the data.

Now we can plot the regression line.

plot(air_time ~ distance, data = flights,


pch = ".",
main = "Regression Plot - Flight Speed",
ylab = "Air Time",
xlab = "Distance")
abline(coef(lm_model), col = "blue")

Decision Tree

The decision tree is a graphical representation of the choices and their results.
We use decision trees in predicting an email as spam or not spam, predicting
whether a tumour is cancerous or not, or predicting whether a loan is good or
bad. etc.

We generally use an R package party to create decision trees.

189
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
First of all, we need to install the party package by executing the following
command in the R console.

install.packages("party")

We use the ctree() function to create and analyse the decision tree. The
basic syntax for creating the decision using ctree() function is given below:

ctree(formula, data)

In this example, we will use built-in data set readingSkills to build a decision
tree. The data set describes the readingSkill score of several individuals. In this
example, we will try to predict whether an individual is a native speaker or not
based on age, shoeSize, and score.

Before we fit the data to the decision tree model, let’s review the data.

library(party)
print(head(readingSkills))

nativeSpeaker age shoeSize score


1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124

Let’s create the decision tree using the ctree() function

input_data <- readingSkills[c(1:105),]

decision_tree <- ctree(


nativeSpeaker ~ age + shoeSize + score,
data = input_data)

# Plot the tree.


plot(decision_tree)

190
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
From the decision tree, we can conclude that a person whose readingSkills
score is less than 38.306 and who is older than 6 years, is not a native speaker.

191
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.

You might also like