RProgramming
RProgramming
RProgramming
Vishal Jain
Samatrix Consulting Pvt Ltd
1
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Table of Contents
1. GETTING STARTED WITH R ................................................................................................................... 6
INTRODUCTION TO R .............................................................................................................................................. 6
R as a programming language .................................................................................................................... 6
R as a computing environment ................................................................................................................... 7
THE NEED FOR R .................................................................................................................................................... 7
INSTALLING R ........................................................................................................................................................ 8
RSTUDIO ............................................................................................................................................................ 10
RSTUDIO’S USER INTERFACE .................................................................................................................................. 11
The console ................................................................................................................................................ 11
The editor ................................................................................................................................................... 12
The environment pane............................................................................................................................... 13
The history pane ........................................................................................................................................ 13
The file pane .............................................................................................................................................. 14
The plots pane............................................................................................................................................ 14
The package pane ...................................................................................................................................... 15
The help pane............................................................................................................................................. 16
The viewer pane......................................................................................................................................... 16
2. R WORKSPACE .................................................................................................................................... 18
R’S WORKING DIRECTORY .................................................................................................................................... 18
CREATE R PROJECT IN RSTUDIO ............................................................................................................................. 19
ABSOLUTE AND RELATIVE PATH .............................................................................................................................. 20
MANAGING THE PROJECT FILES .............................................................................................................................. 21
INSPECTING AN ENVIRONMENT............................................................................................................................... 22
INSPECTING EXISTING SYMBOLS .............................................................................................................................. 23
VIEW THE STRUCTURE OF OBJECT ............................................................................................................................ 24
REMOVING SYMBOLS ........................................................................................................................................... 27
MODIFYING GLOBAL OPTIONS ................................................................................................................................ 28
Modifying the number of digits to print ................................................................................................... 29
Modifying the warning level ..................................................................................................................... 30
MANAGING THE LIBRARY OF PACKAGES ................................................................................................................... 32
Getting to know a package ....................................................................................................................... 32
Installing package from CRAN................................................................................................................... 33
Update package from CRAN ...................................................................................................................... 35
INSTALL PACKAGE FROM ONLINE REPOSITORIES ......................................................................................................... 35
PACKAGE FUNCTIONS ........................................................................................................................................... 36
Masking and name conflicts...................................................................................................................... 40
3. BASIC OBJECTS ................................................................................................................................... 42
VECTOR ............................................................................................................................................................. 42
Numeric Vector .......................................................................................................................................... 43
Logical vector ............................................................................................................................................. 45
Character Vector ........................................................................................................................................ 46
Sub setting Vectors .................................................................................................................................... 48
Named Vector ............................................................................................................................................ 51
EXERCISE ............................................................................................................................................................ 52
EXTRACTING ELEMENT .......................................................................................................................................... 56
CLASS OF THE VECTOR........................................................................................................................................... 57
Converting Vectors .................................................................................................................................... 58
ARITHMETIC OPERATORS ...................................................................................................................................... 59
MATRIX ............................................................................................................................................................. 60
2
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Naming Rows and Columns ....................................................................................................................... 62
Subsetting a Matrix ................................................................................................................................... 62
Matrix Operators ....................................................................................................................................... 64
ARRAYS ............................................................................................................................................................. 66
Subsetting an array ................................................................................................................................... 67
LIST ................................................................................................................................................................... 68
Subset a list ................................................................................................................................................ 70
Named Lists ................................................................................................................................................ 71
Setting Values ............................................................................................................................................ 72
Other List Operations................................................................................................................................. 73
DATA FRAME ...................................................................................................................................................... 74
Create a Data Frame.................................................................................................................................. 75
Naming rows and columns ........................................................................................................................ 76
Subset of a Data frame .............................................................................................................................. 77
Subset of a data frame as a list ................................................................................................................. 77
Subset a data frame as matrix .................................................................................................................. 78
Filtering Data ............................................................................................................................................. 80
Setting Values as a list ............................................................................................................................... 80
Factors ........................................................................................................................................................ 81
Useful functions for Data Frame ............................................................................................................... 83
Loading and Writing data on the disk....................................................................................................... 85
FUNCTIONS......................................................................................................................................................... 85
Creating a function .................................................................................................................................... 86
Calling a function ....................................................................................................................................... 86
Dynamic Typing ......................................................................................................................................... 87
Generalizing a function ............................................................................................................................. 87
Default value for function argument ........................................................................................................ 89
4. BASIC EXPRESSIONS ........................................................................................................................... 90
ASSIGNMENT EXPRESSIONS ................................................................................................................................... 90
Using backticks .......................................................................................................................................... 93
CONDITIONAL EXPRESSIONS ................................................................................................................................... 95
Using if as a statement .............................................................................................................................. 95
Using if as an expression ........................................................................................................................... 98
Using if with vector .................................................................................................................................. 100
Vectorized if:ifelse ................................................................................................................................... 101
USING SWITCH FUNCTION.................................................................................................................................... 102
LOOP EXPRESSIONS ............................................................................................................................................ 104
For loop .................................................................................................................................................... 104
Managing the flow of a for loop ............................................................................................................. 106
Creating nested for loop .......................................................................................................................... 108
While Loop ............................................................................................................................................... 109
5. WORKING WITH BASIC OBJECTS ....................................................................................................... 110
OBJECT FUNCTIONS ............................................................................................................................................ 111
Testing object types ................................................................................................................................. 111
Accessing Object Classes and Types ........................................................................................................ 114
Getting data dimensions ......................................................................................................................... 116
Reshaping Data Structures ...................................................................................................................... 117
Iterating over one dimension .................................................................................................................. 118
USING LOGICAL FUNCTION ................................................................................................................................... 119
Logical operators ..................................................................................................................................... 119
Logical functions ...................................................................................................................................... 120
Which elements are TRUE ....................................................................................................................... 122
3
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Dealing with missing values .................................................................................................................... 123
Logical Coercion ....................................................................................................................................... 124
MATH FUNCTIONS ............................................................................................................................................. 125
Number rounding functions .................................................................................................................... 126
TRIGONOMETRIC FUNCTIONS ............................................................................................................................... 127
HYPERBOLIC FUNCTION ...................................................................................................................................... 128
EXTREME FUNCTIONS ......................................................................................................................................... 128
FINDING ROOTS ................................................................................................................................................ 129
DERIVATIVES..................................................................................................................................................... 131
INTEGRATION .................................................................................................................................................... 131
USING STATISTICAL FUNCTION ............................................................................................................................. 132
Sampling from a vector ........................................................................................................................... 132
PROBABILITY DISTRIBUTIONS ............................................................................................................................... 133
SUMMARY STATISTICS ........................................................................................................................................ 135
COVARIANCE AND CORRELATION MATRIX .............................................................................................................. 137
6. WORKING WITH STRINGS ................................................................................................................. 138
STRINGS AND CHARACTER VECTORS ...................................................................................................................... 139
Printing Strings ........................................................................................................................................ 139
TRANSFORMING TEXT ........................................................................................................................................ 143
Changing case .......................................................................................................................................... 143
Counting characters ................................................................................................................................. 144
Trimming leading and trailing whitespace ............................................................................................. 144
Substring .................................................................................................................................................. 145
Splitting Texts .......................................................................................................................................... 145
Formatting Text ....................................................................................................................................... 146
Parsing text as date/time ........................................................................................................................ 148
Formatting date/time to strings ............................................................................................................. 151
USING REGULAR EXPRESSIONS.............................................................................................................................. 151
Finding a string pattern ........................................................................................................................... 152
Using group to extract data .................................................................................................................... 154
7. WORKING WITH DATA...................................................................................................................... 155
READING AND WRITING DATA .............................................................................................................................. 155
Reading and writing data to text format file.......................................................................................... 155
Importing data via RStudio ..................................................................................................................... 156
Importing data using built-in Functions .................................................................................................. 157
Importing data using the readr package ................................................................................................ 158
Reading and writing Excel worksheets ................................................................................................... 160
Reading and writing native data files ..................................................................................................... 161
Loading built-in datasets ......................................................................................................................... 161
VISUALIZING THE DATA....................................................................................................................................... 163
Creating scatter plots .............................................................................................................................. 163
Customize Chart Elements ....................................................................................................................... 165
Customize point style ............................................................................................................................... 165
Customizing the point colors ................................................................................................................... 168
Creating line plots .................................................................................................................................... 170
Line Type and Width ................................................................................................................................ 171
Multi-period line plot ............................................................................................................................... 173
Line plot with points ................................................................................................................................ 173
Multi-Series Chart with a Legend ............................................................................................................ 174
Bar charts ................................................................................................................................................. 175
Pie Charts ................................................................................................................................................. 178
Histogram and density plots ................................................................................................................... 179
4
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Boxplot ..................................................................................................................................................... 183
8. ANALYSING DATA ............................................................................................................................. 185
LINEAR MODEL .................................................................................................................................................. 185
DECISION TREE .................................................................................................................................................. 189
5
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
1. Getting Started with R
For data analysis, we need proper tools. Extracting patterns directly from a
large set of numbers that have been aligned in rows and columns is almost
impossible. To work with data, we need tools such as R to boost the
productivity.
Introduction to R
R as a programming language
R programming language has been evolving and developing over the last 20
years. The goal is to make the language easy and flexible so that complex
statistical computing, data exploration, and visualization operations can be
performed.
The ease of use and flexibility are conflicting goals. A programming language
can help finish a variety of statistical analysis tasks by clicking a few buttons,
but it won't be flexible if you need customization, automation, and your work
needs to be reproducible. On the other hand, a programming language can
flexible so that you can transform data and make complicated graphs but it
may not be easy to learn. R is known for its well-positioned balance.
6
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
R as a computing environment
Free of charge: R is available free of charge. In other words, for the installation
and using for commercial use, you need not buy a license.
Open-source: R is open source. Thousands of developers around the globe
have been working constantly to add new packages, review the source code,
and fix the bugs. The source code is also available so that you can dig in the
source code to fix any bug or improve the functionality of the packages.
Rich Online Resources: R is known for the huge, rapidly increasing number of
online resources. There are more than 7,500 packages available at CRAN (short
for Comprehensive R Archive Network), a worldwide network of mirror servers
from which you can get identical, up-to-date, R distributions and packages.
7
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Strong community: The community of R consists of not only R developers but
also, (the majority), R users from a wide range of backgrounds such as
statistics, econometrics, finance, bioinformatics, mechanical engineering,
physics, medicine, and so on.
A great number of R developers actively contribute to open source projects or
packages written in R. The goal of the community is to make data analysis,
exploration, and visualization easier and more interesting.
Installing R
If you are Windows user, you can download an installer for the latest version.
Then run the Windows installer to install R. Even though the installation
process is easy, many users face issues during the installation.
8
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Next step is to select additional tasks. Select the default options.
9
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Now installation starts to copy the files on your hard drive
Now R has been installed on your system. You can either use R in the
command prompt or in the R GUI.
Even though, you can directly start using R, we recommend RStudio for editing
and debugging R scripts. R is the backend and RStudio is the front end.
RStudio
10
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
graphics viewport, package management, integrated help viewer, code
formatting, version control, interactive debugging, and many more.
Once you complete the installation of RStudio, you see the following user
interface of RStudio.
The screenshot of the user interface of RStudio for the Windows operating
system is given below. The main window consists of several parts. Each part is
known as a pane. Each part performs a different function. The panes have
been designed to help data analysts work with the data.
The console
11
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The R Console is also embedded in RStudio. It works like a command prompt or
terminal. The commands that you type at the console, would be submitted to
R engine by RStudio. R engine is responsible for executing the commands.
RStudio takes the inputs from the user to R engine and presents the results
back to the user.
The editor
While working with data, we not only type commands at the console but also
write scripts, a set of commands that represent a logic flow, at the editor. The
editor is useful for editing R scripts, markdown documents, web pages, and
many types of other configuration files.
The code editor is a more advanced editor than a plain text editor. It supports
advanced functionalities such as syntax highlighting, autocompletion of R
Code, and debugging with the breakpoint. You may also use the following
shortcut keys:
12
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Breakpoint - You can click on the left margin of a line number to set a
breakpoint. When you execute the script the program will pause at this line
and wait for you to debug.
The environment pane exhibits the variables and functions that have been
created and that are available for repeated use. By default, variables are
shown in the global environment, which is the user workspace where you are
working.
Whenever you create a new object, you can find a new entry in the
Environment pane. You can see the variable name and the short description of
its values. When you change the value of a symbol, the change is reflected in
the environment pane.
You can see previous expressions evaluated in the console. In the history pane,
you can repeat the task that were performed previously by simply pressing up
in the console.
13
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The file pane
In the file pane, you can see the files in the folder whereas you can navigate
between the folders, create new folders, delete or rename the folders and
files. When you work on the RStudio project, you can view and organize the
project files in the File pane
You can use the plots pane to see the graphics produced by R code. If there is
more than one plot, previous plots are stored. You can view all the plots by
navigating back and forth.
14
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The package pane
You can view all the installed packages in the package pane. You can use CRAN
to install or update the package or you can remove an existing package from
your library.
15
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The help pane
• Type the function name in the Search box and find it directly
• Type the function name in the console and press F1
• Type ? before the function name and execute it
In practice, you don't have to remember all of R's functions; you only need to
remember how to get help with a function you are not familiar with.
16
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
17
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
2. R Workspace
This chapter will cover some basic yet important skills that are required to
manage the R Workspace
When you use the relative path, the file path does not change, rather the
notation becomes shorter. This also helps make the scripts more portable. It
helps other users who is using the code on some other machine, has to modify
the code to update the location of the data on their hard drive. If you have
used the relative path and data is stored in same relative location, there is no
need to modify the code.
You can check the current working directory of the running R session using
getwd() from R terminal. By default, the new R session is started from your
user directory. The RStudio runs the R session in the background from the user
documents directory.
In RStudio, you can choose a directory and create an R project. Whenever you
open the project, the location of the project becomes the working directory. It
18
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
improves the portability of the project by accessing the files using the relative
paths.
In order to create a new project, you can go to File | New Project or click the
Project drop-down menu in the top-right corner of the main window and
choose New Project. A window will appear, and you can create a new directory
or choose an existing directory on your hard drive as the project directory:
You have to choose a local directory. The project will be created in this
directory. An R project is .Rproj file. This file has some session. Once you open
the .Rproj file, the setting values stored in the file will be applied. As a result,
the working directory will be set to the directory in which the project file is
located.
When you RStudio to work in a project, the auto-completion makes writing file
paths much more efficient. If you type a string of either an absolute or relative
file path and press Tab, RStudio will list the files in that directory:
19
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Absolute and relative path
> getwd()
[1] "D:/Workspaces/R-programming"
In this example, you can notice that the path of the working directory uses /
instead of \. As we know in windows, the \ is the default path separator. In R
symbol \ is used to make special characters. For example, while creating a
character vector, you can use \n to represent a new line
> "Hello\nWorld"
[1] "Hello\nWorld"
In this example, the special character has been preserved when the character
vector is directly printed. That is why you do not see the effect of the newline
character in the previous example.
If you want the special characters to translate to the character they represent,
you can use cat()
> cat("Hello\nWorld")
Hello
World
In the example above, the second word starts with a new line (\n). If you want
to write \ itself, you can use \\:
20
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
When specifying the path in Windows, you should use \\ or /. Both the options
are supported. The path in Unix like operating systems such as macOS and
Linux, the path is always given by /. So it is easy. In Windows, if you use /, you
will get error
In most cases, Windows users can use /. This will help run the same code in all
the major operating systems
To set the working directory of the current R session, you can use
setwd().This is not a recommended practice because it directs all the relative
paths in the script to another directory and make everything go wrong. That is
why it is a good practice to create an R project to start your work.
When we create a project in RStudio, a .Rproj file is also created. This file is
created in the project directory. Initially there is no other file.
For a typical R project, there would be many R scripts for statistical computing
and programming tasks, data files (such as .csv files), documents (such as
markdown files), and output graphics.
If all the files are mixed up in the project directory, managing the files would
be difficult. So it is recommended to create subdirectories to contain different
types of files for different tasks.
Here is the example of a plain directory structure with all the files in same
folder
21
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
project/
- household.csv
- population.csv
- national-income.png
- popluation-density.png
- utils.R
- import-data.R
- check-data.R
- plot.R
- README.md
- NOTES.md
project/
- data/
- household.csv
- population.csv
- graphics/
- national-income.png
- popluation-density.png
- R/
- utils.R
- import-data.R
- check-data.R
- plot.R
- README.md
- NOTES.md
Inspecting an environment
When you start a fresh R session, the global environment is empty. No object
has been defined in this environment. If you run a command x <- c(1,2,3),
the numeric vector c(1,2,3) is bound to the symbol x in the global
environment.
22
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Inspecting existing symbols
The most useful function to inspect collection of objects that we are working
with is objects(). This function returns a character vector of the names of
existing objects in the current environment
> objects()
character(0)
23
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The Environment pane shows all the symbols and their values in a compact
form. You can view the vectors inside a list or a data frame by expending them.
The Environment pane has two views: List and Grid. The grid view shows not
only the names, types, and the value structures of existing objects, but also
their object sizes:
The str() function shows the type, positions, and a preview of its values:
> x
[1] 1 2 3
> str(x)
num [1:3] 1 2 3
If the vector has more than 10 elements, str() will show only the first 10:
> str(1:40)
int [1:40] 1 2 3 4 5 6 7 8 9 10 ...
24
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
A list can be directly evaluated in the console. You can evaluate the list using
print() as well
> z
$m
[1] 1 2 3 4 5
$n
[1] "x" "y" "z"
You can also use str() to show its type, length, and the structure preview of
the elements
> str(z)
List of 2
$ m: int [1:5] 1 2 3 4 5
$ n: chr [1:3] "x" "y" "z"
We can directly print the list. It will show all its elements and tell us how we
can access them. However it would be long and unnecessary in most of the
cases
> nest_list
$d
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
$e
$e[[1]]
[1] "a"
$e[[2]]
[1] 1 2 3
$f
$f$x
[1] 1 2 3 4 5 6 7 8 9 10
25
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
$f$y
[1] "g" "h"
$g
$g$x
[1] 0 1 2 3 4 5 6 7 8 9 10 11
$g$y
[1] "i" "j"
> str(nest_list)
List of 4
$ d: int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
$ e:List of 2
..$ : chr "a"
..$ : num [1:3] 1 2 3
$ f:List of 2
..$ x: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ y: chr [1:2] "g" "h"
$ g:List of 2
..$ x: int [1:12] 0 1 2 3 4 5 6 7 8 9 ...
..$ y: chr [1:2] "i" "j"
You can use str() to show the structure of an object. You can use ls.str()
to show the structure of the current environment
> ls.str()
nest_list : List of 4
$ d: int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
$ e:List of 2
$ f:List of 2
$ g:List of 2
x : num [1:3] 1 2 3
y : chr [1:3] "a" "b" "c"
z : List of 2
$ m: int [1:5] 1 2 3 4 5
$ n: chr [1:3] "x" "y" "z"
You can use the filters for the ls.str(). One of the filter is mode argument.
For example, if you use ls.str(mode="list"), you can view the structure of
all the list objects
26
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> ls.str(mode="list")
nest_list : List of 4
$ d: int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
$ e:List of 2
$ f:List of 2
$ g:List of 2
z : List of 2
$ m: int [1:5] 1 2 3 4 5
$ n: chr [1:3] "x" "y" "z"
The other filter is the pattern argument, which specifies the pattern of the
names to match. The pattern is expressed in a regular expression. If you want
to show the structures of all variables whose names contain only one
character, you can run the following command:
If you want to show the structures of all list objects whose names contain only
one character, you can use both pattern and mode at the same time:
If you're put off by commands such as ^\\w$, don't worry. This pattern
matches all strings in the form of (string begin)(any one word character like a,
b, c)(string end). We shall cover them in detail in following units.
Removing Symbols
27
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can remove the symbols from the environment using remove() function
of rm().
> ls()
[1] "nest_list" "x" "y" "z"
> rm(x)
> ls()
[1] "nest_list" "y" "z"
> ls()
[1] "nest_list"
If the symbol to be removed does not exist in the environment, a warning will
appear:
> rm(x)
Warning message:
In rm(x) : object 'x' not found
28
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Modifying the number of digits to print
In RStudio, when you type getOption(<Tab>), you can see a list of available
options and their descriptions. A commonly used option is the number of digits
to display. In an R session, the number of digits printed on screen is entirely
managed by digits. We can call getOption() to see the current value of digits
and call options() to set digits to a larger number:
> 1234567.1234567
[1] 1234567
> 123.12345678
[1] 123.1235
In the second example given above, that the 11-digit number is only shown
with 7 digits. This means the last few decimal digits are gone; the printer only
displays the number with 7 digits. To verify no precision is lost because of
digits = 7, see the output of the following code:
> 0.1000002
[1] 0.1000002
> 0.10000002
[1] 0.1
> 0.10000002 -0.1
[1] 2e-08
If the numbers are rounded to the seventh decimal place by default, then
0.10000002 should be rounded to 0.1 and the second expression should result
in 0. However, apparently, this does not happen because digits = 7 only means
the number of numeric digits to be displayed rather than rounded up.
29
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
However, in some cases, the number before the decimal point can be large,
and we don't want to ignore digits following the decimal point. Without
modifying digits, the following number will only display the integer part:
> 1234567.12345678
[1] 1234567
If we want to see more digits printed, we need to increase digits from the
default value 7 to a higher number:
> getOption("digits")
[1] 7
> 1e10 + 0.5
[1] 1e+10
> options(digits=15)
> 1e10 + 0.5
[1] 10000000000.5
Note: if we call the options() function, the modified values are effective
immediately. They may affect the behaviour of all the subsequent commands.
In order to reset the options, we can use
> options(digits=7)
> 1e10 + 0.5
[1] 1e+10
We can manage the warning level by specifying the value of the warn option
> getOption("warn")
[1] 0
30
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> as.numeric("Program")
[1] NA
Warning message:
NAs introduced by coercion
We can execute the same code without warning and get a missing value from
unsuccessful conversion
> options(warn=-1)
> as.numeric("Program")
[1] NA
Now there is no warning. However, it is not a good idea to always remove the
warning messages. Because it will also not show the potential error and you
will get to know about them in the final result. In that case, you have to spend
time in debugging the code. If you want to achieve good results and spend less
time on debugging the code, we recommend that you should be strict in your
code.
If you set warn to 1 or 2, the buggy code will fail fast. When the warn is set to
0, the values are returned before all the warning messages are displayed
together.
> options(warn=0)
> f <- function (x,y){ as.numeric(x)+as.numeric(y)}
> f("Learn","R")
[1] NA
Warning messages:
1: In f("Learn", "R") : NAs introduced by coercion
2: In f("Learn", "R") : NAs introduced by coercion
The function coerces two input arguments to numeric vectors. As the input
arguments are both strings, we get two warning messages. But they appear
after the function returns. On the flip side, if the function is takes considerable
amount of time to complete, you would not see any warning message before
you get the final results even though the intermediate compute was off the
track for some time.
If you want to print the warning messages as soon as the warning is produced,
you can use warn = 1
31
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> options(warn=1)
> f("Learn","R")
Warning in f("Learn", "R") : NAs introduced by coercion
Warning in f("Learn", "R") : NAs introduced by coercion
[1] NA
In this case, we get the same results but warning messages appear before the
result. If the function is time-consuming, you can see the warning messages
first and you may decide to stop the code and debug.
If you set the value warn = 2, all the warnings are considered errors. This is
stricter warning level.
> options(warn=2)
> f("Learn","R")
Error in f("Learn", "R") :
(converted from warning) NAs introduced by coercion
R contains not only the rich source of packages but also well maintained
package archive system called The Comprehensive R Archive Network, or CRAN
(http://cran.r-project.org/). CRAN is an archive of source code of R and
thousands of packages. At the time of writing this content there are 17077
active packages on the CRAN system. Every week more than 100 packages are
updated. You can check out the list of packages at
https://cran.rstudio.com/web/packages/.
CRAN archives R packages and distributes them to more than 120 mirrors
around the world. You can visit CRAN Mirrors (https://cran.r-
project.org/mirrors.html) and check out a nearby mirror. If you find one, you
can go to Tools | Global Options and open the following dialog:
33
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
You can change the CRAN mirror to a nearby one. You can also use the default
mirror. If you use a nearby mirror, the download will be fast. Once you choose
your mirror, you can download and install the package in R.
Once you choose the mirror, you can install R package easily. You can install
package using install.packages("ggplot2"). R will download, install, and
compile it.
You can install the package using RStudio. You need to go to the Tools | Install
Package menu option.
34
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
As per the package description, the package may have some dependencies.
The install.packages() takes care of the dependencies and install the
them before installing the package.
Both RStudio and command line scan for newer function and install the
package with the dependencies.
35
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Many package authors publish their packages on GitHub. The version control
and community development is very easy on GitHub. On several occasion, the
latest version of a package is available first on GitHub. If want to try the latest
development version of a package, you can directly install the package from
online repository using the devtools package.
For this, you need to install the devtools package if it is any already installed.
install.packages(“devtools”)
Then you can use install.github() in the devtools package to install the
latest development version of a package
library(devtools)
install.github(“hadley/ggplot2)
The devtools package will download the source code from GitHub and make it
a package in your library. If the package already exists in the library, the
installation will replace it without asking.
Due to some reasons, if you want to the latest CRAN version, you can do so by
running the following command
install.packages(“ggplot2”)
Package Functions
If you want to use the function that is part of a package, you can do so either
using library() or package:function(). Second option uses the function
without attaching the whole package to the environment.
library(moments)
skewness(x)
Alternatively, we can use the function without attaching the package using ::
36
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
moments::skewness(x)
We shall receive the same output from both the methods. However they have
different impact on the environment. The first method (using library())
modifies the search path of symbol, whereas the second method (using ::)
does not. When you call library(moments), the package is attached to the
search path and the package function can be directly used in the subsequent
code.
Here you can see the R version and the list of attached and loaded packages.
When we use :: to access a function in a package, the package is not attached
but it is loaded in the memory.
37
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
This shows that the package has been loaded but not attached. We can use the
following to attach the package
38
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In this case the skewness() and other functions in moments package are
directly available.
> search()
[1] ".GlobalEnv" "tools:rstudio" "package:stats"
[4] "package:graphics" "package:grDevices" "package:utils"
[7] "package:datasets" "package:methods" "Autoloads"
[10] "package:base"
> require(textPkg)
Loading required package: textPkg
Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, :
there is no package called 'textPkg'
> library(textPkg)
Error in library(textPkg) : there is no package called
'textPkg'
If your R script is long and time consuming. If your script is require() and the
required package is not installed on the machine, you have to wait for the
script to complete before you could see the warning message. Whereas in the
case of library(), the script will stop immediately as soon as the package
function is called and package is not installed.
39
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Masking and name conflicts
When you start a fresh R session, the basic packages are automatically
attached. The base packages include base, stats, graphics, and so on. You can
directly use the functions of these base packages. For example, if you want to
calculate the average of a numeric vector, you can directly use mean()
without using base::mean().
library(dplyr)
filter, lag
The following objects are masked from 'package:base':
The implementation of the function in dplyr package does not change the
meaning and usage, however it generalizes them. The functions in the
packages are compatible with the masked version. Hence, there is no need to
worry about. The masked function will not be broken.
The package function that mask basic functions generally generalize the base
functions rather than replace. However if there is a need to use two packages
40
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
with functions sharing the same name, you should not attach the package. You
can rather extract the functions from both packages as shown below
unloadNamespace("moments")
skewness(c(1, 2, 3, 2, 1))
Error in eval(expr, envir, enclos): could not find function
"skewness"
moments::skewness(c(1, 2, 3, 2, 1))
[1] 0.3436216
41
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
3. Basic Objects
To learn any programming language, the first step is to get familiar with basic
objects and their behaviour. In this chapter you will learn
We need different types of objects while solving any problem. Each object has
its own properties and behaviour. To solve the real-world business problem,
you need to understand how the basic objects work. It help you solve any
problem with more elegant code and fewer steps. A concrete understanding of
object behaviour helps you spend more time on solving the problem than
spend time on fixing countless minor problems.
Vector
Vector is one of the building blocks of all R objects. It contains primitive values
of the same type. Vector can be a group of numbers, texts, true/false values,
and values of some other type. Several type of vector exists in R. The most
commonly used vectors are numeric vectors, logical vectors, and character
vectors.
42
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Numeric Vector
> 1.5
[1] 1.5
After creating a value, we can store for the future use. We can use equal
operator, leftward operator, or rightward operator. We can create a variable in
the following ways
Once the variable is created and value is stored, we can use the variable to
represent the value from now on
We can create a numeric vector using multiple ways such as calling numeric()
to create a zero vector of a given length:
> numeric(10)
[1] 0 0 0 0 0 0 0 0 0 0
43
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
By using c(), we can combine several vectors to make one vector. For
example, we can combine several single-element vectors to create a multi-
element vector.
> c(1,2,3,4)
[1] 1 2 3 4
> 1:5
[1] 1 2 3 4 5
> 1+1:5
[1] 2 3 4 5 6
In this case, 1+1:5 does not mean a sequence from 2 to 5. But it means the
sequence from 2 to 6. The operator : has a higher priority than +. Hence, 1:5 is
evaluated first then 1 is added to each entry.
To create a numeric sequence, we can use seq(). The following code produces
a numeric vector of a sequence from 1 to 10 with an increment of 2
> seq(1,10,2)
[1] 1 3 5 7 9
The functions such as seq() have many parameters. While calling a function,
we can provide these parameters. But in most of the cases, the function takes
the default parameter only. We need to pass the parameter when there is a
requirement of modifying the default value.
We can create a numeric vector that starts from 2 with length 7 by specifying
the length.out parameter.
44
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> seq(2,length.out=7)
[1] 2 3 4 5 6 7 8
Logical vector
The logical vector stores a group of TRUE or FALSE values. They represents the
yes or no as answers to a group of logical questions.
> TRUE
[1] TRUE
We can obtain the logical vector by asking logical questions about R object. For
example, we can ask whether 1 is greater than 2 by using the following:
> 1 > 2
[1] FALSE
> 1 < 2
[1] TRUE
> c(1,2)>2
[1] FALSE FALSE
> c(1,2)>1
[1] FALSE TRUE
We can also compare two multi-element numeric vectors. For this comparison,
the length of the longer vector should be a multiple of the length of the
shorter one
45
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
This expression is equivalent to c(1>2,2>1). We can consider another
example to demonstrate how we can compare two vectors of different length.
In this case, the shorter vector is recycled and repeated. Hence the vector
c(2,3)will become c(2,3,2,3) and the comparison c(2>1,3>-2,2>3,3>-
3). More specifically, the shorter vector will be recycled to finish all the
comparisons for each element in the longer vector.
R also uses %in% logical operator. It tells whether each element in the left-
hand side vector is contained by the right-hand side vector:
The %in% logical operator does not recycle itself. It iterates itself over the
vectors on the left hand side and performs c(1 %in% c(1, 3, 4), 2 %in%
c(1, 3, 4))
Character Vector
A character vector is a group of strings. The character in this case does not
mean literally a single letter or symbol in a language, but it means a string like
this is a string. We can use both the double quotation marks as well as
single quotation mark to create a character vector, as follows:
> c('hello','world')
[1] "hello" "world"
The character vectors are equal because " and ' both work to create a string
and do not affect its value. Hence, The quotes at the beginning and end of a
string should be both double quotes or both single quote. They cannot be
mixed
We get both FALSE. Because neither Hello nor World equals Hello,
World.
2. You can insert single quotes into a string that starts and ends with
double quotes.
47
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
3. You cannot insert double quotes into a string that starts and ends with
double quotes
4. You cannot insert single quote into a string that starts and ends with
single quote
5. You can use escape character (\) to insert double quotes into a string
that starts and ends with double quotes
You can access some specific entries using indexing. We use [ ] brackets for
indexing. The indexing starts with position 1. We can drop an element from
index by using a negative value.
> m[2]
[1] "Feb"
We can access a range of elements using :. In this case we get the elements
from 2nd and 4th position
> m[2:4]
[1] "Feb" "Mar" "Apr"
48
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can access the elements using the position. In the following example, we
get the elements in 2nd, 4th, and 5th position.
> m[c(2,4,5)]
[1] "Feb" "Apr" "May"
We can also use logical indexing. In this case, where ever the value is TRUE, we
get the corresponding element
> m[c(TRUE,FALSE,TRUE,TRUE,FALSE,FALSE)]
[1] "Jan" "Mar" "Apr"
We can also use negative indexing. When we use negative indexing, the index
number in negative value is dropped.
> m[c(-2,-6)]
[1] "Jan" "Mar" "Apr" "May"
> m[c(2,-4)]
Error in m[c(2, -4)] : only 0's may be mixed with negative
subscripts
If we subset the vector using the positions beyond the range of the vector, the
non-existing positions will be returned as NA. In the following example, we
subset the vector using the 7th non-existing position. In this case the missing
value is represented by NA
> m[c(2,7)]
[1] "Feb" NA
> m[c(2:7)]
[1] "Feb" "Mar" "Apr" "May" "Jun" NA
49
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> m[2:4] <- c("Aug","Sep","Oct")
> m
[1] "Jul" "Aug" "Sep" "Oct" "May" "Jun"
We can select the values by logical criteria. For example the following code
picks out all the elements that are not greater than 2 in num
We can use a more complex selection criterion. For example if you want to
pick all the elements of num that satisfy 𝑥 ! − 𝑥 + 1 > 2
We can also overwrite the vector at a non-existing entry. In this case, the
vector will expend automatically and assign NA to the unassigned values
Named Vector
We can assign name to the elements of the vector. We can assign the names
when we create the vector
We can also create the named vector without using the quotes
> n["First"]
First
"Mary"
We can also get multiple elements. But in this case, we have to pass the name
of the elements as vector
> n[c("First","Last")]
First Last
"Mary" "John"
We can also reverse the order with a character string index vector
> n[c("Last","First")]
Last First
"John" "Mary"
If the character string index vector has duplicate elements, the selection with
result in selecting the duplicate elements
> n[c("First","First","Last")]
51
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
First First Last
"Mary" "Mary" "John"
> names(n)
[1] "First" "Last"
We can change the name of the vector by assigning another character vector
to its names
If we try to access an element that does not exists in the vector, we get a
vector of single missing value with a missing name
> n["Last"]
<NA>
NA
> name[c("Last","SurName")]
<NA> SurName
NA "John"
Exercise
52
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> names(vec) <- c('sum1', 'num1', 'dum1', 'rum1', 'lum1')
> vec
sum1 num1 dum1 rum1 lum1
21 22 23 24 25
a. > vec
sum1 num1 dum1 rum1 lum1
21 22 23 24 25
b. > (vec)
sum1 num1 dum1 rum1 lum1
21 22 23 24 25
c. > print(vec)
sum1 num1 dum1 rum1 lum1
21 22 23 24 25
d. > show(vec)
sum1 num1 dum1 rum1 lum1
21 22 23 24 25
> length(vec)
[1] 5
> str(vec)
Named int [1:5] 21 22 23 24 25
- attr(*, "names")= chr [1:5] "sum1" "num1" "dum1" "rum1" ...
> unname(vec)
[1] 21 22 23 24 25
> names(vec)
[1] "sum1" "num1" "dum1" "rum1" "lum1"
53
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> vec["sum1"]
sum1
21
> vec[c('num1','rum1','lum1')]
num1 rum1 lum1
22 24 25
> vec[3]
dum1
23
> vec[c(2,4)]
num1 rum1
22 24
> vec[3:5]
dum1 rum1 lum1
23 24 25
> vec[-2]
sum1 dum1 rum1 lum1
21 23 24 25
> vec[c(-3,-4)]
sum1 num1 lum1
21 22 25
54
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
15.Arrange elements in specific order based on indexes/position
> vec[c(3,2,4,1,5)]
dum1 num1 rum1 sum1 lum1
23 22 24 21 25
> rev(vec)
lum1 rum1 dum1 num1 sum1
25 24 23 22 21
> vec[sort(names(vec))]
dum1 Lion lum1 sum1 Tiger
42 22 11 21 24
> table(is.na(vec))
FALSE TRUE
3 2
Extracting Element
56
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
You can use [[ ]] to extract one element only. You cannot extract more than
one element
> x[[c("a","b")]]
Error in x[[c("a", "b")]] :
attempt to select more than one element in vectorIndex
> x[[-1]]
Error in x[[-1]] : invalid negative subscript in get1index
<real>
> x[["d"]]
Error in x[["d"]] : subscript out of bounds
On several occasions, we need to know the kind of vector we are dealing with
before we can use it. We can use class() function to know about the class of
any R object.
> class(c(1,2,3))
[1] "numeric"
> class(c(TRUE, FALSE))
[1] "logical"
> class(c("Tiger","Snake"))
[1] "character"
To ensure that the object is a vector of a specific class, we can use is.number,
is.logical, is.character
> is.numeric(c(1,2,3))
[1] TRUE
> is.logical(c(TRUE,FALSE))
[1] TRUE
> is.character(c("Tiger","Snake"))
[1] TRUE
57
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> is.numeric(c("Tiger","Snake"))
[1] FALSE
Converting Vectors
> st + 10
Error in st + 10 : non-numeric argument to binary operator
> num + 1
[1] 2 3 4
> num + 10
[1] 11 12 13
58
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
we can use as.* function family in order to convert a vector from one class to
another
> as.numeric(c('1','2','3'))
[1] 1 2 3
> as.numeric(c('1','2','3','a'))
[1] 1 2 3 NA
Warning message:
NAs introduced by coercion
> as.logical(c(-1,0,1,2))
[1] TRUE FALSE TRUE TRUE
> as.character(c(1,2,3))
[1] "1" "2" "3"
> as.character(c(TRUE,FALSE))
[1] "TRUE" "FALSE"
Even though each type of vector can be converted to all other type but the
conversion follows a set of rules.
The second command in the previous code block tries to convert the character
vector to a numeric vector as we did in the first command. However, the last
element a cannot be converted to a number. The conversion for the character
representation of numeric values was successful but the conversion of
character value a produced a missing value.
Arithmetic Operators
The arithmetic operations can be performed easily. They follow two rules
1. Computing in an element-wise manner
2. Recycling the shorter vector
> c(10,11,12,13) + 20
[1] 30 31 32 33
> c(20,21,22,23) - c(10,11,12,13)
[1] 10 10 10 10
> c(10,11,12,13) * c(1,2,3,4)
[1] 10 22 36 52
> c(10,15,20,25) / c(2,3,4,5)
[1] 5 5 5 5
59
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> c(1,2,3,4)^2
[1] 1 4 9 16
> c(1,2,3,4) ^ c(1,2,3,4)
[1] 1 4 27 256
> c(2,3,4,5)%%2
[1] 0 1 0 1
> c(a=1,b=2,c=3)+c(d=1,e=2,f=3)
a b c
2 4 6
> c(a=1, b=2, 3)+c(d=1,e=2,f=3)
a b
2 4 6
Matrix
1 2 5
𝐴=* .
2 3 7
> A = matrix(
+ c(1, 3, 2, 4, 6, 7), #the data elements
+ nrow = 2, #desired number of rows
+ ncol = 3 #desired number of columns
+ )
> A #Print the matrix
[,1] [,2] [,3]
[1,] 1 2 6
[2,] 3 4 7
In this case, if you omit the value of nrow or ncolumn, the value is
automatically taken based on the given value of nrow or ncolumn.
60
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
You may also fill the values by columns. In the case the values are populated
for column wise. By default, R fills the values by columns. You can specify the
option by giving byrow = FALSE
> A = matrix(
+ c(1, 3, 2, 4, 6, 7), #the data elements
+ nrow = 2, #desired number of rows
+ ncol = 3, #desired number of columns
+ byrow = FALSE #Fill rows by columns
+ )
> A #Print the value of A
[,1] [,2] [,3]
[1,] 1 2 6
[2,] 3 4 7
> A = matrix(
+ c(1, 3, 2, 4, 6, 7), #the data elements
+ nrow = 2, #desired number of rows
+ ncol = 3, #desired number of columns
+ byrow = TRUE #Fill rows by columns
+ )
> A #Print the value of A
[,1] [,2] [,3]
[1,] 1 3 2
[2,] 4 6 7
> diag(1,nrow=5)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 1 0 0 0
[3,] 0 0 1 0 0
[4,] 0 0 0 1 0
[5,] 0 0 0 0 1
Diagonal matrix has equal number of rows and columns. Based on the value of
nrow or ncolumn, the other value is calculated by R.
61
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Naming Rows and Columns
> A = matrix(
+ c(1, 2, 3, 4, 5, 6), #the data elements
+ nrow = 3, #desired number of rows
+ byrow = TRUE, #Fill rows by columns
+ dimnames = list( #To give names of rows and columns
+ c('r1','r2','r3'), #row name
+ c('c1','c2') #column name
+ ))
> A #Print the value of A
c1 c2
r1 1 2
r2 3 4
r3 5 6
> B = matrix(
+ c(1, 2, 3, 4, 5, 6), #the data elements
+ nrow = 3) #desired number of rows
> B #Print the values of B
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> rownames(B) <- c('r1','r2','r3') #Specify row name
> colnames(B) <- c('c1','c2') #Specify col name
> B #Print the value of B
c1 c2
r1 1 4
r2 2 5
r3 3 6
Subsetting a Matrix
62
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We have studied in previous section that a matrix is a two-dimensional
rectangular layout. To access a value from a matrix we need a two-dimensional
accessor [ , ]. This is similar to one-dimensional accessor [ ].
To determine the subset of a matrix, we can supply two vectors for each
dimension. The vector are separated by , operator. The first vector is the row
selector and the second vector is column selector.
To extract only one element in the first row and the second column
> B[1,2]
[1] 4
> B
c1 c2
r1 1 4
r2 2 5
r3 3 6
> B[2:3,1]
r2 r3
2 3
> B[2:3,1:2]
c1 c2
r2 2 5
r3 3 6
If we leave one dimension blank, all the values in that dimension will be
returned
Even though a matrix is a vector that can be represented and accessed in two-
dimensional, it is still a vector. Hence, we can use one-dimensional accessors
for vectors
> B[1]
[1] 1
> B[5]
[1] 5
Similar to the vectors, matrix also contains the entries of the same type. If you
type an inequality, you will get a logical matrix of equal size
> B > 2
c1 c2
r1 FALSE TRUE
r2 FALSE TRUE
r3 TRUE TRUE
Matrix Operators
64
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can perform all the arithmetic operators with matrix as well. The operators
perform element-wise operations except for matrix product, %*%.
> B
c1 c2
r1 1 4
r2 2 5
r3 3 6
> B + B #Addition
c1 c2
r1 2 8
r2 4 10
r3 6 12
> B - 0.5*B #Subtraction
c1 c2
r1 0.5 2.0
r2 1.0 2.5
r3 1.5 3.0
> B * B #Multiplication
c1 c2
r1 1 16
r2 4 25
r3 9 36
> B / 0.5*B #Division
c1 c2
r1 2 32
r2 8 50
r3 18 72
> B^2 #Power
c1 c2
r1 1 16
r2 4 25
r3 9 36
> t(B) %*% B #Matrix Multiplication
c1 c2
c1 14 32
c2 32 77
> t(B)
r1 r2 r3
c1 1 2 3
c2 4 5 6
65
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Arrays
Compared to matrices, arrays can have more than two dimensions. To create
an array, we can use array() function. To specify the dimension, we can use
dim parameter.
, , 2
We can create the array with names for these dimensions using dimnames
y1 y2 y3
x1 1 5 9
x2 2 6 10
66
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
x3 3 7 11
x4 4 8 12
, , z2
y1 y2 y3
x1 13 17 21
x2 14 18 22
x3 15 19 23
x4 16 20 24
We can setup the names for each dimension using dimnames(x) <- for
already created array
y1 y2 y3
x1 1 5 9
x2 2 6 10
x3 3 7 11
x4 4 8 12
, , z2
y1 y2 y3
x1 13 17 21
x2 14 18 22
x3 15 19 23
x4 16 20 24
Subsetting an array
67
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> mularray1[1,,]
z1 z2
y1 1 13
y2 5 17
y3 9 21
> mularray1[,1,]
z1 z2
x1 1 13
x2 2 14
x3 3 15
x4 4 16
> mularray1[,,1]
y1 y2 y3
x1 1 5 9
x2 2 6 10
x3 3 7 11
x4 4 8 12
> mularray1[3,2,1]
[1] 7
> mularray1[1:2,2:3,1]
y2 y3
x1 5 9
x2 6 10
As you may notice, atomic vectors, matrices, and arrays share almost the same
set of behaviours. A fundamental common feature they share is that they are
all homogeneous data types, that is, the type of elements they store must be
the same. However, there are also heterogeneous data types in R, that is, they
can store different types of elements, which makes them much more flexible
but they are less memory efficient and slower to operate.
List
A list in R can contain many different data types inside it. A list is a collection of
data which is ordered and changeable. List is known for its flexibility and ability
to extract information without calling different functions each time
We can use list() to create a list and put different type of objects into one
list.
For example, the following variable x is a list of three vectors and a numeric
value
68
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> n <- c(2,3,5) #Numeric Vector
> p <- c(TRUE, FALSE) #Logical Vector
> q <- c('a','b','c') #Character Vector
> x <- list(n,p,q,3) #Heterogeneous members
> x #Print List x
[[1]]
[1] 2 3 5
[[2]]
[1] TRUE FALSE
[[3]]
[1] "a" "b" "c"
[[4]]
[1] 3
We can also use double square bracket to extract the value of a list member
> x[[1]]
[1] 2 3 5
> x[[2]]
[1] TRUE FALSE
> x[[3]]
[1] "a" "b" "c"
> x[[4]]
[1] 3
69
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
You can also provide the name to extract the list member with that name
> x[["n"]]
[1] 2 3 5
> x$a
NULL
> x[[5]]
Error in x[[5]] : subscript out of bounds
Subset a list
$q
[1] "a" "b" "c"
$p
[1] TRUE FALSE
70
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
$q
[1] "a" "b" "c"
Named Lists
Even though, while creating the list, we named the list members, we can
always name or rename the list vector.
$Pum
[1] TRUE FALSE
$Dum
[1] "a" "b" "c"
$<NA>
[1] 3
[[2]]
[1] TRUE FALSE
[[3]]
[1] "a" "b" "c"
[[4]]
[1] 3
Once we remove the names of the list member, we cannot access the list
members by name anymore. We can still access them by position and logical
criterion.
71
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Setting Values
$b
[1] TRUE FALSE
$c
[1] "aa" "bb"
>
> y$a <- 2
> y
$a
[1] 2
$b
[1] TRUE FALSE
$c
[1] "aa" "bb"
$b
[1] TRUE FALSE
$c
[1] "aa" "bb"
$d
72
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] 1 2
$b
[1] "Updated Values"
$c
[1] 1 4
$d
[1] 1 2
We can easily remove more than one members from the list
$b
[1] "Updated Values"
To find out whether an R object is list or not, we can use is.list() function
$b
[1] 2
$c
[1] 3
We can convert a list to a vector by calling unlist() function. In this case, if all
the members are of same type, they will convert to the particular type.
Here, zz$a and zz$b are numbers and can be converted to a character;
however, but zz$c is a character vector and cannot be converted to numeric
values. Therefore, their closest type that is compatible with all elements is a
character vector.
Data Frame
A data frame is used for storing data tables. It is a list of vectors of equal
length.
Data Frames are data displayed in a format as a table. Data Frames can have
different types of data inside it. While the first column can be character, the
second and third can be numeric or logical. However, each column should have
the same type of data.
74
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The following table fully characterized a Data Frame
We can create a data frame using data.frame() and give the data of each
column using a vector of the corresponding type
We can also create the data frame from a list either by calling data.frame()
or as.data.frame().
75
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can also create a data frame from a matrix
Please note that the conversion automatically assign the column names to the
data frame. However, if the columns or the rows already have been named,
the names will be preserved in the conversion.
The data frame is a list that looks like a matrix. Hence, we can apply the
methods to access list and matrix on data frame also.
We can rename the rows and columns in the same way as we do in the case of
a matrix.
76
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
d 4 2 0.63
e 5 3 0.71
Since a data frame is a matrix-like list of column vectors, we can use both sets
of notations to access the elements and subsets in a data frame.
Since the data frame can be regarded as a list of vectors, we can use the list
notations to extract a value. We can either use $ or [[ ]] operators to do so.
> df1$id
[1] 1 2 3 4 5
> df1[[1]]
[1] 1 2 3 4 5
> df1[["id"]]
[1] 1 2 3 4 5
The subset operator ([) allows us to use a numeric vector to extract columns
by position, a character vector to extract columns by name, or a logical vector
to extract columns by TRUE and FALSE selection:
77
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
d 2
e 3
> df1[c("x-value","y-value")] #column with names “x-value”,”y-
value”
x-value y-value
a -1 0.76
b 0 0.45
c 1 0.56
d 2 0.63
e 3 0.71
> df1[c(TRUE,FALSE, TRUE)] #using logical vector
id y-value
a 1 0.76
b 2 0.45
c 3 0.56
d 4 0.63
e 5 0.71
The list notation does not support the row selection whereas the matrix
notation supports both row selection and column selection. We can use [row,
column] notation to subset a data frame by specifying the row and column
selector which ca be numeric vector, character vector and/or a logical vector.
> df1[,1]
[1] 1 2 3 4 5
> df1[,"x-value"]
[1] -1 0 1 2 3
> df1[,c("x-value","y-value")]
x-value y-value
a -1 0.76
b 0 0.45
c 1 0.56
d 2 0.63
e 3 0.71
> df1[,c(1:2)]
id x-value
a 1 -1
b 2 0
c 3 1
d 4 2
e 5 3
78
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The examples of row selectors are
> df1[c('a','e'),1:2]
id x-value
a 1 -1
e 5 3
Note that the matrix notation automatically simplifies the output. That is, if
only one column is selected, the result won't be a data frame but the values of
that column
> df1[1:4,'id']
[1] 1 2 3 4
To always keep the result as a data frame, even if it only has a single column,
we can use both notations together:
> df1[1:4,]['id']
id
a 1
b 2
c 3
d 4
In this case, the first group of brackets subsets the data frame as the matrix
with first four rows and all the columns. The second group of brackets subsets
the resultant data frame as list with only one column selected.
79
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> df1[1:4,'id',drop=FALSE]
id
a 1
b 2
c 3
d 4
Filtering Data
We can filter the rows of the data frame based on a criteria and then select the
desired columns. For example, we select the rows of df1 with y-score >=
0.6 and then select the columns id and x-score
The following code filters the rows of df1 by a criterion that the row name
must be among a, d, or e, and selects the id and x-score columns:
Factors
In the example above, we can notice that Name, Gender, and MaritalStatus are
not character vectors. But they are factors. They represent the categorical
data. For example Gender
We can see that the class of Name, Gender, and MaritalStatus is Factor
whereas the class of Age is numeric. We can confirm the same as follows
> class(person$Name)
[1] "factor"
> class(person$Age)
[1] "numeric"
In this case, we can clearly see the levels (the unique values in the column)
and number of observations.
> str(person$MaritalStatus)
Factor w/ 2 levels "Married","Single": 2 2 1 2
82
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "Fate") :
invalid factor level, NA generated
> person$Name
[1] <NA> Mate Late Jate
Levels: Jate Kate Late Mate
The reason for the Warning message is that there was no word called Fate
when the data frame was initially created using the unique values in that
character vector
This behavior is sometimes very annoying and does not really help much,
especially as memory is cheap today. The simplest way to avoid this behavior is
to set stringsAsFactors = FALSE when we create a data frame using
data.frame():
The summary() function provides the summary statistics of each column. For
the numeric vector, it shows the important quantiles of the number. However
other type of columns, it shows the length, class, and mode of them. In the
case of character columns, the summary statistics depends on the value of
stringsAsFactors.
83
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> person <- data.frame(
+ Name=c("Kate","Mate","Late","Jate"),
+ Age=c(24, 25, 35, 26),
+ Gender=c("Female","Male","Female","Female"),
+ MaritalStatus =c("Single","Single","Married","Single"),
stringsAsFactors = FALSE)
> summary(person)
Name Age Gender MaritalStatus
Length:4 Min. :24.00 Length:4 Length:4
Class :character 1st Qu.:24.75 Class :character Class :character
Mode :character Median :25.50 Mode :character Mode :character
Mean :27.50
3rd Qu.:28.25
Max. :35.00
We can bind multiple data frames either by row or column using rbind()and
cbind(). As their name suggests, they perform row binding and column
binding respectively
For example, if we want to add a new record of a person, we can use rbind()
Similarly, if we want to add two new columns to indicate the nationality and
education level, we can use cbind()
84
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> cbind(person, data.frame(Nationality =
c("USA","UK","France","Australia"), Education =
c("Graduate","High School","Post Graduate","Graduate") ))
Name Age Gender MaritalStatus Nationality Education
1 Kate 24 Female Single USA Graduate
2 Mate 25 Male Single UK High School
3 Late 35 Female Married France Post Graduate
4 Jate 26 Female Single Australia Graduate
Note that rbind() and cbind() do not modify the original data but create a
new data frame with given rows or columns appended.
The argument row.names = FALSE avoids storing the row names which are
not necessary, and the argument quote = FALSE avoids quoting text in the
output
Functions
Creating a function
You can easily create a function. For example, you need an object that can
simply add two objects x, and y.
In R Programming language, the functions act like other objects. We can see
the function by typing add in the console.
> add
function(x, y){x + y}
Calling a function
86
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
When we call the function add, R will look for a function named add in the
environment. If it finds one, it will create a local environment in which x takes
the value 2 and y takes the value 3. For the parameter values, the expression
within the function is evaluated and returns the value 5.
Dynamic Typing
The functions in R are not strongly typed. The type of the inputs are not fixed
prior to the calling the function. The function can work with any type of vector
as long as + operation can be performed on them. For example, we can
execute the following code without changing the function.
> add(c(1,2),1)
[1] 2 3
> add(as.Date("2020-12-01"),1)
[1] "2020-12-02"
The function passes the two argument into the expression without any type
checking. In the example above as.Date() creates a Date object. The function
works very well with the Date object. If the + operation is not possible on the
two values passed as arguments, the function will fail.
> add("a","b")
Error in x + y : non-numeric argument to binary operator
Generalizing a function
87
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
+ } else if (type == "multiply") {
+ x * y
+ } else if (type == "divide") {
+ x / y
+ } else {
+ stop("Unknown type of operation")
+ }
+ }
If we pass some other value for type, we get the predefined error message
> calc(as.Date("2020-12-01"),31,"Addition")
Error in calc(as.Date("2020-12-01"), 31, "Addition") :
Unknown type of operation
In this case, no conditions are satisfied, so the expression in the last else block
will be evaluated. The stop() call yields an error message and terminates the
whole evaluation immediately.
The function seems to work well, but it gives unclear message if we pass a
character vector for type argument.
> calc(1,2,c("add","minus"))
[1] 3
Warning message:
In if (type == "add") { :
the condition has length > 1 and only the first element will
be used
We can further refine the function to avoid such ambiguity. We can add a
condition to check whether the vector has length 1.
Some functions can take a wider range of inputs and meet a variety of
demand. But it would be cumbersome to specify these many arguments
whenever we call the function. By setting the default values of the arguments,
we can simplify the code to call the function.
We can use arg=value to set the default value of the argument and make the
argument optional. If the value of the arg is provided, the new value overrides
the default value
89
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
4. Basic Expressions
• Assignment expressions
• Conditional expressions
• Loop expressions
Assignment Expressions
As we have seen earlier, R uses left assignment (<-), right assignment (->) and
equal (=) operators for assignment. In this section, we will try to study the
assignment expressions in more details
We can have a chain of assignments so that all the symbols take the same
value
90
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
For assignments, both = and <- operators are allowed and they have exactly
the same effect, but as a custom, <- instead of = is preferred in R.
First two lines we used <- as assignment operator whereas in the third line,
we used = to match function argument by name for the function abc().
> x <- 1
> y <- 0.5
Now we change all the operators to = and get the same results from the
function abc(). In this case, we used the = for assignment as well as named
argument.
> x = 1
> y = 0.5
> xvalue
Error: object 'xvalue' not found
Now we change all the operators to <-. Even though the same results from
the function abc(). But new variables xvalue and yvalue have been
created in the environment. The variable xvalue gains the value 1 and the
variable yvalue gains the value 0.5.
91
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> x <- 1
> y <- 0.5
> xvalue
[1] 1
> yvalue
[1] 0.5
When we call the function in this manner, the variable xvalue and yvalue are
created in the environment and due to <- operator results in x and y. Hence
the arguments are not matched by names but by positions.
Now we exchange the position of both the variables, the result is still the same
First we use <- operator and check the results, the results are still the same
Then we use <- operator and exchange the position of both the variables, the
results are different because in this case, the function is taking the variables by
position not by name
Hence <- operator as the name argument for a function not only results in
creating new variables in the environment but also results in abc(yvalue,
xvalue)
Therefore we can use <- or = operators for assignment. For arguments in the
variables, we should only use = operator.
Using backticks
The name new data contains a space whereas _data starts with _. The
population(data) is not a symbol but a function call. However while
working on data science problems, we may find invalid column names, as given
above, in the data table.
We can use back ticks while creating a function. The back tick should be used
while calling the function
We can use back ticks while creating a list. The back tick should be used to
refer the symbol
> li$`Sec(Name)`
[1] "Pat" "Kat" "Mat"
> result
Sec.Name. Sec.Marks.
1 Pat 60
2 Kat 70
3 Mat 80
> result$`Sec(Name)`
NULL
94
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In this case, even if you have used the back ticks around the unusual variable
name, data.frame() replaces them with the dots such as Sec.Name.. We can
access them using the column name with dots not with back tick
> colnames(result)
[1] "Sec.Name." "Sec.Marks."
> result
Sec(Name) Sec(Marks)
1 Pat 60
2 Kat 70
3 Mat 80
> result$`Sec(Name)`
[1] "Pat" "Kat" "Mat"
Conditional expressions
Several programs are not sequential but they contain several branches
depending on certain conditions. In all the programming languages, we use
conditional expression to code the branches based on conditions. In R, we use
if to branch the logic flow by logical conditions
Using if as a statement
In this function, we check the condition x > 0, if the condition is satisfied, the
function returns 1. We have tested the function by passing values.
We can generalize the function by adding else if and else branches and
check branch conditions. Now the function returns 1 for positive input, -1 for
negative input and 0 for 0.
96
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> print_sign(4)
[1] "Number is greater than 0"
> print_sign(-2)
[1] "Number is less than 0"
The branch conditions may or may not be related. For example, in the
following grading policy , branch conditions slices the score range
In this case, the branch condition in else if assumes that the previous
condition does not hold. When we specify marks >= 80, we mean that marks
< 90 and marks >= 80 which depends on the previous conditions. Hence we
can neither change the order of the branches nor make the branches
independent.
97
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
+ }
+ }
> c(assign_grade (95), assign_grade(83), assign_grade(78),
assign_grade(61), assign_grade(54))
[1] "D" "D" "D" "D" "F"
In this case only assign_grade(54) got the correct grade but rest of them
were broken. We can rewrite the conditions so that they do not depend on the
conditions.
However in this case, the function is more verbose than the first correct
version. Therefore, we should figure out the correct order of the branch
conditions and be careful about the dependency of each branch.
Using if as an expression
98
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can rewrite this expression syntax in one line by removing the curly
brackets
Since the return value of a function is the value of its last expression in the
function body, return() can be removed in this case:
In the previous chapter we have seen that the functions created earlier work
with single-value input. If we provide a vector, the functions will produce
warnings. Because the functions do not work with multi-element vectors:
In this example we can see that the if statement ignores all but the first
element, if a multi-element logical vector is supplied.
100
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We get this error because the logic is not clear. When we try to condition the
logical vector, the values can be a mix of TRUE and FALSE value.
We should avoid this ambiguity. One of the methods of avoiding the ambiguity
is any() method. The any() method returns true if at least one element in the
given vector is TRUE:
Now we can try the previous example to print a message if any single value is
greater than 2
Similarly, if we want to print a message if all the values are greater than 2, we
should use all():
Vectorized if:ifelse
We have studied that vectors are the basic building blocks of R programming.
Many functions in R take vectors as input and they output a resultant vector.
The mathematical operations on vectors are more efficient than those on each
element of the vector.
ifelse(test_expression, x, y)
In this case test_expression must be logical vector. The return value is also a
vector that is of same length as test_expression.
> a = c(4, 5, 6, 7)
> ifelse(a %% 2 == 0, "Even", "Odd")
[1] "Even" "Odd" "Even" "Odd"
Similarly, the other two vectors in the function argument are recycled to
("Even", "Even", "Even", "Even")and ("Odd", "Odd", "Odd",
"Odd")respectively. Hence the results are evaluated accordingly.
102
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In R, we test an expression against elements of a list using switch() function.
If the value given in the expression matches item from the list, the
corresponding value is returned.
In this case, the expression is evaluated. Based on this value, we get from the
corresponding item in the list.
If more than one item in the list matches the expression, switch() function
returns the first matches item.
If the evaluated value is a string, it returns the value of the first argument that
matches with the evaluated value.
In this case also, if evaluated value is out of bound, the invisible NULL is
returned
103
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> switch("frame", color="red", shape="square", fill="not
filled")
To cover all the possibilities, we can add the last argument without an
argument name that captures all other inputs:
Compared to the ifelse() method, switch behave more like if() method, if
only accepts a single value but It can return any value
Loop expressions
For loop
If the vector contains n element, the loop will be equivalent to the following
statement block
104
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
For example, if we want to iterate an expression 3 times, by iterating over 1:3
using variable i. In each iteration, we can display a text with the values of
during each iteration i.
We can not only use the iterator with the numeric vectors but all with any
vectors. In the example below, we have replaced a numeric vector 1:3 with a
character vector
105
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
+ y = c("A", "B", "C"),
+ stringsAsFactors = FALSE)
> for (col in df) {
+ str(col)
+ }
num [1:3] 1 2 3
chr [1:3] "A" "B" "C"
Since the data frame is a list in which all the elements are of the same length.
Therefore the behaviour of for loop is the same for the list and the data frame
as we have seen in the previous two examples.
However, we can iterate a data frame row by row. For that, we need to iterate
over the integer sequence from 1 to the number of rows of the data frame
The iteration over a data frame row by row is not a good idea. It is slow and
verbose. We will discuss the better option in the next chapter.
106
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> for (i in 1:5) {
+ if (i == 3) break
+ cat("message ", i, "\n")
+ }
message 1
message 2
You can use break expression in the place of the record tracking expression if
you only need the first number that can satisfy the condition
Once the program finds the solution, the for loop breaks and the last value of
the iterator i is preserved.
You can also use the next keyword to skip the rest of the expressions in the
current iteration and directly jump on the next iteration in the loop.
107
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
message 5
We can include a for loop inside another for loop. If we want to print all the
permutations of the elements in a vector, we can use a two-level nested for
loop.
If we want the permutation of distinct items, we can use a test condition and
the next expression inside the inner for loop.
We have shown, how the for loops and nested for loops work. But they may
not be an optimal solution. R programming language offers several built-in
functions. For example, we can use combn() method to produce a matrix of
combinations of vector elements
> combn(c("a","b","c"),2)
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "b" "c" "c"
108
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Similarly, we can use expand.grid() to produce a data frame that contains all
the permutation of elements in multiple vectors
While Loop
The while loop does not stop running until a specific condition is met.
while (test_expression)
{
expr
}
In the following example, the while loop starts with x = 0. Each time, the
test_expression, which in this case is x <= 5, is evaluated. If it evaluates
to TRUE, body of the loop is executed, else the while loop terminates.
> x <- 0
> while (x <= 5) {
+ cat(x, " ", sep = "")
+ x <- x + 1
+ }
0 1 2 3 4 5
> x <- 0
> while (TRUE) {
+ x <- x + 1
+ if (x == 4) break
+ else if (x == 2) next
109
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
+ else cat(x, '\n')
+ }
1
3
110
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
R provides an enormous amount of function. These built-in functions not only
save your time but also boost productivity.
• Object functions
• Logical functions
• Math functions
• Numeric methods
• Statistical functions
• Apply-family functions
Object functions
In this section, we will study some basic functions that can be used for objects.
Some of such functions, we have already studied in previous chapters. In this
chapter, we will learn more functions to access the type and dimensions of a
data object.
The function returns different values based on the input object s. If s takes an
atomic vector such as a numeric vector, we get the first element of the vector.
If s takes a list of vec and index, we get the element with the index of index
from s$vec.
> obj_type(c(10,11,12))
[1] 10
> obj_type(list(vec=c("Cat","Bat","Rat"), index = 2))
[1] "Bat"
If we pass some other input type such as function, the function should not
return any value. Rather we should get an error message. If we pass the mean
as a function, the function obj_type should get into the else condition and
stop.
> obj_type(mean)
Error in obj_type(mean) : The input type is not supported
Now we need to test out function for other possibilities. What if the input is a
list but the elements are not vec and index? We can test it by passing a list of
lst without any index element.
112
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> obj_type(list(lst=c("Cat","Bat","Rat")))
NULL
> NULL[[NULL]]
NULL
> NULL[[1]]
NULL
Another possibility is passing the vec element correctly but missing the index.
> obj_type(list(vec=c("Cat","Bat","Rat")))
Error in s$vec[[s$index]] :
attempt to select less than one element in get1index
This time, we got an error because s$index is NULL. If we extract value from a
vector by NULL, we get an error.
> c("Cat","Bat","Rat")[[NULL]]
Error in c("Cat", "Bat", "Rat")[[NULL]] :
attempt to select less than one element in get1index
Another possibility is that the list only contains one element index = 2. In this
case, we only get NULL.
From the experiments above, we observe that the error messages are not so
informative. Hence, we should check the input our self in the implementation
of the function.
In case s is a list, we check if s$vec is not null and is an atomic vector. If this
condition is TRUE, we check whether s$index is properly defined as a single-
element numeric vector. If any of the conditions are violated, the program
stops and displays an informative error message.
> obj_type1(list(lst=c("Cat","Bat","Rat")))
Error in obj_type1(list(lst = c("Cat", "Bat", "Rat"))) : Data
is Invalid
> obj_type1(list(index = 2))
Error in obj_type1(list(index = 2)) : Data is Invalid
In previous sections, we have shown that we can use is.* function to access
object class and type. In addition, we can also use class() and typeof()
The function typeof() returns the low-level internal type of an object, while
class() returns the high-level class of an object.
For a list
We can notice that in the last statement, that data.frame is essentially a list
that has all the columns of equal length. Even though the class is returned as
data.frame but typeof() returns list internally.
115
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Getting data dimensions
For the above underlying data, we can use more dimensions dim(), nrow(),
ncol()
The first expression creates a four-column matrix from number vector s. The
underlying typeof() of s have been preserved. But class()has been
changed to "matrix" "array". The dim() shows the dimensional structure
in a vector form. We have also used two shortcuts nrow() and ncol() to
access the number of rows and columns. The nrow() and ncol()are the first
and second elements of dim()vector.
Another data structure where the notion of dimension is used is the data
frame. The data frame is fundamentally different from the matrix. We derive a
matrix from a vector by adding dimensional property. Similarly, we derive a
data frame from a list. Just we add a constraint that each list element should
be of the same length.
117
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The class of object changes from numeric to matrix whereas the type of the
object does not change.
We can reshape the matrix into an array. This is possible because the
dim()function only changes the representation. But the underlying data store
does not change.
We have already created a data frame where each row represents a record.
We can iterate over all records that have been stored in the data frame. Let’s
consider the data frame df
> df
a b
1 1 Cat
2 2 Mat
3 3 Rat
The logical vectors are used to filter the data. They can take TRUE or FALSE. To
solve problems, on various occasions, we need to create joint conditions by
involving multiple logical vectors.
Logical operators
The following R program check whether the values x, y, and z are increasing
monotonically. If they are increasing, the function should return 1; if they are
decreasing, the function should return -1; else it should return 0.
119
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We have seen that & performs the vectorized calculation and returns a multi-
element vector if one of the arguments has more than one element. But in the
case of if statements, it only works with a single-value logical vector.
> direction(1, 2, 3)
[1] 1
For the scalar input, the behaviour of the two versions is the same.
> direction2(1, 2, 3)
[1] 1
But for multiple value input, direction2 ignores the second element of each
input vector without producing any warning.
Now the question arises, which of the two & and &&, is the better option. It
depends on the requirement. But if the requirement is to compare all the
elements in the same position of each input vector, then both the options will
be incorrect.
Logical functions
120
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In previous chapters, we discussed how a few logical aggregation functions
have been very useful. The most commonly used two logical aggregation
functions are any() and all(). The any() function returns TRUE, if at least
one of the elements of the given logical vector is TRUE. Otherwise, it will return
FALSE. The all() function returns TRUE, if all the elements of the given logical
vector are TRUE else it will return FALSE.
While dealing with any() and all() functions, we should remember that they
only return a single TRUE or FALSE value. They never return a multi-element
logical vector. We can modify the function direction, to include both all()
and & together in the if condition.
> updated_direction_all(1, 2, 3)
[1] 1
But for the multi-element vector input, we have to test whether the function
gives us the same monotonicity
We can use several other variations. You can try these functions to test the
functionality they provide
121
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
updated_direction_any <- function(i, j, k) {
if (any(i < j & j < k)) 1
else if (any(i < j & j < k)) -1
else 0
}
updated_direction_all2 <- function(i, j, k) {
if (all(i < j) && all(j < k)) 1
else if (all(i > j) && all(j > k)) -1
else 0
}
updated_direction_any2 <- function(i, j, k) {
if (any(i < j) && any(j < k)) 1
else if (any(i > j) && any(j > k)) -1
else 0
}
The logical operations that we introduced so far, just return whether a certain
condition is TRUE or FALSE. It does not tell us which elements are TRUE. The
which() function can be used to get the positions of TRUE elements in a
logical vector.
> x
[1] -2 -1 0 1 2 3
> abs(x) >= 1.5
[1] TRUE FALSE FALSE FALSE TRUE TRUE
> which(abs(x) >= 1.5)
[1] 1 5 6
We can also use logical conditions to filter elements from a vector or a list
If we use a logical vector that returns all the FALSE values. A zero-length
numeric vector is returned.
> x[x>=10]
122
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
numeric(0)
The real-world data may contain several data issues. It may contain missing
values represented by NA. For example
Any arithmetic calculations on the missing values will also return missing
values:
> x + 2
[1] -1 NA 0 1 2 NA 3 NA 5
> x > 2
[1] FALSE NA FALSE FALSE FALSE NA FALSE NA TRUE
Hence any() and all() have to deal with missing values too.
> x
[1] -3 NA -2 -1 0 NA 1 NA 3
> any(x > 1)
[1] TRUE
> any(x < -2)
[1] TRUE
> any(x < -3)
[1] NA
If any of the results of the expression is TRUE, the function returns TRUE. If no
element is TRUE, the function returns NA. Otherwise if the function returns all
the FALSE values, it would return FALSE as demonstrated below.
Similar but opposite logic applies to all(). If any element of the input vector
along with missing values is FALSE, the function returns FALSE. But all the
elements in the vector are TRUE along with missing values, the function returns
NA.
In this case also, we can use na.rm=TRUE, to ignore the missing values.
The data filtering also behaves differently when missing values are involved.
The following code will preserve the missing values at the corresponding
positions of the logical vector that is produced by x >= 0.
We can use which() that does not preserve the missing values.
Logical Coercion
We can also use numeric vectors in place of logical vectors as input for some
functions. The non-logical vectors are coerced to the logical values.
For example. We can put a numeric vector in the if condition. The numeric
vector will be coerced in such cases.
Math Functions
R provides several groups of basic math functions. The basic functions include
square root, exponential, and logarithm functions.
You can use sqrt() with real numbers. For a negative number, NaN will be
returned followed by a warning message.
> sqrt(4)
[1] 2
> sqrt(-1)
[1] NaN
Warning message:
In sqrt(-1) : NaNs produced
In R, numeric values can be finite, infinite ( Inf and -Inf), and NaN values. The
following code will produce infinite values ( Inf and -Inf):
> 1/0
[1] Inf
> log(0)
[1] -Inf
> is.finite(1/0)
[1] FALSE
> is.infinite(log(0))
[1] TRUE
125
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can use the inequality to check the sign of the Inf.
> is.pos.infinite(1/0)
[1] TRUE
> is.neg.infinite(log(0))
[1] TRUE
To rounds the values in its first argument to the specified number of decimal
places (default 0)
> round(c(-1.3,-1.7, 1.3, 1.7))
[1] -1 -2 1 2
> round(pi, 3)
[1] 3.142
To round the values in its first argument to the specified number of significant
digits
> signif(pi, 4)
[1] 3.142
Trigonometric functions
> sin(0)
[1] 0
> cos(0)
[1] 1
> tan(0)
[1] 0
> asin(1)
[1] 1.570796
> acos(1)
[1] 0
> atan(1)
[1] 0.7853982
In maths, sin(𝜋 ) = 0 strictly holds. But in R the formula does not lead to 0 due
to the precision of floating numbers.
> sin(pi)
[1] 1.224647e-16
127
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] FALSE
Hyperbolic Function
> x <- 1
> sinh(x)
[1] 1.175201
> cosh(x)
[1] 1.543081
> tanh(x)
[1] 0.7615942
> asinh(x)
[1] 0.8813736
> acosh(x)
[1] 0
> atanh(0)
[1] 0
Extreme Functions
128
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can see that min() returns the minimal value among all the input vectors.
On the other hand, max() returns the maximal value.
If we want to obtain maximal or minimal values for each vector, we should use
pmax() or pmin().
In the first example above, pmin()function will give the minimal value among
all the elements at 1st position, then 2nd position, and finally 3rd position within
the vectors. This is called the parallel minima.
The twin function pmax()is used to find parallel maxima. If any of the vectors
contains lesser number of elements than the other two, the elements in the
smaller vector will recycle.
Suppose, you need to write a function that returns -5 if the value is less than
-5. If the input is between -5 and 5, it should return the value of the input. If
the input the greater than 5 then the value is 5.
> new_func(seq(-8,8))
[1] -5 -5 -5 -5 -4 -3 -2 -1 0 1 2 3 4 5 5 5 5
Finding Roots
One of the most commonly encountered tasks is to find the roots. Suppose, we
want to find the roots of the following equation
129
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
𝑥! + 𝑥 − 2 = 0
We can find the roots as 𝑥 = −2 𝑎𝑛𝑑 𝑥 = 1
The function always returns a complex vector whereas each element will be in
the form of a + bi. If we want to get the real roots only, we can use Re() to
extract the real parts of the complex roots:
If we replace 𝑥 with r
> r ^ 3 - r ^ 2 - 2 * r - 1
[1] 8.881784e-16+1.110223e-16i 8.881784e-16+2.220446e-16i
8.881784e-16-4.188101e-16i
You may notice that the result does not go to zero, but it is very close to zero.
If we are only interested in 8 digits of precision, we can use round() to check
whether roots are valid
130
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> round(r ^ 3 - r ^ 2 - 2 * r - 1, 8)
[1] 0+0i 0+0i 0+0i
Derivatives
&
For example if we want to find &' 𝑥 ! , we can use the following
&
Similarly, we can find &' sin(𝑥) cos(𝑥𝑦) as follows
We have used quote() function. This function helps keep the expression
unevaluated. This helps us access the symbols as they are written.
In the example above, we have used both quote()and eval(). The quote()
creates an expression object whereas eval()evaluates a given expression with
specified symbols.
Integration
131
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
(
!
@ sin(𝑥 ) 𝑑𝑥
)
The result looks like a numeric value. But it contains some other information
too because it is a list.
> str(integ)
List of 5
$ value : num 1
$ abs.error : num 1.11e-14
$ subdivisions: int 1
$ message : chr "OK"
$ call : language integrate(f = function(x) sin(x),
lower = 0, upper = pi/2)
- attr(*, "class")= chr "integrate"
$b
[1] "x" "y" "z"
We can also draw a sample from any object using sample() provided that it
supports subset with [ ].
> table(grade)
grade
A B C
4 14 6
Probability Distributions
133
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
On various occasions, we need to draw a sample from a probability distribution
instead of a vector. R provides us a variety of built-in functions for the
probability distribution. In this topic, we will learn basic statistical tools for
sampling from probability distributions. These tools work mainly with numeric
vectors.
For uniform distribution, over [0, 1], we can use runif(n) to generate n
random numbers.
> runif(5)
[1] 0.6488628 0.7354542 0.6606280 0.8576631 0.9273368
> rnorm(5)
[1] -1.53022446 1.77021591 -1.53184603 -0.73656058 -
0.07438508
The interface of both the random generator functions is the same. The first
argument of both runif() and rnorm() is n, the number of values to
generate. The rest of the arguments are the parameters of random
distribution. The parameters of a normal distribution are mean and standard
deviation (sd).
For continuous distribution, the most useful functions are p and q functions.
For a discrete distribution, we use d function to calculate density, which in this
case is a probability.
Distribution Functions
Beta pbeta qbeta dbeta rbeta
Binomial pbinom qbinom dbinom Rbinom
Chi-Square pchisq qchisq Dchisq Rchisq
Exponential pexp qexp Dexp Rexp
Gamma pgamma qgamma Dgamma Rgamma
Normal pnorm qnorm Dnorm Rnorm
Poisson ppois qpois Dpois Rpois
Student t pt qt Dt Rt
Uniform punif qunif Dunif Runif
Summary Statistics
To start with, let’s generate a random numeric vector of length 100. We will
use the standard normal distribution
To calculate mean
> mean(x)
[1] -0.06842303
> sum(x)/length(x)
135
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
[1] -0.06842303
> median(x)
[1] -0.05230469
> sd(x)
[1] 1.085102
> var(x)
[1] 1.177447
> range(x)
[1] -2.486898 4.007746
> quantile(x)
0% 25% 50% 75% 100%
-2.48689824 -0.85248359 -0.05230469 0.63106073 4.00774583
136
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> quantile(x, probs=seq(0,1,0.1))
0% 10% 20% 30% 40%
50% 60%
-2.48689824 -1.26102234 -1.00358444 -0.71149067 -0.40031959 -
0.05230469 0.18725574
70% 80% 90% 100%
0.53641061 0.72404199 1.44893509 4.00774583
To get the most commonly used summary statistics, we can use summary()
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.48690 -0.85248 -0.05230 -0.06842 0.63106 4.00775
> summary(df)
num alph
Min. : 50.91 Length:100
1st Qu.: 73.19 Class :character
Median : 79.66 Mode :character
Mean : 80.23
3rd Qu.: 87.11
Max. :115.23
Using two or more vectors, we can compute the covariance and correlation
matrix.
> cov(x,y)
[1] 1.630218
137
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> cor(x,y)
[1] 0.8961332
We can also use the two functions for more than 2 vectors. For this exercise,
we created a new vector z that has the same length as x using a uniform
distribution. The vector z does not depend on x and y. In this case, we use
cbind() to create a three-column matrix and then compute the covariance of
them
> z = runif(length(x))
> comb = cbind(x, y, z)
> cov(comb)
x y z
x 1.177446803 1.6302178 0.003800914
y 1.630217841 2.8106373 0.034256899
z 0.003800914 0.0342569 0.091151886
> cor(comb)
x y z
x 1.00000000 0.89613325 0.01160204
y 0.89613325 1.00000000 0.06768038
z 0.01160204 0.06768038 1.00000000
The string-related functions are very important for a data analysis problem. In
this chapter, we will study about
138
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
• Manipulation of date/time objects and string representations
• Regular expressions to extract information from text
Printing Strings
The most basic string operation is to print the string. There are several ways in
which we can view the text in the console.
The simplest way is to type the string by using the quotation marks:
> "hello"
[1] "hello"
We can also store the value of the string in a variable and print by evaluating it.
But if we write a character value in a loop, it does not print anything at all
for (i in 1:3) {
"hello"
}
If an expression is typed in the console, its value is printed. But a for loop does
not return a value explicitly, the value inside for loop cannot be printed
directly. We can investigate this using the following example
139
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
test1 <- function(x){
"hello"
x
}
> test1("world")
[1] "world"
In the example above, the function does not print hello but it prints world.
When we call the function test1("world"), the function returns the value of
the last expression x, which is world. If we remove x from the function:
> test2("world")
[1] "hello"
In this case, the test2 will always return hello irrespective of the value of x.
But our objective is to print both the vectors. We can use print()to solve this
problem.
> print(str1)
[1] "hello"
for (i in 1:3){
print(str1)
}
[1] "hello"
[1] "hello"
[1] "hello"
If we want to print the text as a message not as a character vector with indices,
we can call cat() or message().
> cat("hello")
hello
Alternatively, we can use message() function, which does not use space
separators by default. We need to write the space separators manually.
The message() function also ends the text with a new line while cat() does
not. We can run two experiments to understand it further/
141
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In the example above, the cat() function prints the input strings without
appending a new line. Due to this, all three letters are shown in the same line.
Whereas the message() function appends a new line to the input string.
Hence all three letters are printed in three lines.
If we want to print each letter in a new line, we should explicitly add a new line
character in the input.
Concatenating Strings
> paste("hello","world")
[1] "hello world"
> paste("hello","world", sep="-")
[1] "hello-world"
> paste0("hello","world")
[1] "helloworld"
The question arises, what is the difference between paste() and cat(), if
both of them print the characters the same way and concatenate the strings.
The function cat() can only print the value to the console but paste()
returns the value that can be assigned to other variables. We can study the
following examples.
In the example above, we can see that cat() concatenates the strings but
returns NULL.
In the example above, we can see that paste() not only concatenates the
strings but also assigns the value to another variable.
The difference between cat() and paste() is more visible while working with
multielement characters.
The function cat() concatenates both the vectors into one string sequentially.
Whereas the paste(), concatenates element-wise as shown below.
Transforming Text
Changing case
Counting characters
> nchar("Programming")
[1] 11
> nchar(c("Learn","R","Programming"))
[1] 5 1 11
We can use trimws() to trim the leading and trailing whitespace (including
spaces and tabs).
By default, the function trims the whitespaces from both sides of the string.
We can use which= to specify which side of the string, we want to trim.
Substring
Suppose we have a vector that includes several dates where months are
represented by three-letter abbreviations.
substr(dates, 1, 3)
[1] "Jan" "Jun" "Sep"
We can replace the values returned by substr() with a given character vector
Splitting Texts
The function substr() works very well if the lengths of the parts of the strings
are fixed. However, in many cases such as person names, are not of fixed
length. For example, “Mary John” and “Tim Johnson”. In such cases, we can
use the function strsplit() to split texts by a separator such as space or
comma.
[[2]]
[1] "Travis" "45" "Germany"
[[3]]
[1] "Pascal" "23" "France"
We can use strsplit() to split the whole string into individual characters.
For this, we have to pass an empty split argument
Formatting Text
To return the formatted string with the values that have been provided in the
list, we use sprintf() function.
sprintf(format, values)
In the function, format is used to provide the format of the printing the
values and values is used to provide the values
To format the numerical vector to the default number of decimal places (six
digits after the decimal point), we can use the following.
We can add a point and a number between the percentage sign and the f. To
round the numeric input value to two digits after the decimal place, use the
following.
> sprintf("%.2f",x)
[1] "123.46"
146
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can also format the number of digits before the decimal place but without
decimal place
> sprintf("%1.0f",x)
[1] "123"
We can format the pout to print 10 leading blanks before our number without
decimal
> sprintf("%10.0f",x)
[1] " 123"
Format Output
sprintf("%s", "A") A
sprintf("%d", 10) 10
sprintf("%04d", 10) 0010
sprintf("%f", pi) 3.141593
sprintf("%.2f", pi) 3.14
sprintf("%1.0f", pi) 3
sprintf("%8.2f", pi) " 3.14"
Formatting date/time
In data analysis, we encounter the data and time data types very often. The
simplest function for the date is Sys.Date() and for time is Sys.time(). The
function Sys.Date()returns the current date. The function
Sys.time()returns the current time
> Sys.Date()
[1] "2020-01-07"
> Sys.time()
[1] "2020-01-07 16:58:39 IST"
The output above may suggest that date and time are character vectors but
they are not character vectors. It could be verified from the following
command
> as.numeric(Sys.Date())
[1] 18268
147
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The above statement gives the numeric value relative to 1970-01-01. That
means it returns the number of days passed after 1970-01-01.
The question arises if the date can be represented as a string, why do we need
a Date object. The Date object has good arithmetic properties. We can add or
subtract a number of days from a date to get a new date.
> my_date + 7
[1] "2020-01-14"
> my_date - 80
[1] "2019-10-19"
We can subtract one date from another to get the number of days between
the two dates
This looks like a message but it is a numeric value. We can get the numeric
value explicitly using as.numeric().
> as.numeric(date1-date2)
[1] 184
In R, the time is similar. However, R does not have any function called
as.Time(). We can use either as.POSIXct() or as.POSIXlt() to create
date time from the text representation. The two functions are different
148
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
implementations of date/time. The implementation of as.POSIXlt()is given
below.
We can perform addition and subtraction for time calculations. It takes time as
a unit.
> my_time + 10
[1] "2020-01-07 14:45:27 IST"
> as.Date('2017.05.21')
Error in charToDate(x) :
character string is not in a standard unambiguous format
In such a case, a format string can be used to let the as.Date() function
know, how to parse the string to a date.
149
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can use strptime(), which is a more direct method to convert a string to
a date/time.
The date and date/time are also vectors. We can input a character vector and
get a vector of dates.
The math is also vectorized. We can add some consecutive integers to the
date. Hence we get the consecutive dates.
> as.Date('20190610','%Y%m%d')
[1] "2019-06-10"
> strptime('20190610042949','%Y%m%d%H%M%S')
[1] "2019-06-10 04:29:49 IST"
150
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Formatting date/time to strings
In this section, we will learn the functions to convert date and date/time
objects to strings. These functions use a certain template.
> as.character(my_date)
[1] "2020-01-07"
Even though the output looks the same but it is plain text. It does not support
date calculations.
> txt_date + 1
Error in txt_date + 1 : non-numeric argument to binary
operator
We can also get the same result using format(). The function
as.character() calls the format() function directly behind the scenes.
Hence this is a recommended to use format() function.
While working on a data analysis problem, you may get data in various
formats. Most of the time, the data is well organized. The example is given
below
id,name,score
1,A,20
2,B,30
3,C,25
151
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In R, we use read.csv() to import a CSV file as a data frame, which has the
right header and data type.
However, every data file is not well organized. It is challenging to deal with
poorly organized data. We can use built-in functions such as read.table()
and read.csv(), but they may not give us desired results for such format-less
data.
For example, if we want to analyze raw data (fruits.txt) that describes the
number or status of some fruits.
apple: 20
orange: missing
banana: 30
pear: sent to Jerry
watermelon: 2
blueberry: 12
strawberry: sent to James
Our requirement is to pick out all the fruits with a number instead of status
information. First of all, we should distinguish between fruits with numbers
and fruits without numbers. We need to distinguish the text that matches a
pattern from the ones that do not. We can use regular expressions for this
problem.
We can use regular expressions by following two steps. The first step is to find
a pattern to match the text. The second step is to group the patterns to extract
the required information.
To solve the fruits problem, we need to find out a pattern to extract the
required information. In this case, we need to extract all the lines that start
with a word, which is followed by a semicolon and space. The line should end
with an integer instead of the words or other symbols.
The regular expressions help us with a set of symbols that can represent the
patterns. We can describe the preceding pattern using ^\w+:\s\d+$. In this
case, we have used the meta-symbols to represent a class of symbols.
152
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
• ^: We use this symbol at the beginning of the line
• \w: For a word character
• \s: For a space character
• \d: For a digit character
• $: At the end of the line
• \w+: For one or more word characters
• :: The symbol that we want to see after the word
• \d+: For one or more digit characters
We need to select the lines that match the pattern abc: 123 while ignoring
others. We can use the function grep() to get the lines that match the
pattern.
Please note that in R, we should use \\to avoid escaping. Now we can filter
fruits by pat_match
> fruits[pat_match]
[1] "apple: 20" "banana: 30" "watermelon: 2"
"blueberry: 12"
In the above example, we have specified a pattern that starts with ^ and ends
with $ in order to avoid partial matching. By default, the regular expression
performs partial matching. It means that if any part of the string matches the
pattern, the whole string is considered to match the pattern. For example. The
following code determines which string matched two patterns respectively.
153
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The first pattern is an example of partial matching. The results from the
pattern matching are the strings that include any digit. In the second pattern
where we used ^ and $ only one digit is returned.
In the pattern strings, we can mark groups using parentheses. In the fruits
problem, we can mark the groups by modifying the pattern to
(\w+):\s(\d+). In this case, we have marked two groups. One is fruit name
(\w+) and another one is the number of fruits (\d+).
For such problems, we will use stringr package. Even though R has built-in
functions to solve such problems, stringr package is easier to use and more
efficient. We will call function str_match() with the updated group pattern.
> library(stringr)
> match <- str_match(fruits, "^(\\w+):\\s(\\d+)$")
> match
[,1] [,2] [,3]
[1,] "apple: 20" "apple" "20"
[2,] NA NA NA
[3,] "banana: 30" "banana" "30"
[4,] NA NA NA
[5,] "watermelon: 2" "watermelon" "2"
[6,] "blueberry: 12" "blueberry" "12"
[7,] NA NA NA
This time, we get a matrix with more than one column. The groups in the
parenthesis have been extracted from the text. They have been placed in
columns 2 and 3. Now we can transform the character matrix to a data frame
using the right header and data types.
Now we get the data frame df_fruits that has right header and data types.
154
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
> df_fruits
fruit quantity
1 apple 20
2 banana 30
3 watermelon 2
4 blueberry 12
A typical data analysis project starts with loading the data. It means that we
need to import a data set into the R environment. Before we load the data, we
need to check the type of the data file and then use an appropriate tool to
read the data.
155
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
The most commonly used data file type is CSV file. The first line of a typical CSV
file is the header of the columns. Subsequent lines represent a data record
with columns that have been separated by commas. Here is an example of the
CSV file.
Name,Gender,Age,Major
John,Male,24,Finance
Amily,Female,25,Statistics
Jessie,Female,23,Computer Science
You can import the data using RStudio by navigating File | Import Dataset |
From Text (base). You choose a local file in a text format, such as .csv and
.txt.
You should check the Strings as factors only if you want to convert the string
columns to factors.
The file importer translates the file path and options to R code. After setting
different parameters, you can click on Import. It will call the read.csv()
156
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
function. The interactive tool is very handy and helps you avoid several
mistakes.
The function readLines() can be used to read a text file. This function
returns a number of lines as a character vector
> readLines("data/student.txt")
[1] "Name,Gender,Age,Major" "John,Male,24,Finance"
[3] "Amily,Female,25,Statistics"
"Jessie,Female,23,Computer Science"
By default, the function reads all the lines of the file. We can also preview the
first two lines.
157
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can use col.names= to explicitly specify the names of the columns
CSV files use a comma (,) to separate columns and a new line to separate rows.
If the file, you are trying to import is in tab-delimited format, you may use
read.table().
The functions read.* have some inconsistencies. Instead, we can use readr
package to import tabular data in a fast and consistent manner.
> library(readr)
> student1 <- read_csv("data/student.csv")
── Column specification
──────────────────────────────────────────────────────────────
──────
cols(
Name = col_character(),
Gender = col_character(),
Age = col_double(),
Major = col_character()
)
> student1
158
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
# A tibble: 3 x 4
Name Gender Age Major
<chr> <chr> <dbl> <chr>
1 John Male 24 Finance
2 Amily Female 25 Statistics
3 Jessie Female 23 Computer Science
By default, the read_csv() function opens the CSV file and reads it line-by-
line. By default, it also reads the first few rows of the table in order to decide
the type (i.e. integer, character, etc.) of each column. You can specify the type
of each column with col_types argument.
Here col_types = 'ccdc' indicates that the data type of the first, second,
and forth columns is a character, and the data type of the third column is
double.
The read_csv() function can also read the compressed files automatically.
159
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Reading and writing Excel worksheets
Excel workbook is another format for storing tabular data. R does not provide
any built-in function to read an Excel workbook. But several R packages such as
readxl (https://github.com/hadley/readxl), are available to work with Excel
worksheets. You can install the readxl package from CRAN using
install.packages("readxl").
> library(readxl)
> price <- read_excel("data/price.xlsx")
> price
# A tibble: 6 x 3
Date Price Growth
<dttm> <dbl> <dbl>
1 2020-01-03 00:00:00 136 NA
2 2020-02-03 00:00:00 138 0.0147
3 2020-03-03 00:00:00 137 -0.00725
4 2020-04-03 00:00:00 130 -0.0511
5 2020-05-03 00:00:00 139 0.0692
6 2020-06-03 00:00:00 140 0.00719
Another package that we can use while working with Excel is openxlsx. We
can use this package to read, write, and edit XLSX files. Hence openxlsx is
more comprehensive than readr. You can install this package using
install.package("openxlsx") command.
With openxlsx, we can use read.xlsx to read data in the XLSX files into a
data frame just like read_excel()from readxl.
> library(openxlsx)
> price1 <- read.xlsx("data/price.xlsx", detectDates = TRUE)
> price1
Date Price Growth
1 2020-01-03 136 NA
2 2020-02-03 138 0.014705882
3 2020-03-03 137 -0.007246377
4 2020-04-03 130 -0.051094891
5 2020-05-03 139 0.069230769
6 2020-06-03 140 0.007194245
160
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We used detectDates = TRUE to ensure that date values can be imported
correctly. Else the dates will be imported as number. We can also use
write.xlsx() to write the data frame to workbook.
openxlsx::write.xlsx(price1, "data/price1.xlsx")
CSV files and Excel workbooks are non-native data formats to R. Hence, there
is a gap between the original data object and the output file. If we export a
data frame with many columns of data types to a CSV file, the information
about the column types is discarded. The numeric, string or date column data
type is represented in text format.
If the portability of the data is not an issue and you want to use only the R to
work with the data, you can use the native formats to read and write data. The
native formats help you save the objects in a file and recover the same file
exactly without worrying about the data type issues.
R has its own data file format that uses .rds extensions. We can use
readRDS() function to read a R data file.
The .rds file format is usually smaller than its text file and hence it takes up
less storage space. The .rds file format also preserves data types and classes
such as factors and dates eliminating the need to redefine data types after
loading the file.
R has a great number of built-in datasets. We can easily load and use them
easily. The built-in datasets are mostly data frames and contain detailed
specifications.
The most famous built-in R datasets are iris and mtcars. You can use ? iris
and ? mtcars to read the description of the datasets. You can get more
information about the dataset from the description.
161
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can use the built-in datasets because these datasets are immediately
available once R is ready.
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1
...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5
...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1
...
$ Species : Factor w/ 3 levels "setosa","versicolor",..:
1 1 1 1 1 1 1 1 1 1 ...
You can also print iris to see the whole data frame. We can also use
View(iris) to view data in a grid pane.
Similarly, we can view the first six rows of mtcars and see its structure.
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am
gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
4 1
162
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0
3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0
3 1
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92
...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
For this chapter, we will use nycflights13 packages. We can install the
package using the following commands
install.package("nycflights13")
163
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
plot(1:10)
x <- rnorm(200)
y <- 2*x + rnorm(200)
plot(x,y)
164
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Customize Chart Elements
We can customize several chart elements such as title (main or title()), the
label of the x-axis (xlab), the label of y-axis (ylab), the range of the x axis
(xlim), and the range of the y-axis (ylim)
plot(x, y,
main = "Correlated Random numbers",
xlab = "x", ylab = "2x + noise",
xlim = c(-4, 4), ylim = c(-6, 6))
We can specify the chart tile by either the main argument or a separate
title() function call. The following code will plot the same chart as given
above.
plot(x, y,
xlab = "x", ylab = "2x + noise",
xlim = c(-3, 3), ylim = c(-6, 6))
title("Correlated Random numbers")
For a scatter plot, the default point style is a circle. We can specify the pch
argument (plotting character), to change the point style. 26 point styles are
available in R
In the preceding code, we have created a scatter plot that includes all the point
styles while printing the corresponding pch number beside it. First, we created
a simple scatter plot using plot, then printed the pch number using the
text().
We can plot a scatter plot graph using non-default point style by setting
pch=17.
x <- rnorm(200)
y <- 2*x + rnorm(200)
plot(x,y, pch = 17,
main = "Scatter plot with pch = 17")
166
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can also distinguish the two groups of points by a logical condition. We
know that pch is vectorized. So, we can use ifelse() to specify the point of
each observation based on certain condition. The following example applies
pch = 17 to the points satisfying x * y > 1 otherwise pch = 1;
x <- rnorm(200)
y <- 2*x + rnorm(200)
plot(x,y,
pch = ifelse(x * y > 1, 17, 1),
main = "Scatter plot with conditional pch")
A plot containing two separate datasets sharing the same x-axis can be drawn
using plot() and points(). In the previous example, a normally distributed
167
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
vector x, and a linearly correlated random vector y were generated. For this
example, we will generate another random vector, z, that has a non-linear
relationship with x. In this example, we have plotted both y and z against x
whereas both the plots have different point styles:
x <- rnorm(75)
y <- 1.5*x + rnorm(75)
z <- sqrt(1 + x ^ 2) + rnorm(75)
plot(x, y, pch = 1,
xlim = range(x), ylim = range(y, z),
xlab = "x", ylab = "value")
points(x, z, pch = 17)
title("Scatter plot with two datasets")
x <- rnorm(75)
168
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
y <- 1.5*x + rnorm(75)
plot(x, y, pch = 15, col = "blue", main = "Blue Color Scatter Plot")
We can use col to distinguish different groups of point while plotting two
different datasets using plot() and points().
R supports 657 colors in total. You can call the function colors() to get the
list of all the colors supported by R.
On several data analysis problems such as time series analysis, we use line
plots to demonstrate the trend and variation across time. We should use
type=”l” while calling plot().
t <- 1:50
y <- 2.5 * sin(t * pi / 60) + rnorm(t)
plot(t, y, type = "l", main = "Line plot")
170
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Line Type and Width
For the line plot, we can use lty to specify the line type of a line plot. It is
similar to pch for scatter plot. The preview of the six-line types that R supports
is shown below.
171
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In the preceding code, we have used the parameter type = "n" to create an
empty canvas. The value "n" signifies no plotting. The parameters axes =
FALSE, ann = FALSE are used to turn off axes and annotation.
We used the abline() function to add straight lines through the current plot.
The parameter h =lty_val is used to draw the six horizontal line, for each
value of lty_val. The line width has been set by lwd = 2. The different line
types are specified by lty = lty_val.
We have used the function mtext() to draw the text on the margin. Please
note that abline() and mtext() are vectorized with respect to their
argument.
In the following example, we have drawn the auxiliary lines in a plot using the
function abline(). In this example, first of all, we created a plot of y with
time, t. We have shown the mean value and the range (minimum and
maximum values) of y along with the time. We can easily draw these auxiliary
lines very easily by using different line types and colors.
172
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Multi-period line plot
In a multi-period line plot, we mix different line types. For example, a time
series dataset in which the first period is historic data and the second period is
predictions.
p <- 40
plot(t[t <= p], y[t <= p], type = "l",
xlim = range(t), xlab = "t", ylab = "y")
lines(t[t >= p], y[t >= p], lty = 2)
title("Two period Line Plot")
We can plot both the lines and points in the same chart. This can be done
easily by first plotting a line chart and then adding points() of the same data
to the plot again.
173
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
plot(y, type = "l")
points(y, pch = 16)
title("Line plot with points")
Alternatively, first, we can plot a scatter plot using the plot() function and
then we can add lines using the lines() function.
In the following code, we have generated two series, y1 and y2, with time t
and created a chart with the two series with respect to time t.
t <- 1:30
y1 <- 1.5 * t + 6 * rnorm(30)
y2 <- 2.5 * sqrt(t) + 8 * rnorm(30)
plot(t, y1, type = "l", col = "black",
ylim = range(y1, y2), ylab ="y1, y2")
points(y1, pch = 15)
lines(y2, col = "blue", lty = 2)
points(y2, col = "blue", pch = 16)
title ("Plot of two series")
legend("topleft",
legend = c("y1", "y1"),
col = c("black", "blue"),
174
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
lty = c(1, 2), pch = c(15, 16),
cex = 0.8, x.intersp = 0.5, y.intersp = 0.8)
In the above example, we have added a legend() on the top left. It shows the
line and point styles of y1 and y2 respectively. We have also used cex to scale
the font sizes of the legend and x.intersp and y.intersp to make some
minor adjustments to the legend.
Bar charts
The bar charts are one of the most commonly used charts. We use bar charts
to visualize the qualitative data by category frequency.
To plot the bar chart we use barplot() function instead of plot() function.
The function draws either vertical or horizontal bars that are separated by
white space. Even though we display the raw frequencies, but we can use
barplot to visualize other quantities, such as means or proportions, which
directly depend upon these frequencies.
175
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
main: title of the bar chart
names.arg: vector of names appearing under each bar
col: color for the bars in the graph
If the numeric vector is a named vector, the names will automatically be the
names on the x-axis. Hence, we get the same results from the following code,
as we received from the previous code.
Now we will draw the barplot using the flights dataset in nycflights13. This
package contains information about 336,776 flights that departed from NYC to
destinations in 2013.
The data table flights contains the data of all flights that departed from NYC
in 2013.
In this example, we will create a bar plot of the top eight carriers with the most
flights in the record. Before we can start using the dataset, we will use the
command install.packages("nycflights13") to install the dataset.
176
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
data("flights", package = "nycflights13")
carriers <- table(flights$carrier)
carriers
9E AA AS B6 DL EV F9 FL HA MQ OO
UA US VX WN
18460 32729 714 54635 48110 54173 685 3260 342 26397 32
58665 20536 5162 12275
YV
601
In the previous code, we have used table() to count the number of flights in
the record for each carrier. Now sort the carriers in decreasing order.
UA B6 EV DL AA MQ US 9E WN VX FL
AS F9 YV HA
58665 54635 54173 48110 32729 26397 20536 18460 12275 5162 3260
714 685 601 342
OO
32
Now we can take the first 8 elements from the table and draw a bar plot:
barplot(head(carriers_sort, 8),
ylim = c(0, max(carriers_sort) * 1.1),
xlab = "Carrier", ylab = "Flights",
main ="Top 8 carriers ordered by number of flights")
177
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Pie Charts
Pie charts are also useful charts for data analysis. We can use the pie()
function to create a pie chart. The pie-chart is a representation of values as
slices of a circle with different colors.
x: vector that contains the numeric values that are used in the pie chart
labels: to provide the description of the slices
radius: to provide the radius of the circle of the pie chart (value between -1
and +1)
main: to provide the title of the chart
col: indicates the color palette
clockwise: indicates whether the slices are drawn clockwise or anti-
clockwise
178
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Histogram and density plots
We can create the histogram using hist() function. The function accepts a
vector as an input along with some more parameters to plot histograms.
hist(v,main,xlab,xlim,ylim,breaks,col,border)
v: a vector containing the numeric values that are used in the histogram
main: title of the chart
xlab: description of the x-axis
xlim: range of values on the x-axis
ylim: range of values on the y-axis
breaks: width of each bar
col: color of the bars
border: border-color of each bar
179
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In the following example, we have demonstrated how we can use hist() to
plot a histogram using a normally distributed random numeric vector and the
density function of the normal distribution.
180
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
In this case, we have used the curve() function. We have used the parameter
add = TRUE to add the curve to the existing plot.
181
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We observe that the distribution is different from a normal distribution. So, we
can use density() function to estimate the empirical distribution of the
speed and plot a smooth probability distribution curve. We have also added a
vertical line to indicate the global average of all the observations.
hist(ft_speed,
probability = TRUE, ylim = c(0, 0.5),
main ="Histogram & distribution of flight speed",
xlab = "Flight Speed",
border ="gray", col = "lightgray")
lines(density(ft_speed, from = 2, na.rm = TRUE),
col ="darkgray", lwd = 2)
abline(v = mean(ft_speed, na.rm = TRUE),
col ="blue", lty =2)
182
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Boxplot
Boxplot is used to visualize the distribution of the data in a data set. The
boxplot represents the minimum, maximum, median, first quartile, and the
third quartile in the data set. You can compare the distribution of data across
data sets by drawing the boxplot for each one of them.
x: vector or a formula
data: data frame
notch: logical value. Draws a notch is set as TRUE
varwidth: logical value, If TRUE, the width of the box is proportionate to the
sample size
names: group label that can be printed under each boxplot
main: provides the title to the graph
x <- rnorm(1000)
boxplot(x)
183
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
We can draw a box plot of the flight speed for each carrier. In this example, we
have 16 boxplots in one chart. It helps us compare the distribution of different
carriers. We have used the formula distance/air_time ~ carrier to
indicate that the x-axis denotes the carrier and the y-axis denotes the flight
speed (distance/air_time).
184
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
8. Analysing Data
Linear model
The linear model is the simplest model in R. In these models, we use a linear
function to describe the relationship between two random variables. In the
following example, first, we generated a normally distributed random numeric
vector x. Then we mapped x to a function 3 + 2 * x. Finally, we generated y
by adding some independent noise to f(x).
x <- rnorm(100)
f <- function(x) 3 + 2 * x
y <- f(x) + 0.5 * rnorm(100)
Let’s assume that we do not know the underlying relationship between x and
y. Hence, we used a linear model to explore the relationship between the two
variables. Therefore, we need to find out the coefficients of the linear function.
In the following code, we used lm() to fit x and y with a simple linear model.
In the code, we used the formula y ~ x which denotes the linear regression
between the dependent variable y and independent variable x.
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
3.036 1.995
The coefficients received as the result of fitting the model are 3.036
(intercept) and 1.995 (slope) which is close to the true coefficients 3
(intercept) and 2 (slope).
If we want to access the coefficients of the model, we can use the following
code.
185
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
coef(linear_model)
(Intercept) x
3.036328 1.994926
The linear_model is a list. So, we can also use it to access the coefficients.
> linear_model$coefficients
(Intercept) x
3.036328 1.994926
> summary(linear_model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.39226 -0.31731 0.01711 0.28940 1.26922
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.03633 0.04980 60.97 <2e-16 ***
x 1.99493 0.05589 35.69 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
You can refer to the Machine Learning course to know more about the
interpretation of the summary.
You can plot the data and the regression line using the following code.
186
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Now we can call the predict() function to make predictions using the fitted
model. We can predict y with standard errors when x = -1 and x = 0.5
using the following code.
$se.fit
1 2
0.0772407 0.0554981
$df
[1] 98
$residual.scale
[1] 0.4969659
Now we can look into the real-world data set nycflights13. We can analyze
the air time of a flight using linear models by using a different set of input
variables. First, we will start with distance because distance is the most
important variable to analyze air time.
Therefore, to start with, after loading the data set, we make a scatter plot of
distance vs air_time. Since the number of records in the data set is large, we
will use pch = ".".
187
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
data("flights", package = "nycflights13")
plot(air_time ~ distance, data = flights,
pch = ".",
main = "Plot - Flight Speed",
ylab = "Air Time",
xlab = "Distance")
The plot suggests that there is a positive correlation between the two
variables. Hence, we can use a linear model to fit the data.
> summary(lm_model)
Call:
lm(formula = air_time ~ distance, data = flights)
Residuals:
Min 1Q Median 3Q Max
-82.397 -7.334 -1.320 6.513 145.389
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.847e+01 3.888e-02 474.9 <2e-16 ***
distance 1.261e-01 3.036e-05 4154.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
188
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
Residual standard error: 12.78 on 327344 degrees of freedom
(9430 observations deleted due to missingness)
Multiple R-squared: 0.9814, Adjusted R-squared: 0.9814
F-statistic: 1.726e+07 on 1 and 327344 DF, p-value: < 2.2e-16
Decision Tree
The decision tree is a graphical representation of the choices and their results.
We use decision trees in predicting an email as spam or not spam, predicting
whether a tumour is cancerous or not, or predicting whether a loan is good or
bad. etc.
189
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
First of all, we need to install the party package by executing the following
command in the R console.
install.packages("party")
We use the ctree() function to create and analyse the decision tree. The
basic syntax for creating the decision using ctree() function is given below:
ctree(formula, data)
In this example, we will use built-in data set readingSkills to build a decision
tree. The data set describes the readingSkill score of several individuals. In this
example, we will try to predict whether an individual is a native speaker or not
based on age, shoeSize, and score.
Before we fit the data to the decision tree model, let’s review the data.
library(party)
print(head(readingSkills))
190
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.
From the decision tree, we can conclude that a person whose readingSkills
score is less than 38.306 and who is older than 6 years, is not a native speaker.
191
This is a confidential publication. All rights reserved. This document may not, in a whole
or in part, be copied, reproduced, translated, photocopied, or reduced to any medium
without prior and express written consent from Samatrix Consulting Pvt Ltd.