[go: up one dir, main page]

0% found this document useful (0 votes)
4 views227 pages

DMV Unit-2 - RSKNF Bughnvbcyfewubvuwef B Uu Gfu Ufhn

The document outlines the syllabus for a Data Modelling & Visualization course at MIT School of Computing, focusing on R programming. It covers topics such as R programming basics, data visualization techniques, data transformation, and exploratory data analysis. Additionally, it discusses the features, advantages, disadvantages, and applications of the R programming language.

Uploaded by

shreyeskaraditya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views227 pages

DMV Unit-2 - RSKNF Bughnvbcyfewubvuwef B Uu Gfu Ufhn

The document outlines the syllabus for a Data Modelling & Visualization course at MIT School of Computing, focusing on R programming. It covers topics such as R programming basics, data visualization techniques, data transformation, and exploratory data analysis. Additionally, it discusses the features, advantages, disadvantages, and applications of the R programming language.

Uploaded by

shreyeskaraditya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 227

MIT School of Computing

Department of Computer Science & Engineering

Third Year Engineering

21BTCS007-Data Modelling & Visualization

Class - T.Y.
PLD (SEM-II)

Unit - II

Data Visualization using R


AY 2024-2025 SEM-II

1
MIT School of Computing
Department of Computer Science & Engineering

Syllabus
1. Introduction to R Programming
1.1 Basics of R Programming
1.2 Data Types and Structures in R
1.3 Essential R Programming Constructs
1.4 Functions in R
1.5 Importing and Exporting Data
2. Visualization Using R
2.1 Overview of Visualization in R
2.2 Creating Basic Visualizations
2.3 Advanced Visualizations Using ggplot2
2.4 Interactive Visualizations
3. Transformation Using R
3.1 Data Transformation Basics
3.2 Aggregation and Summarization
3.3 Reshaping and Pivoting Data
3.4 Joining and Merging Datasets
3.5 String Manipulations and Regular Expressions
4. Exploratory Data Analysis (EDA)
4.1 Fundamentals of EDA
4.2 Summary Statistics
4.3 Detecting Missing Data and Outliers
4.4 Data Visualization for EDA 2
4.5 Correlation and Relationships
MIT School of Computing
Department of Computer Science & Engineering

1. Introduction to R Programming
1.1 Basics of R Programming:
• The R Language stands out as a powerful tool in the modern era of statistical
computing and data analysis.
• Widely embraced by statisticians, dataPLD
scientists, and researchers
• offers an extensive suite of packages and libraries tailored for data manipulation,
statistical modelling, and visualization.
• an implementation of the S programming language.
• a leading tool for machine learning, statistics, and data analysis, allowing for the
easy creation of objects, functions, and packages.

3
4
MIT School of Computing
Department of Computer Science & Engineering

• Designed by Ross Ihaka and Robert Gentleman at the University


of Auckland and developed by the R Development Core Team
• .
PLD
• Beyond its capabilities as a statistical package, R integrates with
other languages like C and C++, facilitating interaction with
various data sources and statistical tools.

5
MIT School of Computing
Department of Computer Science & Engineering

several reasons why professionals across


various fields prefer R:
1. Comprehensive Statistical Analysis:

• R language is specifically designed for statistical analysis and provides a vast array of

statistical techniques and tests, making it ideal for data-driven research


PLD
2. Extensive Packages and Libraries:

• The R Language boasts a rich ecosystem of packages and libraries that extend its

capabilities, allowing users to perform advanced data manipulation, visualization, and

machine learning tasks with ease.

3. Strong Data Visualization Capabilities:

• R language excels in data visualization, offering powerful tools like ggplot2 and plotly,
6
which enable the creation of detailed and aesthetically pleasing graphs and plots 6
MIT School of Computing
Department of Computer Science & Engineering

4. Open Source and Free:


• As an open-source language, R is free to use, which makes it
accessible to everyone, from individual researchers to large
organizations, without the need for costly licenses.
PLD

5. Platform Independence:
• The R Language is platform-independent, meaning it can run on
various operating systems, including Windows, macOS, and
Linux, providing flexibility in development environments.

7
MIT School of Computing
Department of Computer Science & Engineering

6. Integration with Other Languages:


• R can easily integrate with other programming languages such as C,
C++, Python, and Java, allowing for seamless interaction with different
data sources and statistical packages.

7. Growing Community and Support:


• R language has a large and activePLD
community of users and developers
who contribute to its continuous improvement and provide extensive
support through forums, mailing lists, and online resources.

8. High Demand in Data Science:


• R is one of the most requested programming languages in the Data
Science job market, making it a valuable skill for professionals looking
to advance their careers in this field.
8
MIT School of Computing
Department of Computer Science & Engineering

Features of R Programming Language


The R Language is renowned for its extensive features that make it a
powerful tool for data analysis, statistical computing, and visualization.
Here are some of the key features of R:
1. Comprehensive Statistical Analysis:
• R language provides a wide array of statistical techniques, including
linear and nonlinear modelling, classical
PLD statistical tests, time-series
analysis, classification, and clustering.
2. Advanced Data Visualization:
• With packages like ggplot2, plotly, and lattice, R excels at creating
complex and aesthetically pleasing data visualizations, including plots,
graphs, and charts.
3. Extensive Packages and Libraries:
• The Comprehensive R Archive Network (CRAN) hosts thousands of
packages that extend R’s capabilities in areas such as machine
9
learning, data manipulation, bioinformatics, and more.
MIT School of Computing
Department of Computer Science & Engineering

Features of R Programming Language

PLD

10
MIT School of Computing
Department of Computer Science & Engineering

4. Open Source and Free:


• R is free to download and use, making it accessible to everyone. Its
open-source nature encourages community contributions and
continuous improvement.
5. Platform Independence:
• R is platform-independent, running on various operating systems,
including Windows, macOS, and PLD Linux, which ensures flexibility and
ease of use across different environments.
6. Integration with Other Languages:
• R language can integrate with other programming languages such as C,
C++, Python, Java, and SQL, allowing for seamless interaction with
various data sources and computational processes.
7. Powerful Data Handling and Storage:
• R efficiently handles and stores data, supporting various data types and
structures, including vectors, matrices, data frames, and lists.
11
MIT School of Computing
Department of Computer Science & Engineering

8. Robust Community and Support:


• R has a vibrant and active community that provides extensive support
through forums, mailing lists, and online resources, contributing to its
rich ecosystem of packages and documentation.

9. Interactive Development Environment (IDE):


PLD
• RStudio, the most popular IDE for R, offers a user-friendly interface
with features like syntax highlighting, code completion, and integrated
tools for plotting, history, and debugging.

10. Reproducible Research:


• R supports reproducible research practices with tools like R Markdown
and Knitr, enabling users to create dynamic reports, presentations, and
documents that combine code, text, and visualizations.
12
MIT School of Computing
Department of Computer Science & Engineering

Advantages of R language
• R is the most comprehensive statistical analysis
package. As new technology and concepts often appear
first in R.
• As R programming language is an open source. Thus,
PLD
you can run R anywhere and at any time.
• R programming language is suitable for GNU/Linux and
Windows operating systems.
• R programming is cross-platform and runs on any
operating system.
• In R, everyone is welcome to provide new packages,
bug fixes, and code enhancements.

13
MIT School of Computing
Department of Computer Science & Engineering

Disadvantages of R language

• In the R programming language, the standard of


some packages is less than perfect.
• Although, R commands give little pressure on
memory management. So R programming
language may consume all PLD available memory.

• In R basically, nobody to complain if something


doesn’t work.
• R programming language is much slower than
other programming languages such as Python and
MATLAB.

14
MIT School of Computing
Department of Computer Science & Engineering

Applications of R language

• We use R for Data Science. It gives us a broad variety of libraries


related to statistics. It also provides the environment for statistical
computing and design.
• R is used by many quantitative analysts as its programming tool. Thus,
PLD
it helps in data importing and cleaning.
• R is the most prevalent language. So many data analysts and research
programmers use it. Hence, it is used as a fundamental tool for finance.
• Tech giants like Google, Facebook, Bing, Twitter, Accenture, Wipro,
and many more using R nowadays.

15
MIT School of Computing
Department of Computer Science & Engineering

1.2 Data Types and Structures in R


• Variables are the reserved memory location to store values. As we create a variable
in our program, some space is reserved in memory.
• data types such as integer, string, etc. The operating system allocates memory
based on the data type of the variable and decides what can be stored in the
reserved memory.

PLD

16
Data type ExampleExample Description
MIT School of Computing
Department of Computer Science & Engineering It is a special data type for
data with only two possible
Logical True, False
values which can be construed
as true/false.
Decimal value is called
numeric in R, and it is the
Numeric 12,32,112,5432
default computational data
type.
Here, L tells R to store the
Integer 3L, 66L, 2346L
PLD value as an integer,
A complex value in R is
Complex Z=1+2i, t=7+3i defined as the pure imaginary
value i.
In R programming, a character
is used to represent string
Character 'a', '"good'", "TRUE", '35.4' values. We convert objects
into character values with the
help ofas.character() function.

A raw data type is used to


Raw
holds raw bytes. 17
18
19
MIT School of Computing
Department of Computer Science & Engineering

The most essential data structures used in R include:

• Vectors
• Lists
• Dataframes
• Matrices
• Arrays PLD
• Factors
• Tibbles

20
MIT School of Computing
Department of Computer Science & Engineering

Vectors
• A vector is simply a list of items that are of the
same type.
• To combine the list of items to a vector, use
the c() function and separate the items by a comma.
PLD
Examples:
# Vector of strings
fruits <- c("banana", "apple", "orange")

# Print fruits
fruits
21
MIT School of Computing
Department of Computer Science & Engineering

Vectors

PLD

22
MIT School of Computing
Department of Computer Science & Engineering

In this example, we create a vector that combines numerical values:


Example
# Vector of numerical values
numbers <- c(1, 2, 3)

# Print numbers
numbers
Output: PLD
1] 1 2 3

Vector Length
To find out how many items a vector has, use the length() function:
Example
fruits <- c("banana", "apple", "orange")

length(fruits)

23
MIT School of Computing
Department of Computer Science & Engineering

Sort a Vector
To sort items in a vector alphabetically or numerically, use
the sort() function:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
numbers <- c(13, 3, 5, 7, 20, 2)
PLD
sort(fruits) # Sort a string
sort(numbers) # Sort numbers

Output:

24
MIT School of Computing
Department of Computer Science & Engineering

Lists
A list in R can contain many different data types inside it. A list is a collection of data
which is ordered and changeable.
To create a list, use the list() function:
These are also one-dimensional data structures. A list can be a list of vectors, list of
matrices, a list of characters and a list of functions and so on.
# R program to create a List
PLD

# The first attributes is a numeric vector


# containing the employee IDs which is created
# using the command here
empId = c(1, 2, 3, 4)

25
MIT School of Computing
Department of Computer Science & Engineering

# The second attribute is the employee name


# which is created using this line of code here
# which is the character vector
empName = c("Debi", "Sandeep", "Subham", "Shiba")

# The third attribute is the number of employees


# which is a single numeric variable.
numberOfEmp = 4
PLD

# We can combine all these three different


# data types into a list
# containing the details of employees
# which can be done using a list command
empList = list(empId, empName, numberOfEmp)

empList

26
MIT School of Computing
Department of Computer Science & Engineering

Output
[[1]]
[1] 1 2 3 4

[[2]]
[1] "Debi" "Sandeep" "Subham" "Shiba”
PLD
[[3]]
[1] 4

Example 2: Creating the list with different data type


1.list_data<-
list("Shubham","Arpita",c(1,2,3,4,5),TRUE,FALSE,22.5,12L)
2.list_data

27
MIT School of Computing
Department of Computer Science & Engineering

In the above example, the list function will create a list with character, logical,
numeric, and vector element. It will give the following output
[[1]] [1] "Shubham"
[[2]] [1] "Arpita"
[[3]] [1] 1 2 3 4 5
PLD
[[4]] [1] TRUE
[[5]] [1] FALSE
[[6]] [1] 22.5
[[7]] [1] 12

28
MIT School of Computing
Department of Computer Science & Engineering

R – Array
• Arrays are essential data storage structures defined by a fixed number
of dimensions. Arrays are used for the allocation of space at
contiguous memory locations.
• R Programming Language Uni-dimensional arrays are called vectors
with the length being their only dimension. Two-dimensional arrays
are called matrices, consisting ofPLD
fixed numbers of rows and
columns.
• An R array can be created with the use of array() the function. A list
of elements is passed to the array() functions along with the
dimensions as required.
• Syntax:
array(data, dim = (nrow, ncol, nmat), dimnames=names)
where
nrow: Number of rows
ncol : Number of columns 29
MIT School of Computing
Department of Computer Science & Engineering

R – Matrices
R-matrix is a two-dimensional arrangement of data in rows and
columns.
In a matrix, rows are the ones that run horizontally and columns are the
ones that run vertically. In R programming, matrices are two-
dimensional, homogeneous data structures. These are some examples of
matrices: PLD

30
MIT School of Computing
Department of Computer Science & Engineering

Creating a Matrix in R
To create a matrix in R you need to use the function called matrix().
The arguments to this matrix() are the set of elements in the vector. You
have to pass how many numbers of rows and how many numbers of
columns you want to have in your matrix.

Syntax to Create R-Matrix PLD


matrix(data, nrow, ncol, byrow, dimnames)
Parameters:
• data – values you want to enter
• nrow – no. of rows
• ncol – no. of columns
• byrow – logical clue, if ‘true’ value will be assigned by rows
• dimnames – names of rows and columns
31
MIT School of Computing
Department of Computer Science & Engineering

# R program to create a matrix

A = matrix(

# Taking sequence of elements


c(1, 2, 3, 4, 5, 6, 7, 8, 9),

# No of rows
nrow = 3, PLD

# No of columns
ncol = 3,

# By default matrices are in column-wise order


# So this parameter decides how to arrange the matrix
byrow = TRUE
)

32
MIT School of Computing
Department of Computer Science & Engineering

# Naming rows
rownames(A) = c("a", "b", "c")

# Naming columns
colnames(A) = c("c", "d", "e")
PLD
cat("The 3x3 matrix:\n")
print(A)
Output:
The 3x3 matrix:
c d e
a1 2 3
b4 5 6
c7 8 9
33
MIT School of Computing
Department of Computer Science & Engineering

R Factors
• Factors in R Programming Language are data structures that are
implemented to categorize the data or represent categorical data and
store it on multiple levels.
• They can be stored as integers with a corresponding label to every
unique integer. PLD
• The R factors may look similar to character vectors, they are integers
and care must be taken while using them as strings.
• The R factor accepts only a restricted number of distinct values. For
example, a data field such as gender may contain values only from
female, male

34
• Attributes of Factors in R Language
• x: It is the vector that needs to be converted into a factor.
• Levels: It is a set of distinct values which are given to the
input vector x.
• Labels: It is a character vector corresponding to the number of
labels.
• Exclude: This will mention all the values you want to
exclude.
• Ordered: This logical attribute decides whether the levels are
ordered.
• nmax: It will decide the upper limit for the maximum number
of levels. 35
MIT School of Computing
Department of Computer Science & Engineering

Creating a Factor in R Programming Language


The command used to create or modify a factor in R language is
– factor() with a vector as input.
The two steps to creating an R factor :
• Creating a vector
• Converting the vector created into a factor using function factor()
Examples: Let us create a factorPLDgender with levels female, male
# Creating a vector
x <-c("female", "male", "male", "female")
print(x)

# Converting the vector x into a factor


# named gender
gender <-factor(x)
print(gender)
36
MIT School of Computing
Department of Computer Science & Engineering

Output
[1] "female" "male" "male" "female”
[1] female male male female
Levels: female male
# Create a factor
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))
PLD
# Print the factor
music_genre
Result:
[1] Jazz Rock Classic Classic Pop Jazz Rock Jazz
Levels: Classic Jazz Pop Rock

You can see from the example above that the factor has four levels
(categories): Classic, Jazz, Pop and Rock.
To only print the levels, use the levels() function: 37
MIT School of Computing
Department of Computer Science & Engineering

Data Frames
• Data Frames in R Language are generic data objects of R that are
used to store tabular data.
• Data Frames are data displayed in a format as a table.
• Data Frames can have different types of data inside it. While the first
column can be character, the second and third can
be numeric or logical. However,PLD
each column should have the same
type of data.
• Use the data.frame() function to create a data frame:
Example
# Create a data frame
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
38
Data_Frame
MIT School of Computing
Department of Computer Science & Engineering

Output:
Training Pulse Duration
1 Strength 100 60
2 Stamina 150 30
3 Other 120 45
PLD

39
40
MIT School of Computing
Department of Computer Science & Engineering

PLD

41
MIT School of Computing
Department of Computer Science & Engineering

PLD

42
MIT School of Computing
Department of Computer Science & Engineering

PLD

43
Essential R Programming
Constructs

44
MIT School of Computing
Department of Computer Science & Engineering

Conditions and If Statements

Operator Name Example


== Equal x == y
!= Not equal PLD x != y
> Greater than x>y
< Less than x<y
>= Greater than or equal to x >= y

<= Less than or equal to x <= y


45
MIT School of Computing
Department of Computer Science & Engineering

PLD

46
MIT School of Computing
Department of Computer Science & Engineering

The if Statement
• An "if statement" is written with the if keyword, and it is used to
specify a block of code to be executed if a condition is TRUE:
Example
a <- 33
b <- 200
PLD
if (b > a) {
print("b is greater than a")
}

47
MIT School of Computing
Department of Computer Science & Engineering

Working of if-else statements in R Programming:


Syntax of if-else statement in R Language
if (condition) {
# code to be executed if condition is TRUE
} else {
# code to be executed if condition
PLD is FALSE

48
MIT School of Computing
Department of Computer Science & Engineering

x <- 5

# Check value is less than or greater than 10


if(x > 10)
{
PLD than 10"))
print(paste(x, "is greater
} else
{
print(paste(x, "is less than 10"))
}
Output
[1] "5 is less than 10"
49
MIT School of Computing
Department of Computer Science & Engineering

# define a variable
x <- 15

# check the value of x using nested if-else statements


if (x < 10) {
# if x is less than 10
print("x is less than 10")
} else { PLD
# if x is greater than or equal to 10
if (x < 20) {
# if x is less than 20
print("x is between 10 and 20")
} else {
# if x is greater than or equal to 20
print("x is greater than or equal to 20")
}
}
50
MIT School of Computing
Department of Computer Science & Engineering

There are three types of loops in R programming:


•For Loop
•While Loop
•Repeat Loop
For Loop in R
It is a type of control statement
PLD that enables one to easily

construct an R loop that has to run statements or a set of


statements multiple times. For R loop is commonly used
to iterate over items of a sequence.
R – For loop Syntax:
for (value in sequence)
{ statement }
51
MIT School of Computing
Department of Computer Science & Engineering

PLD

52
MIT School of Computing
Department of Computer Science & Engineering

Example:
# assigning strings to the vector
week <- c('Sunday',
'Monday',
'Tuesday',
'Wednesday',
'Thursday', PLD
'Friday',
'Saturday')
# using for loop to iterate
# over each string in the vector
for (day in week)
{
# displaying each string in the vector
print(day)} 53
MIT School of Computing
Department of Computer Science & Engineering

R – While loop Syntax:


while ( condition ) { statement }
# R program to demonstrate the use of while loop

val = 1

# using while loop PLD


while (val <= 5)
{
# statements
print(val)
val = val + 1
}
Output:
[1] 1 [1] 2[1] 3 [1] 4 [1] 5
54
MIT School of Computing
Department of Computer Science & Engineering

Repeat Loop in R
• It is a simple loop that will run the same statement or a
group of statements repeatedly until the stop condition has
been encountered.
• Repeat loop does not have any condition to terminate the
loop, a programmer mustPLDspecifically place a condition
within the loop’s body and use the declaration of a break
statement to terminate this loop.
• If no condition is present in the body of the repeat loop
then it will iterate infinitely.

55
MIT School of Computing
Department of Computer Science & Engineering

R – Repeat loop Syntax:


repeat {
statement if( condition )
{
break }
}
PLD

56
MIT School of Computing
Department of Computer Science & Engineering

# R program to demonstrate the use of repeat loop


val = 1
# using repeat loop
repeat
{
# statements
print(val) PLD
val = val + 1
# checking stop condition
if(val > 5)
{
# using break statement
# to terminate the loop
break
}
57
}
MIT School of Computing
Department of Computer Science & Engineering

R Functions
• A set of statements which are organized together to perform a specific task is
known as a function. R provides a series of in-built functions, and it allows
the user to create their own functions. Functions are used to perform tasks in
the modular approach.
• "An R function is created by using the keyword function." There is the
following syntax of R function: PLD
func_name <- function(arg_1, arg_2, ...)
{
Function body
}
• Information can be passed into functions as arguments.
• Arguments are specified after the function name, inside the parentheses. You
can add as many arguments as you want, just separate them with a comma.

58
MIT School of Computing
Department of Computer Science & Engineering

Example
my_function <- function(fname) {
paste(fname, "Griffin")
}

my_function("Peter")
my_function("Lois") PLD
my_function("Stewie")

Output
"Peter Griffin"
"Lois Griffin"
"Stewie Griffin"

59
MIT School of Computing
Department of Computer Science & Engineering

Function Types
• Similar to the other languages, R also has two types of function,
i.e. Built-in Function and User-defined Function.
• In R, there are lots of built-in functions which we can directly call in
the program without defining them.
• R also allows us to create our own functions.
PLD

60
MIT School of Computing
Department of Computer Science & Engineering

Built-in function
• The functions which are already created or defined in the
programming framework are known as built-in functions.
• User doesn't need to create these types of functions, and these
functions are built into an application.
• End-users can access these functions by simply calling it. R have
different types of built-in functions
PLD such as seq(), mean(), max(), and
sum(x) etc.
# Creating sequence of numbers from 32 to 46.
print(seq(32,46))

# Finding the mean of numbers from 22 to 80.


print(mean(22:80))

# Finding the sum of numbers from 41 to 70.


print(sum(41:70)) 61
MIT School of Computing
Department of Computer Science & Engineering

PLD

62
MIT School of Computing
Department of Computer Science & Engineering

User-defined function
• R allows us to create our own function in our program. A user defines
a user-defined function to fulfill the requirement of user. Once these
functions are created, we can use these functions like in-built
function.
# function to add 2 numbers
add_num <- function(a,b) PLD

{ sum_result <- a+b


return(sum_result)
}
# calling add_num function
sum = add_num(35,34)
#printing result
63
print(sum)
MIT School of Computing
Department of Computer Science & Engineering

2. Visualization Using R

 2.1 Overview of Visualization in R


PLD

 2.2 Creating Basic Visualizations

 2.3 Advanced Visualizations Using ggplot2

 2.4 Interactive Visualizations

64
MIT School of Computing
Department of Computer Science & Engineering

Data visualization
• Data visualization is the technique used to deliver insights in
data using visual cues such as graphs, charts, maps, and many
others.
• This is useful as it helps in intuitive and easy understanding of
the large quantities of data andPLD
thereby make better decisions
regarding it.
• The popular data visualization tools that are available are
Tableau, Plotly, R, Google Charts, Infogram, and Kibana.
• The various data visualization platforms have different
capabilities, functionality, and use cases.
• They also require a different skill set. This article discusses the
use of R for data visualization.
65
MIT School of Computing
Department of Computer Science & Engineering

• Data visualization serves as an


indispensable tool in data exploration,
inference making, and results
presentation
• It transforms complex PLD data sets into
intuitive graphical representations,
facilitating a deeper understanding of
the data and enabling the
communication of insights in a
universally comprehensible manner.
• R language is usually preferred for data
visualization as it offers flexibility and 66
Pie Chart
• Histograms are graphical representations of data distribution, with
vertical rectangles depicting the frequencies of different value
ranges.
• They are drawn on a natural scale, making it easy to interpret the
central tendency, such as the mode, of the data.
• Despite their simplicity and ease of understanding, histograms have
a limitation: they can only represent one data distribution per axis.

Q. Given a dataset containing the heights of girls in class XII,


construct a histogram to visualize the distribution of heights.
141,145,142,147,144,148,141,142,149,144,143,149,146,141,
147, 142, 143
graphical parameters include:
•xlim and ylim: These parameters set the limits of the x and y axes, respectively.
They take a vector of two values: the minimum and maximum values of the axis.

•xlab and ylab: These parameters set the labels of the x and y axes, respectively. They
take a string value.

•main: This parameter sets the main title of the plot. It takes a string value.

•col: This parameter sets the color of points or lines in a plot. It takes a string value
specifying a color name or a code in the format “#RRGGBB”, where RR, GG, and BB
are the red, green, and blue components of the color, respectively.

•lwd: This parameter sets the width of lines in a plot. It takes numeric value.
•sub: This parameter sets the sub-title/label of the plot.

•pch: This parameter sets the plotting character for points in a plot. It

takes an integer value between 0 and 25.

•lty: This parameter can be used to change the line types of the plot

•font: This parameter sets the font style and font size in the plot. We can

make the text bold, italic, bold italic, etc.

•cex: It is a short form for character expansion. This parameter sets the

size of elements in the plot, such as points or text. cex takes a numeric

value with, 1 being the default size.


bty - allows to custom the
box around the plot.

Several letters are possible.


Shape of the letter
represents the boundaries:

•o: complete box (default


parameter),
•n: no box
•7: top + right
•L: bottom + left
•C: top + left + bottom
•U: left + bottom + right
plot(x,y,bty="c")

plot(x,y,bty="U")
barplot(H, xlab, ylab, main, names.arg, col)

Parameters:
•H: This parameter is a vector or matrix containing numeric values which are
used in bar chart.

•xlab: This parameter is the label for x axis in bar chart.

•ylab: This parameter is the label for y axis in bar chart.

•main: This parameter is the title of the bar chart.

•names.arg: This parameter is a vector of names appearing under each bar in


bar chart.

•col: This parameter is used to give colors to the bars in the graph.
# plotting a bar-graph with title and color
barplot(c(1,3), main="Main title",
xlab="X axis title",
ylab="Y axis title",
col.main="red",
col.lab="blue")
1. Write a program to draw a line chart, use plot
function
2. Write a program to draw a bar chart to visualize the
comparative rainfall data for 12 months
3. Write a program to Create a bar graph to illustrate the
distribution of students from various schools who attended
a seminar on “Deep Learning”. The total number of
students from each school is provided below.
• Boxplots are a measure of how well distributed is the data in
a data set.
• It is also useful in comparing the distribution of data across
data sets by drawing boxplots for each of them.
• Boxplots are created in R by using the boxplot() function.

 Syntax
boxplot(x, data, notch, varwidth, names,
main)

•x is a vector or a formula.
•data is the data frame.
•notch is a logical value. Set as TRUE to draw a notch.
•varwidth is a logical value. Set as true to draw width of the
box proportionate to the sample size.
•names are the group labels which will be printed under each
boxplot.
•main is used to give a title to the graph.
boxplot(mpg ~ cyl, data = mtcars,
xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data",
notch= TRUE, varwidth = TRUE,
col = c("green","yellow","purple"),
names = c("High","Medium","Low"))
4. Scatter Plot
• A scatter plot is a set of dotted points representing individual
data pieces on the horizontal and vertical axis.
• In a graph in which the values of two variables are plotted
along the X-axis and Y-axis, the pattern of the resulting points
reveals a correlation between them.
• They reveal correlations, whether positive or negative, within
paired data, showcasing trends and patterns.
• scatter plots illustrate connections between variables through
ordered pairs, making them useful for analyzing paired
numerical data and situations where the dependent variable
varies across different values of the independent variable.
• Their strength lies in their ability to clearly depict trends,
clusters, and relationships within datasets.
Parameters of plot()
• Syntax: plot(x, y, main, xlab, ylab, xlim, ylim, axes)
• Parameters:
• x: This parameter sets the horizontal coordinates.

• y: This parameter sets the vertical coordinates.

• xlab: This parameter is the label for horizontal axis.

• ylab: This parameter is the label for vertical axis.

• main: This parameter main is the title of the chart.

• xlim: This parameter is used for plotting values of x.

• ylim: This parameter is used for plotting values of y.

• axes: This parameter indicates whether both axes should be drawn


on the plot.
• # Get the input values.
• input <- mtcars[, c('wt', 'mpg')]

• # Plot the chart for cars with


• # weight between 1.5 to 4 and
• # mileage between 10 and 25.
• plot(x = input$wt, y = input$mpg,
• xlab = "Weight",
• ylab = "Milage",
• xlim = c(1.5, 4),
• ylim = c(10, 25),
• main = "Weight vs Milage"
• )
• # Plot the matrices between
• # 4 variables giving 12 plots.
• # One variable with 3 others
• # and total 4 variables.
• pairs(~wt + mpg + disp + cyl, data = mtcars,
• main = "Scatterplot Matrix")
Q. A student had a hypothesis for a science project. He believed that the
more the students studied Math, the better their math scores would be. He
took a poll in which he asked students the average number of hours that
they studied per week during a given semester. He then found out the
overall percentage that they received in their Math classes. His data is
shown in the table below:

• Write a Program in R to make a scatter plot. The independent variable,


or input data, is the study time because the hypothesis is that the Math
grade depends on the study time. That means that the Math grade is
the dependent variable, or the output data. The input data is plotted on
the x-axis and the output data is plotted on the y-axis.
Pie Chart
• A pie chart is a circular graph divided into segments or sections,
each representing a relative proportion or percentage of the total.
• Each segment resembles a slice of pie, hence the name.
• Pie charts are commonly used to visualize data from a small table,
but it is recommended to limit the number of categories to seven
to maintain clarity.
• However, zero values cannot be depicted in pie charts.
• While useful for illustrating compositions or comparing parts of a
whole, pie charts can be challenging to interpret and compare
with data from other charts.
• They are not suitable for showing changes over time.
• Pie charts find applications in various domains such as business,
education, and personal finance.
• In business, they can indicate the success or failure of products or
services. In education, they can depict time allocations for
different subjects. At home, pie charts can help visualize monthly
expenses relative to income.
Create a Pie chart for the periods allotted for each subject in a week.

Write a program to draw a pie chart to visualize the


comparative rainfall data for 12 months in Nicobar using the
CSV file "district wise rainfall normal.csv".
• R Programming Language uses the function pie() to create pie charts. It takes
positive numbers as a vector input.
• Syntax: pie(x, labels, radius, main, col, clockwise)

• Parameters:
• x: This parameter is a vector that contains the numeric values
which are used in the pie chart.
• labels: This parameter gives the description to the slices in pie
chart.
• radius: This parameter is used to indicate the radius of the circle
of the pie chart.(value between -1 and +1).
• main: This parameter is represents title of the pie chart.
• clockwise: This parameter contains the logical value which
indicates whether the slices are drawn clockwise or in anti
clockwise direction.
• col: This parameter give colors to the pie in the graph.
• # Create data for the graph.
• s<- c(23, 56, 20, 63)
• labels <- c("Mumbai", "Pune", "Chennai", "Bangalore")

• # Plot the chart.


• pie(s, labels)
• piepercent<- round(100 * s / sum(s), 1)
• # Plot the chart.
• pie(s, labels = piepercent, main = "City pie chart", col = rainbow(length(s)))
• legend("topright", c("Mumbai", "Pune", "Chennai", "Bangalore"), cex = 0.5,
fill = rainbow(length(s)))
Advanced Visualizations using ggplot2
• Data visualization with R and ggplot2 in R Programming
Language also termed as Grammar of Graphics is a free, open-source,
and easy-to-use visualization package widely used in R Programming
Language.
• Building Blocks of layers with the grammar of graphics
• Data: The element is the data set itself
• Aesthetics: The data is to map onto the Aesthetics attributes such as
x-axis, y-axis, color, fill, size, labels, alpha, shape, line width, line type
• Geometrics: How our data being displayed using point, line,
histogram, bar, boxplot
• Facets: It displays the subset of the data using Columns and rows
• Statistics: Binning, smoothing, descriptive, intermediate
• Coordinates: the space between data and display using Cartesian,
fixed, polar, limits
• Themes: Non-data link
MIT School of Computing
Department of Computer Science & Engineering

Structure of ggplot

PLD

98
Data Layer:
ggplot2 in R the data Layer we define the
source of the information to be visualize

• library(ggplot2)
• library(dplyr)
• ggplot(data = mtcars) +
• labs(title = "MTCars Data Plot")

99
Aesthetic Layer
ggplot2 in R Here we will display and map
dataset into certain aesthetics.
# Aesthetic Layer
ggplot(data = mtcars, aes(x = hp, y = mpg, col =
disp))+
labs(title = "MTCars Data Plot")

100
Geometric layer:

ggplot2 in R geometric layer control the


essential elements, see how our data being
displayed using point, line, histogram, bar, boxplot.
# Geometric layer
ggplot(data = mtcars, aes(x = hp, y = mpg, col = disp)) +
geom_point() +
labs(title = "Miles per Gallon vs Horsepower",
x = "Horsepower",
y = "Miles per Gallon")

101
• Geometric layer: Adding Size, color, and
shape and then plotting the Histogram plot

# Adding size
ggplot(data = mtcars, aes(x = hp, y = mpg, size =
disp)) +
geom_point() +
labs(title = "Miles per Gallon vs Horsepower",
x = "Horsepower",
y = "Miles per Gallon")

# Adding shape and color


ggplot(data = mtcars, aes(x = hp, y = mpg, col =
factor(cyl),
shape = factor(am))) +geom_point() +
labs(title = "Miles per Gallon vs Horsepower",
x = "Horsepower",
y = "Miles per Gallon")

# Histogram plot
ggplot(data = mtcars, aes(x = hp)) +
geom_histogram(binwidth = 5) +
labs(title = "Histogram of Horsepower",
x = "Horsepower", 102
y = "Count")
Facet Layer:
• ggplot2 in R facet layer is used to split the data up into
subsets of the entire dataset and it allows the subsets
to be visualized on the same plot. Here we separate
rows according to transmission type and Separate
columns according to cylinders.
# Facet Layer
# Separate rows according to transmission
type
p <- ggplot(data = mtcars, aes(x = hp, y =
mpg, shape = factor(cyl))) + geom_point()

p + facet_grid(am ~ .) +
labs(title = "Miles per Gallon vs Horsepower",
x = "Horsepower",
y = "Miles per Gallon")

# Separate columns according to cylinders


p <- ggplot(data = mtcars, aes(x = hp, y =
mpg, shape = factor(cyl))) + geom_point()

p + facet_grid(. ~ cyl) +
labs(title = "Miles per Gallon vs Horsepower",
x = "Horsepower",
y = "Miles per Gallon")
103
https://www.geeksforgeeks.org/data-visualization-with-r-and-ggplot2/ refer for other layers
Advanced Visualization using ggplot2
MIT School of Computing
Department of Computer Science & Engineering

Heatmap
• A heatmap depicts the relationship between two
attributes of a data frame as a color-coded tile.
• A heatmap produces a grid with multiple
attributes of the data frame, representing the
relationship between thePLD two attributes taken at
a time.
• In both data analysis and visualization, heatmaps
are a common visualization tool.
• They are especially beneficial for displaying and
examining relationships and patterns in tabular
data.
• Reference:
https://r-charts.com/correlation/heat-map-ggplot2
/ 115
MIT School of Computing
Department of Computer Science & Engineering

• A heap map in ggplot2 can be created with geom_tile, passing


the categorical variables to x and y arguments and the
continuous variable to fill argument of aes.
• depending on the plotting windows size the tiles might not be
squared. If you want to keep them squared use cood_fixed.
• Border customization
• You can customize the borderPLD color, line width and line style of
the tiles with color, lwd and linetype, respectively.

# install.packages("ggplot2")
library(ggplot2)

ggplot(df, aes(x = x, y = y, fill = value)) +


geom_tile(color = "black") +
geom_text(aes(label = value), color = "white", size = 4) +
coord_fixed()
116
MIT School of Computing
Department of Computer Science & Engineering

PLD

117
MIT School of Computing
Department of Computer Science & Engineering

• Adding the values


In addition, you can add the values over the tiles with geom_text,
passing the numerical variable to the label argument of the aes function.
• Color palette
There three ways to change the default color palette used when creating
the heat map: using scale_fill_gradient, scale_fill_gradient2 or
PLD
scale_fill_gradient
scale_fill_gradient(low = "white", high =
"red")

118
MIT School of Computing
Department of Computer Science & Engineering

Mosaic Map
• Mosaic Plots are used to show symmetries for tables that are
divided into two or more conditional distributions.
• Mosaic plots are a great way to visualize hierarchical data.
• A collection of rectangles represents
PLD all the elements to be
visualized with the rectangles of different sizes and colors
makes a table, but what makes these mosaic charts unique is
the arrangement of the elements where there is a hierarchy
those elements are collected and labeled together, perhaps
even with subcategories.
• So mosaic plots can be used for plotting categorical data very
effectively, with the area of the data showing the relative
proportions.
• The package that is used for this is vcd. 119
MIT School of Computing
Department of Computer Science & Engineering

Syntax:
mosaic(x,shade=NULL,legend=NULL, main = NULL,..)
Parameters:
• x: Here, x is pointing to the variable that holds the dataset/table.
We passed our dataset name here.
PLD
• shade: shade is a boolean variable, if it is set to be true then we
will get a colored plot. Its default value is NULL.
• legend: the legend is a boolean variable, if it is set to be true then
we will be able to see legends alongside our mosaic plot. Its
default value is NULL.
• main: main is a string variable, here we pass the title of our
mosaic plot. 120
MIT School of Computing
Department of Computer Science & Engineering

library(ggmosaic)
data("fly")
head(fly)
ggplot(data = fly) + geom_mosaic( aes(x =
product(rude_to_recline), fill=do_you_recline)) +
theme_mosaic()
PLD

121
categorical data, geom_mosaic() has the capability to produce bar charts, stacked bar charts, mosaic plots, and double decker plots and therefore o

MIT School of Computing


Department of Computer Science & Engineering

• Designed to create visualizations of categorical data, geom_mosaic() has


the capability to produce bar charts, stacked bar charts, mosaic plots,
and double decker plots and therefore offers a wide range of potential plots.
In geom_mosaic(), the following aesthetics can be specified:
• weight: select a weighting variable
• x: select variables to add to formula
• declared as x = product(var2, var1, ...)PLD
• alpha: add an alpha transparency to the selected variable
• unless the variable is called in x, it will be added to the formula in the first position
• fill: select a variable to be filled
• unless the variable is called in x, it will be added to the formula in the first position
after the optional alpha variable.
• conds : select a variable to condition on
• declared as conds = product(cond1, cond2, ...) 122
MIT School of Computing
Department of Computer Science & Engineering

3D Graphs
PLD

123
3D scatterplots
• use R scatterplot3D
package to create 3D
scatterplots, this
package can plot R
scatterplots in 3D using
scatterplot3d() methods.
• # 3D Scatterplot
• library(plotly)
• attach(mtcars)
• plot_ly(data=mtcars,x=~mpg,y
=~hp,z=~cyl,color=~gear)
https://rpubs.com/oox/graphs 3D Scatterplots
3D Pie Chart

• to create a 3D Pie chart use plotrix


package and then use pie3D() function
to plot 3D plot
• # Get the library.
• library(plotrix)
• # Create data for the graph.
• geeks <- c(23, 56, 20, 63)
• labels <- c("Mumbai", "Pune",
"Chennai","Bangalore") piepercent<- round(100
* geeks / sum(geeks), 1)
• # Plot the chart.
• pie3D(geeks, labels = piepercent, main = "City
pie chart", col = rainbow(length(geeks)))
• legend("topright", c("Mumbai", "Pune",
"Chennai", "Bangalore"), cex = 0.5, fill =
rainbow(length(geeks)))
Map Visualizations
• to make it easy to create maps and
visualize geospatial data
• To create a map with R, you need to have
data that includes geospatial information,
such as latitude and longitude
coordinates or shapefiles then from that
you can use packages like rgdal to read
and manipulate these data, and finally
then use one of the mapping packages to
visualize the data.
categorical data, geom_mosaic() has the capability to produce bar charts, stacked bar charts, mosaic plots, and double decker plots and therefore o

MIT School of Computing


Department of Computer Science & Engineering

• “sf” and “RColorBrewer” libraries


• loaded a shapefile “nc” which contains information on the
counties in North Carolina
• demonstrated how to plot a simple feature using the “plot”
function from the “sf” package.
• plotted the “AREA” attribute of the “nc” dataset, with a main
PLD
title “AREA”
• styled the plot with color palette using the “brewer.pal”
function from the “RColorBrewer” library which generates a
palette of 9 colors in a yellow-orange-red gradient
(“YlOrRd”).
• Finally, plot is then divided into 9 bins using the “quantile”
option in the “breaks” argument, and the palette is applied
to color the bins with the “pal” argument.
127
• library(sf)
• library(RColorBrewer)
• # Load data
• nc <- st_read(system.file
("shape/nc.shp" package="sf"))
• # Plotting simple features (sf) with plot
• plot(nc["AREA"], main = "AREA",
breaks = "quantile", nbreaks
= 9, pal = brewer.pal(9,
"YlOrRd"))
MIT School of Computing
Department of Computer Science & Engineering

Interactive Visualizations
• use the Plotly and Leaflet packages in combination with HTML
widgets.
• These packages allows to create interactive plots with R
visualizations directly from R code, without the need for a
separate web application framework.
PLD
• we create a scatter plot using the plot_ly() function, specifying
the iris dataset, the x-axis (Sepal.Length) and y-axis
(Petal.Length) variables, and the type of plot (scatter) with mode
= “markers”.
• We also add color and size to the markers using
the add_markers() function.
• Finally, we add labels and a title to the plot using the layout()
function and display the plot using the plot object. 129
Interactive Visualizations
• # Install packages
• install.packages("plotly")
• install.packages("leaflet")
• # Load packages
• library(plotly)
• library(leaflet)
• data(iris)
• # Create a scatter plot
• plot <- plot_ly(data = iris, x = ~Sepal.Length, y = ~Petal.Length, type = "scatter",
mode = "markers")
• # Add color and size to markers
• plot <- plot %>% add_markers(color = ~Species, size = ~Sepal.Width * 2)
• # Add labels and title
• plot <- plot %>% layout(xaxis = list(title = "Sepal Length"), yaxis = list(title =
"Petal Length"), title = "Iris Dataset")
• # Display the plot
Reference:
• plot https://www.geeksforgeeks.org/interactive-charts-using
MIT School of Computing
Department of Computer Science & Engineering

PLD

131
MIT School of Computing
Department of Computer Science & Engineering

# Sample data for 3D scatter plot


scatter3d_data <- data.frame(
X = rnorm(100),
Y = rnorm(100),
Z = rnorm(100),
Group = rep(c("Group A", "Group B"), each = 50)
)

# Create an interactive 3D scatter plot


scatter3d_chart <- plot_ly(scatter3d_data, x = ~X, y =
~Y, z = ~Z, color = ~Group, PLD
type = 'scatter3d', mode = 'markers')
%>%
layout(title = "Interactive 3D Scatter Plot",
scene = list(xaxis = list(title = "X-axis"),
yaxis = list(title = "Y-axis"),
zaxis = list(title = "Z-axis")))

# Display the interactive 3D scatter plot


scatter3d_chart

132
3. Transformation Using R

• 3.1 Data Transformation Basics


• 3.2 Aggregation and Summarization
• 3.3 Reshaping and Pivoting Data
• 3.4 Joining and Merging Datasets
• 3.5 String Manipulations and Regular
Expressions
Data Transformation Basics
• Data Transformation is one of the key aspects of working for business
data analysis, data science or even for the pre-work of artificial
intelligence.
• the process of converting, cleansing, and structuring the data into a
usable format that can be analyzed to support decision-making
processes and to propel the growth of an organization.
• used when data needs to be converted to match that of the destination
system.
• It converts raw data into a usable format by removing duplicates,
converting data types, and enriching the dataset.
• involves defining the structure, mapping the data, extracting the data
from the source system, performing the transformations, and then
storing the transformed data in the appropriate dataset.
• ensure the compatibility of data with other types while combining it
with other information or migrating it into a dataset.
• Through data transformations, organizations can gain valuable insights
into the operational and informational functions.
• Data transformation involves converting raw data into a suitable format
for analysis.
• R provides functions like scale(), log() or sqrt() to normalize or
transform skewed data distributions.
• These transformations help meet the assumptions of statistical models
and improve interpretability
Some of the transformation types, depending on the data involved,
include:
1. Filtering which helps in selecting certain columns that require
transformation
2. Enriching which fills out the basic gaps in the data set
3. Splitting where a single column is split into multiple or vice versa
4. Removal of duplicate data
5. Join or combine data from different sources
6. Reorder Data
7. Transform Data
Data Aggregation and Summarization
• aggregate() function is used to get the summary
statistics of the data by group.
• The statistics include mean, min, sum. max etc.
• aggregate(dataframe$aggregate_column,
list(dataframe$group_column), FUN)
• dataframe is the input dataframe.
• aggregate_column is the column to be aggregated in
the dataframe.
• group_column is the column to be grouped with FUN.
• FUN represents sum/mean/min/ max.
# create a dataframe with 4 columns
data = data.frame(subjects=c("java", "python",
"java", "java", "php", "php"),
id=c(1, 2, 3, 4, 5, 6),
names=c("manoj", "sai", "mounika", "durga",
"deepika", "roshan"),
marks=c(89, 89, 76, 89, 90, 67))
print(data)
# aggregate sum of marks with subjects
print(aggregate(data$marks, list(data$subjects),
FUN=sum))

# aggregate minimum of marks with subjects


print(aggregate(data$marks, list(data$subjects),
FUN=min))
# aggregate maximum of marks with subjects
print(aggregate(data$marks, list(data$subjects),
FUN=max))
• summarize () function from dplyr package
• Summarizing a data set by group gives better indication on the distribution of the data.
• df<-iris
• df1<-summarise(df,mean(Sepal.Length))
• > df1
• mean(Sepal.Length)
• 1 5.843333
• df2<-summarise(df, Mean=mean(Sepal.Length),
SD=sd(Sepal.Length))
• > df2
• Mean SD
• 1 5.843333 0.8280661
• >
• df3<-summarise(group_by(df, Species), Mean=mean(Sepal.Length),
SD=sd(Sepal.Length))
• >df3
• # A tibble: 3 × 3
• Species Mean SD
• <fct> <dbl> <dbl>
• 1 setosa 5.01 0.352

• library(magrittr)
• df4<-df %>%
• group_by(Species) %>%
• summarise(Mean = mean(Sepal.Length),
• SD=sd(Sepal.Length))
• df6<-df %>%
• group_by(Species) %>%
• summarise(Min = min(Sepal.Length),
• Max=max(Sepal.Length))
• > Species Min Max
• 1 setosa 4.3 5.8
• 2 versicolor 4.9 7
• 3 virginica 4.9 7.9

• df7<-df %>%
• group_by(Species) %>% > Species Sepal.Length
• summarise(Sepal.Length = n())%>%
1 setosa 50
• arrange(desc(Sepal.Length)) 2 versicolor 50
3 virginica 50
Filtering, Reshaping
• the filter() method is a powerful tool for
subsetting data frames based on specified
conditions.
• It allows you to extract rows that meet specific
criteria, providing a flexible and efficient way
to manipulate data.
• The filter() method is part of the dplyr package
• Syntax: filter(data_frame, condition)
• Parameters:
-data_frame: The input data frame or tibble.
-condition: The logical condition used to filter rows.
• # Load necessary library
• library(dplyr)
• # Create a simple dataset
• employees <- data.frame(
• ID = 1:10,
• Name = c("John", "Jane", "Bill", "Anna", "Tom", "Sue", "Mike", "Sara",
"Alex","Nina"),
• Department = c("HR", "Finance", "IT", "Finance", "IT", "HR", "IT", "Finance",
• "HR", "Finance"),
• Salary = c(50000, 60000, 70000, 65000, 72000, 48000, 75000, 67000, 52000,
69000))
• # Print the dataset
• print(employees)
• # Filter employees in the IT department
• it <- filter(employees,Department == "IT")
• # Print the result
• print(it)
• # Filter employees in the Finance department with salary greater than 65000
• high_paid_finance_employees <-filter(employees,Department == "Finance" &
Salary > 65000)
• print(high_paid_finance_employees)
USE OF | (or) and in in Filter()
• hr_it_employees <- employees %>% filter(Department == "HR" |
Department == "IT")

• # Print the result


• print(hr_it_employees)

• # Filter employees in HR or Finance department using %in% operator


• hr_finance_employees <- employees %>% filter(Department %in%
c("HR", "Finance"))

• # Print the result


• print(hr_finance_employees)
Reshaping data in a data frame

• Transpose of a Matrix
• Joining Rows and Columns
• Merging of Data Frames
• Melting and Casting
Reshaping
Reshaping
Joining Rows and Columns in Data Frame
• Cbind()
• Rbind()
Merging two Data frames
Merging two Data frames
Melting and Casting
• Data reshaping involves many steps in order to obtain desired or
required format. One of the popular methods is melting the data which
converts each row into a unique id-variable combination and then
casting it. The two functions used for this process:
• melt(): It is used to convert a data frame into a molten data frame.
• Syntax: melt(data, …, na.rm=FALSE, value.name=”value”)
• where, data: data to be melted
• … : arguments
• na.rm: converts explicit missings into implicit missings
• value.name: storing values
• cast(): to reshape the molten data using cast() function which takes

aggregate function and formula to aggregate the data.

• Syntax: cast(data, formula, fun.aggregate)


• where, data: data to be melted
• formula: formula that defines how to cast
• fun.aggregate: used if there is a data aggregation
• library(reshape2)
• a <- data.frame(id = c("1", "1", "2", "2"),
• points = c("1",
"2", "1", "2"),
• x1 = c("5", "3",
"6", "2"),
• x2 = c("6", "5",
"1", "4"))
# Convert numeric columns to actual numeric values
• a$x1 <- as.numeric(as.character(a$x1))
• a$x2 <- as.numeric(as.character(a$x2))
• print("Melting")
• m <- melt(a, id = c("id", "points"))
• print(m)
• print("Casting")
• idmn <- dcast(m, id ~ variable, mean)
• print(idmn)
Log Transformation
• Transform the response variable from y
to log(y).
Cube Root Transformation in R
Benefits of Data Transformation
• Data Utilization - If the data being collected isn’t in an appropriate format, it often ends
up not being utilized at all. With the help of data transformation tools, organizations can
finally realize the true potential of the data they have amassed since the transformation
process standardizes the data and improves its usability and accessibility.

• Data Consistency - Data is continuously being collected from a range of sources which
increases the inconsistencies in metadata. This makes organization and understanding
data a huge challenge. Data transformation helps making it simpler to understand and
organize data sets.

• Better Quality Data- Transformation process also enhances the quality of data which
can then be utilized to acquire business intelligence.

• Compatibility Across Platforms - Data transformation also supports compatibility


between types of data, applications and systems.

• Faster Data Access -It is quicker and easier to retrieve data that has been transformed
into a standardized format.
4. Exploratory Data Analysis (EDA)

EDA is a method of analyzing data sets to identify


their main characteristics
• 4.1 Fundamentals of EDA
• 4.2 Summary Statistics
• 4.3 Data Visualization for EDA
• 4.4 Detecting Missing Data and Outliers
• 4.5 Correlation and Relationships
MIT School of Computing
Department of Computer Science & Engineering

Fundamentals of EDA
• Helps to understand the dataset, showing how many features
there are, the type of data in each feature, and how the data is
spread out, which helps in choosing the right methods for
analysis.
• It helps you identify unusual data points
• EDA helps to identify hiddenPLDpatterns and relationships between
different data points, which help us in and model building.
• It helps you discover how different parts of the data are
connected

• https://www.analyticsvidhya.com/blog/2021/11/fundamentals-of-
exploratory-data-analysis/
201
MIT School of Computing
Department of Computer Science & Engineering

• It helps you prepare the data for more detailed analysis or


building models
• Insights that you obtain from EDA help you decide which
features are most important for building models and how to
prepare them to improve performance.
PLD

• By understanding the data, EDA helps us in choosing the


best modeling techniques and adjusting them for better
results.

202
MIT School of Computing
Department of Computer Science & Engineering

PLD

203
MIT School of Computing
Department of Computer Science & Engineering

PLD

204
MIT School of Computing
Department of Computer Science & Engineering

Types of Exploratory Data Analysis

There are three main types of EDA:


1. Univariate PLD
2. Bivariate
3. Multivariate

205
MIT School of Computing
Department of Computer Science & Engineering

Univariate analysis
• focuses on studying one variable to understand its
characteristics.
• It helps describe the data and find patterns within a
single feature.
• Common methods include PLD histograms to show data
distribution, box plots to detect outliers and
understand data spread, and bar charts for
categorical data.
• Summary statistics like mean, median, mode,
variance, and standard deviation help describe the
central tendency and spread of the data

206
Univariate graphical
• Non-graphical methods are quantitative and
objective, they are not able to give the complete
picture of the data; therefore, graphical methods
are used more as they involve a degree of
subjective analysis, also are required.
• Histogram: The foremost basic graph is a histogram, which may
be a barplot during which each bar represents the frequency
(count) or proportion (count/total count) of cases for a variety
of values. Histograms are one of the simplest ways to quickly
learn a lot about your data, including central tendency, spread,
modality, shape and outliers.
• Stem-and-leaf plots: An easy substitute for a histogram may be
stem-and-leaf plots. It shows all data values and therefore the
shape of the distribution.
207
A stem and leaf plot
• A stem and leaf plot is a graphical representation used to
organize and display quantitative data in a semi-tabular
form. It helps in visualizing the distribution of the data set
and retains the original data values, making it easy to
identify the shape, central tendency, and variability of the
data.
• A stem and leaf plot splits each data point into a "stem" and
a "leaf." The "stem" represents the leading digits, while the
"leaf" represents the trailing digit. This separation makes it
easy to organize data and see patterns.
• For example: For the data set: 23, 25, 27, 32, 34, 35, 41, 42
Stem Leaves

• Stems: 2, 3, 4 2 3, 5, 7

3 2, 4, 5
• Leaves: 3, 5, 7 | 2, 4, 5 | 1, 2
4 1, 2
208
• Boxplots: Boxplots are excellent at presenting
information about central tendency and show robust
measures of location and spread also as providing
information about symmetry and outliers, although
they will be misleading about aspects like
multimodality.
• Quantile-normal plots: quantile-normal or QN plot
or more generally the quantile-quantile or QQ plot. It
allows detection of non-normality and diagnosis of
skewness and kurtosis

209
Univariate Non-graphical:
• know the underlying sample distribution/ data and make
observations about the population
• Central tendency: commonly useful measures of central
tendency are statistics called mean, median, and sometimes
mode. For skewed distribution or when there’s concern about
outliers, the median may be preferred.
• Spread: Spread is an indicator of what proportion distant
from the middle. deviation and variance are two useful
measures of spread.
• Skewness and kurtosis: Skewness is that the measure of
asymmetry and kurtosis may be a more subtle measure of
peaked-ness compared to a normal distribution 210
MIT School of Computing
Department of Computer Science & Engineering

Bivariate analysis graphical and non-graphical

• focuses on exploring the relationship between two variables


to find connections, correlations, and dependencies.
• Some key techniques used in bivariate analysis include
1. scatter plots, which visualize the relationship between two
continuous variables; PLD
2. correlation coefficient, which measures how strongly two
variables are related, commonly using Pearson’s correlation
for linear relationships;
3. cross-tabulation, or contingency tables, which show the
frequency distribution of two categorical variables and help
understand their relationship.

211
MIT School of Computing
Department of Computer Science & Engineering

Bivariate analysis

• Cross tabulation is a data analysis method that


displays the relationship between two or more
variables in a table. It's also known as a contingency
table, pivot table, or two-way table.
• A contingency table displays
PLD frequencies for two categorical
variables. Use two-way tables to see relationships between
the variables.

212
MIT School of Computing
Department of Computer Science & Engineering

Bivariate analysis

• Line graphs are useful for comparing two variables over


time, especially in time series data, to identify trends or
patterns.
• Covariance measures how two variables change together,
though it’s often supplemented
PLD by the correlation coefficient
for a clearer, more standardized view of the relationship.

213
MIT School of Computing
Department of Computer Science & Engineering

Multivariate analysis-
• examines the relationships between two or more
variables in the dataset.
• It aims to understand how variables interact with one
another, which is crucial for most statistical modeling
techniques. PLD
• It include Techniques like pair plots, which show the
relationships between multiple variables at once,
helping to see how they interact.
• Principal Component Analysis (PCA), which reduces
the complexity of large datasets by simplifying them,
while keeping the most important information.

214
Multivariate graphical:
• Scatterplot: For 2 quantitative variables, the essential
graphical EDA technique is that the scatterplot , sohas
one variable on the x-axis and one on the y-axis and
therefore the point for every case in your dataset.
• Run chart: It’s a line graph of data plotted over time.
• Heat map: It’s a graphical representation of data where
values are depicted by color.
• Multivariate chart: It’s a graphical representation of
the relationships between factors and response.
• Bubble chart: It’s a data visualization that displays
multiple circles (bubbles) in two-dimensional plot.
215
Multivariate Non-graphical:
1. correlation coefficient, which measures how strongly two
variables are related, commonly using Pearson’s
correlation for linear relationships;
2. cross-tabulation, or contingency tables, which show the
frequency distribution of two categorical variables and
help understand their relationship.
• For each categorical variable and one quantitative
variable, we create statistics for quantitative variables
separately for every level of the specific variable then
compare the statistics across the amount of categorical
variable.
• Comparing the means is an off-the-cuff version of ANOVA
and comparing medians may be a robust version of one-
way ANOVA.
216
MIT School of Computing
Department of Computer Science & Engineering

How it works
• Generate questions: Formulate questions
about the data
• Inspect the data: Examine the data from
multiple perspectives
PLD
• Visualize the data: Use charts and graphs to
represent the data
• Transform the data: Apply statistical and
mathematical methods to the data
• Identify patterns: Look for relationships
between different parts of the data
• Identify outliers: Find any unusual data points
• Refine questions: Use what you've learned to
generate new questions or refine existing ones 217
MIT School of Computing
Department of Computer Science & Engineering

Steps for Performing Exploratory Data


Analysis
Step 1: Understand the Problem and the Data

The first step in any data analysis project is to clearly understand the problem
you’re trying to solve and the data you have. This involves asking key questions
such as: PLD
• What is the business goal or research question?

• What are the variables in the data and what do they represent?

• What types of data (numerical, categorical, text, etc.) do you have?

• Are there any known data quality issues or limitations?

• Are there any domain-specific concerns or restrictions?

218
MIT School of Computing
Department of Computer Science & Engineering

Step 2: Import and Inspect the Data

• Load the data into your environment carefully to avoid errors or


truncations.
• Examine the size of the data (number of rows and columns) to
PLD
understand its complexity.
• Check for missing values and see how they are distributed across
variables, since missing data can impact the quality of your analysis.
• Identify data types for each variable (like numerical, categorical, etc.),
which will help in the next steps of data manipulation and analysis.
• Look for errors or inconsistencies, such as invalid values, mismatched
units, or outliers, which could signal deeper issues with the data. 219
MIT School of Computing
Department of Computer Science & Engineering

Step 3: Handle Missing Data


• to identify and handle missing data properly to avoid biased or
misleading results.
• Decide whether to remove missing data (listwise deletion)
or impute (fill in) the missing values. Removing data can
lead to biased outcomes
PLD data but should be done
• Imputing values helps preserve
carefully.

• Use appropriate imputation methods like mean/median


imputation, regression imputation, or machine learning
techniques like KNN or decision trees based on the data’s
characteristics.

• Consider the impact of missing data. Even after imputing,


missing data can cause uncertainty and bias, so interpret
the results with caution.
220
MIT School of Computing
Department of Computer Science & Engineering

Step 4: Explore Data Characteristics


• After addressing missing data, the next explore the
characteristics of your data by examining the distribution,
central tendency, and variability of your variables, as well
as identifying any outliers or anomalies.
• This helps in selecting appropriate
PLD analysis methods and
spotting potential data issues.
• You should calculate summary statistics like mean,
median, mode, standard deviation, skewness, and
kurtosis for numerical variables.
• These provide an overview of the data’s distribution and
help identify any irregular patterns or issues.

221
MIT School of Computing
Department of Computer Science & Engineering

Step 5: Perform Data Transformation


• prepares your data for accurate analysis and modeling.
• Depending on your data’s characteristics and analysis needs, you
may need to transform it to ensure it’s in the right format.

Common transformation techniques include:

PLD variables (e.g., min-max scaling


• Scaling or normalizing numerical
or standardization).
• Encoding categorical variables for machine learning (e.g., one-hot
encoding or label encoding).
• Applying mathematical transformations (e.g., logarithmic or square
root) to correct skewness or non-linearity.
• Creating new variables from existing ones (e.g., calculating ratios
or combining variables).
• Aggregating or grouping data based on specific variables or
conditions
222
MIT School of Computing
Department of Computer Science & Engineering

Step 6: Visualize Data Relationship


• Visualization is a powerful tool in the EDA process
• Helps to uncover relationships between variables and
identify patterns or trends that may not be obvious from
summary statistics alone.
• For categorical variables, create frequency tables, bar plots,
and pie charts to understandPLD the distribution of categories
and identify imbalances or unusual patterns.
• For numerical variables, generate histograms, box plots,
violin plots, and density plots to visualize distribution, shape,
spread, and potential outliers.
• To explore relationships between variables, use scatter plots,
correlation matrices, or statistical tests like Pearson’s
correlation coefficient or Spearman’s rank correlation
223
MIT School of Computing
Department of Computer Science & Engineering

Step 7: Handling Outliers

• Outliers are data points that significantly differ from the


rest of the data, often caused by errors in measurement
or data entry.
• Detecting and handling outliers is important because they
can skew your analysis andPLD affect model performance.
• identify outliers using methods like interquartile range
(IQR), Z-scores, or domain-specific rules.
• Once identified, outliers can be removed or adjusted
depending on the context.
• Properly managing outliers ensures your analysis is
accurate and reliable.

224
MIT School of Computing
Department of Computer Science & Engineering

Step 8: Communicate Findings and Insights

• This involves summarizing your analysis, pointing out key


discoveries, and presenting your results in a clear and
engaging way.
• Clearly state the goals and scope of your analysis.
• Provide context and background
PLD to help others understand
your approach.
• Use visualizations to support your findings and make them
easier to understand.
• Highlight key insights, patterns, or anomalies discovered.
• Mention any limitations or challenges faced during the
analysis.
• Suggest next steps or areas that need further investigation.
225
MIT School of Computing
Department of Computer Science & Engineering

:
Use of Tilde ~ in R
• Tilde symbol l is used within formulas of
statistical models, as mainly this symbol is used to
define the relationship between the dependent
variable and the independent variables in the
statistical model formula in the R programming
language.
• The left side of the tilde symbol specifies the
target variable (dependent variable or outcome)
and the right side of the tilde specifies the
predictor variable(independent variables).

226
Thank You
227

You might also like