DMV Unit-2 - RSKNF Bughnvbcyfewubvuwef B Uu Gfu Ufhn
DMV Unit-2 - RSKNF Bughnvbcyfewubvuwef B Uu Gfu Ufhn
Class - T.Y.
PLD (SEM-II)
Unit - II
1
MIT School of Computing
Department of Computer Science & Engineering
Syllabus
1. Introduction to R Programming
1.1 Basics of R Programming
1.2 Data Types and Structures in R
1.3 Essential R Programming Constructs
1.4 Functions in R
1.5 Importing and Exporting Data
2. Visualization Using R
2.1 Overview of Visualization in R
2.2 Creating Basic Visualizations
2.3 Advanced Visualizations Using ggplot2
2.4 Interactive Visualizations
3. Transformation Using R
3.1 Data Transformation Basics
3.2 Aggregation and Summarization
3.3 Reshaping and Pivoting Data
3.4 Joining and Merging Datasets
3.5 String Manipulations and Regular Expressions
4. Exploratory Data Analysis (EDA)
4.1 Fundamentals of EDA
4.2 Summary Statistics
4.3 Detecting Missing Data and Outliers
4.4 Data Visualization for EDA 2
4.5 Correlation and Relationships
MIT School of Computing
Department of Computer Science & Engineering
1. Introduction to R Programming
1.1 Basics of R Programming:
• The R Language stands out as a powerful tool in the modern era of statistical
computing and data analysis.
• Widely embraced by statisticians, dataPLD
scientists, and researchers
• offers an extensive suite of packages and libraries tailored for data manipulation,
statistical modelling, and visualization.
• an implementation of the S programming language.
• a leading tool for machine learning, statistics, and data analysis, allowing for the
easy creation of objects, functions, and packages.
3
4
MIT School of Computing
Department of Computer Science & Engineering
5
MIT School of Computing
Department of Computer Science & Engineering
• R language is specifically designed for statistical analysis and provides a vast array of
• The R Language boasts a rich ecosystem of packages and libraries that extend its
• R language excels in data visualization, offering powerful tools like ggplot2 and plotly,
6
which enable the creation of detailed and aesthetically pleasing graphs and plots 6
MIT School of Computing
Department of Computer Science & Engineering
5. Platform Independence:
• The R Language is platform-independent, meaning it can run on
various operating systems, including Windows, macOS, and
Linux, providing flexibility in development environments.
7
MIT School of Computing
Department of Computer Science & Engineering
PLD
10
MIT School of Computing
Department of Computer Science & Engineering
Advantages of R language
• R is the most comprehensive statistical analysis
package. As new technology and concepts often appear
first in R.
• As R programming language is an open source. Thus,
PLD
you can run R anywhere and at any time.
• R programming language is suitable for GNU/Linux and
Windows operating systems.
• R programming is cross-platform and runs on any
operating system.
• In R, everyone is welcome to provide new packages,
bug fixes, and code enhancements.
13
MIT School of Computing
Department of Computer Science & Engineering
Disadvantages of R language
14
MIT School of Computing
Department of Computer Science & Engineering
Applications of R language
15
MIT School of Computing
Department of Computer Science & Engineering
PLD
16
Data type ExampleExample Description
MIT School of Computing
Department of Computer Science & Engineering It is a special data type for
data with only two possible
Logical True, False
values which can be construed
as true/false.
Decimal value is called
numeric in R, and it is the
Numeric 12,32,112,5432
default computational data
type.
Here, L tells R to store the
Integer 3L, 66L, 2346L
PLD value as an integer,
A complex value in R is
Complex Z=1+2i, t=7+3i defined as the pure imaginary
value i.
In R programming, a character
is used to represent string
Character 'a', '"good'", "TRUE", '35.4' values. We convert objects
into character values with the
help ofas.character() function.
• Vectors
• Lists
• Dataframes
• Matrices
• Arrays PLD
• Factors
• Tibbles
20
MIT School of Computing
Department of Computer Science & Engineering
Vectors
• A vector is simply a list of items that are of the
same type.
• To combine the list of items to a vector, use
the c() function and separate the items by a comma.
PLD
Examples:
# Vector of strings
fruits <- c("banana", "apple", "orange")
# Print fruits
fruits
21
MIT School of Computing
Department of Computer Science & Engineering
Vectors
PLD
22
MIT School of Computing
Department of Computer Science & Engineering
# Print numbers
numbers
Output: PLD
1] 1 2 3
Vector Length
To find out how many items a vector has, use the length() function:
Example
fruits <- c("banana", "apple", "orange")
length(fruits)
23
MIT School of Computing
Department of Computer Science & Engineering
Sort a Vector
To sort items in a vector alphabetically or numerically, use
the sort() function:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
numbers <- c(13, 3, 5, 7, 20, 2)
PLD
sort(fruits) # Sort a string
sort(numbers) # Sort numbers
Output:
24
MIT School of Computing
Department of Computer Science & Engineering
Lists
A list in R can contain many different data types inside it. A list is a collection of data
which is ordered and changeable.
To create a list, use the list() function:
These are also one-dimensional data structures. A list can be a list of vectors, list of
matrices, a list of characters and a list of functions and so on.
# R program to create a List
PLD
25
MIT School of Computing
Department of Computer Science & Engineering
empList
26
MIT School of Computing
Department of Computer Science & Engineering
Output
[[1]]
[1] 1 2 3 4
[[2]]
[1] "Debi" "Sandeep" "Subham" "Shiba”
PLD
[[3]]
[1] 4
27
MIT School of Computing
Department of Computer Science & Engineering
In the above example, the list function will create a list with character, logical,
numeric, and vector element. It will give the following output
[[1]] [1] "Shubham"
[[2]] [1] "Arpita"
[[3]] [1] 1 2 3 4 5
PLD
[[4]] [1] TRUE
[[5]] [1] FALSE
[[6]] [1] 22.5
[[7]] [1] 12
28
MIT School of Computing
Department of Computer Science & Engineering
R – Array
• Arrays are essential data storage structures defined by a fixed number
of dimensions. Arrays are used for the allocation of space at
contiguous memory locations.
• R Programming Language Uni-dimensional arrays are called vectors
with the length being their only dimension. Two-dimensional arrays
are called matrices, consisting ofPLD
fixed numbers of rows and
columns.
• An R array can be created with the use of array() the function. A list
of elements is passed to the array() functions along with the
dimensions as required.
• Syntax:
array(data, dim = (nrow, ncol, nmat), dimnames=names)
where
nrow: Number of rows
ncol : Number of columns 29
MIT School of Computing
Department of Computer Science & Engineering
R – Matrices
R-matrix is a two-dimensional arrangement of data in rows and
columns.
In a matrix, rows are the ones that run horizontally and columns are the
ones that run vertically. In R programming, matrices are two-
dimensional, homogeneous data structures. These are some examples of
matrices: PLD
30
MIT School of Computing
Department of Computer Science & Engineering
Creating a Matrix in R
To create a matrix in R you need to use the function called matrix().
The arguments to this matrix() are the set of elements in the vector. You
have to pass how many numbers of rows and how many numbers of
columns you want to have in your matrix.
A = matrix(
# No of rows
nrow = 3, PLD
# No of columns
ncol = 3,
32
MIT School of Computing
Department of Computer Science & Engineering
# Naming rows
rownames(A) = c("a", "b", "c")
# Naming columns
colnames(A) = c("c", "d", "e")
PLD
cat("The 3x3 matrix:\n")
print(A)
Output:
The 3x3 matrix:
c d e
a1 2 3
b4 5 6
c7 8 9
33
MIT School of Computing
Department of Computer Science & Engineering
R Factors
• Factors in R Programming Language are data structures that are
implemented to categorize the data or represent categorical data and
store it on multiple levels.
• They can be stored as integers with a corresponding label to every
unique integer. PLD
• The R factors may look similar to character vectors, they are integers
and care must be taken while using them as strings.
• The R factor accepts only a restricted number of distinct values. For
example, a data field such as gender may contain values only from
female, male
34
• Attributes of Factors in R Language
• x: It is the vector that needs to be converted into a factor.
• Levels: It is a set of distinct values which are given to the
input vector x.
• Labels: It is a character vector corresponding to the number of
labels.
• Exclude: This will mention all the values you want to
exclude.
• Ordered: This logical attribute decides whether the levels are
ordered.
• nmax: It will decide the upper limit for the maximum number
of levels. 35
MIT School of Computing
Department of Computer Science & Engineering
Output
[1] "female" "male" "male" "female”
[1] female male male female
Levels: female male
# Create a factor
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))
PLD
# Print the factor
music_genre
Result:
[1] Jazz Rock Classic Classic Pop Jazz Rock Jazz
Levels: Classic Jazz Pop Rock
You can see from the example above that the factor has four levels
(categories): Classic, Jazz, Pop and Rock.
To only print the levels, use the levels() function: 37
MIT School of Computing
Department of Computer Science & Engineering
Data Frames
• Data Frames in R Language are generic data objects of R that are
used to store tabular data.
• Data Frames are data displayed in a format as a table.
• Data Frames can have different types of data inside it. While the first
column can be character, the second and third can
be numeric or logical. However,PLD
each column should have the same
type of data.
• Use the data.frame() function to create a data frame:
Example
# Create a data frame
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
38
Data_Frame
MIT School of Computing
Department of Computer Science & Engineering
Output:
Training Pulse Duration
1 Strength 100 60
2 Stamina 150 30
3 Other 120 45
PLD
39
40
MIT School of Computing
Department of Computer Science & Engineering
PLD
41
MIT School of Computing
Department of Computer Science & Engineering
PLD
42
MIT School of Computing
Department of Computer Science & Engineering
PLD
43
Essential R Programming
Constructs
44
MIT School of Computing
Department of Computer Science & Engineering
PLD
46
MIT School of Computing
Department of Computer Science & Engineering
The if Statement
• An "if statement" is written with the if keyword, and it is used to
specify a block of code to be executed if a condition is TRUE:
Example
a <- 33
b <- 200
PLD
if (b > a) {
print("b is greater than a")
}
47
MIT School of Computing
Department of Computer Science & Engineering
48
MIT School of Computing
Department of Computer Science & Engineering
x <- 5
# define a variable
x <- 15
PLD
52
MIT School of Computing
Department of Computer Science & Engineering
Example:
# assigning strings to the vector
week <- c('Sunday',
'Monday',
'Tuesday',
'Wednesday',
'Thursday', PLD
'Friday',
'Saturday')
# using for loop to iterate
# over each string in the vector
for (day in week)
{
# displaying each string in the vector
print(day)} 53
MIT School of Computing
Department of Computer Science & Engineering
val = 1
Repeat Loop in R
• It is a simple loop that will run the same statement or a
group of statements repeatedly until the stop condition has
been encountered.
• Repeat loop does not have any condition to terminate the
loop, a programmer mustPLDspecifically place a condition
within the loop’s body and use the declaration of a break
statement to terminate this loop.
• If no condition is present in the body of the repeat loop
then it will iterate infinitely.
55
MIT School of Computing
Department of Computer Science & Engineering
56
MIT School of Computing
Department of Computer Science & Engineering
R Functions
• A set of statements which are organized together to perform a specific task is
known as a function. R provides a series of in-built functions, and it allows
the user to create their own functions. Functions are used to perform tasks in
the modular approach.
• "An R function is created by using the keyword function." There is the
following syntax of R function: PLD
func_name <- function(arg_1, arg_2, ...)
{
Function body
}
• Information can be passed into functions as arguments.
• Arguments are specified after the function name, inside the parentheses. You
can add as many arguments as you want, just separate them with a comma.
58
MIT School of Computing
Department of Computer Science & Engineering
Example
my_function <- function(fname) {
paste(fname, "Griffin")
}
my_function("Peter")
my_function("Lois") PLD
my_function("Stewie")
Output
"Peter Griffin"
"Lois Griffin"
"Stewie Griffin"
59
MIT School of Computing
Department of Computer Science & Engineering
Function Types
• Similar to the other languages, R also has two types of function,
i.e. Built-in Function and User-defined Function.
• In R, there are lots of built-in functions which we can directly call in
the program without defining them.
• R also allows us to create our own functions.
PLD
60
MIT School of Computing
Department of Computer Science & Engineering
Built-in function
• The functions which are already created or defined in the
programming framework are known as built-in functions.
• User doesn't need to create these types of functions, and these
functions are built into an application.
• End-users can access these functions by simply calling it. R have
different types of built-in functions
PLD such as seq(), mean(), max(), and
sum(x) etc.
# Creating sequence of numbers from 32 to 46.
print(seq(32,46))
PLD
62
MIT School of Computing
Department of Computer Science & Engineering
User-defined function
• R allows us to create our own function in our program. A user defines
a user-defined function to fulfill the requirement of user. Once these
functions are created, we can use these functions like in-built
function.
# function to add 2 numbers
add_num <- function(a,b) PLD
2. Visualization Using R
64
MIT School of Computing
Department of Computer Science & Engineering
Data visualization
• Data visualization is the technique used to deliver insights in
data using visual cues such as graphs, charts, maps, and many
others.
• This is useful as it helps in intuitive and easy understanding of
the large quantities of data andPLD
thereby make better decisions
regarding it.
• The popular data visualization tools that are available are
Tableau, Plotly, R, Google Charts, Infogram, and Kibana.
• The various data visualization platforms have different
capabilities, functionality, and use cases.
• They also require a different skill set. This article discusses the
use of R for data visualization.
65
MIT School of Computing
Department of Computer Science & Engineering
•xlab and ylab: These parameters set the labels of the x and y axes, respectively. They
take a string value.
•main: This parameter sets the main title of the plot. It takes a string value.
•col: This parameter sets the color of points or lines in a plot. It takes a string value
specifying a color name or a code in the format “#RRGGBB”, where RR, GG, and BB
are the red, green, and blue components of the color, respectively.
•lwd: This parameter sets the width of lines in a plot. It takes numeric value.
•sub: This parameter sets the sub-title/label of the plot.
•pch: This parameter sets the plotting character for points in a plot. It
•lty: This parameter can be used to change the line types of the plot
•font: This parameter sets the font style and font size in the plot. We can
•cex: It is a short form for character expansion. This parameter sets the
size of elements in the plot, such as points or text. cex takes a numeric
plot(x,y,bty="U")
barplot(H, xlab, ylab, main, names.arg, col)
Parameters:
•H: This parameter is a vector or matrix containing numeric values which are
used in bar chart.
•col: This parameter is used to give colors to the bars in the graph.
# plotting a bar-graph with title and color
barplot(c(1,3), main="Main title",
xlab="X axis title",
ylab="Y axis title",
col.main="red",
col.lab="blue")
1. Write a program to draw a line chart, use plot
function
2. Write a program to draw a bar chart to visualize the
comparative rainfall data for 12 months
3. Write a program to Create a bar graph to illustrate the
distribution of students from various schools who attended
a seminar on “Deep Learning”. The total number of
students from each school is provided below.
• Boxplots are a measure of how well distributed is the data in
a data set.
• It is also useful in comparing the distribution of data across
data sets by drawing boxplots for each of them.
• Boxplots are created in R by using the boxplot() function.
Syntax
boxplot(x, data, notch, varwidth, names,
main)
•x is a vector or a formula.
•data is the data frame.
•notch is a logical value. Set as TRUE to draw a notch.
•varwidth is a logical value. Set as true to draw width of the
box proportionate to the sample size.
•names are the group labels which will be printed under each
boxplot.
•main is used to give a title to the graph.
boxplot(mpg ~ cyl, data = mtcars,
xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data",
notch= TRUE, varwidth = TRUE,
col = c("green","yellow","purple"),
names = c("High","Medium","Low"))
4. Scatter Plot
• A scatter plot is a set of dotted points representing individual
data pieces on the horizontal and vertical axis.
• In a graph in which the values of two variables are plotted
along the X-axis and Y-axis, the pattern of the resulting points
reveals a correlation between them.
• They reveal correlations, whether positive or negative, within
paired data, showcasing trends and patterns.
• scatter plots illustrate connections between variables through
ordered pairs, making them useful for analyzing paired
numerical data and situations where the dependent variable
varies across different values of the independent variable.
• Their strength lies in their ability to clearly depict trends,
clusters, and relationships within datasets.
Parameters of plot()
• Syntax: plot(x, y, main, xlab, ylab, xlim, ylim, axes)
• Parameters:
• x: This parameter sets the horizontal coordinates.
• Parameters:
• x: This parameter is a vector that contains the numeric values
which are used in the pie chart.
• labels: This parameter gives the description to the slices in pie
chart.
• radius: This parameter is used to indicate the radius of the circle
of the pie chart.(value between -1 and +1).
• main: This parameter is represents title of the pie chart.
• clockwise: This parameter contains the logical value which
indicates whether the slices are drawn clockwise or in anti
clockwise direction.
• col: This parameter give colors to the pie in the graph.
• # Create data for the graph.
• s<- c(23, 56, 20, 63)
• labels <- c("Mumbai", "Pune", "Chennai", "Bangalore")
Structure of ggplot
PLD
98
Data Layer:
ggplot2 in R the data Layer we define the
source of the information to be visualize
• library(ggplot2)
• library(dplyr)
• ggplot(data = mtcars) +
• labs(title = "MTCars Data Plot")
99
Aesthetic Layer
ggplot2 in R Here we will display and map
dataset into certain aesthetics.
# Aesthetic Layer
ggplot(data = mtcars, aes(x = hp, y = mpg, col =
disp))+
labs(title = "MTCars Data Plot")
100
Geometric layer:
101
• Geometric layer: Adding Size, color, and
shape and then plotting the Histogram plot
# Adding size
ggplot(data = mtcars, aes(x = hp, y = mpg, size =
disp)) +
geom_point() +
labs(title = "Miles per Gallon vs Horsepower",
x = "Horsepower",
y = "Miles per Gallon")
# Histogram plot
ggplot(data = mtcars, aes(x = hp)) +
geom_histogram(binwidth = 5) +
labs(title = "Histogram of Horsepower",
x = "Horsepower", 102
y = "Count")
Facet Layer:
• ggplot2 in R facet layer is used to split the data up into
subsets of the entire dataset and it allows the subsets
to be visualized on the same plot. Here we separate
rows according to transmission type and Separate
columns according to cylinders.
# Facet Layer
# Separate rows according to transmission
type
p <- ggplot(data = mtcars, aes(x = hp, y =
mpg, shape = factor(cyl))) + geom_point()
p + facet_grid(am ~ .) +
labs(title = "Miles per Gallon vs Horsepower",
x = "Horsepower",
y = "Miles per Gallon")
p + facet_grid(. ~ cyl) +
labs(title = "Miles per Gallon vs Horsepower",
x = "Horsepower",
y = "Miles per Gallon")
103
https://www.geeksforgeeks.org/data-visualization-with-r-and-ggplot2/ refer for other layers
Advanced Visualization using ggplot2
MIT School of Computing
Department of Computer Science & Engineering
Heatmap
• A heatmap depicts the relationship between two
attributes of a data frame as a color-coded tile.
• A heatmap produces a grid with multiple
attributes of the data frame, representing the
relationship between thePLD two attributes taken at
a time.
• In both data analysis and visualization, heatmaps
are a common visualization tool.
• They are especially beneficial for displaying and
examining relationships and patterns in tabular
data.
• Reference:
https://r-charts.com/correlation/heat-map-ggplot2
/ 115
MIT School of Computing
Department of Computer Science & Engineering
# install.packages("ggplot2")
library(ggplot2)
PLD
117
MIT School of Computing
Department of Computer Science & Engineering
118
MIT School of Computing
Department of Computer Science & Engineering
Mosaic Map
• Mosaic Plots are used to show symmetries for tables that are
divided into two or more conditional distributions.
• Mosaic plots are a great way to visualize hierarchical data.
• A collection of rectangles represents
PLD all the elements to be
visualized with the rectangles of different sizes and colors
makes a table, but what makes these mosaic charts unique is
the arrangement of the elements where there is a hierarchy
those elements are collected and labeled together, perhaps
even with subcategories.
• So mosaic plots can be used for plotting categorical data very
effectively, with the area of the data showing the relative
proportions.
• The package that is used for this is vcd. 119
MIT School of Computing
Department of Computer Science & Engineering
Syntax:
mosaic(x,shade=NULL,legend=NULL, main = NULL,..)
Parameters:
• x: Here, x is pointing to the variable that holds the dataset/table.
We passed our dataset name here.
PLD
• shade: shade is a boolean variable, if it is set to be true then we
will get a colored plot. Its default value is NULL.
• legend: the legend is a boolean variable, if it is set to be true then
we will be able to see legends alongside our mosaic plot. Its
default value is NULL.
• main: main is a string variable, here we pass the title of our
mosaic plot. 120
MIT School of Computing
Department of Computer Science & Engineering
library(ggmosaic)
data("fly")
head(fly)
ggplot(data = fly) + geom_mosaic( aes(x =
product(rude_to_recline), fill=do_you_recline)) +
theme_mosaic()
PLD
121
categorical data, geom_mosaic() has the capability to produce bar charts, stacked bar charts, mosaic plots, and double decker plots and therefore o
3D Graphs
PLD
123
3D scatterplots
• use R scatterplot3D
package to create 3D
scatterplots, this
package can plot R
scatterplots in 3D using
scatterplot3d() methods.
• # 3D Scatterplot
• library(plotly)
• attach(mtcars)
• plot_ly(data=mtcars,x=~mpg,y
=~hp,z=~cyl,color=~gear)
https://rpubs.com/oox/graphs 3D Scatterplots
3D Pie Chart
Interactive Visualizations
• use the Plotly and Leaflet packages in combination with HTML
widgets.
• These packages allows to create interactive plots with R
visualizations directly from R code, without the need for a
separate web application framework.
PLD
• we create a scatter plot using the plot_ly() function, specifying
the iris dataset, the x-axis (Sepal.Length) and y-axis
(Petal.Length) variables, and the type of plot (scatter) with mode
= “markers”.
• We also add color and size to the markers using
the add_markers() function.
• Finally, we add labels and a title to the plot using the layout()
function and display the plot using the plot object. 129
Interactive Visualizations
• # Install packages
• install.packages("plotly")
• install.packages("leaflet")
• # Load packages
• library(plotly)
• library(leaflet)
• data(iris)
• # Create a scatter plot
• plot <- plot_ly(data = iris, x = ~Sepal.Length, y = ~Petal.Length, type = "scatter",
mode = "markers")
• # Add color and size to markers
• plot <- plot %>% add_markers(color = ~Species, size = ~Sepal.Width * 2)
• # Add labels and title
• plot <- plot %>% layout(xaxis = list(title = "Sepal Length"), yaxis = list(title =
"Petal Length"), title = "Iris Dataset")
• # Display the plot
Reference:
• plot https://www.geeksforgeeks.org/interactive-charts-using
MIT School of Computing
Department of Computer Science & Engineering
PLD
131
MIT School of Computing
Department of Computer Science & Engineering
132
3. Transformation Using R
• df7<-df %>%
• group_by(Species) %>% > Species Sepal.Length
• summarise(Sepal.Length = n())%>%
1 setosa 50
• arrange(desc(Sepal.Length)) 2 versicolor 50
3 virginica 50
Filtering, Reshaping
• the filter() method is a powerful tool for
subsetting data frames based on specified
conditions.
• It allows you to extract rows that meet specific
criteria, providing a flexible and efficient way
to manipulate data.
• The filter() method is part of the dplyr package
• Syntax: filter(data_frame, condition)
• Parameters:
-data_frame: The input data frame or tibble.
-condition: The logical condition used to filter rows.
• # Load necessary library
• library(dplyr)
• # Create a simple dataset
• employees <- data.frame(
• ID = 1:10,
• Name = c("John", "Jane", "Bill", "Anna", "Tom", "Sue", "Mike", "Sara",
"Alex","Nina"),
• Department = c("HR", "Finance", "IT", "Finance", "IT", "HR", "IT", "Finance",
• "HR", "Finance"),
• Salary = c(50000, 60000, 70000, 65000, 72000, 48000, 75000, 67000, 52000,
69000))
• # Print the dataset
• print(employees)
• # Filter employees in the IT department
• it <- filter(employees,Department == "IT")
• # Print the result
• print(it)
• # Filter employees in the Finance department with salary greater than 65000
• high_paid_finance_employees <-filter(employees,Department == "Finance" &
Salary > 65000)
• print(high_paid_finance_employees)
USE OF | (or) and in in Filter()
• hr_it_employees <- employees %>% filter(Department == "HR" |
Department == "IT")
• Transpose of a Matrix
• Joining Rows and Columns
• Merging of Data Frames
• Melting and Casting
Reshaping
Reshaping
Joining Rows and Columns in Data Frame
• Cbind()
• Rbind()
Merging two Data frames
Merging two Data frames
Melting and Casting
• Data reshaping involves many steps in order to obtain desired or
required format. One of the popular methods is melting the data which
converts each row into a unique id-variable combination and then
casting it. The two functions used for this process:
• melt(): It is used to convert a data frame into a molten data frame.
• Syntax: melt(data, …, na.rm=FALSE, value.name=”value”)
• where, data: data to be melted
• … : arguments
• na.rm: converts explicit missings into implicit missings
• value.name: storing values
• cast(): to reshape the molten data using cast() function which takes
• Data Consistency - Data is continuously being collected from a range of sources which
increases the inconsistencies in metadata. This makes organization and understanding
data a huge challenge. Data transformation helps making it simpler to understand and
organize data sets.
• Better Quality Data- Transformation process also enhances the quality of data which
can then be utilized to acquire business intelligence.
• Faster Data Access -It is quicker and easier to retrieve data that has been transformed
into a standardized format.
4. Exploratory Data Analysis (EDA)
Fundamentals of EDA
• Helps to understand the dataset, showing how many features
there are, the type of data in each feature, and how the data is
spread out, which helps in choosing the right methods for
analysis.
• It helps you identify unusual data points
• EDA helps to identify hiddenPLDpatterns and relationships between
different data points, which help us in and model building.
• It helps you discover how different parts of the data are
connected
• https://www.analyticsvidhya.com/blog/2021/11/fundamentals-of-
exploratory-data-analysis/
201
MIT School of Computing
Department of Computer Science & Engineering
202
MIT School of Computing
Department of Computer Science & Engineering
PLD
203
MIT School of Computing
Department of Computer Science & Engineering
PLD
204
MIT School of Computing
Department of Computer Science & Engineering
205
MIT School of Computing
Department of Computer Science & Engineering
Univariate analysis
• focuses on studying one variable to understand its
characteristics.
• It helps describe the data and find patterns within a
single feature.
• Common methods include PLD histograms to show data
distribution, box plots to detect outliers and
understand data spread, and bar charts for
categorical data.
• Summary statistics like mean, median, mode,
variance, and standard deviation help describe the
central tendency and spread of the data
206
Univariate graphical
• Non-graphical methods are quantitative and
objective, they are not able to give the complete
picture of the data; therefore, graphical methods
are used more as they involve a degree of
subjective analysis, also are required.
• Histogram: The foremost basic graph is a histogram, which may
be a barplot during which each bar represents the frequency
(count) or proportion (count/total count) of cases for a variety
of values. Histograms are one of the simplest ways to quickly
learn a lot about your data, including central tendency, spread,
modality, shape and outliers.
• Stem-and-leaf plots: An easy substitute for a histogram may be
stem-and-leaf plots. It shows all data values and therefore the
shape of the distribution.
207
A stem and leaf plot
• A stem and leaf plot is a graphical representation used to
organize and display quantitative data in a semi-tabular
form. It helps in visualizing the distribution of the data set
and retains the original data values, making it easy to
identify the shape, central tendency, and variability of the
data.
• A stem and leaf plot splits each data point into a "stem" and
a "leaf." The "stem" represents the leading digits, while the
"leaf" represents the trailing digit. This separation makes it
easy to organize data and see patterns.
• For example: For the data set: 23, 25, 27, 32, 34, 35, 41, 42
Stem Leaves
• Stems: 2, 3, 4 2 3, 5, 7
3 2, 4, 5
• Leaves: 3, 5, 7 | 2, 4, 5 | 1, 2
4 1, 2
208
• Boxplots: Boxplots are excellent at presenting
information about central tendency and show robust
measures of location and spread also as providing
information about symmetry and outliers, although
they will be misleading about aspects like
multimodality.
• Quantile-normal plots: quantile-normal or QN plot
or more generally the quantile-quantile or QQ plot. It
allows detection of non-normality and diagnosis of
skewness and kurtosis
209
Univariate Non-graphical:
• know the underlying sample distribution/ data and make
observations about the population
• Central tendency: commonly useful measures of central
tendency are statistics called mean, median, and sometimes
mode. For skewed distribution or when there’s concern about
outliers, the median may be preferred.
• Spread: Spread is an indicator of what proportion distant
from the middle. deviation and variance are two useful
measures of spread.
• Skewness and kurtosis: Skewness is that the measure of
asymmetry and kurtosis may be a more subtle measure of
peaked-ness compared to a normal distribution 210
MIT School of Computing
Department of Computer Science & Engineering
211
MIT School of Computing
Department of Computer Science & Engineering
Bivariate analysis
212
MIT School of Computing
Department of Computer Science & Engineering
Bivariate analysis
213
MIT School of Computing
Department of Computer Science & Engineering
Multivariate analysis-
• examines the relationships between two or more
variables in the dataset.
• It aims to understand how variables interact with one
another, which is crucial for most statistical modeling
techniques. PLD
• It include Techniques like pair plots, which show the
relationships between multiple variables at once,
helping to see how they interact.
• Principal Component Analysis (PCA), which reduces
the complexity of large datasets by simplifying them,
while keeping the most important information.
214
Multivariate graphical:
• Scatterplot: For 2 quantitative variables, the essential
graphical EDA technique is that the scatterplot , sohas
one variable on the x-axis and one on the y-axis and
therefore the point for every case in your dataset.
• Run chart: It’s a line graph of data plotted over time.
• Heat map: It’s a graphical representation of data where
values are depicted by color.
• Multivariate chart: It’s a graphical representation of
the relationships between factors and response.
• Bubble chart: It’s a data visualization that displays
multiple circles (bubbles) in two-dimensional plot.
215
Multivariate Non-graphical:
1. correlation coefficient, which measures how strongly two
variables are related, commonly using Pearson’s
correlation for linear relationships;
2. cross-tabulation, or contingency tables, which show the
frequency distribution of two categorical variables and
help understand their relationship.
• For each categorical variable and one quantitative
variable, we create statistics for quantitative variables
separately for every level of the specific variable then
compare the statistics across the amount of categorical
variable.
• Comparing the means is an off-the-cuff version of ANOVA
and comparing medians may be a robust version of one-
way ANOVA.
216
MIT School of Computing
Department of Computer Science & Engineering
How it works
• Generate questions: Formulate questions
about the data
• Inspect the data: Examine the data from
multiple perspectives
PLD
• Visualize the data: Use charts and graphs to
represent the data
• Transform the data: Apply statistical and
mathematical methods to the data
• Identify patterns: Look for relationships
between different parts of the data
• Identify outliers: Find any unusual data points
• Refine questions: Use what you've learned to
generate new questions or refine existing ones 217
MIT School of Computing
Department of Computer Science & Engineering
The first step in any data analysis project is to clearly understand the problem
you’re trying to solve and the data you have. This involves asking key questions
such as: PLD
• What is the business goal or research question?
• What are the variables in the data and what do they represent?
218
MIT School of Computing
Department of Computer Science & Engineering
221
MIT School of Computing
Department of Computer Science & Engineering
224
MIT School of Computing
Department of Computer Science & Engineering
:
Use of Tilde ~ in R
• Tilde symbol l is used within formulas of
statistical models, as mainly this symbol is used to
define the relationship between the dependent
variable and the independent variables in the
statistical model formula in the R programming
language.
• The left side of the tilde symbol specifies the
target variable (dependent variable or outcome)
and the right side of the tilde specifies the
predictor variable(independent variables).
226
Thank You
227