[go: up one dir, main page]

0% found this document useful (0 votes)
32 views98 pages

Data Manipulation Using R

Module 3 of the CSE1006 course covers data manipulation techniques in R, focusing on data sorting, identifying and removing duplicates, and cleaning data. Key functions discussed include sort() and order() for sorting vectors and data frames, as well as duplicated() and unique() for handling duplicate data. The module also emphasizes the importance of data deduplication to maintain data integrity and reduce redundancy.

Uploaded by

yaraha5692
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views98 pages

Data Manipulation Using R

Module 3 of the CSE1006 course covers data manipulation techniques in R, focusing on data sorting, identifying and removing duplicates, and cleaning data. Key functions discussed include sort() and order() for sorting vectors and data frames, as well as duplicated() and unique() for handling duplicate data. The module also emphasizes the importance of data deduplication to maintain data integrity and reduce redundancy.

Uploaded by

yaraha5692
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Course code : CSE1006

Course title : Foundations of Data Analytics

Module-3
Data manipulation

22-02-2025 Dr. V. Srilakshmi 1


Module-3
• Data manipulation:
➢Data sorting
➢Find and remove duplicates record
➢Cleaning data
➢Recording data
➢Merging data

22-02-2025 Dr. V. Srilakshmi 2


Data Sorting
• R provides a different way to sort the data either in
ascending or descending order.
• Data-analysts, and Data scientists use order() and sort()
functions to sort data depending upon the structure of
the obtained data.
• The function Order() can sort Vector, Matrix and also a
Dataframe can be sorted in ascending and descending
order.

22-02-2025 Dr. V. Srilakshmi 3


• Syntax of sort():
sort(x, decreasing, na.last)
Where,
• Parameters: x: Vector to be sorted
• decreasing: Boolean value to sort in descending order
• na.last: Boolean value to put NA at the end

22-02-2025 Dr. V. Srilakshmi 4


• Example 1:
># Creating a vector
>x <- c(7, 4, 3, 9, 1.2, -4, -5, -8, 6, NA)

># Calling sort() function


>sort(x)

OUTPUT:
[1] -8.0 -5.0 -4.0 1.2 3.0 4.0 6.0 7.0 9.0

22-02-2025 Dr. V. Srilakshmi 5


• EXAMPLE 2:
># Creating a vector
>x <- c(7, 4, 3, 9, 1.2, -4, -5, -8, 6, NA)

># Calling sort() function to print in decreasing order


>sort(x, decreasing = TRUE)
[1] 9.0 7.0 6.0 4.0 3.0 1.2 -4.0 -5.0 -8.0

># Calling sort() function to print NA at the end


>sort(x, na.last = TRUE)
[1] -8.0 -5.0 -4.0 1.2 3.0 4.0 6.0 7.0 9.0 NA

22-02-2025 Dr. V. Srilakshmi 6


• Example 3:
# Create a character vector
>names <- c("Vipul", "Raj", "Singh", "Jhala")

># Sort the vector in alphabetical order


>sorted_names <- sort(names)
>print(sorted_names)
[1] "Jhala" "Raj" "Singh" "Vipul"

22-02-2025 Dr. V. Srilakshmi 7


• Syntax of order():
>order(x, [decreasing = TRUE or FALSE], [na.last = TRUE or
FLASE], [method = c("auto", "shell", "quick", "radix")])
• The first argument x is the vector or data frame to be sorted.
• The second argument decreasing is a logical value that determines
whether the sorting should be in decreasing order (TRUE) or
increasing order (FALSE).
• The third argument na.last is also a logical value that determines
whether missing values should be placed at the end of the sorted
vector (TRUE) or at the beginning (FALSE).
• The fourth argument method is an optional argument that specifies
the sorting algorithm to be used.
22-02-2025 Dr. V. Srilakshmi 8
• The available options are "auto" (default), "shell", "quick",
and "radix".
• The "auto" option automatically selects the best algorithm
based on the size and type of the input data.
• Overall, this code snippet is used to sort a vector or data
frame in R using the order() function with various options
for sorting order, handling of missing values, and sorting
algorithm.

22-02-2025 Dr. V. Srilakshmi 9


• Example:

>y = c(4,12,6,7,2,9,5)
>order(y)
• In this case, the output would be: 5 1 7 3 4 6 2, which
indicates that the smallest value is at index 5, the next
smallest is at index 1, and so on.
• Note that the original vector "y" is not modified by this
function.

22-02-2025 Dr. V. Srilakshmi 10


• Example:
>y = c(4,12,6,7,2,9,5)
>y[order(y)]
• The "order" function returns the indices of the sorted
values, so when we use these indices to subset "y" with
square brackets, we get the sorted values of "y".
• Therefore, the output of this code will be:
2 4 5 6 7 9 12.

22-02-2025 Dr. V. Srilakshmi 11


•Example:
>x <- c(8,2,4,1,-4,NA,46,8,9,5,3)
>order(x,na.last = TRUE)
• The "na.last = TRUE" argument specifies that missing values
(NA) should be placed at the end of the sorted vector.
• So, the output of this code would be:
5 4 2 1 11 10 9 8 3 7 6.

22-02-2025 Dr. V. Srilakshmi 12


• Example:
>x <- c(8,2,4,1,-4,NA,46,8,9,5,3)
>order(x,decreasing=TRUE,na.last=TRUE)
• The decreasing=TRUE argument specifies that the indices should be
sorted in descending order instead of ascending order.
• The na.last=TRUE argument specifies that missing values should be
placed at the end of the sorted vector.
• Overall, this code sorts the input vector x in descending order and
returns the indices that would sort x in that order.
• So, the output of this code would be: 7 9 1 8 10 3 11 2 4 5 6

22-02-2025 Dr. V. Srilakshmi 13


Sorting a data frame by using order()
• The function order() is used to sort the dataframe based on the particular
column in the dataframe.
• Syntax:
order(dataframe$column_name,decreasing = TRUE))
• where
• dataframe is the input dataframe
• Column name is the column in the dataframe such that dataframe is sorted
based on this column
• Decreasing parameter specifies the type of sorting order - If it is TRUE
dataframe is sorted in descending order. Otherwise, in increasing order.
• returns: Index positions of the elements.
22-02-2025 Dr. V. Srilakshmi 14
Sorting a data frame by using order()
• Example 1: R program to create a dataframe with 2 columns and order based
on particular columns in decreasing order. Displayed the Sorted dataframe
based on subjects in decreasing order, displayed the Sorted dataframe based
on roll no in decreasing order.
• # create dataframe with roll no and subjects columns
data = data.frame(rollno = c(1, 5, 4, 2, 3),
subjects = c("java", "python", "php", "sql", "c"))
print(data)
print("sort the data in decreasing order based on subjects ")
print(data[order(data$subjects, decreasing = TRUE), ] )
print("sort the data in decreasing order based on rollno ")
print(data[order(data$rollno, decreasing = TRUE), ] )
22-02-2025 Dr. V. Srilakshmi 15
Sorting a data frame by using order()
print(data)
print("sort the data in decreasing order based on subjects ")
print(data[order(data$subjects, decreasing = TRUE), ] )
print("sort the data in decreasing order based on rollno ")
print(data[order(data$rollno, decreasing = TRUE), ] )

22-02-2025 Dr. V. Srilakshmi 16


Sorting a data frame by using order()
• Example 2: R program to create a dataframe with 3 columns
named roll no, names, and subjects with a vector, display
the Sorted dataframe based on subjects in increasing order,
displayed the Sorted dataframe based on roll no in
increasing order, displayed the Sorted dataframe based on
names in increasing order.

22-02-2025 Dr. V. Srilakshmi 17


Sorting a data frame by using order()
# create dataframe with roll no, names and subjects columns
data=data.frame(rollno = c(1, 5, 4, 2, 3),
names = c("sravan", "bobby","pinkey", "rohith","ganesh"),
subjects = c("java", "python","php", "sql", "c"))

print(data)
print("sort the data in increasing order based on subjects")
print(data[order(data$subjects, decreasing = FALSE), ] )
print("sort the data in increasing order based on rollno.")
print(data[order(data$rollno, decreasing = FALSE), ] )
print("sort the data in increasing order based on names")
print(data[order(data$names, decreasing = FALSE), ] )
22-02-2025 Dr. V. Srilakshmi 18
22-02-2025 Dr. V. Srilakshmi 19
• Example 3: R program to create a vector with 10
elements(positive, negative and NA). Display the Sorted
vector in
a) Increasing order using sort()
b) Decreasing order using order()
c) Also display the element indexes in sorted order

22-02-2025 Dr. V. Srilakshmi 20


• To open all predefined data sets in R
>data()

22-02-2025 Dr. V. Srilakshmi 21


• Open mtcars dataset and perform the following
>mtcars

22-02-2025 Dr. V. Srilakshmi 22


• To Print the strcture of the dataset
>

22-02-2025 Dr. V. Srilakshmi 23


• To Print the strcture of the dataset

22-02-2025 Dr. V. Srilakshmi 24


Exercise:
• Sort the observations of the dataset “mtcars” in increasing order based on the
values in the column "mpg"
• Sort the observations of the dataset “mtcars” in decreasing order based on the
values in the column "cyl"
• Sort the observations of the dataset “mtcars” in increasing order based on the
values in the columns both "mpg" and "cyl"
• Sort the observations of the dataset “mtcars” in decreasing order based on the
values in the columns both "mpg" and "cyl”
• Sort the observations of the dataset “mtcars” by column “mpg” in increasing order
and column “cyl” in decreasing order

22-02-2025 Dr. V. Srilakshmi 25


Identify and Remove Duplicate Data in R
• Storage administrators are struggling to handle spiraling volumes of
documents, audio, video, images and large email attachments.
• Adding storage is not always the best solution
• Many companies are turning to data reduction technologies such as
data deduplication
• Entries that have been added by a system user multiple times
• For example, re-registering because you have forgotten your details.
• It is one of the problem which causes inconsistency in databases.
• Same data is stored at multiple locations or tables
22-02-2025 Dr. V. Srilakshmi 26
• Data redundancy is costly to address as it requires
1. Additional storage,
2. Synchronization between databases
3. Design work to align the information represented by different
presentation of the same data

22-02-2025 Dr. V. Srilakshmi 27


Problems with Data Redundancy & Duplication
• Storing the information several times
1. It leads to waste of storage space
2. More difficult Database Updates.
3. A Possibility of Inconsistent data

22-02-2025 Dr. V. Srilakshmi 28


Data Deduplication:
• Data deduplication, also called intelligent compression
or single-instance storage

22-02-2025 Dr. V. Srilakshmi 29


Identify and Remove Duplicate Data in R

22-02-2025 Dr. V. Srilakshmi 30


Identify and Remove Duplicate Data
• A dataset can have duplicate values and to keep it redundancy-free
and accurate, duplicate rows need to be identified and removed.
• First we will check if duplicate data is present in our data, if yes
then, we will remove it.

22-02-2025 Dr. V. Srilakshmi 31


Identifying Duplicate Data in vector: duplicated()
• We can use duplicated() function to find out how many duplicates value are
present in a vector.
• Syntax : duplicated(vector_name,fromLast)
• The R function duplicated() returns a logical vector where TRUE specifies
which elements of a vector or data frame are duplicates.
• fromLast is by default FALSE and if TRUE duplicated function finds duplicate
values from the end.
• Example 1:
# Create a sample vector with duplicate elements
vector1<- c(1, 2, 3, 4, 4, 5)
# Identify duplicate elements
duplicated(vector1) OUTPUT: FALSE FALSE FALSE FALSE TRUE FALSE
# count of duplicated data
sum(duplicated(vector1)) #1
22-02-2025 Dr. V. Srilakshmi 32
• Example 2:
# Create a sample vector with duplicate elements
vector1<- c(1, 2, 3, 4, 4, 5)
# Identify duplicate elements
duplicated(vector1,fromLast=TRUE)
output: [1] FALSE FALSE FALSE TRUE FALSE FALSE

# count of duplicated data


sum(duplicated(vector1)) #1
# identify location or index if duplicate element
which(duplicated(vector1)) #5

22-02-2025 Dr. V. Srilakshmi 33


Removing Duplicate Data in vector: unique()
• We can remove duplicate data from vectors by using unique()
functions so it will give only unique values.
• Syntax : unique(vector_name)
• Example:
# Create a sample vector with duplicate elements
vector1<- c(1, 2, 3, 4, 4, 5)
# Remove duplicate elements
unique(vector1) #1 2 3 4 5

22-02-2025 Dr. V. Srilakshmi 34


Identify and Remove Duplicate Data in R
• Identifying Duplicate Data in a data frame:
• For identification, we will use the duplicated() function.
• Syntax : duplicated(dataframe)
• Approach:
• Create data frame
• Pass it to duplicated() function
• This function returns the rows which are duplicated in form of
boolean values
• Apply the sum() function to get the number of duplicates.

22-02-2025 Dr. V. Srilakshmi 35


• Removing Duplicate Data in a data frame:
• we use unique() and distinct() functions.
• Approach:
• Create data frame
• Select rows which are unique
• Retrieve those rows
• Display result

22-02-2025 Dr. V. Srilakshmi 36


Identify and Remove Duplicate Data in R
• Identifying Duplicate Data in a data frame:
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul","Cassie","Geeta","Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
print(student)
print( duplicated(student))
print(sum(duplicated(student)))
22-02-2025 Dr. V. Srilakshmi 37
Identify and Remove Duplicate Data in R
• Identifying Duplicate Data in a data frame:
# create dataframe
data=data.frame(names=c("manoj","bobby","sravan", "deepu","manoj","bobby") ,
id=c(1,2,3,1,1,2),
subject=c("java","python","php","html","java","python"))
print(data)
# remove duplicate rows in subject column
print(data[!duplicated(data$subject), ])
# remove duplicate rows in names column
print(data[!duplicated(data$names), ])
# remove duplicate rows in id column
print(data[!duplicated(data$id),
22-02-2025
]) Dr. V. Srilakshmi 38
Identify and Remove Duplicate Data in R
• Removing Duplicate Data in a data frame:
• Method 1: Using unique()
• We use unique() to get rows having unique values in our data.
• Syntax: unique(dataframe)
• Example:
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul“,"Cassie","Geeta", "Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))
# Printing data
print(student)
# Printing data without duplicates using unique
print( unique(student))
22-02-2025 Dr. V. Srilakshmi 39
Identify and Remove Duplicate Data in R
install dplyr package
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul“,"Cassie","Geeta", "Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
print(student)

# Printing data without duplicates using unique


print( unique(student))
22-02-2025 Dr. V. Srilakshmi 40
Identify and Remove Duplicate Data in R
• Method 2: Using distinct()
• This method is available in dplyr package which is used to get the
unique rows from the dataframe.
• We can remove rows from the entire which are duplicates and also
we can remove duplicate rows in a particular column.
• Syntax: distinct(dataframe,keepall)
• Where dataframe -> data in use and keepall -> decides which
variables to keep

22-02-2025 Dr. V. Srilakshmi 41


Identify and Remove Duplicate Data in R
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul“,"Cassie","Geeta", "Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
print(Student)

# Printing data without duplicates using distinct


print(dplyr::distinct(student))
22-02-2025 Dr. V. Srilakshmi 42
Identify and Remove Duplicate Data in R
# load the package
library(dplyr)
# create dataframe
data=data.frame(names=c("manoj","bobby","sravan","deepu","manoj","bobby") ,
id=c(1,2,3,4,1,2),
subjects=c("java","python","php","html","java","python")
# remove all duplicate rows
print(dplyr::distinct(data))
# remove duplicate rows in subjects column
print(dplyr::distinct(data,subjects))
# remove duplicate rows in namescolumn
print(dplyr::distinct(data,names))
22-02-2025 Dr. V. Srilakshmi 43
Identify and Remove Duplicate Data in R
# remove all duplicate rows
print(dplyr::distinct(data))

# remove duplicate rows in subjects column


print(dplyr::distinct(data,subjects))

# remove duplicate rows in namescolumn


print(dplyr::distinct(data,names))
22-02-2025 Dr. V. Srilakshmi 44
Identify and Remove Duplicate Data in R
Example: # Creating a sample data frame of students and their marks in respective
subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul", "Cassie","Geeta","Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
print(student)
print(dplyr::distinct(student,maths,.keep_all = TRUE))

22-02-2025 Dr. V. Srilakshmi 45


22-02-2025 Dr. V. Srilakshmi 46
22-02-2025 Dr. V. Srilakshmi 47
22-02-2025 Dr. V. Srilakshmi 48
22-02-2025 Dr. V. Srilakshmi 49
22-02-2025 Dr. V. Srilakshmi 50
22-02-2025 Dr. V. Srilakshmi 51
22-02-2025 Dr. V. Srilakshmi 52
Data Cleaning

• Data Cleaning in R is the process to transform raw data into


consistent data that can be easily analysed.
• It is aimed at filtering the content of statistical statements based on
the data as well as their reliability.
• Moreover, it influences the statistical statements based on the data
and improves your data quality and overall productivity.

22-02-2025 Dr. V. Srilakshmi 53


• Purpose of Data Cleaning:
• The following are the various purposes of data cleaning in R
• Eliminate Errors
• Eliminate Redundancy
• Increase Data Reliability
• Delivery Accuracy
• Ensure Consistency
• Assure Completeness
• Standardize your approach

22-02-2025 Dr. V. Srilakshmi 54


Data Cleaning
• Overview of a typical data analysis chain:

22-02-2025 Dr. V. Srilakshmi 55


Data Cleaning
• For better understanding let us discuss with one example
1) Creation of Example Data (Data Frame)
2) Modify Column Names
3) Format Missing Values
4) Remove Empty Rows & Columns
5) Remove Rows with Missing Values
6) Remove Duplicates
7) Modify Classes of Columns
8) Detect & Remove Outliers
9) Remove Spaces in Character Strings
22-02-2025 Dr. V. Srilakshmi 56
Data Cleaning
1) Creation of Example Data: ),
# Create example data frame
data <- data.frame(x1 = c(1:4, 99999, 1, NA, 1, 1, NA
x1 = c(1:5, 1, "NA", 1, 1, "NA"),
x1 = c(letters[c(1:3)], "x x", "x", " y y y", "x", "a", "a", NA),
x4 = "",
x5 = NA)
print(data) # Printing data frame

22-02-2025 Dr. V. Srilakshmi 57


is.na() function- to find NA’s in the given data
set

22-02-2025 Dr. V. Srilakshmi 58


• Identify NAs in specific data frame column

or

Print NAs in row 3 in given dataframe


>is.na(data[3,])

22-02-2025 Dr. V. Srilakshmi 59


• #identify location of NAs in dataframe

22-02-2025 Dr. V. Srilakshmi 60


Arithmetic functions on missing values yield
missing values.

22-02-2025 Dr. V. Srilakshmi 61


Data Cleaning
2) Modify Column Names:
data <- data.frame(x1 = c(1:4, 99999, 1, NA, 1, 1, NA), # Create example data frame
x1 = c(1:5, 1, "NA", 1, 1, "NA"),
x1 = c(letters[c(1:3)], "x x", "x", " y y y", "x", "a", "a", NA),
x4 = "",
x5 = NA)
print(colnames(data))
print(ncol(data))
• The colnames() function returns or sets the names
of the columns in a data frame.
• ncol() function in R Language is used to return the
number of columns of the object.
22-02-2025 Dr. V. Srilakshmi 62
Data Cleaning
2) Modify Column Names:
• Let’s assume that we want to change these column names to a consecutive range with the
prefix “col”. Then, we can apply the colnames, paste0, and ncol functions as shown below.
colnames(data) <- paste0("col", 1:ncol(data)) # Modify all column names
print(data) # Print updated data frame

22-02-2025 Dr. V. Srilakshmi 63


Data Cleaning
3) Format Missing Values:
• In the R programming language, missing values are usually represented by NA. For
that reason, it is useful to convert all missing values to this NA format.
• In our specific example data frame, we have the problem that some missing values
are represented by blank character strings.
• If we want to assign NA values to those blank cells, we can use the following syntax
data[data == ""] <- NA

data[data == "NA"] <- NA

print(data)

22-02-2025 Dr. V. Srilakshmi 64


Data Cleaning
4) Remove Empty Rows & Columns:
• The syntax below demonstrates how to use the rowSums, is.na, and ncol functions to
remove only-NA rows.
data <- data[rowSums(is.na(data)) != ncol(data), ] # Drop empty rows
data # Print updated data frame

22-02-2025 Dr. V. Srilakshmi 65


Data Cleaning
4) Remove Empty Rows & Columns:
• Similar to that, we can also exclude columns that contain only NA values.
data <- data[ , colSums(is.na(data)) != nrow(data)] # Drop empty cols
print(data) # Print updated data frame

22-02-2025 Dr. V. Srilakshmi 66


Data Cleaning
5) Remove Rows with Missing Values:
• However, in case you have decided to remove all rows with one or more NA values,
you may use the na.omit() or na.exclude() function as shown below.
data <- na.omit(data) # Drop rows with missing vals
print(data) # Print updated data frame
• Or
data <- na.exclude(data) # Drop rows with missing vals
print(data) # Print updated data frame

22-02-2025 Dr. V. Srilakshmi 67


Data Cleaning
6) Remove Duplicates:
• we can apply the unique function to our data frame to remove duplicates.
data <- unique(data) # delete duplicate rows
data # Print updated data frame

22-02-2025 Dr. V. Srilakshmi 68


Data Cleaning
7) Modify Classes of Columns:
• The class of the columns of a data frame is another critical topic when it comes to
data cleaning.
• This example explains how to format each column to the most appropriate data
type automatically.
• Let’s first check the current classes of our data frame columns.
>sapply(data, class)# Print classes of all columns
# col1 col2 col3
# "numeric" "character" "character“

22-02-2025 Dr. V. Srilakshmi 69


• We can now use the type.convert() function to change the
column classes whenever it is appropriate
data <- type.convert(data, as.is = TRUE)
sapply(data, class) # Print classes of all columns
# col1 col2 col3
# “integer" “integer" "character“

22-02-2025 Dr. V. Srilakshmi 70


Data Cleaning
8) Detect & Remove Outliers:
• One method to detect outliers is provided by the boxplot.stats function. The
following R code demonstrates how to test for outliers in our data frame column
col1
# Identify outliers in column
data$col1[data$col1 %in% boxplot.stats(data$col1)$out]
[1] 99999 #This value is obviously much higher than the other values in this column.
• Let’s assume that we have confirmed theoretically that the observation containing
this outlier should be removed. Then, we can apply the R code below
# Remove rows with outliers
data <- data[! data$col1 %in% boxplot.stats(data$col1)$out, ]
print(data)

22-02-2025 Dr. V. Srilakshmi 71


Data Cleaning
9) Remove Spaces in Character Strings:
• The manipulation of character strings is another important aspect of the data
cleaning process.
• This example demonstrates how to avoid blank spaces in the character strings of a
certain variable.
• For this task, we can use the gsub function as demonstrated below:
Syntax: gsub(pattern, replacement, x, ignore.case = FALSE )

# Delete white space in character strings


data$col3 <- gsub(" ", "", data$col3)
print(data)
22-02-2025 Dr. V. Srilakshmi 72
Practice Exercise:
Creation a Data Frame “sample” with columns
A11 = c(10:14, 11111, 1, NA, 1, NA, NA), # Create example data frame
A12 = c(1:5, 1, "NA", 1, 1, "NA"),
A12 = c(letters[c(1:3)], “a a", “a", " b b b", “a", “x", “x", NA),
A11 = "",
A1= NA
Write code to implement the following
1) Modify Column Names
2) Format Missing Values
3) Remove Empty Rows & Columns
4) Remove Rows with Missing Values
5) Remove Duplicates
6) Modify Classes of Columns
7) Detect & Remove Outliers
8) Remove Spaces in Character StringsDr. V. Srilakshmi
22-02-2025 73
Data Recoding
• Recoding allows you to create new variables and to replace existing values of a
variables based on a criterion.
• Example: Let us consider a Data frame
df <- data.frame(player = c('P1', 'P2', 'P3', 'P4'),
points = c(124, 229, 313, 415),
result = c('Win', 'Loss', 'Win', 'Loss’))
print(df)
• Output:

22-02-2025 Dr. V. Srilakshmi 74


Data Recoding
• To recode, The easiest way is to use revalue() or mapvalues() from the plyr package. These
are defined in plyr package.
• Example:
>df <- data.frame(player = c('P1', 'P2', 'P3', 'P4'),
points = c(124, 229, 313, 415),
result = c('Win', 'Loss', 'Win', 'Loss’))
>print(df)

>df$scode <- plyr::revalue(df$result, c("Win"="1", "Loss"="2")) #Creaing new Variable


>print(df)

22-02-2025 Dr. V. Srilakshmi 75


Data Recoding

>df$result <- plyr::mapvalues(df$result, from = c("Win","Loss"), to = c("1", "0"))


>print(df) #Modify the existing variable

22-02-2025 Dr. V. Srilakshmi 76


Data Recoding
• It is also possible to recode using ifelse.
• Example:
>df <- data.frame(player = c('P1', 'P2', 'P3', 'P4'),
points = c(124, 229, 313, 415),
result = c('Win', 'Loss', 'Win', 'Loss’))
>print(df)
>df$scode <- ifelse(df$result=="Win",1,2)
>print(df) #Creaing new Variable

>df$result <- ifelse(df$result=="Win",1,0)


>print(df) #Modify the existing variable

22-02-2025 Dr. V. Srilakshmi 77


Data Recoding
• Recoding is also known as Replacing or Imputation

22-02-2025 Dr. V. Srilakshmi 78


22-02-2025 Dr. V. Srilakshmi 79
22-02-2025 Dr. V. Srilakshmi 80
22-02-2025 Dr. V. Srilakshmi 81
Practice Exercise:
Consider a numeric vector x <- c(3,4,5,6,7,8)
• Write a command to recode the values less than 6 with zero in the vector x
• Write a command to recode the values between 4 and 8 with 100
• Write a command to recode the values that are less than 5 or greater than 6 with 50
• Write a command to recode the values less than 6 with NA in the vector x
• Write a command to recode the values between 4 and 8 with NA
• Write a command to recode the values that are less than 5 or greater than 6 with NA
• Count number of NA values after each operation
• Find mean of x (Hint: exclude NA values)
• Find median of x (Hint: exclude NA values)
• Write a command to recode the values less than 6 with “NA” (enclose NA with double quotes) in the vector x
• Write a command to recode the values between 4 and 8 with “NA”
• Write a command to recode the values that are less than 5 or greater than 6 with “NA”
• Count number of NA values after each operation
• Find mean of x (Hint: exclude NA values)
• Find median of x (Hint: exclude NA values)
• What
22-02-2025
is the difference between NA and “NA” Dr. V. Srilakshmi 82
Data Merging
• Merging data is a common task in data analysis, especially when
working with large datasets.
• The merge function in R is a powerful tool that allows you to
combine two or more datasets based on shared variables (two
datasets that share at least one common column)
• In R there are various ways to merge data frames, using the
‘merge()’ function from base R and by using the ‘dplyr’ package

22-02-2025 Dr. V. Srilakshmi 83


Types of joins:
• Joins are of mainly three types
1. Inner Join or Join
2. Outer Join or Full Join
➢Left Outer Join or Left Join
➢Right Outer Join or Right Join
3. Cross Join

22-02-2025 Dr. V. Srilakshmi 84


22-02-2025 Dr. V. Srilakshmi 85
CROSS JOIN: A Cross Join also known as cartesian join results in
every row of one data frame is being joined to every other row of
another data frame.

22-02-2025 Dr. V. Srilakshmi 86


INNER JOIN OUTER JOIN

LEFT JOIN RIGHT JOIN

22-02-2025 Dr. V. Srilakshmi 87


22-02-2025 Dr. V. Srilakshmi 88
Data Merging
• Using ‘merge()’ from base R:
• The merge() function in base R helps us to combine two or more data
frames based on common columns.
• It performs various types of joins such as inner join, left join, right join, and
full join.
• Syntax:
merged_df <- merge(x,y,by = "common_column",..)
• x’ and ‘y’ are the data frames that you want to merge.
• ‘by’ specifies the common columns on which the merge will be performed.
• Additional arguments like ‘all.x’,all.y’ and ‘all’ control the type of join that is
to be performed.
22-02-2025 Dr. V. Srilakshmi 89
Data Merging
• Example:
• Consider two data frames ‘df1’ and ‘df2’
df1 <- data.frame(ID = c(1, 2, 3, 4),
Name = c("A", "B", "C", "D"),
Age = c(25, 30, 35, 40))

df2 <- data.frame(ID = c(2, 3, 4, 5),


Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
Salary = c(5000, 4000, 6000, 7000))

1. Inner join (default behavior):


ij<- merge(x=df1,y=df2, by = "ID")
print(ij)

The resulting ‘inner_join’ dataframe will only


include the common rows where ‘ID’ is present
in both ‘df1’ and ‘df2’.
22-02-2025 Dr. V. Srilakshmi 90
Data Merging
• Example:

df1-> df2->

2. Left join(‘all.x=TRUE’):
lj<- merge(x=df1, y=df2, by = "ID“, all.x=TRUE)
print(lj)

The resulting ‘left_join’ data frame will


include all rows from ‘df1’ and the matching
rows from ‘df2’. Non-matching rows from
‘df2’ will have an ‘NA’ value
22-02-2025 Dr. V. Srilakshmi 91
Data Merging
• Example:

df1-> df2->

3. Right join(‘all.y=TRUE’):
rj <- merge(df1, df2, by = "ID“, all.y=TRUE)
print(rj)

The resulting ‘right_join’ data frame will include


all rows from ‘df2’ and the matching rows from
‘df1’. Non-matching rows from ‘df1’ will have
‘NA’ values.

22-02-2025 Dr. V. Srilakshmi 92


Data Merging
• Example:

df1-> df2->

4. Full outer join(‘all =TRUE’)


foj <- merge(df1, df2, by = "ID“, all=TRUE)
print(foj)

The resulting ‘full_join’ data frame will include


all rows from both ‘df1’ and ‘df2’.
Non-matching values will have ‘NA’ values.
22-02-2025 Dr. V. Srilakshmi 93
Data Merging
5. Cross join(by=NULL)
cj <- merge(df1, df2, by = NULL)
print(cj)

A Cross Join also known as cartesian join results


in every row of one data frame is being joined to
every other row of another data frame.

22-02-2025 Dr. V. Srilakshmi 94


22-02-2025 Dr. V. Srilakshmi 95
Data Merging
• Using ‘dplyr’ Package:
• The primary function for merging in ‘dplyr’ is ‘join()’, which supports various types of joins.
• Syntax :
merged_df<- join(x,y,by="common_column",type="type_of_join")
• ‘x’ and ‘y’ are the data frames to be merged.
• ‘by’ specifies the common columns on which the merge is to be performed
• ‘type_of_join’ can be ‘inner’, ‘left’,’ right’ or ‘full’ to specify the type of join.
• Example:
• Consider two data frames ‘df1’ and ‘df2’
df1 <- data.frame(ID = c(1, 2, 3, 4),
Name = c("A", "B", "C", "D"),
Age = c(25, 30, 35, 40))
df2 <- data.frame(ID = c(2, 3, 4, 5),
Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
Salary = c(5000, 4000, 6000, 7000))

22-02-2025 Dr. V. Srilakshmi 96


Data Merging
• Inner join:
inner_join <- dplyr::inner_join(df1, df2, by = "ID")
print(inner_join)

• Left join:
left_join <- dplyr::left_join(df1, df2, by = "ID")
print(left_join)

22-02-2025 Dr. V. Srilakshmi 97


Data Merging
• Right join:
right_join <- dplyr::right_join(df1, df2, by = "ID")
print(right_join)

• Full outer join:


left_join <- dplyr::full_join(df1, df2, by = "ID")
print(full_join)

22-02-2025 Dr. V. Srilakshmi 98

You might also like