0% found this document useful (0 votes)

32 views98 pages

Data Manipulation Using R

Module 3 of the CSE1006 course covers data manipulation techniques in R, focusing on data sorting, identifying and removing duplicates, and cleaning data. Key functions discussed include sort() and order() for sorting vectors and data frames, as well as duplicated() and unique() for handling duplicate data. The module also emphasizes the importance of data deduplication to maintain data integrity and reduce redundancy.

Uploaded by

yaraha5692

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views98 pages

Data Manipulation Using R

Uploaded by

yaraha5692

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Course code : CSE1006

Course title : Foundations of Data Analytics

Module-3
Data manipulation

22-02-2025 Dr. V. Srilakshmi 1

Module-3
• Data manipulation:
➢Data sorting
➢Find and remove duplicates record
➢Cleaning data
➢Recording data
➢Merging data

22-02-2025 Dr. V. Srilakshmi 2

Data Sorting
• R provides a different way to sort the data either in
ascending or descending order.
• Data-analysts, and Data scientists use order() and sort()
functions to sort data depending upon the structure of
the obtained data.
• The function Order() can sort Vector, Matrix and also a
Dataframe can be sorted in ascending and descending
order.

22-02-2025 Dr. V. Srilakshmi 3

• Syntax of sort():
sort(x, decreasing, na.last)
Where,
• Parameters: x: Vector to be sorted
• decreasing: Boolean value to sort in descending order
• na.last: Boolean value to put NA at the end

22-02-2025 Dr. V. Srilakshmi 4

• Example 1:
># Creating a vector
>x <- c(7, 4, 3, 9, 1.2, -4, -5, -8, 6, NA)

># Calling sort() function

>sort(x)

OUTPUT:
[1] -8.0 -5.0 -4.0 1.2 3.0 4.0 6.0 7.0 9.0

22-02-2025 Dr. V. Srilakshmi 5

• EXAMPLE 2:
># Creating a vector
>x <- c(7, 4, 3, 9, 1.2, -4, -5, -8, 6, NA)

># Calling sort() function to print in decreasing order

>sort(x, decreasing = TRUE)
[1] 9.0 7.0 6.0 4.0 3.0 1.2 -4.0 -5.0 -8.0

># Calling sort() function to print NA at the end

>sort(x, na.last = TRUE)
[1] -8.0 -5.0 -4.0 1.2 3.0 4.0 6.0 7.0 9.0 NA

22-02-2025 Dr. V. Srilakshmi 6

• Example 3:
# Create a character vector
>names <- c("Vipul", "Raj", "Singh", "Jhala")

># Sort the vector in alphabetical order

>sorted_names <- sort(names)
>print(sorted_names)
[1] "Jhala" "Raj" "Singh" "Vipul"

22-02-2025 Dr. V. Srilakshmi 7

• Syntax of order():
>order(x, [decreasing = TRUE or FALSE], [na.last = TRUE or
FLASE], [method = c("auto", "shell", "quick", "radix")])
• The first argument x is the vector or data frame to be sorted.
• The second argument decreasing is a logical value that determines
whether the sorting should be in decreasing order (TRUE) or
increasing order (FALSE).
• The third argument na.last is also a logical value that determines
whether missing values should be placed at the end of the sorted
vector (TRUE) or at the beginning (FALSE).
• The fourth argument method is an optional argument that specifies
the sorting algorithm to be used.
22-02-2025 Dr. V. Srilakshmi 8
• The available options are "auto" (default), "shell", "quick",
and "radix".
• The "auto" option automatically selects the best algorithm
based on the size and type of the input data.
• Overall, this code snippet is used to sort a vector or data
frame in R using the order() function with various options
for sorting order, handling of missing values, and sorting
algorithm.

22-02-2025 Dr. V. Srilakshmi 9

• Example:

>y = c(4,12,6,7,2,9,5)
>order(y)
• In this case, the output would be: 5 1 7 3 4 6 2, which
indicates that the smallest value is at index 5, the next
smallest is at index 1, and so on.
• Note that the original vector "y" is not modified by this
function.

22-02-2025 Dr. V. Srilakshmi 10

• Example:
>y = c(4,12,6,7,2,9,5)
>y[order(y)]
• The "order" function returns the indices of the sorted
values, so when we use these indices to subset "y" with
square brackets, we get the sorted values of "y".
• Therefore, the output of this code will be:
2 4 5 6 7 9 12.

22-02-2025 Dr. V. Srilakshmi 11

•Example:
>x <- c(8,2,4,1,-4,NA,46,8,9,5,3)
>order(x,na.last = TRUE)
• The "na.last = TRUE" argument specifies that missing values
(NA) should be placed at the end of the sorted vector.
• So, the output of this code would be:
5 4 2 1 11 10 9 8 3 7 6.

22-02-2025 Dr. V. Srilakshmi 12

• Example:
>x <- c(8,2,4,1,-4,NA,46,8,9,5,3)
>order(x,decreasing=TRUE,na.last=TRUE)
• The decreasing=TRUE argument specifies that the indices should be
sorted in descending order instead of ascending order.
• The na.last=TRUE argument specifies that missing values should be
placed at the end of the sorted vector.
• Overall, this code sorts the input vector x in descending order and
returns the indices that would sort x in that order.
• So, the output of this code would be: 7 9 1 8 10 3 11 2 4 5 6

22-02-2025 Dr. V. Srilakshmi 13

Sorting a data frame by using order()
• The function order() is used to sort the dataframe based on the particular
column in the dataframe.
• Syntax:
order(dataframe$column_name,decreasing = TRUE))
• where
• dataframe is the input dataframe
• Column name is the column in the dataframe such that dataframe is sorted
based on this column
• Decreasing parameter specifies the type of sorting order - If it is TRUE
dataframe is sorted in descending order. Otherwise, in increasing order.
• returns: Index positions of the elements.
22-02-2025 Dr. V. Srilakshmi 14
Sorting a data frame by using order()
• Example 1: R program to create a dataframe with 2 columns and order based
on particular columns in decreasing order. Displayed the Sorted dataframe
based on subjects in decreasing order, displayed the Sorted dataframe based
on roll no in decreasing order.
• # create dataframe with roll no and subjects columns
data = data.frame(rollno = c(1, 5, 4, 2, 3),
subjects = c("java", "python", "php", "sql", "c"))
print(data)
print("sort the data in decreasing order based on subjects ")
print(data[order(data$subjects, decreasing = TRUE), ] )
print("sort the data in decreasing order based on rollno ")
print(data[order(data$rollno, decreasing = TRUE), ] )
22-02-2025 Dr. V. Srilakshmi 15
Sorting a data frame by using order()
print(data)
print("sort the data in decreasing order based on subjects ")
print(data[order(data$subjects, decreasing = TRUE), ] )
print("sort the data in decreasing order based on rollno ")
print(data[order(data$rollno, decreasing = TRUE), ] )

22-02-2025 Dr. V. Srilakshmi 16

Sorting a data frame by using order()
• Example 2: R program to create a dataframe with 3 columns
named roll no, names, and subjects with a vector, display
the Sorted dataframe based on subjects in increasing order,
displayed the Sorted dataframe based on roll no in
increasing order, displayed the Sorted dataframe based on
names in increasing order.

22-02-2025 Dr. V. Srilakshmi 17

Sorting a data frame by using order()
# create dataframe with roll no, names and subjects columns
data=data.frame(rollno = c(1, 5, 4, 2, 3),
names = c("sravan", "bobby","pinkey", "rohith","ganesh"),
subjects = c("java", "python","php", "sql", "c"))

print(data)
print("sort the data in increasing order based on subjects")
print(data[order(data$subjects, decreasing = FALSE), ] )
print("sort the data in increasing order based on rollno.")
print(data[order(data$rollno, decreasing = FALSE), ] )
print("sort the data in increasing order based on names")
print(data[order(data$names, decreasing = FALSE), ] )
22-02-2025 Dr. V. Srilakshmi 18
22-02-2025 Dr. V. Srilakshmi 19
• Example 3: R program to create a vector with 10
elements(positive, negative and NA). Display the Sorted
vector in
a) Increasing order using sort()
b) Decreasing order using order()
c) Also display the element indexes in sorted order

22-02-2025 Dr. V. Srilakshmi 20

• To open all predefined data sets in R
>data()

22-02-2025 Dr. V. Srilakshmi 21

• Open mtcars dataset and perform the following
>mtcars

22-02-2025 Dr. V. Srilakshmi 22

• To Print the strcture of the dataset
>

22-02-2025 Dr. V. Srilakshmi 23

• To Print the strcture of the dataset

22-02-2025 Dr. V. Srilakshmi 24

Exercise:
• Sort the observations of the dataset “mtcars” in increasing order based on the
values in the column "mpg"
• Sort the observations of the dataset “mtcars” in decreasing order based on the
values in the column "cyl"
• Sort the observations of the dataset “mtcars” in increasing order based on the
values in the columns both "mpg" and "cyl"
• Sort the observations of the dataset “mtcars” in decreasing order based on the
values in the columns both "mpg" and "cyl”
• Sort the observations of the dataset “mtcars” by column “mpg” in increasing order
and column “cyl” in decreasing order

22-02-2025 Dr. V. Srilakshmi 25

Identify and Remove Duplicate Data in R
• Storage administrators are struggling to handle spiraling volumes of
documents, audio, video, images and large email attachments.
• Adding storage is not always the best solution
• Many companies are turning to data reduction technologies such as
data deduplication
• Entries that have been added by a system user multiple times
• For example, re-registering because you have forgotten your details.
• It is one of the problem which causes inconsistency in databases.
• Same data is stored at multiple locations or tables
22-02-2025 Dr. V. Srilakshmi 26
• Data redundancy is costly to address as it requires
1. Additional storage,
2. Synchronization between databases
3. Design work to align the information represented by different
presentation of the same data

22-02-2025 Dr. V. Srilakshmi 27

Problems with Data Redundancy & Duplication
• Storing the information several times
1. It leads to waste of storage space
2. More difficult Database Updates.
3. A Possibility of Inconsistent data

22-02-2025 Dr. V. Srilakshmi 28

Data Deduplication:
• Data deduplication, also called intelligent compression
or single-instance storage

22-02-2025 Dr. V. Srilakshmi 29

Identify and Remove Duplicate Data in R

22-02-2025 Dr. V. Srilakshmi 30

Identify and Remove Duplicate Data
• A dataset can have duplicate values and to keep it redundancy-free
and accurate, duplicate rows need to be identified and removed.
• First we will check if duplicate data is present in our data, if yes
then, we will remove it.

22-02-2025 Dr. V. Srilakshmi 31

Identifying Duplicate Data in vector: duplicated()
• We can use duplicated() function to find out how many duplicates value are
present in a vector.
• Syntax : duplicated(vector_name,fromLast)
• The R function duplicated() returns a logical vector where TRUE specifies
which elements of a vector or data frame are duplicates.
• fromLast is by default FALSE and if TRUE duplicated function finds duplicate
values from the end.
• Example 1:
# Create a sample vector with duplicate elements
vector1<- c(1, 2, 3, 4, 4, 5)
# Identify duplicate elements
duplicated(vector1) OUTPUT: FALSE FALSE FALSE FALSE TRUE FALSE
# count of duplicated data
sum(duplicated(vector1)) #1
22-02-2025 Dr. V. Srilakshmi 32
• Example 2:
# Create a sample vector with duplicate elements
vector1<- c(1, 2, 3, 4, 4, 5)
# Identify duplicate elements
duplicated(vector1,fromLast=TRUE)
output: [1] FALSE FALSE FALSE TRUE FALSE FALSE

# count of duplicated data

sum(duplicated(vector1)) #1
# identify location or index if duplicate element
which(duplicated(vector1)) #5

22-02-2025 Dr. V. Srilakshmi 33

Removing Duplicate Data in vector: unique()
• We can remove duplicate data from vectors by using unique()
functions so it will give only unique values.
• Syntax : unique(vector_name)
• Example:
# Create a sample vector with duplicate elements
vector1<- c(1, 2, 3, 4, 4, 5)
# Remove duplicate elements
unique(vector1) #1 2 3 4 5

22-02-2025 Dr. V. Srilakshmi 34

Identify and Remove Duplicate Data in R
• Identifying Duplicate Data in a data frame:
• For identification, we will use the duplicated() function.
• Syntax : duplicated(dataframe)
• Approach:
• Create data frame
• Pass it to duplicated() function
• This function returns the rows which are duplicated in form of
boolean values
• Apply the sum() function to get the number of duplicates.

22-02-2025 Dr. V. Srilakshmi 35

• Removing Duplicate Data in a data frame:
• we use unique() and distinct() functions.
• Approach:
• Create data frame
• Select rows which are unique
• Retrieve those rows
• Display result

22-02-2025 Dr. V. Srilakshmi 36

Identify and Remove Duplicate Data in R
• Identifying Duplicate Data in a data frame:
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul","Cassie","Geeta","Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
print(student)
print( duplicated(student))
print(sum(duplicated(student)))
22-02-2025 Dr. V. Srilakshmi 37
Identify and Remove Duplicate Data in R
• Identifying Duplicate Data in a data frame:
# create dataframe
data=data.frame(names=c("manoj","bobby","sravan", "deepu","manoj","bobby") ,
id=c(1,2,3,1,1,2),
subject=c("java","python","php","html","java","python"))
print(data)
# remove duplicate rows in subject column
print(data[!duplicated(data$subject), ])
# remove duplicate rows in names column
print(data[!duplicated(data$names), ])
# remove duplicate rows in id column
print(data[!duplicated(data$id),
22-02-2025
]) Dr. V. Srilakshmi 38
Identify and Remove Duplicate Data in R
• Removing Duplicate Data in a data frame:
• Method 1: Using unique()
• We use unique() to get rows having unique values in our data.
• Syntax: unique(dataframe)
• Example:
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul“,"Cassie","Geeta", "Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))
# Printing data
print(student)
# Printing data without duplicates using unique
print( unique(student))
22-02-2025 Dr. V. Srilakshmi 39
Identify and Remove Duplicate Data in R
install dplyr package
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul“,"Cassie","Geeta", "Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
print(student)

# Printing data without duplicates using unique

print( unique(student))
22-02-2025 Dr. V. Srilakshmi 40
Identify and Remove Duplicate Data in R
• Method 2: Using distinct()
• This method is available in dplyr package which is used to get the
unique rows from the dataframe.
• We can remove rows from the entire which are duplicates and also
we can remove duplicate rows in a particular column.
• Syntax: distinct(dataframe,keepall)
• Where dataframe -> data in use and keepall -> decides which
variables to keep

22-02-2025 Dr. V. Srilakshmi 41

Identify and Remove Duplicate Data in R
# Creating a sample data frame of students and their marks in respective subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul“,"Cassie","Geeta", "Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
print(Student)

# Printing data without duplicates using distinct

print(dplyr::distinct(student))
22-02-2025 Dr. V. Srilakshmi 42
Identify and Remove Duplicate Data in R
# load the package
library(dplyr)
# create dataframe
data=data.frame(names=c("manoj","bobby","sravan","deepu","manoj","bobby") ,
id=c(1,2,3,4,1,2),
subjects=c("java","python","php","html","java","python")
# remove all duplicate rows
print(dplyr::distinct(data))
# remove duplicate rows in subjects column
print(dplyr::distinct(data,subjects))
# remove duplicate rows in namescolumn
print(dplyr::distinct(data,names))
22-02-2025 Dr. V. Srilakshmi 43
Identify and Remove Duplicate Data in R
# remove all duplicate rows
print(dplyr::distinct(data))

# remove duplicate rows in subjects column

print(dplyr::distinct(data,subjects))

# remove duplicate rows in namescolumn

print(dplyr::distinct(data,names))
22-02-2025 Dr. V. Srilakshmi 44
Identify and Remove Duplicate Data in R
Example: # Creating a sample data frame of students and their marks in respective
subjects.
student=data.frame(name=c("Ram","Geeta","John","Paul", "Cassie","Geeta","Paul"),
maths=c(7,8,8,9,10,8,9),
science=c(5,7,6,8,9,7,8),
history=c(7,7,7,7,7,7,7))

# Printing data
print(student)
print(dplyr::distinct(student,maths,.keep_all = TRUE))

22-02-2025 Dr. V. Srilakshmi 45

22-02-2025 Dr. V. Srilakshmi 46
22-02-2025 Dr. V. Srilakshmi 47
22-02-2025 Dr. V. Srilakshmi 48
22-02-2025 Dr. V. Srilakshmi 49
22-02-2025 Dr. V. Srilakshmi 50
22-02-2025 Dr. V. Srilakshmi 51
22-02-2025 Dr. V. Srilakshmi 52
Data Cleaning

• Data Cleaning in R is the process to transform raw data into

consistent data that can be easily analysed.
• It is aimed at filtering the content of statistical statements based on
the data as well as their reliability.
• Moreover, it influences the statistical statements based on the data
and improves your data quality and overall productivity.

22-02-2025 Dr. V. Srilakshmi 53

• Purpose of Data Cleaning:
• The following are the various purposes of data cleaning in R
• Eliminate Errors
• Eliminate Redundancy
• Increase Data Reliability
• Delivery Accuracy
• Ensure Consistency
• Assure Completeness
• Standardize your approach

22-02-2025 Dr. V. Srilakshmi 54

Data Cleaning
• Overview of a typical data analysis chain:

22-02-2025 Dr. V. Srilakshmi 55

Data Cleaning
• For better understanding let us discuss with one example
1) Creation of Example Data (Data Frame)
2) Modify Column Names
3) Format Missing Values
4) Remove Empty Rows & Columns
5) Remove Rows with Missing Values
6) Remove Duplicates
7) Modify Classes of Columns
8) Detect & Remove Outliers
9) Remove Spaces in Character Strings
22-02-2025 Dr. V. Srilakshmi 56
Data Cleaning
1) Creation of Example Data: ),
# Create example data frame
data <- data.frame(x1 = c(1:4, 99999, 1, NA, 1, 1, NA
x1 = c(1:5, 1, "NA", 1, 1, "NA"),
x1 = c(letters[c(1:3)], "x x", "x", " y y y", "x", "a", "a", NA),
x4 = "",
x5 = NA)
print(data) # Printing data frame

22-02-2025 Dr. V. Srilakshmi 57

is.na() function- to find NA’s in the given data
set

22-02-2025 Dr. V. Srilakshmi 58

• Identify NAs in specific data frame column

Print NAs in row 3 in given dataframe

>is.na(data[3,])

22-02-2025 Dr. V. Srilakshmi 59

• #identify location of NAs in dataframe

22-02-2025 Dr. V. Srilakshmi 60

Arithmetic functions on missing values yield
missing values.

22-02-2025 Dr. V. Srilakshmi 61

Data Cleaning
2) Modify Column Names:
data <- data.frame(x1 = c(1:4, 99999, 1, NA, 1, 1, NA), # Create example data frame
x1 = c(1:5, 1, "NA", 1, 1, "NA"),
x1 = c(letters[c(1:3)], "x x", "x", " y y y", "x", "a", "a", NA),
x4 = "",
x5 = NA)
print(colnames(data))
print(ncol(data))
• The colnames() function returns or sets the names
of the columns in a data frame.
• ncol() function in R Language is used to return the
number of columns of the object.
22-02-2025 Dr. V. Srilakshmi 62
Data Cleaning
2) Modify Column Names:
• Let’s assume that we want to change these column names to a consecutive range with the
prefix “col”. Then, we can apply the colnames, paste0, and ncol functions as shown below.
colnames(data) <- paste0("col", 1:ncol(data)) # Modify all column names
print(data) # Print updated data frame

22-02-2025 Dr. V. Srilakshmi 63

Data Cleaning
3) Format Missing Values:
• In the R programming language, missing values are usually represented by NA. For
that reason, it is useful to convert all missing values to this NA format.
• In our specific example data frame, we have the problem that some missing values
are represented by blank character strings.
• If we want to assign NA values to those blank cells, we can use the following syntax
data[data == ""] <- NA

data[data == "NA"] <- NA

print(data)

22-02-2025 Dr. V. Srilakshmi 64

Data Cleaning
4) Remove Empty Rows & Columns:
• The syntax below demonstrates how to use the rowSums, is.na, and ncol functions to
remove only-NA rows.
data <- data[rowSums(is.na(data)) != ncol(data), ] # Drop empty rows
data # Print updated data frame

22-02-2025 Dr. V. Srilakshmi 65

Data Cleaning
4) Remove Empty Rows & Columns:
• Similar to that, we can also exclude columns that contain only NA values.
data <- data[ , colSums(is.na(data)) != nrow(data)] # Drop empty cols
print(data) # Print updated data frame

22-02-2025 Dr. V. Srilakshmi 66

Data Cleaning
5) Remove Rows with Missing Values:
• However, in case you have decided to remove all rows with one or more NA values,
you may use the na.omit() or na.exclude() function as shown below.
data <- na.omit(data) # Drop rows with missing vals
print(data) # Print updated data frame
• Or
data <- na.exclude(data) # Drop rows with missing vals
print(data) # Print updated data frame

22-02-2025 Dr. V. Srilakshmi 67

Data Cleaning
6) Remove Duplicates:
• we can apply the unique function to our data frame to remove duplicates.
data <- unique(data) # delete duplicate rows
data # Print updated data frame

22-02-2025 Dr. V. Srilakshmi 68

Data Cleaning
7) Modify Classes of Columns:
• The class of the columns of a data frame is another critical topic when it comes to
data cleaning.
• This example explains how to format each column to the most appropriate data
type automatically.
• Let’s first check the current classes of our data frame columns.
>sapply(data, class)# Print classes of all columns
# col1 col2 col3
# "numeric" "character" "character“

22-02-2025 Dr. V. Srilakshmi 69

• We can now use the type.convert() function to change the
column classes whenever it is appropriate
data <- type.convert(data, as.is = TRUE)
sapply(data, class) # Print classes of all columns
# col1 col2 col3
# “integer" “integer" "character“

22-02-2025 Dr. V. Srilakshmi 70

Data Cleaning
8) Detect & Remove Outliers:
• One method to detect outliers is provided by the boxplot.stats function. The
following R code demonstrates how to test for outliers in our data frame column
col1
# Identify outliers in column
data$col1[data$col1 %in% boxplot.stats(data$col1)$out]
[1] 99999 #This value is obviously much higher than the other values in this column.
• Let’s assume that we have confirmed theoretically that the observation containing
this outlier should be removed. Then, we can apply the R code below
# Remove rows with outliers
data <- data[! data$col1 %in% boxplot.stats(data$col1)$out, ]
print(data)

22-02-2025 Dr. V. Srilakshmi 71

Data Cleaning
9) Remove Spaces in Character Strings:
• The manipulation of character strings is another important aspect of the data
cleaning process.
• This example demonstrates how to avoid blank spaces in the character strings of a
certain variable.
• For this task, we can use the gsub function as demonstrated below:
Syntax: gsub(pattern, replacement, x, ignore.case = FALSE )

# Delete white space in character strings

data$col3 <- gsub(" ", "", data$col3)
print(data)
22-02-2025 Dr. V. Srilakshmi 72
Practice Exercise:
Creation a Data Frame “sample” with columns
A11 = c(10:14, 11111, 1, NA, 1, NA, NA), # Create example data frame
A12 = c(1:5, 1, "NA", 1, 1, "NA"),
A12 = c(letters[c(1:3)], “a a", “a", " b b b", “a", “x", “x", NA),
A11 = "",
A1= NA
Write code to implement the following
1) Modify Column Names
2) Format Missing Values
3) Remove Empty Rows & Columns
4) Remove Rows with Missing Values
5) Remove Duplicates
6) Modify Classes of Columns
7) Detect & Remove Outliers
8) Remove Spaces in Character StringsDr. V. Srilakshmi
22-02-2025 73
Data Recoding
• Recoding allows you to create new variables and to replace existing values of a
variables based on a criterion.
• Example: Let us consider a Data frame
df <- data.frame(player = c('P1', 'P2', 'P3', 'P4'),
points = c(124, 229, 313, 415),
result = c('Win', 'Loss', 'Win', 'Loss’))
print(df)
• Output:

22-02-2025 Dr. V. Srilakshmi 74

Data Recoding
• To recode, The easiest way is to use revalue() or mapvalues() from the plyr package. These
are defined in plyr package.
• Example:
>df <- data.frame(player = c('P1', 'P2', 'P3', 'P4'),
points = c(124, 229, 313, 415),
result = c('Win', 'Loss', 'Win', 'Loss’))
>print(df)

>df$scode <- plyr::revalue(df$result, c("Win"="1", "Loss"="2")) #Creaing new Variable

>print(df)

22-02-2025 Dr. V. Srilakshmi 75

Data Recoding

>df$result <- plyr::mapvalues(df$result, from = c("Win","Loss"), to = c("1", "0"))

>print(df) #Modify the existing variable

22-02-2025 Dr. V. Srilakshmi 76

Data Recoding
• It is also possible to recode using ifelse.
• Example:
>df <- data.frame(player = c('P1', 'P2', 'P3', 'P4'),
points = c(124, 229, 313, 415),
result = c('Win', 'Loss', 'Win', 'Loss’))
>print(df)
>df$scode <- ifelse(df$result=="Win",1,2)
>print(df) #Creaing new Variable

>df$result <- ifelse(df$result=="Win",1,0)

>print(df) #Modify the existing variable

22-02-2025 Dr. V. Srilakshmi 77

Data Recoding
• Recoding is also known as Replacing or Imputation

22-02-2025 Dr. V. Srilakshmi 78

22-02-2025 Dr. V. Srilakshmi 79
22-02-2025 Dr. V. Srilakshmi 80
22-02-2025 Dr. V. Srilakshmi 81
Practice Exercise:
Consider a numeric vector x <- c(3,4,5,6,7,8)
• Write a command to recode the values less than 6 with zero in the vector x
• Write a command to recode the values between 4 and 8 with 100
• Write a command to recode the values that are less than 5 or greater than 6 with 50
• Write a command to recode the values less than 6 with NA in the vector x
• Write a command to recode the values between 4 and 8 with NA
• Write a command to recode the values that are less than 5 or greater than 6 with NA
• Count number of NA values after each operation
• Find mean of x (Hint: exclude NA values)
• Find median of x (Hint: exclude NA values)
• Write a command to recode the values less than 6 with “NA” (enclose NA with double quotes) in the vector x
• Write a command to recode the values between 4 and 8 with “NA”
• Write a command to recode the values that are less than 5 or greater than 6 with “NA”
• Count number of NA values after each operation
• Find mean of x (Hint: exclude NA values)
• Find median of x (Hint: exclude NA values)
• What
22-02-2025
is the difference between NA and “NA” Dr. V. Srilakshmi 82
Data Merging
• Merging data is a common task in data analysis, especially when
working with large datasets.
• The merge function in R is a powerful tool that allows you to
combine two or more datasets based on shared variables (two
datasets that share at least one common column)
• In R there are various ways to merge data frames, using the
‘merge()’ function from base R and by using the ‘dplyr’ package

22-02-2025 Dr. V. Srilakshmi 83

Types of joins:
• Joins are of mainly three types
1. Inner Join or Join
2. Outer Join or Full Join
➢Left Outer Join or Left Join
➢Right Outer Join or Right Join
3. Cross Join

22-02-2025 Dr. V. Srilakshmi 84

22-02-2025 Dr. V. Srilakshmi 85
CROSS JOIN: A Cross Join also known as cartesian join results in
every row of one data frame is being joined to every other row of
another data frame.

22-02-2025 Dr. V. Srilakshmi 86

INNER JOIN OUTER JOIN

LEFT JOIN RIGHT JOIN

22-02-2025 Dr. V. Srilakshmi 87

22-02-2025 Dr. V. Srilakshmi 88
Data Merging
• Using ‘merge()’ from base R:
• The merge() function in base R helps us to combine two or more data
frames based on common columns.
• It performs various types of joins such as inner join, left join, right join, and
full join.
• Syntax:
merged_df <- merge(x,y,by = "common_column",..)
• x’ and ‘y’ are the data frames that you want to merge.
• ‘by’ specifies the common columns on which the merge will be performed.
• Additional arguments like ‘all.x’,all.y’ and ‘all’ control the type of join that is
to be performed.
22-02-2025 Dr. V. Srilakshmi 89
Data Merging
• Example:
• Consider two data frames ‘df1’ and ‘df2’
df1 <- data.frame(ID = c(1, 2, 3, 4),
Name = c("A", "B", "C", "D"),
Age = c(25, 30, 35, 40))

df2 <- data.frame(ID = c(2, 3, 4, 5),

Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
Salary = c(5000, 4000, 6000, 7000))

1. Inner join (default behavior):

ij<- merge(x=df1,y=df2, by = "ID")
print(ij)

The resulting ‘inner_join’ dataframe will only

include the common rows where ‘ID’ is present
in both ‘df1’ and ‘df2’.
22-02-2025 Dr. V. Srilakshmi 90
Data Merging
• Example:

df1-> df2->

2. Left join(‘all.x=TRUE’):
lj<- merge(x=df1, y=df2, by = "ID“, all.x=TRUE)
print(lj)

The resulting ‘left_join’ data frame will

include all rows from ‘df1’ and the matching
rows from ‘df2’. Non-matching rows from
‘df2’ will have an ‘NA’ value
22-02-2025 Dr. V. Srilakshmi 91
Data Merging
• Example:

df1-> df2->

3. Right join(‘all.y=TRUE’):
rj <- merge(df1, df2, by = "ID“, all.y=TRUE)
print(rj)

The resulting ‘right_join’ data frame will include

all rows from ‘df2’ and the matching rows from
‘df1’. Non-matching rows from ‘df1’ will have
‘NA’ values.

22-02-2025 Dr. V. Srilakshmi 92

Data Merging
• Example:

df1-> df2->

4. Full outer join(‘all =TRUE’)

foj <- merge(df1, df2, by = "ID“, all=TRUE)
print(foj)

The resulting ‘full_join’ data frame will include

all rows from both ‘df1’ and ‘df2’.
Non-matching values will have ‘NA’ values.
22-02-2025 Dr. V. Srilakshmi 93
Data Merging
5. Cross join(by=NULL)
cj <- merge(df1, df2, by = NULL)
print(cj)

A Cross Join also known as cartesian join results

in every row of one data frame is being joined to
every other row of another data frame.

22-02-2025 Dr. V. Srilakshmi 94

22-02-2025 Dr. V. Srilakshmi 95
Data Merging
• Using ‘dplyr’ Package:
• The primary function for merging in ‘dplyr’ is ‘join()’, which supports various types of joins.
• Syntax :
merged_df<- join(x,y,by="common_column",type="type_of_join")
• ‘x’ and ‘y’ are the data frames to be merged.
• ‘by’ specifies the common columns on which the merge is to be performed
• ‘type_of_join’ can be ‘inner’, ‘left’,’ right’ or ‘full’ to specify the type of join.
• Example:
• Consider two data frames ‘df1’ and ‘df2’
df1 <- data.frame(ID = c(1, 2, 3, 4),
Name = c("A", "B", "C", "D"),
Age = c(25, 30, 35, 40))
df2 <- data.frame(ID = c(2, 3, 4, 5),
Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
Salary = c(5000, 4000, 6000, 7000))

22-02-2025 Dr. V. Srilakshmi 96

Data Merging
• Inner join:
inner_join <- dplyr::inner_join(df1, df2, by = "ID")
print(inner_join)

• Left join:
left_join <- dplyr::left_join(df1, df2, by = "ID")
print(left_join)

22-02-2025 Dr. V. Srilakshmi 97

Data Merging
• Right join:
right_join <- dplyr::right_join(df1, df2, by = "ID")
print(right_join)

• Full outer join:

left_join <- dplyr::full_join(df1, df2, by = "ID")
print(full_join)

22-02-2025 Dr. V. Srilakshmi 98

Module III
No ratings yet
Module III
53 pages
Data Manipulation Using R: Dr. D. Kothandaraman Associate Professor, SCOPE, VIT-AP Module - 3
No ratings yet
Data Manipulation Using R: Dr. D. Kothandaraman Associate Professor, SCOPE, VIT-AP Module - 3
56 pages
1a Data Sorting
No ratings yet
1a Data Sorting
9 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
Introduction To R Software: Sorting and Ordering
No ratings yet
Introduction To R Software: Sorting and Ordering
8 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
Data Analytics Using R
No ratings yet
Data Analytics Using R
37 pages
Basics of R Programming - Ghosh - Tagged
No ratings yet
Basics of R Programming - Ghosh - Tagged
18 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
Week6 Slides Updated
No ratings yet
Week6 Slides Updated
57 pages
R Programming Essentials
No ratings yet
R Programming Essentials
27 pages
Module 3 R Data Science
No ratings yet
Module 3 R Data Science
158 pages
R Language PDF
100% (1)
R Language PDF
619 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
R Data Manipulation Guide
No ratings yet
R Data Manipulation Guide
46 pages
R Studio
No ratings yet
R Studio
8 pages
Unit 3 Chatgpt
No ratings yet
Unit 3 Chatgpt
6 pages
Lab 1 (With Answers)
No ratings yet
Lab 1 (With Answers)
44 pages
Lecture 1
No ratings yet
Lecture 1
42 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
R Programming
No ratings yet
R Programming
22 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
2 Undefined
No ratings yet
2 Undefined
86 pages
Machine Learning with R Guide
No ratings yet
Machine Learning with R Guide
151 pages
R Programming Basics for Beginners
No ratings yet
R Programming Basics for Beginners
14 pages
Section 03
No ratings yet
Section 03
20 pages
Base R
No ratings yet
Base R
9 pages
Unit 4
No ratings yet
Unit 4
27 pages
Unit 1 Notes R Programming
No ratings yet
Unit 1 Notes R Programming
7 pages
Data Types & RStudio Basics
No ratings yet
Data Types & RStudio Basics
42 pages
Presentation 3 - Data Structures
No ratings yet
Presentation 3 - Data Structures
45 pages
Factors
No ratings yet
Factors
23 pages
R Study Material I
No ratings yet
R Study Material I
8 pages
Introduction To R
No ratings yet
Introduction To R
39 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
Tutorial 1
No ratings yet
Tutorial 1
29 pages
R - A Practical Course
No ratings yet
R - A Practical Course
42 pages
R Programming Code
No ratings yet
R Programming Code
7 pages
R Programming Cheat Sheet: Ata Tructures
No ratings yet
R Programming Cheat Sheet: Ata Tructures
2 pages
R Programming Cheatsheet
100% (2)
R Programming Cheatsheet
6 pages
Mydata - Read - CSV ("Nameofthedatafile - CSV") : Sorting A Data Frame
No ratings yet
Mydata - Read - CSV ("Nameofthedatafile - CSV") : Sorting A Data Frame
2 pages
R Programming Easy
No ratings yet
R Programming Easy
8 pages
Basics of R
No ratings yet
Basics of R
12 pages
Module 7 - (Data Analysis With R Programming)
No ratings yet
Module 7 - (Data Analysis With R Programming)
18 pages
Basics of R: Installation & Data Types
No ratings yet
Basics of R: Installation & Data Types
43 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
A Crash Course in R - Intro To Statistical Programming
No ratings yet
A Crash Course in R - Intro To Statistical Programming
53 pages
R Topicscovered
No ratings yet
R Topicscovered
22 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
R Vectors and Matrices Guide
No ratings yet
R Vectors and Matrices Guide
33 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
Introduction To R Chap 2
No ratings yet
Introduction To R Chap 2
30 pages
People Analytics With R Part 3
No ratings yet
People Analytics With R Part 3
11 pages
Statistics With R Unit 1: Divya Arun Kumar
No ratings yet
Statistics With R Unit 1: Divya Arun Kumar
65 pages
Basic Network Commands
No ratings yet
Basic Network Commands
10 pages
Life Is Precious
No ratings yet
Life Is Precious
1 page
Binomial and Poisson Distribution
No ratings yet
Binomial and Poisson Distribution
43 pages
Aws Services Overview
No ratings yet
Aws Services Overview
4 pages
Ampere's Eye Review - 1
No ratings yet
Ampere's Eye Review - 1
11 pages
Fall 2025-26 Mock Registration: Vit - Ap Campus
No ratings yet
Fall 2025-26 Mock Registration: Vit - Ap Campus
1 page
Introduction To R in Data Analytics
No ratings yet
Introduction To R in Data Analytics
135 pages
Data Import, Export and Analysis Using R
No ratings yet
Data Import, Export and Analysis Using R
190 pages
Getting Started With NumPy in Data Analytics
No ratings yet
Getting Started With NumPy in Data Analytics
45 pages
87 Notification MathSC
No ratings yet
87 Notification MathSC
118 pages
Isilon Design Consideration For SMB Environment PDF
No ratings yet
Isilon Design Consideration For SMB Environment PDF
43 pages
4 Marks in Question Bank
No ratings yet
4 Marks in Question Bank
9 pages
Finemind Question Bank Linked List
No ratings yet
Finemind Question Bank Linked List
12 pages
Types of Database Indexing
No ratings yet
Types of Database Indexing
9 pages
Global Variables Declaration
No ratings yet
Global Variables Declaration
11 pages
Vsphere Esxi Vcenter Server 67 Storage Guide PDF
100% (1)
Vsphere Esxi Vcenter Server 67 Storage Guide PDF
357 pages
Willem EPROM Programmer Guide
100% (1)
Willem EPROM Programmer Guide
8 pages
NFA008 Examen Final 2021-2022 VEng Session 1
No ratings yet
NFA008 Examen Final 2021-2022 VEng Session 1
5 pages
C•CURE 9000 v2.40 Enhancements
No ratings yet
C•CURE 9000 v2.40 Enhancements
16 pages
Erd University Unit2
100% (1)
Erd University Unit2
6 pages
Communication Protocol of VL502 V1.0.7 - Eng
No ratings yet
Communication Protocol of VL502 V1.0.7 - Eng
66 pages
Prof David Marshall: Dave - Marshall@cs - Cardiff.ac - Uk
No ratings yet
Prof David Marshall: Dave - Marshall@cs - Cardiff.ac - Uk
27 pages
Association Rule Mining
No ratings yet
Association Rule Mining
20 pages
BMS Protocol
No ratings yet
BMS Protocol
8 pages
Druva Partner Marketing Starter Kit AMS PDF
No ratings yet
Druva Partner Marketing Starter Kit AMS PDF
19 pages
AnalysisReport 1709047325596
No ratings yet
AnalysisReport 1709047325596
45 pages
Computer Science Practical File XII
No ratings yet
Computer Science Practical File XII
15 pages
2026 2028 Syllabus
No ratings yet
2026 2028 Syllabus
56 pages
Huawei FusionSphere 5.1 Data Sheet (Server Virtualizaiton)
100% (1)
Huawei FusionSphere 5.1 Data Sheet (Server Virtualizaiton)
12 pages
Network Programming Lab Guide
No ratings yet
Network Programming Lab Guide
35 pages
ZCS5 Day 3
No ratings yet
ZCS5 Day 3
85 pages
Supermarket Management System Analysis
No ratings yet
Supermarket Management System Analysis
6 pages
Upgrade Oracle E-Business Suite From R12.1.2 To R12.13
No ratings yet
Upgrade Oracle E-Business Suite From R12.1.2 To R12.13
13 pages
Week 3: Assignment: Assignment Submitted On 2025-02-12, 12:17 IST
No ratings yet
Week 3: Assignment: Assignment Submitted On 2025-02-12, 12:17 IST
5 pages
Processing Cheat Sheet
No ratings yet
Processing Cheat Sheet
1 page
RAIDIX 5.1 Product Features
No ratings yet
RAIDIX 5.1 Product Features
15 pages
Exercise No. 7 Heap and Priority Queue
No ratings yet
Exercise No. 7 Heap and Priority Queue
4 pages
Cs601 Assignment No 1 Sp23
No ratings yet
Cs601 Assignment No 1 Sp23
2 pages
MongoDB & Atlas Fundamentals Guide
No ratings yet
MongoDB & Atlas Fundamentals Guide
50 pages