[go: up one dir, main page]

0% found this document useful (0 votes)
123 views62 pages

BDA Lab Manual

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 62

Vidarbha Youth Welfare Society’s

Prof. Ram Meghe Institute of Technology & Research,


Badnera-Amravati (M.S) 444701.

Practical Record
Semester VI

Subject: - (6KS07) Emerging Technology (BDA) Lab-II

Department of Computer Science and Engineering


Prof. Ram Meghe Institute of Technology & Research Badnera,
Amravati (M.S) 444701.

CERTIFICATE

This is to certify that Mr/Miss Enrolment

No. _ Roll No. _ Section _of B.E.

Third year Semester VI, Department of Computer Science & Engineering has
satisfactorily completed the practical work of the subject Emerging Technology
Lab-II (BDA) prescribed by Sant Gadge Baba Amravati University, Amravati
during the academic term 2021-22.

Signature of the faculty Head of Department

Date:
Vision and Mission of the Institute and Programme

□ Vision& Mission statement of the Institute

VISION
To become a pace-setting centre of excellence believing in three
universal values namely Synergy, Trust and Passion, with zeal to serve the Nation in the
global scenario.

MISSION
To dedicate ourselves to the highest standard of technical education
& research in core & emerging engineering disciplines
and strive for the overall personality development of students
so as to nurture not only quintessential technocrats
but also responsible citizens.

□ Vision& Mission statement of the Department of Computer Science &


Engineering

VISION
To ensure that the world saves time and other depletable resources and free it from
complexity by providing efficient computing services.

MISSION
To dedicate ourselves to the highest standard by providing knowledge, skills and wisdom
to the incumbent by imparting value based education to enable them to solve complex
system by simple algorithms and to make them innovative, research oriented to serve the
global society, while imbibing highest ethical values.
2. Program educational objective (PEO’s), program outcomes (PO’s) and
Program Specific Outcomes (PSO’s)
● Program educational objective (PEO’s)
PE01. Preparation: To prepare students for successful careers in software industry that meet the
needs of Indian and multinational companies or to excel in Higher studies.
PEO2. Core competence: To develop the ability among students to synthesize data and technical
concepts for software design and development.
PEO3. Breadth: To inculcate in students professional and ethical attitude, effective communication
skills, teamwork skills, multidisciplinary approach and an ability to relate engineering issues to
broader social context.
PEO4. Professionalism: To provide students with a sound foundation in the mathematical, scientific
and computer engineering fundamentals required to solve engineering problems and also pursue
higher studies.
PEO5. Learning Environment: To promote student with an academic environment aware of
excellence, leadership, written ethical codes and guidelines and the life-long learning needed for a
successful professional career.

● Program Outcomes (PO’s)


Engineering Graduate will be able to:
PO1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering problems.
PO2. Problem analysis: Identify, formulate, review research literature, and analyse complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
PO3. Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate consideration
for the public health and safety, and the cultural, societal, and environmental considerations.
PO4. Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
PO5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modelling to complex engineering activities with an
understanding of the limitations.
PO6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
PO7. Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need for
sustainable development.
PO8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
PO9. Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
PO10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive clear
instructions.
PO11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and leader
in a team, to manage projects and in multidisciplinary environments.
PO12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.

● Program Specific Outcomes (PSO’s)


PS01: Foundation of Computer System Development: Ability to use knowledge of computer
systems and design principles in building the hardware and software components, products in the
domain of embedded system, artificial intelligence, databases, networking, web technology and
mobile computing.
PS02: Problem Solving Ability: Ability to apply knowledge in various problem domains and
implement innovative and suitable solutions to cater to needs of industry, business and e-governance
by imbibing highest ethical and economical values.

3. Syllabus, Course outcomes and Mapping of CO’s with PO’s


Course Number and Title : Emerging Technology Lab-II (BDA) (6KS07)
Faculty : Prof. N.M.Yawale, (nmyawale@mitra.ac.in)

Course Type : Theory & Laboratory


Compulsory / Elective : Elective
Teaching Method : Theory: 03 Hrs / Week
Laboratory: 02 Hrs / week (04 Batches)
Subject Credits : 03 Credit for Theory
03 Credit for Laboratory
Course Assessment : Laboratory: Lab Performance + Viva-Voce Semester
examination by SGBAU
Grading Policy : Laboratory: 50 % Internal + 50 % External.
25 Internal Marks + 25 External Marks (Viva-Voce
Semester exam by SGBAU)
Prerequisites:
Knowledge of basic computer science principles and skills, Basic knowledge of Linear Algebra
and Probability Theory, Basic knowledge of Data Base Management Systems

● Course Learning Objectives:


Throughout the course, students will be expected to demonstrate their understanding of Big Data
Analytics by being able to do each of the following:
1. To know the fundamental concepts of big data and analytics.
2. To explore tools and practices for working with big data.
3. To know about the research that requires the integration of large amounts of data.
● Course Outcomes (CO’s)
At the end of the course, Students will be able to:
Sr. No. Course Outcome

K607.1 Work with big data tools and its analysis techniques
K607.2 Analyze data by utilizing clustering and classification algorithms
Learn and apply different algorithms and recommendation systems for large
K607.3 volumes of data.
K607.4 Perform analytics on data streams.
K607.5 Learn No SQL databases and management.

1. Practical Evaluation Guidelines: (ACPV Guidelines)


Guidelines for Awarding Internal Marks for Practical:
At the end of the semester, internal assessment marks for practical shall be the average of
marks scored in all the experiments.

a. Attendance (05 Marks): These 05 marks will be given for the regularity of a student. If
the student is present on the scheduled day, he / she will be awarded 05 marks.
b. Competency (05 Marks): Here the basic aim is to check competency of the student. The
skill to interprets the aim of the practical and provide solution for the same. Here expectation
from the student is to improvise the existing solution, and the given modification. The marks will
be given according to the level of improvement made in the practical.
c. Performance (05 Marks): Here the basic aim is to check whether the student has
developed the skill to write down the code on his own, and compile / debug. If a student
correctly executes the practical on the scheduled day within the allocated time; he / she will be
awarded full marks. Otherwise, marks will be given according to the level of completion /
execution of practical.
d. Viva-Voce (05 Marks): These 05 marks will be totally based on the performance in
viva-voce. There shall be viva-voce on the completion of each practical in the same practical
session. The student shall get the record checked before next practical.
LIST OF PRACTICALS
Sr. No. Aim Page No. Signature
1 Installation of Hadoop & R
Study of R: Declaring Variable, Expression,
2
Function and Executing R script.
Creating List in R – merging two lists, adding
3
matrices in lists, adding vectors in list.
Manipulating & Processing Data in R – merging

4 data sets, sorting data, plotting data, managing data


using matrices & data frames

5 Implementation of K-Means Clustering with R

6 Implementation of Apriori Algorithm with R


Text Analysis using R: analyzing minimum two
7
different data sets

8 Twitter Data Analysis with R

9 Sentiment Analysis of Whatsapp data with R


Practical No. 1
Aim: Installation of Hadoop & R
Software Required: Hadoop, R, JDK8
Theory:
1. Installation Steps for Java
• Step 1- Download Java JDK 8. You can download Java 8 from the Oracle's Java official
website. (www.java.com)

• Step 2- Run the Installer.

• Step 3- Custom Setup.


• Step 4 – Installation begins.
• Step 5- Check the version of Java installed.
- Open Command Prompt
- Type “java –version” to check version of your Java
2. Installation steps for Hadoop
Java is the main prerequisite for Hadoop. First of all, you should verify the existence of java in
your system using the command “java -version”. The syntax of java version command is given
below.
$ java -version

Step 1
Download java (JDK <latest version> - X64.tar.gz) by visiting the following
link www.oracle.com
Then jdk-7u71-linux-x64.tar.gz will be downloaded into your system.

Step 2
Generally you will find the downloaded java file in Downloads folder. Verify it and extract
the jdk-7u71-linux-x64.gz file using the following commands.
• Step 1: Click here to download the Java 8 Package.
• Step 2: Extract the Java Tar File.
• Step 3: Download the Hadoop 2.7.3 Package.
• Step 4: Extract the Hadoop tar File.
• Step 5: Add the Hadoop and Java paths in the bash file
• Step 6: Edit the Hadoop Configuration files.
• Step 7: Open core-site.

3. Viva Questions
1. What are the different data structures in R? ...
2. What is class()?
3. How do you install a package in R?
4. List some Advantages and disadvantages of R Language.

Conclusion:
Practical No.2
Aim: Study of R: Declaring Variable, Expression, Function and Executing R script
Software Required: R
Theory:
R Language:
● R is a programming language.
● R is often used for statistical computing and graphical presentation to analyze and visualize data.
● Example of Expression:
● print(“hello”)
● Plot(1:10)
● 5+5
Why Use R
● It is a great resource for data analysis, data visualization, data science and machine learning
● It provides many statistical techniques (such as statistical tests, classification, clustering and data
reduction)
● It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc++
● It works on different platforms (Windows, Mac, Linux)
● It is open-source and free
● It has many packages (libraries of functions) that can be used to solve different problems
R Variables
● Creating Variables in R
● Variables are containers for storing data values.
● R does not have a command for declaring a variable. A variable is created the moment you first
assign a value to it. To assign a value to a variable, use the <- sign. To output (or print) the
variable value, just type the variable name.
● Compared to many other programming languages, you do not have to use a function to
print/output variables in R. You can just type the name of the variable
● Example
● name <- "John"
age <- 40
Output: print(name)
● name # output "John"
age # output 40
Multiple Variable
● R allows you to assign the same value to multiple variables in one line
● Example
● # Assign the same value to multiple variables in one line
var1 <- var2 <- var3 <- "Orange"

# Print variable values


var1
var2
var3

Variable Names
● A variable can have a short name (like x and y) or a more descriptive name (age, carname,
total_volume).
● Rules for R variables are:
● A variable name must start with a letter and can be a combination of letters, digits,
period(.)
and underscore(_). If it starts with period(.), it cannot be followed by a digit.
● A variable name cannot start with a number or underscore (_)
● Variable names are case-sensitive (age, Age and AGE are three different variables)
● Reserved words cannot be used as variables (TRUE, FALSE, NULL, if...)
# Legal variable names:
myvar <- "John"
my_var <- "John"
myVar <- "John"
MYVAR <- "John"
myvar2 <- "John"
.myvar <- "John"

# Illegal variable names:


2myvar <- "John"
my-var <- "John"
my var <- "John"
_my_var <- "John"
my_v@ar <- "John"
TRUE <- "John
R Functions
● A function is a block of code which only runs when it is called.
● You can pass data, known as parameters, into a function.
● A function can return data as a result.
Creating a Function
● To create a function, use the function() keyword
● Example
my_function <- function()
{ # create a function with the name my_function
print("Hello World!")
}
Call a Function
● To call a function, use the function name followed by parenthesis, like my_function():
● Example
my_function <- function()
{
print("Hello World!")
}
1
my_function()
Return values from Function
● To let a function return a result, use the return() function:
Example
my_function <- function(x)
{
return (5 * x)
}

print(my_function(3))
print(my_function(5))
print(my_function(9))

Program and Output:


Viva Questions:
1. How to Declared Variables in R?
2. What is Datatype? List different datatypes available in R.
3. What is global variable?
4. What is function in R?

Conclusion:
Practical No. 3
Aim: Creating List in R – merging two lists, adding matrices in lists, adding vectors in list.
Software Required: R
Theory:
List
● A list in R can contain many different data types inside it. A list is a collection of data
which is ordered and changeable.
● To create a list, use the list() function:
● Example
● # List of strings
thislist <- list("apple", "banana", "cherry")

# Print the list


thislist
Access the List
● You can access the list items by referring to its index number, inside brackets. The first
item has index 1, the second item has index 2, and so on:
● Example
● thislist <- list("apple", "banana", "cherry")

thislist[1]
Add List Items
● To add an item to the end of the list, use the append() function:
● Example
● Add "orange" to the list:
● thislist <- list("apple", "banana", "cherry")

append(thislist, "orange")
● To add an item to the right of a specified index, add "after=index number" in
the append() function:
● Example
● Add "orange" to the list after "banana" (index 2):
● thislist <- list("apple", "banana", "cherry")

append(thislist, "orange", after = 2)


Remove List Item
● You can also remove list items. The following example creates a new, updated list
without an "apple" item:
● Example
● Remove "apple" from the list:
● thislist <- list("apple", "banana", "cherry")

newlist <- thislist[-1]

# Print the new list


newlist
Range of Indexes
● You can specify a range of indexes by specifying where to start and where to end the
range, by using the operator ( : )
● Example
● Return the second, third, fourth and fifth item:
● thislist <- list("apple", "banana", "cherry", "orange", "kiwi", "melon", "mango")

(thislist)[2:5]
Join Two Lists
● There are several ways to join, or concatenate, two or more lists in R.
● The most common way is to use the c() function, which combines two elements together:
● Example
● list1 <- list("a", "b", "c")
list2 <- list(1,2,3)
list3 <- c(list1,list2)

list3 Vectors
● A vector is simply a list of items that are of the same type.
● To combine the list of items to a vector, use the c() function and separate the items by a
comma.
● In the example below, we create a vector variable called fruits, that combine strings:
● Example 1
● # Vector of strings
fruits <- c("banana", "apple", "orange")

# Print fruits
fruits
● Example 2
● # Vector of numerical values
numbers <- c(1, 2, 3)

# Print numbers
numbers
● Example 3
● # Vector with numerical values in a sequence
numbers <- 1:10

numbers
Vector Length
● To find out how many items a vector has, use the length() function:
● Example
fruits <- c("banana", "apple", "orange")

length(fruits)
Access Vectors
● You can access the vector items by referring to its index number inside brackets [].
● The first item has index 1, the second item has index 2, and so on:
● Example
● fruits <- c("banana", "apple", "orange")
# Access the first item (banana)
fruits[1]
● You can also access multiple elements by referring to different index positions with
the c() function:
● Example
● fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Access the first and third item (banana and orange)


fruits[c(1, 3)]
R Matrices
● A matrix is a two dimensional data set with columns and rows.
● A column is a vertical representation of data, while a row is a horizontal representation of
data.
● A matrix can be created with the matrix() function. Specify the nrow and ncol parameters
to get the amount of rows and columns:
● Example
● # Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)

# Print the matrix


thismatrix
● You can also create a matrix with strings:
● Example
● thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)

thismatrix
Access Matrix Items
● You can access the items by using [ ] brackets. The first number "1" in the bracket
specifies the row-position, while the second number "2" specifies the column-position:
● Example
● thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)

thismatrix[1, 2]
● The whole row can be accessed if you specify a comma after the number in the bracket:
● Example
● thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)

thismatrix[2,]
● The whole column can be accessed if you specify a comma before the number in the
bracket:
● Example
● thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)

thismatrix[,2]
Access More Than One Row
● More than one row can be accessed if you use the c() function:
● Example
● thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fi
g"), nrow = 3, ncol = 3)

thismatrix[c(1,2),]
Access More Than One Column
● More than one column can be accessed if you use the c() function:
● Example
● thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fi
g"), nrow = 3, ncol = 3)

thismatrix[, c(1,2)]

Add Rows and Columns


● Use the cbind() function to add additional columns in a Matrix:
● Example
● thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fi
g"), nrow = 3, ncol = 3)

newmatrix <- cbind(thismatrix, c("strawberry", "blueberry", "raspberry"))

# Print the new matrix


newmatrix
● Use the rbind() function to add additional rows in a Matrix:
● Example
● thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fi
g"), nrow = 3, ncol = 3)

newmatrix <- rbind(thismatrix, c("strawberry", "blueberry", "raspberry"))

# Print the new matrix


newmatrix
Remove Rows and Columns
● Use the c() function to remove rows and columns in a Matrix:
● Example
● thismatrix <-
matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"), nrow = 3, ncol
=2)

#Remove the first row and the first column


thismatrix <- thismatrix[-c(1), -c(1)]

thismatrix

Program and Output:


Viva Questions:
1. What is the use of Vector in R Language?
2. Explain rbind() and cbind().
3. What is Matrices? What is the use of Matrices?
4. Differentiate between List and Vector.
Conclusion:
Practical No. 4
Aim: Manipulating & Processing Data in R- Merging datasets, sorting data, plotting data,
Managing data using matrices and data frames.
Software Required: R
Theory:
Data Frame
● Data Frames are data displayed in a format as a table.
● Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each column
should have the same type of data.
● Use the data.frame() function to create a data frame:
● Example
● # Create a data frame
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

# Print the data frame


Data_Frame
Summarize the Data
● Use the summary() function to summarize the data from a Data Frame:
● Example
● Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

Data_Frame

summary(Data_Frame)
Access Items
● We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a data
frame:
● Example
● Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

Data_Frame[1]

Data_Frame[["Training"]]
Data_Frame$Training
Add Rows
● Use the rbind() function to add new rows in a Data Frame:
● Example
● Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

# Add a new row


New_row_DF <- rbind(Data_Frame, c("Strength", 110, 110))

# Print the new row


New_row_DF
Add Columns
● Use the cbind() function to add new columns in a Data Frame:
● Example
● Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

# Add a new column


New_col_DF <- cbind(Data_Frame, Steps = c(1000, 6000, 2000))

# Print the new column


New_col_DF

Remove Rows and Columns


● Use the c() function to remove rows and columns in a Data Frame:
● Example
● Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

# Remove the first row and column


Data_Frame_New <- Data_Frame[-c(1), -c(1)]

# Print the new data frame


Data_Frame_New
Combining Data Frames
● Use the rbind() function to combine two or more data frames in R vertically:
● Example
● Data_Frame1 <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

Data_Frame2 <- data.frame (


Training = c("Stamina", "Stamina", "Strength"),
Pulse = c(140, 150, 160),
Duration = c(30, 30, 20)
)

New_Data_Frame <- rbind(Data_Frame1, Data_Frame2)


New_Data_Frame

Plot
● The plot() function is used to draw points (markers) in a diagram.
● The function takes parameters for specifying points in the diagram.
● Parameter 1 specifies points on the x-axis.
● Parameter 2 specifies points on the y-axis.
● At its simplest, you can use the plot() function to plot two numbers against each other:
● Example
● Draw one point in the diagram, at position (1) and position (3):
● plot(1, 3)
● Example
● Draw two points in the diagram, one at position (1, 3) and one in position (8, 10):
● plot(c(1, 8), c(3, 10))

Multiple Points:
● You can plot as many points as you like, just make sure you have the same number of
points in both axis:
● Example
● plot(c(1, 2, 3, 4, 5), c(3, 7, 8, 9, 12))
Sequences of Points
● If you want to draw dots in a sequence, on both the x-axis and the y-axis, use
the : operator:
● Example
● plot(1:10)

Draw a Line
● The plot() function also takes a type parameter with the value l to draw a line to connect
all the points in the diagram:
● Example
● plot(1:10, type="l")
● The plot() function also takes a type parameter with the value l to draw a line to connect
all the points in the diagram:
● Example
● plot(1:10, type="l")
Plot Labels
● The plot() function also accept other parameters, such as main, xlab and ylab if you want
to customize the graph with a main title and different labels for the x and y-axis:
● Example
● plot(1:10, main="My Graph",
● xlab="The x-axis",
● ylab="The y axis")

Graph Appearance
● There are many other parameters you can use to change the appearance of the points.
● Colors
● Use col="color" to add a color to the points:
● Example
● plot(1:10, col="red")
Point Shape
● Use pch with a value from 0 to 25 to change the point shape format:
● Example
● plot(1:10, pch=25, cex=2)
● The values of the pch parameter ranges from 0 to 25, which means that we can choose up
to 26 different types of point shapes:

Program and Output:


Viva Questions:
1. What is the use of pnch Parameter?
2. What is Dataset?
3. Which function is used for combining data farme? Explain with Syntax.

Conclusion:
Practical No. 5
Aim: Implementing K-Mean Clustering Algorithm.
Software Required: R
Theory:
● Given a set of n distinct objects, the k-Means clustering algorithm partitions the objects
into k number of clusters.
● In this algorithm, user has to specify k, the number of clusters.
● Given a collection of objects each with n measurable attributes, k-means is an analytical
technique that, for a chosen value of k, identifies k clusters of objects based on the
objects’ proximity to the center of the k groups.
● The center is determined as the arithmetic average (mean) of each cluster’s n-dimensional
vector of attributes.
The k-means algorithm to find k clusters
Step 1: Choose the value of k and the k initial guesses for the centroids.
● In this example, k = 3, and the initial centroids are indicated by the points shaded
in red, green, and blue

Step 2: Compute the distance from each data point (Xi, Yi) to each centroid. Assign each point
to the closest centroid. This association defines the first k clusters.
● In two dimensions, the distance, d, between any two points, (X1,Y1) and (X2,Y2), in the
Cartesian plane is typically expressed by using the Euclidean distance measure provided in
Step 3: Compute the centroid, the center of mass, of each newly defined cluster from Step 2.
● In two dimensions, the centroid ( Xc, Yc) of the m points in a k-means cluster is
calculated as follows in Equation

Step 4: Repeat Steps 2 and 3 until the algorithm converges to an answer.


1. Assign each point to the closest centroid computed in Step 3.
2. Compute the centroid of newly defined clusters.
3. Repeat until the algorithm reaches the final answer.
Program:
Output:
Viva Question:
1. What is the use of K-mean Clustering?
2. What is Cluster?
3. What are the applications of k-mean clustering?

Conclusion:
Practical No. 6
Aim: Implementing apriori Algorithm.
Software Required: R
Theory:
● Apriori is one of the earliest and the most fundamental algorithms for generating
association rules.
● One major component of Apriori is support. Given an itemset L, the support of L is the
percentage of transactions that contain L.
● For example, if 80% of all transactions contain itemset {bread}, then the support of
{bread} is 0.8. Similarly, if 60% of all transactions contain itemset {bread,butter}, then
the support of {bread,butter} is 0.6.
● A frequent itemset has items that appear together often enough. The term “often enough”
is formally defined with a minimum support criterion.
● If the minimum support is set at 0.5, any itemset can be considered a frequent itemset if
at least 50% of the transactionscontain this itemset.
● In other words, the support of a frequent itemset should be greater than or equal to the
minimum support.
● For the previous example, both {bread} and {bread,butter} are considered frequent
itemsets at the minimum support 0.5. If the minimum support is 0.7, only {bread} is
considered a frequent itemset.
● If an itemset is considered frequent, then any subset of the frequent itemset must also be
frequent.
● This is referred to as the Apriori property (or downward closure property).
● For example, if 60% of the transactions contain {bread,jam}, then at least 60% of all the
transactions will contain {bread} or {jam}.
● In other words, when the support of {bread,jam} is 0.6, the support of {bread} or {jam}
is at least 0.6.

● If itemset {B,C,D} is frequent, then all the subsets of thisitemset, shaded, must also be
frequent itemsets.

Apriori Algorithm:
• Ck-- be the set of candidate k-itemsets and variable be the set of k-itemsets that satisfy
the minimum support.
● Given a transaction database D ,
● a minimum support threshold ð,
● an optional parameter N indicating the maximum length an itemset could reach, Apriori
iteratively computes frequent itemsets Lk+1 based on Lk .

Program:
Output:
Viva Question:
1. What is the use of apriori Algorithm?
2. List applications of apriori algorithm.
3. What is associationrule?

Conclusion:
Practical No.7

Aim: Text Analysis using R: analyzing minimum two different data sets
Software Required: R
Theory:

Example 1:
1. Loading the Data set
There are some data sets that are already pre-installed in R. Here, we shall be using The
Titanic data set that comes built-in R in the Titanic Package. While using any external data
source, we can use the read command to load the files(Excel, CSV, HTML and text files etc.)
This data set is also available at Kaggle. You may download the data set, both train and test files.

titanic <- read.csv(“C:/Users/Desktop/titanic.csv”, header=TRUE, sep=”,”)

The above code reads the file titanic.csv into a dataframe titanic. With Header=TRUE we
are specifying that the data includes a header(column names) and sep=”,” specifies that the values
in data are comma separated.

2. Understanding the Data set


We have used the Titanic data set that contains historical records of all the passengers who on-
boarded the Titanic. Below is a brief description of the 12 variables in the data set :
• PassengerId: Serial Number
• Survived: Contains binary Values of 0 & 1. Passenger did not survive — 0, Passenger
Survived — 1.
• Pclass — Ticket Class | 1st Class, 2nd Class or 3rd Class Ticket
• Name — Name of the passenger
• Sex — Male or Female
• Age — Age in years — Integer
• SibSp — No. of Siblings / Spouses — brothers, sisters and/or husband/wife
• Parch — No. of parents/children — mother/father and/or daughter, son
• Ticket — Serial Number
• Fare — Passenger fare
• Cabin — Cabin Number
• Embarked — Port of Embarkment | C- Cherbourg, Q — Queenstown, S — Southhampton
2.1 Peek at your Data
Before we begin working on the dataset, let’s have a good look at the raw data.

view(titanic)

This helps us in familiarising with the data set.

head(titanic,n) | tail(titanic,n)

In order to have a quick look at the data, we often use the head()/tail().

Top 10 rows of the data set.

Bottom 5 rows of the data set.


In case we do not explicitly pass the value for n, it takes the default value of 5, and displays 5
rows.
names(titanic)

This helps us in checking out all the variables in the data set.

Familiarising with all the Variables/Column Names


str(titanic)
This helps in understanding the structure of the data set, data type of each attribute and number of
rows and columns present in the data.

summary(titanic)

A cursory look at the data


Summary() is one of the most important functions that help in summarising each attribute in the
dataset. It gives a set of descriptive statistics, depending on the type of variable:
• In case of a Numerical Variable -> Gives Mean, Median, Mode, Range and Quartiles.
• In case of a Factor Variable -> Gives a table with the frequencies.
• In case of Factor + Numerical Variables -> Gives the number of missing values.
• In case of character variables -> Gives the length and the class.
In case we just need the summary statistic for a particular variable in the dataset, we can use

summary(datasetName$VariableName) -> summary(titanic$Pclass)


as.factor(dataset$ColumnName)

There are times when some of the variables in the data set are factors but might get
interpreted as numeric. For example, the Pclass(Passenger Class) tales the values 1, 2 and 3,
however, we know that these are not to be considered as numeric, as these are just levels. In order
to such variables treated as factors and not as numbers we need explicitly convert them to factors
using the function as.factor()

3. Analysis & Visualisations

Data Visualisation is an art of turning data into insights that can be easily interpreted.

What was the survival rate?


When talking about the Titanic data set, the first question that comes up is “How many people did
survive?”. Let’s have a simple Bar Graph to demonstrate the same.
ggplot(titanic, aes(x=Survived)) + geom_bar()

On the X-axis we have the survived variable, 0 representing the passengers that did not survive,
and 1 representing the passengers who survived. The Y -axis represents the number of passengers.
Here we see that over 550 passenger did not survive and ~ 340 passengers survived.

Let’s make is more clear by using checking out the percentages


prop.table(table(titanic$Survived))

Only 38.38% of the passengers who on-boarded the titanic did survive.

Survival rate basis Gender

It is believed that in case of rescue operations during disasters, woman’s safety is prioritised. Did
the same happen back then?
We see that the survival rate amongst the women was significantly higher when compared to men.
The survival ratio amongst women was around 75%, whereas for men it was less than 20%.

Survival Rate basis Class of tickets (Pclass)


There were 3 segments of passengers, depending upon the class they were travelling in, namely,
1st class, 2nd class and 3rd class. We see that over 50% of the passengers were travelling in the
3rd class.

Survival Rate basis Passenger Class

1st and 2nd Class passengers disproportionately survived, with over 60% survival rate of the 1st
class passengers, around 45–50% of 2nd class, and less than 25% survival rate of those travelling
in 3rd class.
I’ll leave you at the thought… Was it because of a preferential treatment to the passengers
travelling elite class, or the proximity, as the 3rd class compartments were in the lower deck?
Survival Rate basis Class of tickets and Gender(pclass)
We see that the females in the 1st and 2nd class had a very high survival rate. The survival
rate for the females travelling in 1st and 2nd class was 96% and 92% respectively,
corresponding to 37% and 16% for men. The survival rate for men travelling 3rd class was
less than 15%.
Till now it is evident that the Gender and Passenger class had significant impact on the survival
rates. Let’s now check the impact of passenger’s Age on Survival Rate.
Survival rates basis age

Looking at the age<10 years section in the graph, we see that the survival rate is high. And the
survival rate is low and drops beyond the age of 45.
Example 2

Installing and loading R packages


 tm for text mining operations like removing numbers, special characters, punctuations and
stop words (Stop words in any language are the most commonly occurring words that have
very little value for NLP and should be filtered out. Examples of stop words in English are
“the”, “is”, “are”.)
 snowballc for stemming, which is the process of reducing words to their base or root form.
For example, a stemming algorithm would reduce the words “fishing”, “fished” and “fisher” to
the stem “fish”.
 wordcloud for generating the word cloud plot.
 RColorBrewer for color palettes used in various plots
 syuzhet for sentiment scores and emotion classification
 ggplot2 for plotting graphs
# Install
install.packages("tm") # for text mining
install.packages("SnowballC") # for text stemming
install.packages("wordcloud") # word-cloud generator
install.packages("RColorBrewer") # color palettes
install.packages("syuzhet") # for sentiment analysis
install.packages("ggplot2") # for plotting graphs
# Load
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
library("syuzhet")
library("ggplot2")

Reading file data into R


The R base function read.table() is generally used to read a file in table format and imports data as a
data frame. Several variants of this function are available, for importing different file formats;

 read.csv() is used for reading comma-separated value (csv) files, where a comma “,” is used
a field separator
 read.delim() is used for reading tab-separated values (.txt) files
# Read the text file from local machine , choose file interactively
text <- readLines(file.choose())
# Load the data as a corpus
TextDoc <- Corpus(VectorSource(text))
Cleaning up Text Data
Cleaning the text data starts with making transformations like removing special characters from the
text. This is done using the tm_map() function to replace special characters like /, @ and | with a
space. The next step is to remove the unnecessary whitespace and convert the text to lower case.

Then remove the stopwords. They are the most commonly occurring words in a language and have
very little value in terms of gaining useful information. They should be removed before performing
further analysis. Examples of stopwords in English are “the, is, at, on”. There is no single universal list
of stop words used by all NLP tools. stopwords in the tm_map() function supports several languages
like English, French, German, Italian, and Spanish. Please note the language names are case
sensitive. I will also demonstrate how to add your own list of stopwords, which is useful in this Team
Health example for removing non-default stop words like “team”, “company”, “health”. Next, remove
numbers and punctuation.

#Replacing "/", "@" and "|" with space


toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
TextDoc <- tm_map(TextDoc, toSpace, "/")
TextDoc <- tm_map(TextDoc, toSpace, "@")
TextDoc <- tm_map(TextDoc, toSpace, "\\|")
# Convert the text to lower case
TextDoc <- tm_map(TextDoc, content_transformer(tolower))
# Remove numbers
TextDoc <- tm_map(TextDoc, removeNumbers)
# Remove english common stopwords
TextDoc <- tm_map(TextDoc, removeWords, stopwords("english"))
# Remove your own stop word
# specify your custom stopwords as a character vector
TextDoc <- tm_map(TextDoc, removeWords, c("s", "company", "team"))
# Remove punctuations
TextDoc <- tm_map(TextDoc, removePunctuation)
# Eliminate extra white spaces
TextDoc <- tm_map(TextDoc, stripWhitespace)
# Text stemming - which reduces words to their root form
TextDoc <- tm_map(TextDoc, stemDocument)

Building the term document matrix


After cleaning the text data, the next step is to count the occurrence of each word, to identify popular
or trending topics. Using the function TermDocumentMatrix() from the text mining package, you can
build a Document Matrix – a table containing the frequency of words.

# Build a term-document matrix


TextDoc_dtm <- TermDocumentMatrix(TextDoc)
dtm_m <- as.matrix(TextDoc_dtm)
# Sort by descearing value of frequency
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)
# Display the top 5 most frequent words
head(dtm_d, 5)

Plotting the top 5 most frequent words using a bar chart is a good basic way to visualize this word
frequent data. In your R script, add the following code and run it to generate a bar chart, which will
display in the Plots sections of RStudio.

# Plot the most frequent words


barplot(dtm_d[1:5,]$freq, las = 2, names.arg = dtm_d[1:5,]$word,
col ="lightgreen", main ="Top 5 most frequent words",
ylab = "Word frequencies")
Generate the Word Cloud
A word cloud is one of the most popular ways to visualize and analyze qualitative data. It’s an image
composed of keywords found within a body of text, where the size of each word indicates its
frequency in that body of text. Use the word frequency data frame (table) created previously to
generate the word cloud. In your R script, add the following code and run it to generate the word cloud
and display it in the Plots section of RStudio.

#generate word cloud


set.seed(1234)
wordcloud(words = dtm_d$word, freq = dtm_d$freq, min.freq = 5,
max.words=100, random.order=FALSE, rot.per=0.40,
colors=brewer.pal(8, "Dark2"))

# Find associations
findAssocs(TextDoc_dtm, terms = c("good","work","health"), corlimit = 0.25)

# Find associations for words that occur at least 50 times


findAssocs(TextDoc_dtm, terms = findFreqTerms(TextDoc_dtm, lowfreq = 50), corlimit = 0.25)
# regular sentiment score using get_sentiment() function and method of your choice
# please note that different methods may have different scales
syuzhet_vector <- get_sentiment(text, method="syuzhet")
# see the first row of the vector
head(syuzhet_vector)
# see summary statistics of the vector
summary(syuzhet_vector)
# bing
bing_vector <- get_sentiment(text, method="bing")
head(bing_vector)
summary(bing_vector)
#affin
afinn_vector <- get_sentiment(text, method="afinn")
head(afinn_vector)
summary(afinn_vector)
#compare the first row of each vector using sign function
rbind(
sign(head(syuzhet_vector)),
sign(head(bing_vector)),
sign(head(afinn_vector))
)
# run nrc sentiment analysis to return data frame with each row classified as one of the following
# emotions, rather than a score:
# anger, anticipation, disgust, fear, joy, sadness, surprise, trust
# It also counts the number of positive and negative emotions found in each row
d<-get_nrc_sentiment(text)
# head(d,10) - to see top 10 lines of the get_nrc_sentiment dataframe
head (d,10)
#Plot two - count of words associated with each sentiment, expressed as a percentage
barplot(
sort(colSums(prop.table(d[, 1:8]))),
horiz = TRUE,
cex.names = 0.7,
las = 1,
main = "Emotions in Text", xlab="Percentage"
)
Conclusion::
Practical No. 8
Aim: Twitter Data Analysis with R
Software Required: R
Theory:
Step 1: Load the required packages (including rtweet) in RStudio

Step 2: Authenticate using your credentials to Twitter’s API by creating an access token. Steps on
getting Twitter access tokens:
https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html

Step 3: Search tweets on the topic of your choice; narrow the number of tweets as you see fit and
decide on whether or not to include retweets. I decided to include 100 tweets each for Canada and
Scotland, plus decided not to include retweets, so as to avoid duplicate tweets impacting the
evaluation.

Step 4: Process each set of tweets into tidy text or corpus objects.
Step 5: Use pre-processing text transformations to clean up the tweets; this
includes stemming words. An example of stemming is rolling the words “computer”,
“computational” and “computation” to the root “comput”.
Additional pre-processing involves converting all words to lower-case, removing links to web
pages (http elements), and deleting punctuation as well as stop words. The tidytext package
contains a list of over 1,000 stop words in the English language that are not helpful in determining
the overall sentiment of a text body; these are words such as “I”, “ myself”, “ themselves”,
“being” and “have”. We are using the tidytext package with an anti-join to remove the stop words
from the tweets that were extracted in step 3.

Step 6: Find the top 10 commonly used words in the set of tweets for both countries; this will
give an overall picture of what the populations are most concerned about, and the extent to which
they are engaged on these topics.
The output below shows the top 10 words plotted for both Canada and Scotland.
As you can see, the output for Scotland returned many more than 10 words, since many of these
top words occurred the same number of times.

Step 7: Perform sentiment analysis using the Bing lexicon and get_sentiments function from the
tidytext package. There are many libraries, dictionaries and packages available in R to evaluate
the emotion prevalent in a text. The tidytext and textdata packages have such word-to-emotion
evaluation repositories. Three of the general purpose lexicons are Bing, AFINN and nrc (from the
textdata package).
To take a look at what each package contains, you can run the following commands in R:

The get_sentiments function returns a tibble, so to take a look at what is included as “positive”
and “negative” sentiment, you will need to filter accordingly. Since I wanted a general glimpse, I
didn’t need to extract the entire dataset, however depending on your needs, you may want to do
so.
In contrast to Bing, the AFINN lexicon assigns a “positive” or “negative” score to each word in
its lexicon; further sentiment analysis will then add up the emotion score to determine overall
expression. A score greater than zero indicates positive sentiment, while a score less than zero
would mean negative overall emotion. A calculated score of zero indicates neutral sentiment
(neither positive or negative).

Conclusion:
Practical No. 9
Aim: Sentiment Analysis of WhatsApp data with R
Software Required: R
Theory:
the sentiment of each author over the course of the year. This is calculated using the AFINN
lexicon which assigns words with a score that runs between -5 and 5, with negative scores
indicating negative sentiment and positive scores indicating positive sentiment.
From this, we’re able to see the sentiment of each author throughout the year. This correlates with
the words used most frequently by each author. From this, we can see that Blake, John and
Michael tend to be more negative than the rest of the group.

bullring_sentiment_afinn <- chat_clean %>%


inner_join(get_sentiments("afinn")) %>%
group_by(author) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "Group Chat Sentiment")
bullring_sentiment_afinn%>%
ggplot(aes(author, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 3, scales = "free_y")

bing_and_nrc <- bind_rows(chat_clean %>%


inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
chat_clean %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
mutate(method = "NRC")) %>%
count(method, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
Wordcounts
These are the most popular positive and negative words used throughout the chat. This is
calculated using the Bing lexicon, which categorises words in a binary fashion into positive and
negative categories (side note: ‘Dick’ is the nickname for Richard). The results are used to create
the wordcloud that follows.

bing_word_counts
<- chat_clean
%>% inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()

library(wordcloud) chat_clean <- chat %>%


unnest_tokens(word, text) %>%
anti_join(stop_words)
chat_clean <- chat_clean %>%
na.omit(chat_clean)
chat_clean <- chat_clean %>%
filter(!word %in% to_remove)
chat_clean%>%
count(word) %>%
with(wordcloud(word, n,colors = c("#D55E00", "#009E73"), max.words =
100))
# Wordcloud

chat_clean <- chat %>%


unnest_tokens(word, text) %>%
anti_join(stop_words)
chat_clean <- chat_clean %>%
na.omit(chat_clean)
chat_clean <- chat_clean %>%
filter(!word %in% to_remove)
library(reshape2)
chat_clean %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("#D55E00", "#009E73"),
max.words = 100)

Conclusion:

You might also like