BDA Lab Manual
BDA Lab Manual
BDA Lab Manual
Practical Record
Semester VI
CERTIFICATE
Third year Semester VI, Department of Computer Science & Engineering has
satisfactorily completed the practical work of the subject Emerging Technology
Lab-II (BDA) prescribed by Sant Gadge Baba Amravati University, Amravati
during the academic term 2021-22.
Date:
Vision and Mission of the Institute and Programme
VISION
To become a pace-setting centre of excellence believing in three
universal values namely Synergy, Trust and Passion, with zeal to serve the Nation in the
global scenario.
MISSION
To dedicate ourselves to the highest standard of technical education
& research in core & emerging engineering disciplines
and strive for the overall personality development of students
so as to nurture not only quintessential technocrats
but also responsible citizens.
VISION
To ensure that the world saves time and other depletable resources and free it from
complexity by providing efficient computing services.
MISSION
To dedicate ourselves to the highest standard by providing knowledge, skills and wisdom
to the incumbent by imparting value based education to enable them to solve complex
system by simple algorithms and to make them innovative, research oriented to serve the
global society, while imbibing highest ethical values.
2. Program educational objective (PEO’s), program outcomes (PO’s) and
Program Specific Outcomes (PSO’s)
● Program educational objective (PEO’s)
PE01. Preparation: To prepare students for successful careers in software industry that meet the
needs of Indian and multinational companies or to excel in Higher studies.
PEO2. Core competence: To develop the ability among students to synthesize data and technical
concepts for software design and development.
PEO3. Breadth: To inculcate in students professional and ethical attitude, effective communication
skills, teamwork skills, multidisciplinary approach and an ability to relate engineering issues to
broader social context.
PEO4. Professionalism: To provide students with a sound foundation in the mathematical, scientific
and computer engineering fundamentals required to solve engineering problems and also pursue
higher studies.
PEO5. Learning Environment: To promote student with an academic environment aware of
excellence, leadership, written ethical codes and guidelines and the life-long learning needed for a
successful professional career.
K607.1 Work with big data tools and its analysis techniques
K607.2 Analyze data by utilizing clustering and classification algorithms
Learn and apply different algorithms and recommendation systems for large
K607.3 volumes of data.
K607.4 Perform analytics on data streams.
K607.5 Learn No SQL databases and management.
a. Attendance (05 Marks): These 05 marks will be given for the regularity of a student. If
the student is present on the scheduled day, he / she will be awarded 05 marks.
b. Competency (05 Marks): Here the basic aim is to check competency of the student. The
skill to interprets the aim of the practical and provide solution for the same. Here expectation
from the student is to improvise the existing solution, and the given modification. The marks will
be given according to the level of improvement made in the practical.
c. Performance (05 Marks): Here the basic aim is to check whether the student has
developed the skill to write down the code on his own, and compile / debug. If a student
correctly executes the practical on the scheduled day within the allocated time; he / she will be
awarded full marks. Otherwise, marks will be given according to the level of completion /
execution of practical.
d. Viva-Voce (05 Marks): These 05 marks will be totally based on the performance in
viva-voce. There shall be viva-voce on the completion of each practical in the same practical
session. The student shall get the record checked before next practical.
LIST OF PRACTICALS
Sr. No. Aim Page No. Signature
1 Installation of Hadoop & R
Study of R: Declaring Variable, Expression,
2
Function and Executing R script.
Creating List in R – merging two lists, adding
3
matrices in lists, adding vectors in list.
Manipulating & Processing Data in R – merging
Step 1
Download java (JDK <latest version> - X64.tar.gz) by visiting the following
link www.oracle.com
Then jdk-7u71-linux-x64.tar.gz will be downloaded into your system.
Step 2
Generally you will find the downloaded java file in Downloads folder. Verify it and extract
the jdk-7u71-linux-x64.gz file using the following commands.
• Step 1: Click here to download the Java 8 Package.
• Step 2: Extract the Java Tar File.
• Step 3: Download the Hadoop 2.7.3 Package.
• Step 4: Extract the Hadoop tar File.
• Step 5: Add the Hadoop and Java paths in the bash file
• Step 6: Edit the Hadoop Configuration files.
• Step 7: Open core-site.
3. Viva Questions
1. What are the different data structures in R? ...
2. What is class()?
3. How do you install a package in R?
4. List some Advantages and disadvantages of R Language.
Conclusion:
Practical No.2
Aim: Study of R: Declaring Variable, Expression, Function and Executing R script
Software Required: R
Theory:
R Language:
● R is a programming language.
● R is often used for statistical computing and graphical presentation to analyze and visualize data.
● Example of Expression:
● print(“hello”)
● Plot(1:10)
● 5+5
Why Use R
● It is a great resource for data analysis, data visualization, data science and machine learning
● It provides many statistical techniques (such as statistical tests, classification, clustering and data
reduction)
● It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc++
● It works on different platforms (Windows, Mac, Linux)
● It is open-source and free
● It has many packages (libraries of functions) that can be used to solve different problems
R Variables
● Creating Variables in R
● Variables are containers for storing data values.
● R does not have a command for declaring a variable. A variable is created the moment you first
assign a value to it. To assign a value to a variable, use the <- sign. To output (or print) the
variable value, just type the variable name.
● Compared to many other programming languages, you do not have to use a function to
print/output variables in R. You can just type the name of the variable
● Example
● name <- "John"
age <- 40
Output: print(name)
● name # output "John"
age # output 40
Multiple Variable
● R allows you to assign the same value to multiple variables in one line
● Example
● # Assign the same value to multiple variables in one line
var1 <- var2 <- var3 <- "Orange"
Variable Names
● A variable can have a short name (like x and y) or a more descriptive name (age, carname,
total_volume).
● Rules for R variables are:
● A variable name must start with a letter and can be a combination of letters, digits,
period(.)
and underscore(_). If it starts with period(.), it cannot be followed by a digit.
● A variable name cannot start with a number or underscore (_)
● Variable names are case-sensitive (age, Age and AGE are three different variables)
● Reserved words cannot be used as variables (TRUE, FALSE, NULL, if...)
# Legal variable names:
myvar <- "John"
my_var <- "John"
myVar <- "John"
MYVAR <- "John"
myvar2 <- "John"
.myvar <- "John"
print(my_function(3))
print(my_function(5))
print(my_function(9))
Conclusion:
Practical No. 3
Aim: Creating List in R – merging two lists, adding matrices in lists, adding vectors in list.
Software Required: R
Theory:
List
● A list in R can contain many different data types inside it. A list is a collection of data
which is ordered and changeable.
● To create a list, use the list() function:
● Example
● # List of strings
thislist <- list("apple", "banana", "cherry")
thislist[1]
Add List Items
● To add an item to the end of the list, use the append() function:
● Example
● Add "orange" to the list:
● thislist <- list("apple", "banana", "cherry")
append(thislist, "orange")
● To add an item to the right of a specified index, add "after=index number" in
the append() function:
● Example
● Add "orange" to the list after "banana" (index 2):
● thislist <- list("apple", "banana", "cherry")
(thislist)[2:5]
Join Two Lists
● There are several ways to join, or concatenate, two or more lists in R.
● The most common way is to use the c() function, which combines two elements together:
● Example
● list1 <- list("a", "b", "c")
list2 <- list(1,2,3)
list3 <- c(list1,list2)
list3 Vectors
● A vector is simply a list of items that are of the same type.
● To combine the list of items to a vector, use the c() function and separate the items by a
comma.
● In the example below, we create a vector variable called fruits, that combine strings:
● Example 1
● # Vector of strings
fruits <- c("banana", "apple", "orange")
# Print fruits
fruits
● Example 2
● # Vector of numerical values
numbers <- c(1, 2, 3)
# Print numbers
numbers
● Example 3
● # Vector with numerical values in a sequence
numbers <- 1:10
numbers
Vector Length
● To find out how many items a vector has, use the length() function:
● Example
fruits <- c("banana", "apple", "orange")
length(fruits)
Access Vectors
● You can access the vector items by referring to its index number inside brackets [].
● The first item has index 1, the second item has index 2, and so on:
● Example
● fruits <- c("banana", "apple", "orange")
# Access the first item (banana)
fruits[1]
● You can also access multiple elements by referring to different index positions with
the c() function:
● Example
● fruits <- c("banana", "apple", "orange", "mango", "lemon")
thismatrix
Access Matrix Items
● You can access the items by using [ ] brackets. The first number "1" in the bracket
specifies the row-position, while the second number "2" specifies the column-position:
● Example
● thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
thismatrix[1, 2]
● The whole row can be accessed if you specify a comma after the number in the bracket:
● Example
● thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
thismatrix[2,]
● The whole column can be accessed if you specify a comma before the number in the
bracket:
● Example
● thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
thismatrix[,2]
Access More Than One Row
● More than one row can be accessed if you use the c() function:
● Example
● thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fi
g"), nrow = 3, ncol = 3)
thismatrix[c(1,2),]
Access More Than One Column
● More than one column can be accessed if you use the c() function:
● Example
● thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fi
g"), nrow = 3, ncol = 3)
thismatrix[, c(1,2)]
thismatrix
Data_Frame
summary(Data_Frame)
Access Items
● We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a data
frame:
● Example
● Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame[1]
Data_Frame[["Training"]]
Data_Frame$Training
Add Rows
● Use the rbind() function to add new rows in a Data Frame:
● Example
● Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Plot
● The plot() function is used to draw points (markers) in a diagram.
● The function takes parameters for specifying points in the diagram.
● Parameter 1 specifies points on the x-axis.
● Parameter 2 specifies points on the y-axis.
● At its simplest, you can use the plot() function to plot two numbers against each other:
● Example
● Draw one point in the diagram, at position (1) and position (3):
● plot(1, 3)
● Example
● Draw two points in the diagram, one at position (1, 3) and one in position (8, 10):
● plot(c(1, 8), c(3, 10))
Multiple Points:
● You can plot as many points as you like, just make sure you have the same number of
points in both axis:
● Example
● plot(c(1, 2, 3, 4, 5), c(3, 7, 8, 9, 12))
Sequences of Points
● If you want to draw dots in a sequence, on both the x-axis and the y-axis, use
the : operator:
● Example
● plot(1:10)
Draw a Line
● The plot() function also takes a type parameter with the value l to draw a line to connect
all the points in the diagram:
● Example
● plot(1:10, type="l")
● The plot() function also takes a type parameter with the value l to draw a line to connect
all the points in the diagram:
● Example
● plot(1:10, type="l")
Plot Labels
● The plot() function also accept other parameters, such as main, xlab and ylab if you want
to customize the graph with a main title and different labels for the x and y-axis:
● Example
● plot(1:10, main="My Graph",
● xlab="The x-axis",
● ylab="The y axis")
Graph Appearance
● There are many other parameters you can use to change the appearance of the points.
● Colors
● Use col="color" to add a color to the points:
● Example
● plot(1:10, col="red")
Point Shape
● Use pch with a value from 0 to 25 to change the point shape format:
● Example
● plot(1:10, pch=25, cex=2)
● The values of the pch parameter ranges from 0 to 25, which means that we can choose up
to 26 different types of point shapes:
Conclusion:
Practical No. 5
Aim: Implementing K-Mean Clustering Algorithm.
Software Required: R
Theory:
● Given a set of n distinct objects, the k-Means clustering algorithm partitions the objects
into k number of clusters.
● In this algorithm, user has to specify k, the number of clusters.
● Given a collection of objects each with n measurable attributes, k-means is an analytical
technique that, for a chosen value of k, identifies k clusters of objects based on the
objects’ proximity to the center of the k groups.
● The center is determined as the arithmetic average (mean) of each cluster’s n-dimensional
vector of attributes.
The k-means algorithm to find k clusters
Step 1: Choose the value of k and the k initial guesses for the centroids.
● In this example, k = 3, and the initial centroids are indicated by the points shaded
in red, green, and blue
Step 2: Compute the distance from each data point (Xi, Yi) to each centroid. Assign each point
to the closest centroid. This association defines the first k clusters.
● In two dimensions, the distance, d, between any two points, (X1,Y1) and (X2,Y2), in the
Cartesian plane is typically expressed by using the Euclidean distance measure provided in
Step 3: Compute the centroid, the center of mass, of each newly defined cluster from Step 2.
● In two dimensions, the centroid ( Xc, Yc) of the m points in a k-means cluster is
calculated as follows in Equation
Conclusion:
Practical No. 6
Aim: Implementing apriori Algorithm.
Software Required: R
Theory:
● Apriori is one of the earliest and the most fundamental algorithms for generating
association rules.
● One major component of Apriori is support. Given an itemset L, the support of L is the
percentage of transactions that contain L.
● For example, if 80% of all transactions contain itemset {bread}, then the support of
{bread} is 0.8. Similarly, if 60% of all transactions contain itemset {bread,butter}, then
the support of {bread,butter} is 0.6.
● A frequent itemset has items that appear together often enough. The term “often enough”
is formally defined with a minimum support criterion.
● If the minimum support is set at 0.5, any itemset can be considered a frequent itemset if
at least 50% of the transactionscontain this itemset.
● In other words, the support of a frequent itemset should be greater than or equal to the
minimum support.
● For the previous example, both {bread} and {bread,butter} are considered frequent
itemsets at the minimum support 0.5. If the minimum support is 0.7, only {bread} is
considered a frequent itemset.
● If an itemset is considered frequent, then any subset of the frequent itemset must also be
frequent.
● This is referred to as the Apriori property (or downward closure property).
● For example, if 60% of the transactions contain {bread,jam}, then at least 60% of all the
transactions will contain {bread} or {jam}.
● In other words, when the support of {bread,jam} is 0.6, the support of {bread} or {jam}
is at least 0.6.
● If itemset {B,C,D} is frequent, then all the subsets of thisitemset, shaded, must also be
frequent itemsets.
Apriori Algorithm:
• Ck-- be the set of candidate k-itemsets and variable be the set of k-itemsets that satisfy
the minimum support.
● Given a transaction database D ,
● a minimum support threshold ð,
● an optional parameter N indicating the maximum length an itemset could reach, Apriori
iteratively computes frequent itemsets Lk+1 based on Lk .
Program:
Output:
Viva Question:
1. What is the use of apriori Algorithm?
2. List applications of apriori algorithm.
3. What is associationrule?
Conclusion:
Practical No.7
Aim: Text Analysis using R: analyzing minimum two different data sets
Software Required: R
Theory:
Example 1:
1. Loading the Data set
There are some data sets that are already pre-installed in R. Here, we shall be using The
Titanic data set that comes built-in R in the Titanic Package. While using any external data
source, we can use the read command to load the files(Excel, CSV, HTML and text files etc.)
This data set is also available at Kaggle. You may download the data set, both train and test files.
The above code reads the file titanic.csv into a dataframe titanic. With Header=TRUE we
are specifying that the data includes a header(column names) and sep=”,” specifies that the values
in data are comma separated.
view(titanic)
head(titanic,n) | tail(titanic,n)
In order to have a quick look at the data, we often use the head()/tail().
This helps us in checking out all the variables in the data set.
summary(titanic)
There are times when some of the variables in the data set are factors but might get
interpreted as numeric. For example, the Pclass(Passenger Class) tales the values 1, 2 and 3,
however, we know that these are not to be considered as numeric, as these are just levels. In order
to such variables treated as factors and not as numbers we need explicitly convert them to factors
using the function as.factor()
Data Visualisation is an art of turning data into insights that can be easily interpreted.
On the X-axis we have the survived variable, 0 representing the passengers that did not survive,
and 1 representing the passengers who survived. The Y -axis represents the number of passengers.
Here we see that over 550 passenger did not survive and ~ 340 passengers survived.
Only 38.38% of the passengers who on-boarded the titanic did survive.
It is believed that in case of rescue operations during disasters, woman’s safety is prioritised. Did
the same happen back then?
We see that the survival rate amongst the women was significantly higher when compared to men.
The survival ratio amongst women was around 75%, whereas for men it was less than 20%.
1st and 2nd Class passengers disproportionately survived, with over 60% survival rate of the 1st
class passengers, around 45–50% of 2nd class, and less than 25% survival rate of those travelling
in 3rd class.
I’ll leave you at the thought… Was it because of a preferential treatment to the passengers
travelling elite class, or the proximity, as the 3rd class compartments were in the lower deck?
Survival Rate basis Class of tickets and Gender(pclass)
We see that the females in the 1st and 2nd class had a very high survival rate. The survival
rate for the females travelling in 1st and 2nd class was 96% and 92% respectively,
corresponding to 37% and 16% for men. The survival rate for men travelling 3rd class was
less than 15%.
Till now it is evident that the Gender and Passenger class had significant impact on the survival
rates. Let’s now check the impact of passenger’s Age on Survival Rate.
Survival rates basis age
Looking at the age<10 years section in the graph, we see that the survival rate is high. And the
survival rate is low and drops beyond the age of 45.
Example 2
read.csv() is used for reading comma-separated value (csv) files, where a comma “,” is used
a field separator
read.delim() is used for reading tab-separated values (.txt) files
# Read the text file from local machine , choose file interactively
text <- readLines(file.choose())
# Load the data as a corpus
TextDoc <- Corpus(VectorSource(text))
Cleaning up Text Data
Cleaning the text data starts with making transformations like removing special characters from the
text. This is done using the tm_map() function to replace special characters like /, @ and | with a
space. The next step is to remove the unnecessary whitespace and convert the text to lower case.
Then remove the stopwords. They are the most commonly occurring words in a language and have
very little value in terms of gaining useful information. They should be removed before performing
further analysis. Examples of stopwords in English are “the, is, at, on”. There is no single universal list
of stop words used by all NLP tools. stopwords in the tm_map() function supports several languages
like English, French, German, Italian, and Spanish. Please note the language names are case
sensitive. I will also demonstrate how to add your own list of stopwords, which is useful in this Team
Health example for removing non-default stop words like “team”, “company”, “health”. Next, remove
numbers and punctuation.
Plotting the top 5 most frequent words using a bar chart is a good basic way to visualize this word
frequent data. In your R script, add the following code and run it to generate a bar chart, which will
display in the Plots sections of RStudio.
# Find associations
findAssocs(TextDoc_dtm, terms = c("good","work","health"), corlimit = 0.25)
Step 2: Authenticate using your credentials to Twitter’s API by creating an access token. Steps on
getting Twitter access tokens:
https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html
Step 3: Search tweets on the topic of your choice; narrow the number of tweets as you see fit and
decide on whether or not to include retweets. I decided to include 100 tweets each for Canada and
Scotland, plus decided not to include retweets, so as to avoid duplicate tweets impacting the
evaluation.
Step 4: Process each set of tweets into tidy text or corpus objects.
Step 5: Use pre-processing text transformations to clean up the tweets; this
includes stemming words. An example of stemming is rolling the words “computer”,
“computational” and “computation” to the root “comput”.
Additional pre-processing involves converting all words to lower-case, removing links to web
pages (http elements), and deleting punctuation as well as stop words. The tidytext package
contains a list of over 1,000 stop words in the English language that are not helpful in determining
the overall sentiment of a text body; these are words such as “I”, “ myself”, “ themselves”,
“being” and “have”. We are using the tidytext package with an anti-join to remove the stop words
from the tweets that were extracted in step 3.
Step 6: Find the top 10 commonly used words in the set of tweets for both countries; this will
give an overall picture of what the populations are most concerned about, and the extent to which
they are engaged on these topics.
The output below shows the top 10 words plotted for both Canada and Scotland.
As you can see, the output for Scotland returned many more than 10 words, since many of these
top words occurred the same number of times.
Step 7: Perform sentiment analysis using the Bing lexicon and get_sentiments function from the
tidytext package. There are many libraries, dictionaries and packages available in R to evaluate
the emotion prevalent in a text. The tidytext and textdata packages have such word-to-emotion
evaluation repositories. Three of the general purpose lexicons are Bing, AFINN and nrc (from the
textdata package).
To take a look at what each package contains, you can run the following commands in R:
The get_sentiments function returns a tibble, so to take a look at what is included as “positive”
and “negative” sentiment, you will need to filter accordingly. Since I wanted a general glimpse, I
didn’t need to extract the entire dataset, however depending on your needs, you may want to do
so.
In contrast to Bing, the AFINN lexicon assigns a “positive” or “negative” score to each word in
its lexicon; further sentiment analysis will then add up the emotion score to determine overall
expression. A score greater than zero indicates positive sentiment, while a score less than zero
would mean negative overall emotion. A calculated score of zero indicates neutral sentiment
(neither positive or negative).
Conclusion:
Practical No. 9
Aim: Sentiment Analysis of WhatsApp data with R
Software Required: R
Theory:
the sentiment of each author over the course of the year. This is calculated using the AFINN
lexicon which assigns words with a score that runs between -5 and 5, with negative scores
indicating negative sentiment and positive scores indicating positive sentiment.
From this, we’re able to see the sentiment of each author throughout the year. This correlates with
the words used most frequently by each author. From this, we can see that Blake, John and
Michael tend to be more negative than the rest of the group.
bing_word_counts
<- chat_clean
%>% inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
Conclusion: