Data Science Record
Data Science Record
8. Implement an application that stores big data in Hbase / MongoDB / Pig using Hadoop / R.
Index
Exp. Date Experiment Name Page Mark Signature
No. No.
AIM:
PROCEDURE:
Hadoop Installation
Run the following command on ubuntu terminal:
1. Install Hadoop on ubuntu.
a. sudo apt update
b. sudo apt install openjdk-8-jdk -y
2. Check if java is installed
a. java -version
b. javac -version
3. Install SSH server
a. sudo apt install openssh-server openssh-client -y
4. Create a new user in ubuntu
a. sudo adduser hdoop
b. sudo adduser hdoop sudo
c. su – hdoop
5. Create rsa file
a. ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
b. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
c. chmod 0600 ~/.ssh/authorized_keys
6. Open ssh server
a. ssh localhost
7. Download Hadoop
a. wget https://downloads.apache.org/hadoop/common/hadoop-3.2.3/hadoop-
3.2.3.tar.gz
b. tar xzf hadoop-3.2.3.tar.gz
8. Edit bashrc
a. sudo nano .bashrc
b. Add the following lines at the end of the file
i. export HADOOP_HOME=/home/hdoop/hadoop-3.2.3
ii. export HADOOP_INSTALL=$HADOOP_HOME
iii. export HADOOP_MAPRED_HOME=$HADOOP_HOME
iv. export HADOOP_COMMON_HOME=$HADOOP_HOME
v. export HADOOP_HDFS_HOME=$HADOOP_HOME
vi. export YARN_HOME=$HADOOP_HOME
2
vii. export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
viii. export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
ix. export HADOOP_OPTS"-Djava.library.path=$HADOOP_HOME/lib/nativ"
c. source ~/.bashrc
9. Edit JAVA_HOME
a. sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
b. Add the following line at the end of the file
i. export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
10. Edit core-site
a. sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
b. Add the following lines
i. <property>
ii. <name>hadoop.tmp.dir</name>
iii. <value>/home/hdoop/tmpdata</value>
iv. <description>A base for other temporary directories.</description>
v. </property>
vi. <property>
vii. <name>fs.default.name</name>
viii. <value>hdfs://localhost:9000</value>
ix. <description>The name of the default file system></description>
x. </property>
11. Edit hdfs-site
a. sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
b. Add the following lines
i. <property>
ii. <name>dfs.data.dir</name>
iii. <value>/home/hdoop/dfsdata/namenode</value>
iv. </property>
v. <property>
vi. <name>dfs.data.dir</name>
vii. <value>/home/hdoop/dfsdata/datanode</value>
viii. </property>
ix. <property>
x. <name>dfs.replication</name>
xi. <value>1</value>
xii. </property>
12. Edit mapred-site
a. sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
b. Add the following lines
i. <property>
ii. <name>mapreduce.framework.name</name>
iii. <value>yarn</value>
iv. </property>
13. Edit yarn-site
3
a. sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
b. Add the following lines
i. <property>
ii. <name>yarn.nodemanager.aux-services</name>
iii. <value>mapreduce_shuffle</value>
iv. </property>
v. <property>
vi. <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
vii. <value>org.apache.hadoop.mapred.ShuffleHandler</value>
viii. </property>
ix. <property>
x. <name>yarn.resourcemanager.hostname</name>
xi. <value>127.0.0.1</value>
xii. </property>
xiii. <property>
xiv. <name>yarn.acl.enable</name>
xv. <value>0</value>
xvi. </property>
xvii. <property>
xviii. <name>yarn.nodemanager.env-whitelist</name>
xix. <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_
HOME,HADOOP_CONF_DIR,CLAS
xx. SPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_
MAPRED_HOME</value>
xxi. </property>
14. Launch Hadoop
a. hdfs namenode -format
b. cd $HADOOP_HOME/sbin
c. start-all.sh
15. Go to browser
a. localhost:8088
b. localhost:9870
Install R on windows
4
OUTPUT:
RESULT:
Thus, the installation, configuration of Hadoop and R has been executed successfully.
5
Exp No: 2
Implement word count / frequency programs using
Date: MapReduce
AIM:
PROCEDURE:
Run the following command on ubuntu terminal.
1. Create a directory on the Desktop named Lab and inside it create two folders; one called
“Input” and the other called “tutorial_classes”.
a. cd Desktop
b. mkdir Lab
c. mkdir Lab/Input
d. mkdir Lab/tutorial_classes
2. Add the file attached with this document “WordCount.java” in the directory Lab
3. Add the file attached with this document “input.txt” in the directory Lab/Input.
4. Type the following command to export the hadoop classpath into bash.
a. export HADOOP_CLASSPATH=$(hadoop classpath)
5. Make sure it is now exported.
a. echo $HADOOP_CLASSPATH
6. It is time to create these directories on HDFS rather than locally. Type the following
commands.
a. hadoop fs -mkdir /WordCountTutorial
b. hadoop fs -mkdir /WordCountTutorial/Input
c. hadoop fs -put Lab/Input/input.txt /WordCountTutorial/Input
7. Go to localhost:9870 from the browser, Open “Utilities → Browse File System” and
you should see the directories and files we placed in the file system.
8. Then, back to local machine where we will compile the WordCount.java file. Assuming
we are currently in the Desktop directory.
a. cd Lab
b. javac -classpath $HADOOP_CLASSPATH -d tutorial_classes
WordCount.javaPut the output files in one jar file (There is a dot at the end)
c. jar -cvf WordCount.jar -C tutorial_classes .
9. Now, we run the jar file on Hadoop.
a. hadoop jar WordCount.jar WordCount /WordCountTutorial/Input
/WordCountTutorial/Output
10. Output the result:
a. hadoop dfs -cat /WordCountTutorial/Output/*
6
OUTPUT:
RESULT:
Thus, the implementation of word count using Map Reduce has been executed successfully.
7
Exp No: 3
Implement an MR program that processes a Weather
Date: Dataset
AIM:
PROCEDURE:
Run the following commands on ubuntu terminal.
MyMaxMin.java
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class MyMaxMin {
public static class MaxTemperatureMapper extends
Mapper<LongWritable, Text, Text, Text> {
public static final int MISSING = 9999;
@Override
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {
String line = Value.toString();
if (!(line.length() == 0)) {
String date = line.substring(6, 14);
float temp_Max = Float.parseFloat(line.substring(39, 45).trim());
float temp_Min = Float.parseFloat(line.substring(47, 53).trim());
if (temp_Max > 30.0) {
context.write(new Text("The Day is Hot Day :" + date),
newText(String.valueOf(temp_Max)));
8
}
}
public static class MaxTemperatureReducer extends
Reducer<Text, Text, Text, Text> {
public void reduce(Text Key, Iterator<Text> Values, Context context)
throws IOException, InterruptedException {
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}
3. Now we add these external jars to our MyProject. Right Click on MyProject -> then
select Build Path-> Click on Configure Build Path and select Add External jars…. and
add jars from it’s download location then click -> Apply and Close.
4. Now export the project as jar file. Right-click on MyProject choose Export.. and go to
Java -> JAR file click -> Next and choose your export destination then click -> Next.
5. Choose Main Class as MyMaxMin by clicking -> Browse and then click -> Finish ->
Ok.
9
6. Start Hadoop
a. start-all.sh
7. Move dataset to Hadoop HDFS
a. hdfs dfs -put /file_path /destination
b. hdfs dfs -put /home/hadoop/Downloads/CRND0103-2020-
AK_Fairbanks_11_NE.txt /
c. hdfs dfs -ls /
8. Now Run your Jar File with below command and produce the output in MyOutput File.
a. hadoop jar /home/hadoop/Documents/Project.jar /CRND0103-2020-
AK_Fairbanks_11_NE.txt /MyOutput
9. Go to browser – localhost:9870
OUTPUT:
RESULT:
Thus, the implementation of MR program that processes a weather dataset has been
executed successfully.
10
Exp No: 4
Implement Linear and Logistic Regression
Date:
AIM:
PROCEDURE:
1. Open R on windows.
2. Create a new workspace.
3. Create a new script file.
4. Type the code in the script file.
5. Run the script file.
6. Close R.
PROGRAM:
OUTPUT:
11
RESULT:
Thus, the implementation of linear and logistic regression has been executed successfully.
12
Exp No: 5 a
Implement SVM Classification Techniques
Date:
AIM:
PROCEDURE:
1. Open R on windows.
2. Create a new workspace.
3. Create a new script file.
4. Type the code in the script file.
5. Run the script file.
6. Close R.
PROGRAM:
> library(e1071)
> plot(iris)
> plot(iris$Sepal.Length, iris$Sepal.width, col=iris$Species)
> plot(iris$Petal.Length, iris$Petal.width, col=iris$Species)
> s<-sample(150,100)
> col<- c("Petal.Length", "Petal.Width", "Species")
> iris_train<- iris[s,col]
> iris_test<- iris[-s,col]
> svmfit<- svm(Species ~., data = iris_train, kernel = "linear", cost = .1, scale = FALSE)
> print(svmfit)
> plot(svmfit, iris_train[,col])
> tuned <- tune(svm, Species~., data = iris_train, kernel = "linear", ranges=
list(cost=c(0.001,0.01,.1,.1,10,100)))
> summary(tuned)
> p<-predict(svmfit, iris_test[,col], type="class")
> plot(p)
> table(p,iris_test[,3] )
> mean(p== iris_test[,3])
13
OUTPUT:
14
15
RESULT:
Thus, the implementation of SVM Classification technique has been executed successfully.
16
Exp No: 5 b
Implement Decision Tree Classification Techniques
Date:
AIM:
To implement decision tree classification technique.
PROCEDURE:
1. Open R on windows.
2. Create a new workspace.
3. Create a new script file.
4. Type the code in the script file.
5. Run the script file.
6. Close R.
PROGRAM:
> library(MASS)
> library(rpart)
> head(birthwt)
> hist(birthwt$bwt)
> table(birthwt$low)
> cols <- c('low', 'race', 'smoke', 'ht', 'ui')
> birthwt[cols] <- lapply(birthwt[cols], as.factor)
> set.seed(1)
> train<- sample(1:nrow(birthwt), 0.75 * nrow(birthwt))
> birthwtTree<- rpart(low ~ . - bwt, data = birthwt[train, ], method = 'class')
> plot(birthwtTree)
> text(birthwtTree, pretty = 0)
> summary(birthwtTree)
> birthwtPred<- predict(birthwtTree, birthwt[-train, ], type = 'class')
> table(birthwtPred, birthwt[-train, ]$low)
OUTPUT:
17
RESULT
Thus, the implementation of decision tree classification technique has been executed
successfully.
18
Exp No: 6
Implement Clustering Techniques
Date:
AIM:
To implement clustering techniques.
PROCEDURE:
1. Open R on windows.
2. Create a new workspace.
3. Create a new script file.
4. Type the code in the script file.
5. Run the script file.
6. Close R.
PROGRAM:
> library(datasets)
> head(iris)
> library(ggplot2)
> ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
> set.seed(20)
> irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
> irisCluster
> table(irisCluster$cluster, iris$Species)
OUTPUT:
19
RESULT:
Thus, the implementation of clustering techniques has been executed successfully.
20
Exp No: 7
Visualize data using any plotting framework
Date:
AIM:
To visualize data using any plotting framework in R.
PROCEDURE:
1. Open R on windows.
2. Create a new workspace.
3. Create a new script file.
4. Type the code in the script file.
5. Run the script file.
6. Close R.
PROGRAM:
1. Histogram
> library(RColorBrewer)
> data(VADeaths)
> par(mfrow=c(2,3))
> hist(VADeaths,breaks=10, col=brewer.pal(3,"Set3"),main="Set3 3 colors")
> hist(VADeaths,breaks=3 ,col=brewer.pal(3,"Set2"),main="Set2 3 colors")
> hist(VADeaths,breaks=7, col=brewer.pal(3,"Set1"),main="Set1 3 colors")
> hist(VADeaths,,breaks= 2, col=brewer.pal(8,"Set3"),main="Set3 8 colors")
> hist(VADeaths,col=brewer.pal(8,"Greys"),main="Greys 8 colors")
> hist(VADeaths,col=brewer.pal(8,"Greens"),main="Greens 8 colors")\
2. Line Chart
> data(AirPassengers)
> plot(AirPassengers,type="l")
3. Bar Chart
> data("iris")
> barplot(iris$Petal.Length)
> barplot(iris$Sepal.Length,col = brewer.pal(3,"Set1"))
> barplot(table(iris$Species,iris$Sepal.Length),col = brewer.pal(3,"Set1"))
4. Box Plot
> data(iris)
> par(mfrow=c(2,2))
> boxplot(iris$Sepal.Length,col="red")
> boxplot(iris$Sepal.Length~iris$Species,col="red")
> boxplot(iris$Sepal.Length~iris$Species,col=heat.colors(3))
> boxplot(iris$Sepal.Length~iris$Species,col=topo.colors(3))
> boxplot(iris$Petal.Length~iris$Species)
5. Scatter Plot
> plot(x=iris$Petal.Length)
> plot(x=iris$Petal.Length,y=iris$Species)
21
6. Heat Map
> x <- rnorm(10,mean=rep(1:5,each=2),sd=0.7)
> y <- rnorm(10,mean=rep(c(1,9),each=5),sd=0.1)
> dataFrame<- data.frame(x=x,y=y)
> set.seed(143)
> dataMatrix<-as.matrix(dataFrame)[sample(1:10),]
> heatmap(dataMatrix)
7. Correlogram
> library("corrplot")
> data("mtcars")
> corr_matrix <- cor(mtcars)
> corrplot(corr_matrix)
> corrplot(corr_matrix,method = 'number',type = "lower")
8. Area Chart
> library(dplyr)
> library(ggplot2)
> airquality %>%
o group_by(Day) %>%
o summarise(mean_wind = mean(Wind)) %>%
o ggplot() +
o geom_area(aes(x = Day, y = mean_wind)) +
o labs(title = "Area Chart of Average Wind per Day",
subtitle = "using airquality data",
y = "Mean Wind")
OUTPUT:
22
23
24
25
RESULT:
Thus, the visualization of data using plotting framework has been executed successfully.
26
Exp No: 8
Implement an application that stores big data in Hbase /
Date: MongoDB / Pig using Hadoop / R
AIM:
To implement an application that stores big data in mongoDB using R.
PROCEDURE:
1. Open R on windows.
2. Create a new workspace.
3. Create a new script file.
4. Type the code in the script file.
5. Run the script file.
6. Close R.
PROGRAM:
> library(ggplot2)
> library(mongolite)
> library(dplyr)
> crimes=data.table::fread("crimes.csv")
> connection_string="mongodb://localhost:27017/?tls=false&readPreference=primary"
> my_collection = mongo(collection = "crimes", db = "chicago",url=connection_string)
> my_collection$insert(crimes)
> my_collection$count()
> my_collection$iterate()$one()
> df <- as.data.frame(my_collection$find())
> head(df)
> length(my_collection$distinct("Primary Type"))
> my_collection$aggregate('[{"$group":{"_id":"$Location Description", "Count":
{"$sum":1}}}]')%>%na.omit()%>%
> arrange(desc(Count))%>%head(10)%>%
> ggplot(aes(x=reorder(`_id`,Count),y=Count))+
> geom_bar(stat="identity",color='skyblue',fill='#b35900')+geom_text(aes(label = Count),
color = "blue") +coord_flip()+xlab("Location Description")
> crimes=my_collection$find('{}', fields = '{"_id":0, "Primary Type":1,"Year":1}')
> crimes%>%group_by("Primary
Type")%>%summarize(Count=n())%>%arrange(desc(Count))%>%head(4)
OUTPUT:
27
28
RESULT:
Thus, the implementation of application to store big data in mongoDB using R has been
executed successfully.
29