[go: up one dir, main page]

0% found this document useful (0 votes)
19 views30 pages

Data Science Record

Uploaded by

tharunpravin.06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views30 pages

Data Science Record

Uploaded by

tharunpravin.06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

LIST OF EXPERIMENTS

1. Install, configure and run Hadoop/HDFS/Pig and R

2. Implement word count / frequency programs using MapReduce

3. Implement an MR program that processes a weather dataset R

4. Implement Linear and logistic Regression

5. Implement SVM / Decision tree classification techniques

6. Implement clustering techniques

7. Visualize data using any plotting framework

8. Implement an application that stores big data in Hbase / MongoDB / Pig using Hadoop / R.
Index
Exp. Date Experiment Name Page Mark Signature
No. No.

1. Install, configure and run Hadoop and R

Implement word count / frequency


2.
programs using MapReduce

Implement an MR program that processes


3.
a Weather Dataset

4 Implement Linear and Logistic Regression

Implement SVM / Decision Tree


5
Classification Techniques

6. Implement Clustering Techniques

Visualize data using any plotting


7. framework

Implement an application that stores big


8. data in Hbase / MongoDB / Pig using
Hadoop / R
Exp No: 1
Install, configure and run Hadoop and R
Date:

AIM:

To install, configure and run Hadoop and R.

PROCEDURE:
 Hadoop Installation
Run the following command on ubuntu terminal:
1. Install Hadoop on ubuntu.
a. sudo apt update
b. sudo apt install openjdk-8-jdk -y
2. Check if java is installed
a. java -version
b. javac -version
3. Install SSH server
a. sudo apt install openssh-server openssh-client -y
4. Create a new user in ubuntu
a. sudo adduser hdoop
b. sudo adduser hdoop sudo
c. su – hdoop
5. Create rsa file
a. ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
b. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
c. chmod 0600 ~/.ssh/authorized_keys
6. Open ssh server
a. ssh localhost
7. Download Hadoop
a. wget https://downloads.apache.org/hadoop/common/hadoop-3.2.3/hadoop-
3.2.3.tar.gz
b. tar xzf hadoop-3.2.3.tar.gz
8. Edit bashrc
a. sudo nano .bashrc
b. Add the following lines at the end of the file
i. export HADOOP_HOME=/home/hdoop/hadoop-3.2.3
ii. export HADOOP_INSTALL=$HADOOP_HOME
iii. export HADOOP_MAPRED_HOME=$HADOOP_HOME
iv. export HADOOP_COMMON_HOME=$HADOOP_HOME
v. export HADOOP_HDFS_HOME=$HADOOP_HOME
vi. export YARN_HOME=$HADOOP_HOME

2
vii. export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
viii. export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
ix. export HADOOP_OPTS"-Djava.library.path=$HADOOP_HOME/lib/nativ"
c. source ~/.bashrc
9. Edit JAVA_HOME
a. sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
b. Add the following line at the end of the file
i. export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
10. Edit core-site
a. sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
b. Add the following lines
i. <property>
ii. <name>hadoop.tmp.dir</name>
iii. <value>/home/hdoop/tmpdata</value>
iv. <description>A base for other temporary directories.</description>
v. </property>
vi. <property>
vii. <name>fs.default.name</name>
viii. <value>hdfs://localhost:9000</value>
ix. <description>The name of the default file system></description>
x. </property>
11. Edit hdfs-site
a. sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
b. Add the following lines
i. <property>
ii. <name>dfs.data.dir</name>
iii. <value>/home/hdoop/dfsdata/namenode</value>
iv. </property>
v. <property>
vi. <name>dfs.data.dir</name>
vii. <value>/home/hdoop/dfsdata/datanode</value>
viii. </property>
ix. <property>
x. <name>dfs.replication</name>
xi. <value>1</value>
xii. </property>
12. Edit mapred-site
a. sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
b. Add the following lines
i. <property>
ii. <name>mapreduce.framework.name</name>
iii. <value>yarn</value>
iv. </property>
13. Edit yarn-site
3
a. sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
b. Add the following lines
i. <property>
ii. <name>yarn.nodemanager.aux-services</name>
iii. <value>mapreduce_shuffle</value>
iv. </property>
v. <property>
vi. <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
vii. <value>org.apache.hadoop.mapred.ShuffleHandler</value>
viii. </property>
ix. <property>
x. <name>yarn.resourcemanager.hostname</name>
xi. <value>127.0.0.1</value>
xii. </property>
xiii. <property>
xiv. <name>yarn.acl.enable</name>
xv. <value>0</value>
xvi. </property>
xvii. <property>
xviii. <name>yarn.nodemanager.env-whitelist</name>
xix. <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_
HOME,HADOOP_CONF_DIR,CLAS
xx. SPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_
MAPRED_HOME</value>
xxi. </property>
14. Launch Hadoop
a. hdfs namenode -format
b. cd $HADOOP_HOME/sbin
c. start-all.sh
15. Go to browser
a. localhost:8088
b. localhost:9870

 Install R on windows

Download and install R for windows from - https://cran.r-project.org/bin/windows/base/

4
OUTPUT:

RESULT:

Thus, the installation, configuration of Hadoop and R has been executed successfully.

5
Exp No: 2
Implement word count / frequency programs using
Date: MapReduce

AIM:

To implement word count program using MapReduce.

PROCEDURE:
Run the following command on ubuntu terminal.

1. Create a directory on the Desktop named Lab and inside it create two folders; one called
“Input” and the other called “tutorial_classes”.
a. cd Desktop
b. mkdir Lab
c. mkdir Lab/Input
d. mkdir Lab/tutorial_classes
2. Add the file attached with this document “WordCount.java” in the directory Lab
3. Add the file attached with this document “input.txt” in the directory Lab/Input.
4. Type the following command to export the hadoop classpath into bash.
a. export HADOOP_CLASSPATH=$(hadoop classpath)
5. Make sure it is now exported.
a. echo $HADOOP_CLASSPATH
6. It is time to create these directories on HDFS rather than locally. Type the following
commands.
a. hadoop fs -mkdir /WordCountTutorial
b. hadoop fs -mkdir /WordCountTutorial/Input
c. hadoop fs -put Lab/Input/input.txt /WordCountTutorial/Input
7. Go to localhost:9870 from the browser, Open “Utilities → Browse File System” and
you should see the directories and files we placed in the file system.
8. Then, back to local machine where we will compile the WordCount.java file. Assuming
we are currently in the Desktop directory.
a. cd Lab
b. javac -classpath $HADOOP_CLASSPATH -d tutorial_classes
WordCount.javaPut the output files in one jar file (There is a dot at the end)
c. jar -cvf WordCount.jar -C tutorial_classes .
9. Now, we run the jar file on Hadoop.
a. hadoop jar WordCount.jar WordCount /WordCountTutorial/Input
/WordCountTutorial/Output
10. Output the result:
a. hadoop dfs -cat /WordCountTutorial/Output/*

6
OUTPUT:

RESULT:

Thus, the implementation of word count using Map Reduce has been executed successfully.

7
Exp No: 3
Implement an MR program that processes a Weather
Date: Dataset

AIM:

To implement an MR program that processes a weather dataset.

PROCEDURE:
Run the following commands on ubuntu terminal.

1. Download dataset from - ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01


2. Create a java class as MyMaxMin in eclipse IDE

MyMaxMin.java

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class MyMaxMin {
public static class MaxTemperatureMapper extends
Mapper<LongWritable, Text, Text, Text> {
public static final int MISSING = 9999;

@Override
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {
String line = Value.toString();
if (!(line.length() == 0)) {
String date = line.substring(6, 14);
float temp_Max = Float.parseFloat(line.substring(39, 45).trim());
float temp_Min = Float.parseFloat(line.substring(47, 53).trim());
if (temp_Max > 30.0) {
context.write(new Text("The Day is Hot Day :" + date),

newText(String.valueOf(temp_Max)));
8
}

if (temp_Min < 15) {


context.write(new Text("The Day is Cold Day :" + date),
new Text(String.valueOf(temp_Min)));
}
}
}

}
public static class MaxTemperatureReducer extends
Reducer<Text, Text, Text, Text> {
public void reduce(Text Key, Iterator<Text> Values, Context context)
throws IOException, InterruptedException {
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = new Job(conf, "weather example");
job.setJarByClass(MyMaxMin.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path OutputPath = new Path(args[1]);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
OutputPath.getFileSystem(conf).delete(OutputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

3. Now we add these external jars to our MyProject. Right Click on MyProject -> then
select Build Path-> Click on Configure Build Path and select Add External jars…. and
add jars from it’s download location then click -> Apply and Close.
4. Now export the project as jar file. Right-click on MyProject choose Export.. and go to
Java -> JAR file click -> Next and choose your export destination then click -> Next.
5. Choose Main Class as MyMaxMin by clicking -> Browse and then click -> Finish ->
Ok.
9
6. Start Hadoop
a. start-all.sh
7. Move dataset to Hadoop HDFS
a. hdfs dfs -put /file_path /destination
b. hdfs dfs -put /home/hadoop/Downloads/CRND0103-2020-
AK_Fairbanks_11_NE.txt /
c. hdfs dfs -ls /
8. Now Run your Jar File with below command and produce the output in MyOutput File.
a. hadoop jar /home/hadoop/Documents/Project.jar /CRND0103-2020-
AK_Fairbanks_11_NE.txt /MyOutput
9. Go to browser – localhost:9870

OUTPUT:

RESULT:
Thus, the implementation of MR program that processes a weather dataset has been
executed successfully.

10
Exp No: 4
Implement Linear and Logistic Regression
Date:

AIM:

To implement linear and logistic regression.

PROCEDURE:

1. Open R on windows.
2. Create a new workspace.
3. Create a new script file.
4. Type the code in the script file.
5. Run the script file.
6. Close R.

PROGRAM:

> dataset = read.csv("data-marketing-budget-12mo.csv", header=T, colClasses = c("numeric",


"numeric", "numeric"))
> head(dataset,5)
> simple.fit = lm(Sales~Spend,data=dataset)
> summary(simple.fit)
> multi.fit = lm(Sales~Spend+Month, data=dataset)
> summary(multi.fit)
> input<- mtcars [,c("am","cyl","hp","wt")]
> print(head(input))
> am.data =glm(formula = am ~ cyl+hp+wt,data = input,family = binomial)
> print(summary(am.data))

OUTPUT:

11
RESULT:

Thus, the implementation of linear and logistic regression has been executed successfully.

12
Exp No: 5 a
Implement SVM Classification Techniques
Date:

AIM:

To implement SVM Classification technique.

PROCEDURE:

1. Open R on windows.
2. Create a new workspace.
3. Create a new script file.
4. Type the code in the script file.
5. Run the script file.
6. Close R.

PROGRAM:

> library(e1071)
> plot(iris)
> plot(iris$Sepal.Length, iris$Sepal.width, col=iris$Species)
> plot(iris$Petal.Length, iris$Petal.width, col=iris$Species)
> s<-sample(150,100)
> col<- c("Petal.Length", "Petal.Width", "Species")
> iris_train<- iris[s,col]
> iris_test<- iris[-s,col]
> svmfit<- svm(Species ~., data = iris_train, kernel = "linear", cost = .1, scale = FALSE)
> print(svmfit)
> plot(svmfit, iris_train[,col])
> tuned <- tune(svm, Species~., data = iris_train, kernel = "linear", ranges=
list(cost=c(0.001,0.01,.1,.1,10,100)))
> summary(tuned)
> p<-predict(svmfit, iris_test[,col], type="class")
> plot(p)
> table(p,iris_test[,3] )
> mean(p== iris_test[,3])

13
OUTPUT:

14
15
RESULT:

Thus, the implementation of SVM Classification technique has been executed successfully.

16
Exp No: 5 b
Implement Decision Tree Classification Techniques
Date:

AIM:
To implement decision tree classification technique.
PROCEDURE:

1. Open R on windows.
2. Create a new workspace.
3. Create a new script file.
4. Type the code in the script file.
5. Run the script file.
6. Close R.

PROGRAM:

> library(MASS)
> library(rpart)
> head(birthwt)
> hist(birthwt$bwt)
> table(birthwt$low)
> cols <- c('low', 'race', 'smoke', 'ht', 'ui')
> birthwt[cols] <- lapply(birthwt[cols], as.factor)
> set.seed(1)
> train<- sample(1:nrow(birthwt), 0.75 * nrow(birthwt))
> birthwtTree<- rpart(low ~ . - bwt, data = birthwt[train, ], method = 'class')
> plot(birthwtTree)
> text(birthwtTree, pretty = 0)
> summary(birthwtTree)
> birthwtPred<- predict(birthwtTree, birthwt[-train, ], type = 'class')
> table(birthwtPred, birthwt[-train, ]$low)
OUTPUT:

17
RESULT
Thus, the implementation of decision tree classification technique has been executed
successfully.

18
Exp No: 6
Implement Clustering Techniques
Date:

AIM:
To implement clustering techniques.

PROCEDURE:
1. Open R on windows.
2. Create a new workspace.
3. Create a new script file.
4. Type the code in the script file.
5. Run the script file.
6. Close R.
PROGRAM:
> library(datasets)
> head(iris)
> library(ggplot2)
> ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
> set.seed(20)
> irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
> irisCluster
> table(irisCluster$cluster, iris$Species)
OUTPUT:

19
RESULT:
Thus, the implementation of clustering techniques has been executed successfully.

20
Exp No: 7
Visualize data using any plotting framework
Date:

AIM:
To visualize data using any plotting framework in R.
PROCEDURE:
1. Open R on windows.
2. Create a new workspace.
3. Create a new script file.
4. Type the code in the script file.
5. Run the script file.
6. Close R.
PROGRAM:
1. Histogram
> library(RColorBrewer)
> data(VADeaths)
> par(mfrow=c(2,3))
> hist(VADeaths,breaks=10, col=brewer.pal(3,"Set3"),main="Set3 3 colors")
> hist(VADeaths,breaks=3 ,col=brewer.pal(3,"Set2"),main="Set2 3 colors")
> hist(VADeaths,breaks=7, col=brewer.pal(3,"Set1"),main="Set1 3 colors")
> hist(VADeaths,,breaks= 2, col=brewer.pal(8,"Set3"),main="Set3 8 colors")
> hist(VADeaths,col=brewer.pal(8,"Greys"),main="Greys 8 colors")
> hist(VADeaths,col=brewer.pal(8,"Greens"),main="Greens 8 colors")\

2. Line Chart
> data(AirPassengers)
> plot(AirPassengers,type="l")

3. Bar Chart
> data("iris")
> barplot(iris$Petal.Length)
> barplot(iris$Sepal.Length,col = brewer.pal(3,"Set1"))
> barplot(table(iris$Species,iris$Sepal.Length),col = brewer.pal(3,"Set1"))

4. Box Plot
> data(iris)
> par(mfrow=c(2,2))
> boxplot(iris$Sepal.Length,col="red")
> boxplot(iris$Sepal.Length~iris$Species,col="red")
> boxplot(iris$Sepal.Length~iris$Species,col=heat.colors(3))
> boxplot(iris$Sepal.Length~iris$Species,col=topo.colors(3))
> boxplot(iris$Petal.Length~iris$Species)

5. Scatter Plot
> plot(x=iris$Petal.Length)
> plot(x=iris$Petal.Length,y=iris$Species)

21
6. Heat Map
> x <- rnorm(10,mean=rep(1:5,each=2),sd=0.7)
> y <- rnorm(10,mean=rep(c(1,9),each=5),sd=0.1)
> dataFrame<- data.frame(x=x,y=y)
> set.seed(143)
> dataMatrix<-as.matrix(dataFrame)[sample(1:10),]
> heatmap(dataMatrix)

7. Correlogram
> library("corrplot")
> data("mtcars")
> corr_matrix <- cor(mtcars)
> corrplot(corr_matrix)
> corrplot(corr_matrix,method = 'number',type = "lower")
8. Area Chart
> library(dplyr)
> library(ggplot2)
> airquality %>%
o group_by(Day) %>%
o summarise(mean_wind = mean(Wind)) %>%
o ggplot() +
o geom_area(aes(x = Day, y = mean_wind)) +
o labs(title = "Area Chart of Average Wind per Day",
 subtitle = "using airquality data",
 y = "Mean Wind")
OUTPUT:

22
23
24
25
RESULT:
Thus, the visualization of data using plotting framework has been executed successfully.

26
Exp No: 8
Implement an application that stores big data in Hbase /
Date: MongoDB / Pig using Hadoop / R

AIM:
To implement an application that stores big data in mongoDB using R.
PROCEDURE:
1. Open R on windows.
2. Create a new workspace.
3. Create a new script file.
4. Type the code in the script file.
5. Run the script file.
6. Close R.
PROGRAM:
> library(ggplot2)
> library(mongolite)
> library(dplyr)
> crimes=data.table::fread("crimes.csv")
> connection_string="mongodb://localhost:27017/?tls=false&readPreference=primary"
> my_collection = mongo(collection = "crimes", db = "chicago",url=connection_string)
> my_collection$insert(crimes)
> my_collection$count()
> my_collection$iterate()$one()
> df <- as.data.frame(my_collection$find())
> head(df)
> length(my_collection$distinct("Primary Type"))
> my_collection$aggregate('[{"$group":{"_id":"$Location Description", "Count":
{"$sum":1}}}]')%>%na.omit()%>%
> arrange(desc(Count))%>%head(10)%>%
> ggplot(aes(x=reorder(`_id`,Count),y=Count))+
> geom_bar(stat="identity",color='skyblue',fill='#b35900')+geom_text(aes(label = Count),
color = "blue") +coord_flip()+xlab("Location Description")
> crimes=my_collection$find('{}', fields = '{"_id":0, "Primary Type":1,"Year":1}')
> crimes%>%group_by("Primary
Type")%>%summarize(Count=n())%>%arrange(desc(Count))%>%head(4)
OUTPUT:

27
28
RESULT:
Thus, the implementation of application to store big data in mongoDB using R has been
executed successfully.

29

You might also like