0961771922 KINJAL BANSAL
Experiment 1
Aim: Install Apache Hadoop.
Theory:
Apache Hadoop is a powerful framework designed for the distributed storage and processing of large
datasets using clusters of commodity hardware. Its integration with Ubuntu provides a stable and efficient
environment for big data processing. Here’s how Hadoop relates to Ubuntu:
1. Installation – Apache Hadoop can be installed on Ubuntu by downloading the official Hadoop
binaries or using package managers like apt. Ubuntu’s robust package management system
simplifies the setup of Hadoop clusters.
2. Compatibility – Hadoop runs smoothly on Ubuntu servers and desktops without major
compatibility issues, allowing users to leverage Ubuntu’s stability and efficiency for big data
processing.
3. Resource Management – Ubuntu offers various tools for managing system resources, which is
crucial when running Hadoop clusters. Proper resource management ensures optimal performance
and efficient utilization of cluster resources.
4. Security – Ubuntu provides strong security features, including firewall configurations, user
permissions, and encryption, which help secure Hadoop clusters and protect stored and processed
data.
5. Maintenance – With its regular updates and long-term support (LTS) releases, Ubuntu ensures
the stability and security of Hadoop clusters over extended periods. Updates can be easily applied
to both Ubuntu and Hadoop components for seamless operation.
6. Community Support – Both Hadoop and Ubuntu have active communities that offer extensive
documentation, troubleshooting resources, and support, making it easier for users to resolve
issues and stay updated with new developments.
By using Ubuntu as the operating system for Hadoop, users can take advantage of its reliability, security,
and ease of maintenance to build scalable and efficient big data processing systems.
Steps:
1. Install Hadoop in the virtual machine.
0961771922 KINJAL BANSAL
2. Unzip Hadoop
0961771922 KINJAL BANSAL
3. Setup environment variables
4. Download Java SE Development Kit
0961771922 KINJAL BANSAL
5. Configure hadoop
6. Check Hadoop version.
Learning Outcome
0961771922 KINJAL BANSAL
Experiment 2
Aim: Develop a mapreduce program to calculate the frequency of a given word in a given file.
Theory
This experiment focuses on developing a MapReduce program using Apache Hadoop to efficiently
compute the frequency of a specified word within a given text file. MapReduce is a programming model
designed for processing large datasets in a distributed manner. By utilizing Hadoop's parallel processing
capabilities, this approach enables efficient text analysis and provides insights into word frequency.
Word frequency analysis is a key task in natural language processing (NLP), involving the determination
of how often each word appears in a document or corpus. Apache Hadoop offers a scalable framework for
distributed computing, making it well-suited for parallel processing tasks like MapReduce. This
experiment employs the MapReduce paradigm to distribute the workload across multiple nodes in a
Hadoop cluster, significantly enhancing the speed and efficiency of text data analysis.
The primary objective of this experiment is to develop and implement a MapReduce program that
calculates the frequency of a given word in a text file. By distributing computations across multiple
nodes, this experiment highlights the scalability and efficiency of Hadoop in handling large-scale text
processing tasks.
Ultimately, the experiment showcases the effectiveness of Apache Hadoop and the MapReduce
framework in processing big data. By leveraging distributed computing, Hadoop enables efficient and
scalable word frequency analysis, making it applicable to various domains requiring large-scale data
processing.
Steps
Check Hadoop version
Create folder named wordcount
0961771922 KINJAL BANSAL
Install JDK
Code of WordCount.java file
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.StringTokenizer;
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
0961771922 KINJAL BANSAL
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Compile java program
Create jar file
Copy the input file to Hadoop's HDFS:
0961771922 KINJAL BANSAL
Run the MapReduce Job
Check the output directory and Retrieve and display the word count:
Learning Output