Bad601 Lab
Bad601 Lab
LAB MANUAL
Module 1: Classification of data, Characteristics, Evolution and definition of Big data, What is Big
data, Why Big data, Traditional Business Intelligence Vs Big Data,Typical data warehouse and
Hadoop environment. Big Data Analytics: What is Big data Analytics, Classification of Analytics,
Importance of Big Data Analytics, Technologies used in Big data Environments, Few Top Analytical
Tools , NoSQL, Hadoop.
Module 2: Introduction to Hadoop: Introducing hadoop, Why hadoop, Why not RDBMS, RDBMS
Vs Hadoop, History of Hadoop, Hadoop overview, Use case of Hadoop, HDFS (Hadoop Distributed
File System),Processing data with Hadoop, Managing resources and applications with Hadoop
YARN(Yet Another Resource Negotiator). Introduction to Map Reduce Programming: Introduction,
Mapper, Reducer, Combiner, Partitioner, Searching, Sorting, Compression.
Module 3: Introduction to MongoDB: What is MongoDB, Why MongoDB, Terms used in RDBMS
and MongoDB, Data Types in MongoDB, MongoDB Query Language.
Module 4: Introduction to Hive: What is Hive, Hive Architecture, Hive data types, Hive file formats,
Hive Query Language (HQL), RC File implementation, User Defined Function (UDF). Introduction
to Pig: What is Pig, Anatomy of Pig, Pig on Hadoop, Pig Philosophy, Use case for Pig, Pig Latin
Overview, Data types in Pig, Running Pig, Execution Modes of Pig, HDFS Commands, Relational
Operators, Eval Function, Complex Data Types, Piggy Bank, User Defined Function, Pig Vs Hive.
Module 5: Spark and Big Data Analytics: Spark, Introduction to Data Analysis with Spark.
SL
EXPERIMENTS
NO
Install Hadoop and Implement the following file management tasks in
Hadoop: Adding files and directories Retrieving files Deleting files and
1 directories. Hint: A typical Hadoop workflow creates data files (such
as log files) elsewhere and copies them into HDFS using one of the
above command line utilities.
Develop a Map Reduce program that mines weather data and displays
3
appropriate messages indicating the weather conditions of the day.
Use Hive to create, alter, and drop databases, tables, views, functions,
7
and indexes.
Use CDH (Cloudera Distribution for Hadoop) and HUE (Hadoop User
9
Interface) to analyze data and generate reports for sample datasets
SL
EXPERIMENTS
NO
Develop a MapReduce to find the maximum electrical consumption in
1
each year given electrical consumption for each month in each year.
Develop a MapReduce program to find the maximum temperature in
2
each year.
3 Visualize data using basic plotting techniques in Python
4 Implement a MapReduce program that processes a dataset
Develop a MapReduce program to analyze Uber data set to find the days
on which each basement has more trips using the following dataset. The
5 Uber dataset consists of four columns they are
6 Write queries to sort and aggregate the data in a table using HiveQL
Develop a MapReduce program to find the number of products sold in
each country by considering sales data containing fields like
7
To install Hadoop and implement file management tasks (adding, retrieving, and deleting files and
directories) using Python, follow these steps:
----------------------------------------------------------------------------------------------------------------------------------------
-----------
### **Step 1: Install Hadoop**
1. **Prerequisites**:
- Install Java Development Kit (JDK) 8 or later.
- Ensure SSH is installed and configured.
3. **Configure Hadoop**:
- Set environment variables in `~/.bashrc` or `~/.bash_profile`:
```bash
export HADOOP_HOME=/path/to/hadoop-<version>
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=/path/to/java
```
- Reload the shell configuration:
```bash
source ~/.bashrc
```
```python
from hdfs import InsecureClient
# Connect to HDFS
client = InsecureClient('http://localhost:9870', user='your-username')
def create_directory_in_hdfs(hdfs_path):
"""Create a directory in HDFS."""
client.makedirs(hdfs_path)
print(f"Directory {hdfs_path} created in HDFS")
# Example usage
2. **Retrieving Files**:
- Use `client.download()` to download files from HDFS to the local filesystem.
----------------------------------------------------------------------------------------------------------------------------------------
### **Notes**
- Replace `your-username` with your Hadoop username.
- Ensure Hadoop services (NameNode, DataNode, etc.) are running before executing the script.
Matrix multiplication is a classic problem that can be efficiently parallelized using the MapReduce
programming model. Below is a Python implementation of matrix multiplication using the MapReduce
paradigm. We'll use the mrjob library, which allows us to write MapReduce programs in Python.
-----------------------------------------------------------------------------------------------------------------------------
```python
from mrjob.job import MRJob
import sys
class MatrixMultiplication(MRJob):
def configure_args(self):
super(MatrixMultiplication, self).configure_args()
self.add_passthru_arg('--size', type=int, help='Size of the matrices (N x N)')
N = self.options.size
if __name__ == '__main__':
MatrixMultiplication.run()
```
-----------------------------------------------------------------------------------------------------------------------------
-----------
Here, the first four lines represent matrix A, and the next four lines represent matrix B.
2. **Reducer**:
- Collect all values for a given key `(i, j)`.
- Separate values from matrix A and matrix B.
- Compute the dot product of the corresponding row from matrix A and column from matrix B.
- Emit the result for the cell `(i, j)`.
-----------------------------------------------------------------------------------------------------------------------------
-----------
### **Output**
The output will be the resulting matrix in the format:
```
(i, j) value
```
-----------------------------------------------------------------------------------------------------------------------------
-----------
### **Notes**
- The `--size` argument specifies the size of the matrices (N x N).
- This implementation assumes square matrices. For non-square matrices, additional adjustments are
needed.
- The `mrjob` library can run locally for testing or on a Hadoop cluster for distributed processing.
-------------------------------------------------------------------------------------------------------------------------------
3. Develop a Map Reduce program that mines weather data and displays
appropriate messages indicating the weather conditions of the day.
To develop a MapReduce program that mines weather data and displays appropriate messages
indicating the weather conditions of the day, we can use the `mrjob` library in Python. The program
will analyze weather data (e.g., temperature, humidity, precipitation) and generate messages like "Hot
day", "Rainy day", "Cold day", etc.
------------------------------------------------------------------------------------------------------------------------
-----------
### **Step 1: Install `mrjob`**
Install the `mrjob` library using pip:
```bash
pip install mrjob
```
------------------------------------------------------------------------------------------------------------------------
-----------
### **Step 2: Weather Data Format**
Assume the weather data is stored in a CSV file with the following columns:
- `Date`: The date of the weather record.
- `Temperature`: The temperature in Celsius.
- `Humidity`: The humidity percentage.
- `Precipitation`: The amount of precipitation in millimeters.
```python
from mrjob.job import MRJob
class WeatherAnalysis(MRJob):
if precipitation > 0:
condition += ", Rainy day"
if humidity > 80:
condition += ", Humid day"
------------------------------------------------------------------------------------------------------------------------
-----------
### **Explanation of the Code**
1. **Mapper**:
- Reads each line of the input file and extracts the date, temperature, humidity, and precipitation.
- Determines the weather condition based on the following rules:
- If temperature > 30°C: "Hot day"
- If temperature < 10°C: "Cold day"
- Otherwise: "Moderate day"
- Adds additional conditions if precipitation > 0 ("Rainy day") or humidity > 80% ("Humid day").
- Emits the date as the key and the weather condition as the value.
2. **Reducer**:
- Combines all weather conditions for the same date.
- Emits the date and the combined weather conditions.
------------------------------------------------------------------------------------------------------------------------
-----------
------------------------------------------------------------------------------------------------------------------------
-----------
4. Develop a MapReduce program to find the tags associated with each movie
by analyzing movie lens data.
To develop a MapReduce program to find the tags associated with each movie using MovieLens data, we
need to follow these steps:
1. **Understand the Data**: MovieLens dataset typically contains files like `movies.csv`, `tags.csv`, etc.
The `tags.csv` file contains user-generated tags for movies, with columns like `userId`, `movieId`, `tag`,
and `timestamp`.
2. **MapReduce Overview**:
- **Mapper**: The mapper will process each line of the `tags.csv` file and emit key-value pairs where
the key is the `movieId` and the value is the `tag`.
- **Reducer**: The reducer will collect all tags for each `movieId` and output the `movieId` along with
the list of associated tags.
3. **Implementation**:
Below is a Python implementation using the `mrjob` library, which is a Python package for writing
MapReduce jobs.
```python
from mrjob.job import MRJob
from mrjob.step import MRStep
class MovieTags(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper,
reducer=self.reducer)
]
if __name__ == '__main__':
MovieTags.run()
```
Save the above code in a file, say `movie_tags.py`, and run it using the following command:
```bash
python movie_tags.py tags.csv > output.txt
```
### Explanation:
- **Mapper**:
- The mapper reads each line from the `tags.csv` file.
- It splits the line into fields and extracts `movieId` and `tag`.
- It emits a key-value pair where the key is `movieId` and the value is `tag`.
- **Reducer**:
- The reducer receives all tags for a particular `movieId`.
- It collects these tags into a list and emits the `movieId` along with the list of tags.
Each line represents a movie and the list of tags associated with it.
This MapReduce program will efficiently process large datasets and provide the tags associated with each
movie in the MovieLens dataset.
To implement MongoDB functions like **Count**, **Sort**, **Limit**, **Skip**, and **Aggregate**
using Python, you can use the `pymongo` library. Below is a step-by-step guide with examples.
---
---
```python
from pymongo import MongoClient
# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
---
```python
# Count all documents
total_movies = collection.count_documents({})
print(f"Total movies: {total_movies}")
---
```python
# Sort movies by release year in descending order
sorted_movies = collection.find().sort("year", -1)
---
```python
# Get the first 10 movies
limited_movies = collection.find().limit(10)
---
```python
# Skip the first 5 movies and return the rest
skipped_movies = collection.find().skip(5)
---
```python
# Find the average rating for each movie and sort by average rating
pipeline = [
{
"$group": {
"_id": "$movieId",
"averageRating": {"$avg": "$rating"}
}
},
{
"$sort": {"averageRating": -1}
},
{
"$limit": 10 # Limit to top 10 results
}
]
---
**Example**:
Find the top 5 highest-rated movies, skip the first 2, and return the rest.
```python
pipeline = [
{
"$group": {
"_id": "$movieId",
"averageRating": {"$avg": "$rating"}
}
},
{
---
### **Step 5: Additional Aggregation Example**
```python
pipeline = [
{
"$unwind": "$genres" # Unwind the genres array
},
{
"$group": {
"_id": "$genres",
"count": {"$sum": 1}
}
},
{
"$sort": {"count": -1} # Sort by count in descending order
}
]
```python
pipeline = [
{
"$group": {
"_id": "$movieId",
"totalRatings": {"$sum": 1}
}
},
{
"$match": {
"totalRatings": {"$gt": 100} # Filter movies with more than 100 ratings
}
},
{
"$sort": {"totalRatings": -1} # Sort by totalRatings in descending order
}
]
These examples demonstrate how to use MongoDB functions in Python with the `pymongo` library. You
can adapt these examples to your specific dataset and requirements.
employees.csv
1, John Doe, 101, 50000
2, Jane Smith, 102, 60000
3, Jim Brown, 101, 55000
4, Jake White, 103, 70000
departments.csv
101, HR
102, Engineering
103, Marketing
7. Use Hive to create, alter, and drop databases, tables, views, functions, and
indexes.
Notes
• Make sure to replace the paths and class names with the actual paths and class names for your
UDFs.
• The CASCADE option in the DROP DATABASE command will drop all tables and views in the
database.
• You can run these commands in a Hive shell or through a Hive client that supports HiveQL.
mapper.py
#!/usr/bin/env python
import sys
reducer.py
#!/usr/bin/env python
import sys
current_word = None
current_count = 0
if current_word == word:
current_count += count
else:
if current_word:
# Output the count for the previous word
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
def main():
# Create a SparkContext
sc = SparkContext("local", "Word Count")
spark-submit word_count.py
Notes
• For the Hadoop Streaming example, ensure you have Hadoop installed and configured properly.
• For the Spark example, ensure you have Apache Spark installed and configured.
• The input file for both examples should be a text file containing the text you want to analyze.
• The output will be saved in the specified output directory, which will contain the word counts in
text format. If the output directory already exists, you will need to delete it before running the job
again.
Hello world
Hello Hadoop
Hello Spark
Hadoop is great
Spark is great
Sample Output
Hadoop 2
Hello 3
Spark 2
is 2
great 2
world 1
Prerequisites
1. CDH Installation: Ensure that you have CDH installed and running on your cluster.
2. Hue Installation: Ensure that Hue is installed and configured to connect to your Hadoop cluster.
3. Sample Dataset: For this example, we will use a sample dataset, such as a CSV file containing
employee data.
Sample Dataset
Let's assume we have a CSV file named employees.csv with the following content:
employee_id,name,department,salary
1,John Doe,Engineering,70000
2,Jane Smith,Marketing,60000
3,Jim Brown,Engineering,80000
4,Jake White,Sales,50000
5,Emily Davis,Marketing,75000
3. Run the Query: Click on the "Execute" button to run the query.
Using CDH and Hue, you can easily upload datasets, analyze them using Hive, and generate reports. The
Hue interface provides a user-friendly way to interact with Hadoop components, making it easier to perform
data analysis without needing to write complex code.
• Ensure that your CDH and Hue installations are properly configured and that you have the
necessary permissions to create tables and run queries.
• You can also explore other applications in Hue, such as Pig, Impala, and Oozie, for more advanced
data processing and scheduling tasks.
7. **What is MapReduce?**
- Explain the MapReduce programming model and its phases (Map, Shuffle, Reduce).
13. **What is the difference between batch processing and stream processing?**
17. **What are some common machine learning algorithms used in Big Data?**
- Explain algorithms like linear regression, decision trees, clustering, etc.
22. **What challenges did you face during your project, and how did you overcome
them?**
- Discuss specific problems and your problem-solving approach.
24. **What are some best practices for working with Big Data?**
- Discuss practices related to data governance, security, and performance optimization.